Qiita全国学生対抗戦Advent Calendar 2024

mediapipeでじゃんけんの手を判定するモデル作成してみた

MediaPipe

Posted at 2024-12-18

はじめに

前回はmediapipeを使ってハンドトラッキングを行いました。ハンド引き続き、今回は、mediapipeを用いて、じゃんけんの手を判定するモデルを作成してみました。

この記事では、Mediapipeを使って手のランドマークを取得し、じゃんけんの手（グー、チョキ、パー）を機械学習モデル(tensorflowを使用)で判別するシステムを作成します。この記事を読み進めることで、以下の手順を理解できます：

１.　手のランドマークデータの収集
２.　データの前処理
３.　モデルの構築と学習
４.　モデルを使ったリアルタイム判別

１.　手のランドマークデータの収集

目的:
じゃんけんの手を判別するには、手のランドマーク（特徴点）データが必要です。このセクションでは、Mediapipeを使って手のランドマークを取得し、収集したデータをCSV形式で保存します。

Mediapipe Handsの役割:
Mediapipeは、手のランドマークを高精度に検出するツールです。このランドマークデータを、後ほど機械学習モデルの学習に利用します。

収集の流れ:

・カメラ映像をキャプチャ
・手のランドマークを検出
・座標データを保存

以下がコードです。
それぞれ、グー、チョキ、パーについて、エンターキーを押してから、自分で、手のデータを集めていきます。q　のボタンを押して、終了します。できるだけ、多くの多様なデータ(手を画面に近づけたり、回転させたりなど）を集めた方が、予測の正解率が高くなります。

import cv2
import mediapipe as mp
import pandas as pd
import os

csv_file = "data/hand_landmarks.csv"

# MediaPipe Handsのセットアップ
mp_hands = mp.solutions.hands
hands = mp_hands.Hands(static_image_mode=False, max_num_hands=1, min_detection_confidence=0.5)
mp_drawing = mp.solutions.drawing_utils

# カメラのセットアップ
cap = cv2.VideoCapture(0)

# 手の座標データを収集
landmarks_data = []

def collect_landmarks(label):
    global landmarks_data
    count = 0
    while True:
        ret, frame = cap.read()
        if not ret:
            print("Failed to capture image")
            break

        # 画像を水平方向に反転
        frame = cv2.flip(frame, 1)

        # 画像をRGBに変換
        image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

        # 手のランドマークの検出
        results = hands.process(image)

        # 画像をBGRに戻す
        frame = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)

        # ランドマークが検出された場合、データを収集
        if results.multi_hand_landmarks:
            for hand_landmarks in results.multi_hand_landmarks:
                landmarks = []
                for lm in hand_landmarks.landmark:
                    landmarks.extend([lm.x, lm.y, lm.z])
                landmarks.append(label)
                landmarks_data.append(landmarks)
                mp_drawing.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)
                count += 1
                print(f"Collected {count} images for {label}")

        # 画像を表示
        cv2.imshow('Hand Landmarks', frame)

        # 'q'キーが押されたら終了
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    print(f'Collected {len(landmarks_data)} data points for {label}')

# データ収集のための指示
input("Press Enter to collect data for Rock (グー)...")
collect_landmarks('rock')

input("Press Enter to collect data for Scissors (チョキ)...")
collect_landmarks('scissors')

input("Press Enter to collect data for Paper (パー)...")
collect_landmarks('paper')

# データをCSVファイルに保存
try:
    columns = [f'x{i}' for i in range(21)] + [f'y{i}' for i in range(21)] + [f'z{i}' for i in range(21)] + ['label']
    df = pd.DataFrame(landmarks_data, columns=columns)
    df.to_csv(csv_file, index=False)
    print(f'Data saved successfully to {csv_file}')
except Exception as e:
    print(f'Error saving data: {e}')

# リソースの解放
cap.release()
cv2.destroyAllWindows()

2. データの前処理

目的:
収集した手のランドマークデータを、機械学習モデルが学習できる形式に整えます。主に以下を行います：

ランドマークデータ（特徴量）と手の種類（ラベル）に分ける
ラベルを数値データに変換（エンコード）
訓練データとテストデータに分割

ポイント:
データのラベル（グー、チョキ、パー）は、機械学習で扱いやすい数値形式に変換し、分類のためにOne-hotエンコーディングを行います。

tenslowのライブラリがインストールできていない場合は、　pip install tensorflow で tensorflowのライブラリをインストールします。

以下がコードです。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

# データを読み込む
df = pd.read_csv("data/hand_landmarks.csv")

# 特徴量（X）とラベル（y）に分ける
X = df.drop('label', axis=1).values
y = df['label'].values

# ラベルを数値にエンコード
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# One-hot encoding (グー, チョキ, パーの3クラスに分類するため)
y_categorical = to_categorical(y_encoded, num_classes=3)

# 訓練データとテストデータに分割
X_train, X_test, y_train, y_test = train_test_split(X, y_categorical, test_size=0.2, random_state=42)

3. モデルの構築と学習

目的:
手のランドマークから「グー、チョキ、パー」を分類するニューラルネットワークモデルを構築し、学習させます。

ポイント:
・入力データは手のランドマーク座標
・出力層は3クラス（グー、チョキ、パー）
・過学習を防ぐためにDropout層を追加

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# モデルの構築
model = Sequential()

# 入力層
model.add(Dense(64, input_shape=(X_train.shape[1],), activation='relu'))

# 中間層
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))  # 過学習防止のためのDropout層

# 出力層 (3つのクラス: グー, チョキ, パー)
model.add(Dense(3, activation='softmax'))

# モデルのコンパイル
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# モデルの概要を表示
model.summary()

モデルの概要は以下のようになっています。

モデルの学習、評価、保存を行います。

# モデルの学習
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

# モデルの評価
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

#モデルの保存
model.save('model/janken_model.h5')

4. モデルを使ったリアルタイム判別

目的:
学習済みモデルを使って、リアルタイムでカメラ映像から手の形を判別します。

ポイント:

カメラ映像をキャプチャ
Mediapipeでランドマークを検出
学習済みモデルで判別
判別結果を画面上に表示
以下がコードです。

import cv2
import mediapipe as mp
import numpy as np
from tensorflow.keras.models import load_model

# Mediapipe Handsのセットアップ
mp_hands = mp.solutions.hands
hands = mp_hands.Hands(static_image_mode=False, max_num_hands=1, min_detection_confidence=0.5)
mp_drawing = mp.solutions.drawing_utils

# モデルの読み込み
model = load_model('model/janken_model.h5')

# 手のランドマークを抽出する関数
def extract_landmark_data(hand_landmarks):
    landmarks = []
    for lm in hand_landmarks.landmark:
        landmarks.extend([lm.x, lm.y, lm.z])
    return landmarks

# 手の形を判別する関数
def predict_hand_shape(landmarks):
    # ランドマークデータを正しい形に変換
    landmarks = np.array(landmarks).reshape(1, -1)
    
    # モデルで予測
    prediction = model.predict(landmarks)
    predicted_class = np.argmax(prediction)

    # ラベルを返す
    if predicted_class == 0:
        return "paper"
    elif predicted_class == 1:
        return "rocks"
    else:
        return "scissors"

# カメラのセットアップ
cap = cv2.VideoCapture(0)

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # 画像を水平方向に反転
    frame = cv2.flip(frame, 1)

    # 画像をRGBに変換
    image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    # 手のランドマークの検出
    results = hands.process(image)

    # 画像をBGRに戻す
    frame = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)

    # ランドマークが検出された場合、手の形を判別
    if results.multi_hand_landmarks:
        for hand_landmarks in results.multi_hand_landmarks:
            # 手のランドマークデータを抽出
            landmarks = extract_landmark_data(hand_landmarks)
            
            # じゃんけんの手を判別
            hand_shape = predict_hand_shape(landmarks)
            
            # 結果を表示
            cv2.putText(frame, f"Predicted: {hand_shape}", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2, cv2.LINE_AA)
            
            # ランドマークを描画
            mp_drawing.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)
    
    # 画像を表示
    cv2.imshow('Janken Detection', frame)

    # 'q'キーで終了
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

# リソースの解放
cap.release()
cv2.destroyAllWindows()

以下のように、リアルタイムに、判別できます。

おわりに

今回の記事では、Mediapipeを使って手のランドマークを取得し、じゃんけんの手（グー、チョキ、パー）を判別する仕組みを作る手順を紹介しました。手のデータを収集するところから始まり、機械学習モデルの構築、そしてリアルタイムで手を判別するまで、一通りの流れを体験できたのではないでしょうか。

作りながら感じたのは、手のデータ収集が非常に重要な部分だということです。モデルの性能は収集したデータに大きく左右されるので、収集の段階でどれだけ丁寧に取り組むかがカギになりそうです。また、リアルタイムの判別を実装したとき、実際に自分の手を動かして結果が画面に出るのを見るのは、とても楽しい体験でした。

ここからのアイデア：

・今回はじゃんけんに絞りましたが、他のジェスチャーや動作を認識させたり、より高精度なモデルに挑戦するのも面白そうです。

・このモデルをゲームや教育アプリ、ジェスチャー操作デバイスとして活用する道もあると思います。

僕は、このじゃんけんの手の判別をするモデルを使って、じゃんけんゲームのWEBアプリを作ってみました！皆さんも何かオリジナルのゲームやアプリなどつくって見てください！

良ければ遊んでみてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up