IPFactory(現役生)Advent Calendar 2024

MediaPipeの新ソリューションでリアルタイムHandLandmark検出をする

Last updated at 2024-12-17Posted at 2024-12-17

はじめに

MediaPipeでリアルタイム顔検出を以前いじっていたが、新ソリューションになっており色々勝手が変わっていた。公式ドキュメントには画像からの検出のサンプルコードのみで、リアルタイムで検出をしようとするとあまり上手く行かなかった。とりあえず動く所までこぎつけたので、あまり内容のあるものではないがまとめる。

今回最終的にできるコードがこちら。
とにかく動かしたい場合は以下をコピペし、次のモデルをダウンロードするセクションからモデルのファイルをダウンロードし、15行目の'/path/to/model.task'部分を正しいパスに修正する。

import mediapipe as mp
import cv2 as cv
import time

BaseOptions = mp.tasks.BaseOptions
HandLandmarker = mp.tasks.vision.HandLandmarker
HandLandmarkerOptions = mp.tasks.vision.HandLandmarkerOptions
HandLandmarkerResult = mp.tasks.vision.HandLandmarkerResult
VisionRunningMode = mp.tasks.vision.RunningMode

def print_result(result: HandLandmarkerResult, output_image: mp.Image, timestamp_ms: int):
    print('hand landmarker result: {}'.format(result))

options = HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='hand_landmarker.task'),
    running_mode=VisionRunningMode.LIVE_STREAM,
    result_callback=print_result)
    
with HandLandmarker.create_from_options(options) as landmarker:
    cap = cv.VideoCapture(0)
    
    if not cap.isOpened():
        print("Cannot open camera")
        exit()
        
    while True:
        ret, frame = cap.read()
    
        if not ret:
            print("Can't receive frame (stream end?). Exiting ...")
            break
            
        gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
        
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=frame)
        frame_timestamp_ms = int(time.time() * 1000)
        landmarker.detect_async(mp_image, frame_timestamp_ms)

        cv.imshow('frame', gray)
        
        if cv.waitKey(1) == ord('q'):
            break
    
    cap.release()
    cv.destroyAllWindows()

Python 用手のランドマーク検出ガイド
こちらがMediaPipe公式ドキュメントの手検出ページ。まずはここから見てみる。
Pythonを使える環境があるのならば、お馴染みのpipでmediapipeをinstallするとimportできるようになる。

モデルをダウンロードする

以前のソリューションでは、モデルのファイルをダウンロードしておく必要はなかったはずだが、どうやら手元にトレーニング済みモデルを置いておく必要があるらしい。
モデルについての公式ドキュメント
上記ページの手のランドマーク（完全） のリンク部分をクリックすると、hand_landmarker.taskファイルがダウンロードできる。これがトレーニング済みモデルなので、わかりやすい場所に保存しておく。

タスクを作成する

タスクを作成する
最小限のサンプルコードが記述してある。ライブ配信のタブをクリックすると、リアルタイム検出のサンプルコードが出てくる。

タスクを作成する

import mediapipe as mp

BaseOptions = mp.tasks.BaseOptions
HandLandmarker = mp.tasks.vision.HandLandmarker
HandLandmarkerOptions = mp.tasks.vision.HandLandmarkerOptions
HandLandmarkerResult = mp.tasks.vision.HandLandmarkerResult
VisionRunningMode = mp.tasks.vision.RunningMode

# Create a hand landmarker instance with the live stream mode:
def print_result(result: HandLandmarkerResult, output_image: mp.Image, timestamp_ms: int):
    print('hand landmarker result: {}'.format(result))

options = HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='/path/to/model.task'),
    running_mode=VisionRunningMode.LIVE_STREAM,
    result_callback=print_result)
with HandLandmarker.create_from_options(options) as landmarker:
  # The landmarker is initialized. Use it here.
  # ...

よくわからない。でも公式ドキュメントに書いてあることなのでとりあえず手元の.pyファイルにコピペしておこう。

タスクを作成する10行目 def print_result(...

def print_result(result: HandLandmarkerResult, output_image: mp.Image, timestamp_ms: int):
    print('hand landmarker result: {}'.format(result))

よくわからないなりに調べてたが、print_result関数はコールバック関数で、リアルタイムで検出したい場合はこのコールバック関数が必須であるらしい。

タスクを作成する13行目 options = ...

options = HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='/path/to/model.task'),
    running_mode=VisionRunningMode.LIVE_STREAM,
    result_callback=print_result)

optionsにはHandLandmarkerOptionsの値を設定している。

base_options=BaseOptions(model_asset_path='/path/to/model.task'),の'/path/to/model.task'部分はダウンロードしたhand_landmarker.taskファイルの場所を各自記述する
running_mode=VisionRunningMode.LIVE_STREAMではリアルタイム検出対応のライブストリームモードを設定している
result_callback=print_resultはコールバック関数にprint_result関数を設定している。コールバック関数名を別の名前にしているのならば、ここも変更する。result_callback=コールバック関数名

タスクを作成する17行目 with HandLandmarker...

with HandLandmarker.create_from_options(options) as landmarker:

HandLandmarker.create_from_options(options)は、モデルパスやコールバック関数名を設定したoptionsを使用してHandLandmarkerオブジェクトを作成している。
with-as構文に馴染みがないので最初全く処理がわからなかったが、どうやらcloseの処理をしなくてよいようにしているらしい。as landmarkerとしているので、landmarkerにHandLandmarkerが返されるはず。

後ろのコメントで「landmarkerが初期化されたので、使用してどぞ^^」と言われているので、withインデントの中に処理を足していく。

データの準備

import mediapipe as mp

# Use OpenCV’s VideoCapture to start capturing from the webcam.

# Create a loop to read the latest frame from the camera using VideoCapture#read()

# Convert the frame received from OpenCV to a MediaPipe’s Image object.
mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=numpy_frame_from_opencv)

書くべき処理がコメントでメモされている。

withの中に処理を足していくので、先ほどのコードにこれを足して枠をつくってしまおう。import文は重複するので消してしまう。

作成中

import mediapipe as mp

BaseOptions = mp.tasks.BaseOptions
HandLandmarker = mp.tasks.vision.HandLandmarker
HandLandmarkerOptions = mp.tasks.vision.HandLandmarkerOptions
HandLandmarkerResult = mp.tasks.vision.HandLandmarkerResult
VisionRunningMode = mp.tasks.vision.RunningMode

# Create a hand landmarker instance with the live stream mode:
def print_result(result: HandLandmarkerResult, output_image: mp.Image, timestamp_ms: int):
    print('hand landmarker result: {}'.format(result))

options = HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='/path/to/model.task'),
    running_mode=VisionRunningMode.LIVE_STREAM,
    result_callback=print_result)
with HandLandmarker.create_from_options(options) as landmarker:
    # The landmarker is initialized. Use it here.
    # ...
    
    ### 追加↓ ###
    # Use OpenCV’s VideoCapture to start capturing from the webcam.
    
    # Create a loop to read the latest frame from the camera using VideoCapture#read()
    
    # Convert the frame received from OpenCV to a MediaPipe’s Image object.
    mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=numpy_frame_from_opencv)
    ### ここまで↑ ###

データの準備3行目 # Use OpenCV’s VideoCapture...

# Use OpenCV’s VideoCapture to start capturing from the webcam.

OpenCVのVideoCaptureを使って、ウェブカメラからのキャプチャを開始するように書かれている。
次のコメント行も、VideoCaptureのread()を使ってカメラから最新のフレームを読み込むループを作成するようかかれており、これらについてはOpenCV公式にVideoCaptureのチュートリアルがあるので、そこからコピペしてしまおう。
Capture Video from Cameraチュートリアル

Capture Video from Cameraチュートリアル

import numpy as np
import cv2 as cv

cap = cv.VideoCapture(0)
if not cap.isOpened():
    print("Cannot open camera")
    exit()
while True:
    # Capture frame-by-frame
    ret, frame = cap.read()

    # if frame is read correctly ret is True
    if not ret:
        print("Can't receive frame (stream end?). Exiting ...")
        break
    # Our operations on the frame come here
    gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
    # Display the resulting frame
    cv.imshow('frame', gray)
    if cv.waitKey(1) == ord('q'):
        break

# When everything done, release the capture
cap.release()
cv.destroyAllWindows()

ウェブカメラから画像を取得してフレームごとにグレースケールで画面に表示、qキーで処理を終了するシンプルなコードである。

枠を作った作成中のコードにCapture Video from Cameraチュートリアルのコードを足していく。

作成中

import mediapipe as mp
### 追加↓ ###
import numpy as np
import cv2 as cv
### ここまで↑ ###

BaseOptions = mp.tasks.BaseOptions
HandLandmarker = mp.tasks.vision.HandLandmarker
HandLandmarkerOptions = mp.tasks.vision.HandLandmarkerOptions
HandLandmarkerResult = mp.tasks.vision.HandLandmarkerResult
VisionRunningMode = mp.tasks.vision.RunningMode

# Create a hand landmarker instance with the live stream mode:
def print_result(result: HandLandmarkerResult, output_image: mp.Image, timestamp_ms: int):
    print('hand landmarker result: {}'.format(result))

options = HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='/path/to/model.task'),
    running_mode=VisionRunningMode.LIVE_STREAM,
    result_callback=print_result)
with HandLandmarker.create_from_options(options) as landmarker:
    # The landmarker is initialized. Use it here.
    # ...

    # Use OpenCV’s VideoCapture to start capturing from the webcam.
    
    # Create a loop to read the latest frame from the camera using VideoCapture#read()
    ### 追加↓ ###
    cap = cv.VideoCapture(0)
    if not cap.isOpened():
        print("Cannot open camera")
        exit()
    while True:
        # Capture frame-by-frame
        ret, frame = cap.read()
    
        # if frame is read correctly ret is True
        if not ret:
            print("Can't receive frame (stream end?). Exiting ...")
            break
        # Our operations on the frame come here
        gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
        # Display the resulting frame
        cv.imshow('frame', gray)
        if cv.waitKey(1) == ord('q'):
            break
    
    # When everything done, release the capture
    cap.release()
    cv.destroyAllWindows()
    ### ここまで↑ ###
    # Convert the frame received from OpenCV to a MediaPipe’s Image object.
    mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=numpy_frame_from_opencv)

データの準備3行目 # Use OpenCV’s VideoCapture... のコメントの下にそのまま貼り付けただけ

データの準備7行目 mp_image =...

mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=numpy_frame_from_opencv)

mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=numpy_frame_from_opencv)
OpenCVから受け取ったフレームをMediaPipeのImageオブジェクトに変換している。MediaPipeで処理するために画像のデータ形式を変更する必要がある。つまりフレームを処理するため、このコードはフレームを読み込むループの中にあるべきなので移動させる。

作成中

import mediapipe as mp
import numpy as np
import cv2 as cv

BaseOptions = mp.tasks.BaseOptions
HandLandmarker = mp.tasks.vision.HandLandmarker
HandLandmarkerOptions = mp.tasks.vision.HandLandmarkerOptions
HandLandmarkerResult = mp.tasks.vision.HandLandmarkerResult
VisionRunningMode = mp.tasks.vision.RunningMode

# Create a hand landmarker instance with the live stream mode:
def print_result(result: HandLandmarkerResult, output_image: mp.Image, timestamp_ms: int):
    print('hand landmarker result: {}'.format(result))

options = HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='/path/to/model.task'),
    running_mode=VisionRunningMode.LIVE_STREAM,
    result_callback=print_result)
with HandLandmarker.create_from_options(options) as landmarker:
    # The landmarker is initialized. Use it here.
    # ...

    # Use OpenCV’s VideoCapture to start capturing from the webcam.
    
    # Create a loop to read the latest frame from the camera using VideoCapture#read()
    cap = cv.VideoCapture(0)
    if not cap.isOpened():
        print("Cannot open camera")
        exit()
    while True:
        # Capture frame-by-frame
        ret, frame = cap.read()
    
        # if frame is read correctly ret is True
        if not ret:
            print("Can't receive frame (stream end?). Exiting ...")
            break
        # Our operations on the frame come here
        gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
        ### 追加↓ ###
        # Convert the frame received from OpenCV to a MediaPipe’s Image object.
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=numpy_frame_from_opencv)
        ### ここまで↑ ###

        # Display the resulting frame
        cv.imshow('frame', gray)
        if cv.waitKey(1) == ord('q'):
            break
    
    # When everything done, release the capture
    cap.release()
    cv.destroyAllWindows()

作成中ファイルの一番下にあったmp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=numpy_frame_from_opencv)をその上のコメントと一緒にループの中の処理まで移動させた。

次に引数に渡す値を変更する。data=numpy_frame_from_opencvとあるが、引数dataには画像データを渡す必要があるので、各フレーム画像を格納した変数であるframeに変更する。

作成中

import mediapipe as mp
import numpy as np
import cv2 as cv

BaseOptions = mp.tasks.BaseOptions
HandLandmarker = mp.tasks.vision.HandLandmarker
HandLandmarkerOptions = mp.tasks.vision.HandLandmarkerOptions
HandLandmarkerResult = mp.tasks.vision.HandLandmarkerResult
VisionRunningMode = mp.tasks.vision.RunningMode

# Create a hand landmarker instance with the live stream mode:
def print_result(result: HandLandmarkerResult, output_image: mp.Image, timestamp_ms: int):
    print('hand landmarker result: {}'.format(result))

options = HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='/path/to/model.task'),
    running_mode=VisionRunningMode.LIVE_STREAM,
    result_callback=print_result)
with HandLandmarker.create_from_options(options) as landmarker:
    # The landmarker is initialized. Use it here.
    # ...

    # Use OpenCV’s VideoCapture to start capturing from the webcam.
    
    # Create a loop to read the latest frame from the camera using VideoCapture#read()
    cap = cv.VideoCapture(0)
    if not cap.isOpened():
        print("Cannot open camera")
        exit()
    while True:
        # Capture frame-by-frame
        ret, frame = cap.read()
    
        # if frame is read correctly ret is True
        if not ret:
            print("Can't receive frame (stream end?). Exiting ...")
            break
        # Our operations on the frame come here
        gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
        # Convert the frame received from OpenCV to a MediaPipe’s Image object.
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=frame) ### ←変更

        # Display the resulting frame
        cv.imshow('frame', gray)
        if cv.waitKey(1) == ord('q'):
            break
    
    # When everything done, release the capture
    cap.release()
    cv.destroyAllWindows()

タスクを実行する

# Send live image data to perform hand landmarks detection.
# The results are accessible via the `result_callback` provided in
# the `HandLandmarkerOptions` object.
# The hand landmarker must be created with the live stream mode.
landmarker.detect_async(mp_image, frame_timestamp_ms)

コメントは
「手のランドマーク検出を行うためにライブ画像データを送信する。」
「結果は HandLandmarkerOptions オブジェクトで提供される result_callback を介してアクセス可能である。」
「ハンドランドマーカーはライブストリームモードで作成する必要がある。」
と書いてある。

ライブ画像データの送信は、タスクを実行する5行目のlandmarker.detect_async(mp_image, frame_timestamp_ms)部分についてなのでこの後対処する。

結果はresult_callbackを介して云々は、最初の方タスクを作成する10行目 def print_result(... でコールバック関数を設定した部分なので今は大丈夫。

ハンドランドマーカーはライブストリームモードで作成する必要があるのは、タスクを作成する13行目 options = ... で既に設定してあるので大丈夫。

タスクを実行する5行目 landmarker.detect_async...

landmarker.detect_async(mp_image, frame_timestamp_ms)

この処理をどこに追加するべきかだが、結論から言うと直前に移動させたデータの準備7行目 mp_image =... のmp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=numpy_frame_from_opencv)のすぐ後である。
MediaPipeの詳細なコード例が用意されている。
コード例
ただライブストリームモードでの例がない。(ので今この記事をかいている)
しかしMediaPipeでの処理の流れはわかるので、これを参考に検出を実行する手順を確認する。

最初の方は可視化の部分なので、今回は省略する。重要なのは一番下のブロックである。

コード例

# STEP 1: Import the necessary modules.
import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

# STEP 2: Create an HandLandmarker object.
base_options = python.BaseOptions(model_asset_path='hand_landmarker.task')
options = vision.HandLandmarkerOptions(base_options=base_options,
                                       num_hands=2)
detector = vision.HandLandmarker.create_from_options(options)

# STEP 3: Load the input image.
image = mp.Image.create_from_file("image.jpg")

# STEP 4: Detect hand landmarks from the input image.
detection_result = detector.detect(image)

# STEP 5: Process the classification result. In this case, visualize it.
annotated_image = draw_landmarks_on_image(image.numpy_view(), detection_result)
cv2_imshow(cv2.cvtColor(annotated_image, cv2.COLOR_RGB2BGR))

これまでやってきたことと同じような記述が確認できる。

STEP1 各種importをする
STEP2 optionsを設定しHandLandmarkerオブジェクトを作成する
STEP3 画像をMediaPipeで処理できる形式に変更する
STEP4 形式変更した画像から手を検出する
STEP5 結果を処理する、可視化する

コード例STEP4のdetection_result = detector.detect(image)とタスクを実行する5行目 landmarker.detect_async... のlandmarker.detect_async(mp_image, frame_timestamp_ms)は処理の流れでは同じ段階である。
内容はMediaPipe用に形式を変えた画像から手を検出し、結果をdetection_resultに格納している。

コード例STEP4detection_result = detector.detect(image)のdetectorはコード例STEP2からわかるように、create_from_options(options)から作成したHandLandmarkerオブジェクトである。
これはタスクを作成する17行目 with HandLandmarker... で書いたように、landmarkerにHandLandmarkerオブジェクトが返されているので、detectorとlandmarkerは同じ内容を指している。

コード例STEP4detection_result = detector.detect(image)の.detect(image)はMediaPipe用に形式を変えた画像から手を検出している。

タスクを実行するに以下のように記述してある。

Hand Landmarker は、detect、detect_for_video、detect_async 関数を使用して推論をトリガーします。

つまり.detectは画像モードでの検出、.detect_asyncはライブストリームモードでの検出に使用する。

ここでコード例STEP4detector.detect(image)とタスクを実行する5行目 landmarker.detect_async...　のlandmarker.detect_async(mp_image, frame_timestamp_ms)は同じ処理だとわかる。引数が違うのは仕様で、タスクを実行するに以下のように記述してある。

次の点にご留意ください。
動画モードまたはライブ配信モードで実行する場合は、ハンドランドマークタスクに入力フレームのタイムスタンプも指定する必要があります

landmarker.detect_async(mp_image, frame_timestamp_ms)のmp_imageはMediaPipe用に形式を変えた画像であり、frame_timestamp_msはタイムスタンプであることがわかる。

画像をMediaPipeで処理できる形式に変更してから、形式変更した画像から手を検出するので、最初に結論を述べたようにデータの準備7行目 mp_image =... のmp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=numpy_frame_from_opencv)の後ろに追加する。
また、タイムスタンプframe_timestamp_msを作成する必要がある。これは引数として渡すので、こちらを先に記述する。それに伴いtimeをimportする。

作成中

import mediapipe as mp
import numpy as np
import cv2 as cv
import time ### ←追加

BaseOptions = mp.tasks.BaseOptions
HandLandmarker = mp.tasks.vision.HandLandmarker
HandLandmarkerOptions = mp.tasks.vision.HandLandmarkerOptions
HandLandmarkerResult = mp.tasks.vision.HandLandmarkerResult
VisionRunningMode = mp.tasks.vision.RunningMode

# Create a hand landmarker instance with the live stream mode:
def print_result(result: HandLandmarkerResult, output_image: mp.Image, timestamp_ms: int):
    print('hand landmarker result: {}'.format(result))

options = HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='/path/to/model.task'),
    running_mode=VisionRunningMode.LIVE_STREAM,
    result_callback=print_result)
with HandLandmarker.create_from_options(options) as landmarker:
    # The landmarker is initialized. Use it here.
    # ...

    # Use OpenCV’s VideoCapture to start capturing from the webcam.
    
    # Create a loop to read the latest frame from the camera using VideoCapture#read()
    cap = cv.VideoCapture(0)
    if not cap.isOpened():
        print("Cannot open camera")
        exit()
    while True:
        # Capture frame-by-frame
        ret, frame = cap.read()
    
        # if frame is read correctly ret is True
        if not ret:
            print("Can't receive frame (stream end?). Exiting ...")
            break
        # Our operations on the frame come here
        gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
        # Convert the frame received from OpenCV to a MediaPipe’s Image object.
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=frame)
        ### 追加↓ ###
        frame_timestamp_ms = int(time.time() * 1000)
        landmarker.detect_async(mp_image, frame_timestamp_ms)
        ### ここまで↑ ###

        # Display the resulting frame
        cv.imshow('frame', gray)
        if cv.waitKey(1) == ord('q'):
            break
    
    # When everything done, release the capture
    cap.release()
    cv.destroyAllWindows()

仕上げ

不要なコメントなどを消してきれいにしたものがこちら

import mediapipe as mp
import cv2 as cv
import time

BaseOptions = mp.tasks.BaseOptions
HandLandmarker = mp.tasks.vision.HandLandmarker
HandLandmarkerOptions = mp.tasks.vision.HandLandmarkerOptions
HandLandmarkerResult = mp.tasks.vision.HandLandmarkerResult
VisionRunningMode = mp.tasks.vision.RunningMode

def print_result(result: HandLandmarkerResult, output_image: mp.Image, timestamp_ms: int):
    print('hand landmarker result: {}'.format(result))

options = HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='hand_landmarker.task'),
    running_mode=VisionRunningMode.LIVE_STREAM,
    result_callback=print_result)
    
with HandLandmarker.create_from_options(options) as landmarker:
    cap = cv.VideoCapture(0)
    
    if not cap.isOpened():
        print("Cannot open camera")
        exit()
        
    while True:
        ret, frame = cap.read()
    
        if not ret:
            print("Can't receive frame (stream end?). Exiting ...")
            break
            
        gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
        
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=frame)
        frame_timestamp_ms = int(time.time() * 1000)
        landmarker.detect_async(mp_image, frame_timestamp_ms)

        cv.imshow('frame', gray)
        
        if cv.waitKey(1) == ord('q'):
            break
    
    cap.release()
    cv.destroyAllWindows()

numpyはimportしていたが使ってないので消して大丈夫。

実行すると、白黒のカメラ画像がウィンドウで表示される。ターミナルにhand landmarker result: HandLandmarkerResult(handedness=[], hand_landmarks=[], hand_world_landmarks=[])が大量に出力される。手をカメラに写すと、座標の羅列が出力される。

座標が出力されたら、手が検出できたということであるため完成！

この記事はIPFactory Advent Calender 2024 17日目の記事です

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up