リアルタイムで映像を単語化し、画面に表示するマルチモーダルエンベディングゲーム。

Last updated at 2024-10-07Posted at 2024-10-07

ショートストーリー: 「東京のプログラマと映像の言葉」

東京の繁華街、ネオンがきらめく夜の街で、主人公の翔太はひとり静かなカフェに座っていた。最近、彼は「マルチモーダルエンベディング」という概念に夢中になっていた。

「映像を単語化するなんて、すごく面白そうだ」と、翔太は心の中で呟いた。彼の頭の中では、映像を分析し、そこから得られる情報を言葉に変換するアイデアが駆け巡っていた。彼は、この技術を使って、日常の何気ない瞬間を言葉にして、誰もが共感できるストーリーを作り出すことができると信じていた。

彼の目の前には、ウェブカメラが設置されており、彼の動きや周囲の映像をリアルタイムでキャッチしていた。彼は映像データを取得し、そのデータを「エンベディング」と呼ばれるプロセスを通じて数値に変換していく。映像の中の色彩や動きが、彼のプログラムによって言葉に変わっていく。

「これだ！この瞬間を捉えられる！」と翔太は興奮した。彼はテキストボックスにサンプルの英文を入力し、エンベディングボタンを押した。すると、プログラムはそのテキストを解析し、数値のベクトルに変換した。彼はさらに、周囲の映像から得たデータを同様に数値化し、これらのベクトルを比較することで、映像の中に潜む「言葉」を見つけ出すことができた。

時間が経つにつれ、翔太は自分の作業が進むのを感じた。映像から最も近い単語を見つけ出す機能が実装され、カフェの中で起こる様々な出来事が次々と彼のプログラムによって言葉に変換されていった。賑やかな会話、コーヒーを飲む音、隣のテーブルの笑い声が、まるで生きたデータとして彼の目の前に広がった。

サンプルテキストを張り付けると語彙ベクトルを生成して、映像を単語化できます。

ユーチューブをスマホでカメラに向けてます。

コードをメモ帳などのテキストエディタに貼り付け、ファイル名を「index.html」として保存します。その後、保存したファイルをブラウザで開けば、コードが実行されます。

サンプルテキスト。

Short Story: "A Tokyo Programmer and His Adventures with High-Dimensional Neural Network Tensors"

In a busy corner of Shinjuku, in the heart of Tokyo, lived a futuristic programmer named Kenichi Sato. Kenichi was fascinated by cutting-edge technology and wrote code every day. His latest project was to develop an algorithm to efficiently process huge tensor calculations.

One evening, Kenichi was looking at the night view from the window of a high-rise building and was thinking deeply about a problem he was challenging himself with. He was trying to compare the calculation of complex tensors on a CPU and a GPU and measure their performance. The core of the project was to measure the time it takes to process data with different tensor sizes and find the most efficient approach.

Kenichi first performed tensor calculations using the classic method, the CPU. He used NumPy to perform a simple process of stacking tensors and saving the calculation results. This method certainly worked, but he noticed that the calculation time was long.

Next, Kenichi tried to use CuPy to perform calculations on a GPU. CuPy was able to significantly speed up calculations by leveraging the parallel processing power of the GPU. He adopted an approach of creating a mesh grid of tensors and calculating all elements simultaneously. This reduced the processing time and he was very pleased with the results.

Finally, Kenichi wrote his own kernel code using PyCUDA to run the calculations on the GPU. By utilizing the powerful features of PyCUDA, tensor calculations became even more efficient and allowed precise control. His code was designed to maximize memory management and calculation performance.

After a few weeks, Kenichi plotted the measured processing time on a graph. The graph showed the processing time of the CPU, CuPy, and PyCUDA as the tensor size increased. The results were clear. CuPy and PyCUDA performed well even with larger tensor sizes, but CUPY was found to be the most efficient.

Kenichi's confidence in the knowledge and experience he gained through this project increased his enthusiasm for a new project. Looking out at the night view of Tokyo, he was thinking about his next challenge. His mind was filled with anticipation and excitement at how technology would evolve in the future.

And so, Kenichi took one step at a time towards his new computing adventure.

リアルタイムで映像を単語化し、画面に表示するマルチモーダルエンベディングゲーム。


<!DOCTYPE html>
<html lang="ja">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>リアルタイム単語生成</title>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/universal-sentence-encoder"></script>
    <style>
        /* カメラ用のキャンバススタイル */
        #cameraCanvas {
            width: 320px;
            height: 240px;
            border: 1px solid black;
        }
        /* 単語表示用のスタイル */
        #wordDisplay {
            font-size: 24px;
            color: black;
            margin-top: 10px;
        }
    </style>
</head>
<body>
    <h1>リアルタイム単語生成</h1>

    <!-- サンプルテキスト入力ボックスとボタン -->
    <input type="text" id="sampleText" placeholder="サンプルテキストを入力">
    <button onclick="embedText()">エンベディング</button>

    <p id="embeddingStatus">エンベディングを待機中...</p>
    <p id="wordDisplay"></p>

    <!-- ウェブカメラ表示用のビデオタグ -->
    <video id="webcam" autoplay playsinline width="320" height="240"></video>
    <canvas id="cameraCanvas"></canvas>
    <button onclick="startWebcam()">ウェブカメラを開始</button>

    <script>
        let model; // モデルを格納する変数
        let textEmbedding; // テキストのエンベディングを格納する変数
        let wordDictionary = []; // 単語辞書を格納する配列
        let webcamVideo = document.getElementById('webcam'); // ウェブカメラのビデオ要素
        let cameraCanvas = document.getElementById('cameraCanvas'); // カメラキャンバス
        let cameraContext = cameraCanvas.getContext('2d'); // キャンバスのコンテキストを取得

        // Universal Sentence Encoder モデルのロード
        async function loadModel() {
            model = await use.load(); // モデルを非同期にロード
            document.getElementById('embeddingStatus').innerText = "モデルがロードされました。";
        }

        // 入力テキストをエンベディングする関数
        async function embedText() {
            const sampleText = document.getElementById('sampleText').value; // 入力テキストを取得
            const embeddings = await model.embed([sampleText]); // テキストのエンベディングを計算
            textEmbedding = embeddings.arraySync()[0]; // エンベディングを配列として取得
            document.getElementById('embeddingStatus').innerText = "テキストがエンベディングされました。";
            generateWordDictionary(textEmbedding); // エンベディングから単語辞書を生成
        }

        // エンベディングから簡単な単語辞書を生成する関数
        function generateWordDictionary(embedding) {
            // エンベディングを基にしたダミー辞書の生成（簡略化のため）
            wordDictionary = [
                { word: 'hello', vector: embedding.map(v => v * 0.9) }, // 近似的なベクトル
                { word: 'world', vector: embedding.map(v => v * 1.1) },
                { word: 'tensorflow', vector: embedding.map(v => v * 1.2) }
            ];
            console.log("単語辞書が生成されました。");
        }

        // ウェブカメラを開始する関数
        function startWebcam() {
            navigator.mediaDevices.getUserMedia({ video: true }).then(stream => {
                webcamVideo.srcObject = stream; // ウェブカメラのストリームをビデオ要素に設定
                // 1秒ごとにフレームをキャプチャして単語を探す
                setInterval(findClosestWord, 1000);
            });
        }

        // ウェブカメラから最も近い単語を見つける関数
        function findClosestWord() {
            // カメラからフレームをキャプチャ
            cameraContext.drawImage(webcamVideo, 0, 0, cameraCanvas.width, cameraCanvas.height);
            let imageData = cameraContext.getImageData(0, 0, cameraCanvas.width, cameraCanvas.height); // イメージデータを取得
            let grayscaleVector = convertToGrayscaleVector(imageData); // グレースケールベクトルに変換
            let resizedVector = resizeVector(grayscaleVector, 512); // エンベディング次元に合わせてサイズを調整

            // コサイン類似度を計算し、最も近い単語を探す
            let closestWord = findMostSimilarWord(resizedVector);
            document.getElementById('wordDisplay').innerText = `最も近い単語: ${closestWord}`; // 単語を表示
        }

        // イメージデータをグレースケールベクトルに変換する関数
        function convertToGrayscaleVector(imageData) {
            let grayscaleVector = [];
            for (let i = 0; i < imageData.data.length; i += 4) {
                // 各ピクセルをグレースケールに変換
                let grayscale = (imageData.data[i] + imageData.data[i + 1] + imageData.data[i + 2]) / 3;
                grayscaleVector.push(grayscale / 255); // [0, 1]に正規化
            }
            return grayscaleVector; // グレースケールベクトルを返す
        }

        // ベクトルをターゲット次元に合わせてサイズを調整する関数
        function resizeVector(vector, targetDim) {
            if (vector.length > targetDim) {
                return vector.slice(0, targetDim); // 長すぎる場合は切り捨て
            } else if (vector.length < targetDim) {
                return [...vector, ...Array(targetDim - vector.length).fill(0)]; // 短すぎる場合は0でパディング
            }
            return vector; // サイズが一致する場合はそのまま返す
        }

        // コサイン類似度を計算する関数
        function cosineSimilarity(vectorA, vectorB) {
            const dotProduct = tf.tidy(() => tf.dot(tf.tensor(vectorA), tf.tensor(vectorB)).arraySync()); // 内積を計算
            const magnitudeA = tf.tidy(() => tf.norm(tf.tensor(vectorA)).arraySync()); // ベクトルAの大きさを計算
            const magnitudeB = tf.tidy(() => tf.norm(tf.tensor(vectorB)).arraySync()); // ベクトルBの大きさを計算
            return dotProduct / (magnitudeA * magnitudeB); // コサイン類似度を計算
        }

        // 最も近い単語を探す関数
        function findMostSimilarWord(grayscaleVector) {
            let maxSimilarity = -Infinity; // 最大コサイン類似度の初期値
            let closestWord = ''; // 最も近い単語の初期化

            // 単語辞書をループして類似度を計算
            wordDictionary.forEach(entry => {
                let similarity = cosineSimilarity(grayscaleVector, entry.vector); // 類似度を計算
                if (similarity > maxSimilarity) {
                    maxSimilarity = similarity; // 最大類似度を更新
                    closestWord = entry.word; // 最も近い単語を更新
                }
            });

            return closestWord; // 最も近い単語を返す
        }

        // ページが読み込まれた時にモデルをロード
        loadModel();
    </script>
</body>
</html>

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up