Neural Group Advent Calendar 2025

ブラウザ完結でWhisperを動かす —— サーバー不要な高精度ASRデモの実装

Posted at 2025-12-15

1. 背景

音声認識（ASR: Automatic Speech Recognition）は、議事録作成、字幕生成、ボイスコマンドなど、様々なアプリケーションで需要が高まっています。特に2022年にOpenAIがリリースした Whisper は、オープンソースでありながら商用サービスに匹敵する精度を実現し、音声認識の民主化に大きく貢献しました。

しかし、Whisperを実際にプロダクトに組み込もうとすると、以下のような課題に直面します。

APIコスト: OpenAI Whisper APIは従量課金制で、大量の音声処理にはコストがかさむ
プライバシー: 機密性の高い会議音声を外部サーバーに送信することへの懸念
レイテンシ: ネットワーク往復による遅延がリアルタイム用途では致命的

こうした課題を解決する手段として、近年注目されているのが ブラウザ上でのローカル推論 です。WebAssembly (Wasm) や WebGPU の進化により、かつてはサーバーでしか動かせなかったAIモデルが、ユーザーの手元のブラウザで実行可能になりつつあります。

本記事では、この「エッジAI」アプローチを用いて、Whisperをブラウザ内で完結させる実装方法を解説します。

2. TL;DR

OpenAI Whisperを、Pythonサーバーを介さずブラウザ（Wasm）上で直接実行するデモを実装しました。
ライブラリにはTransformers.jsを採用し、ONNXモデルのロードから推論までをJavaScriptのみで完結させています。
プライバシー重視やサーバーコスト削減に有効な「エッジAI」の具体的な実装パターンを解説します。

3. 前提・環境

本記事の実装は以下の環境で動作確認を行っています。ビルドツールは使用せず、可搬性を高めるためにCDN経由で実装します。

項目	詳細	備考
Browser	Chrome / Edge / Safari	WebAssembly対応の最新版推奨
Library	Transformers.js (v2.13.0)	Hugging Face公式のJSライブラリ
Model	Xenova/whisper-tiny.en	ONNX形式に量子化済み（約40MB）
Backend	ONNX Runtime Web	Wasmバックエンドで動作

ローカルで動作確認をする際は、ブラウザのセキュリティ制約（マイクアクセス等）のため、必ず localhost または https 環境で実行してください。VS Codeの Live Server 拡張機能などが便利です。

Transformers.js とは何か？

通常、Webで機械学習モデルを動かすには、TensorFlow.jsを使うか、ONNX Runtime WebのローレベルなAPIを叩く必要があります。これは前処理（トークナイズや音声の特徴量抽出）の実装コストが非常に高い作業です。

Transformers.js は、Python版 transformers ライブラリとほぼ同じAPI設計で、これら面倒な前処理・後処理をすべてJavaScript上で実行してくれる革命的なライブラリです。

4. アーキテクチャ

従来は音声データをAPIサーバーに投げていましたが、今回は全ての処理フローがブラウザ内で完結します。

5. 実装のポイント

実装は非常にシンプルで、HTML単一ファイルで完結します。主要なコードブロックを解説します。

A. ライブラリのインポートと設定

pipeline APIを使用することで、複雑なONNXセッション管理を隠蔽できます。

import { pipeline, env } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.13.0';

// ローカル開発時のCORS設定などを回避するため、ローカルモデルの検索を無効化
env.allowLocalModels = false;
// ブラウザのCache APIを利用してモデルをキャッシュ
env.useBrowserCache = true;

B. モデルのロード

ASR（Automatic Speech Recognition）タスクを指定してパイプラインを構築します。

// 軽量な英語版Tinyモデルを指定
const MODEL_NAME = 'Xenova/whisper-tiny.en';

// 初回実行時にモデルのダウンロード（約40MB）が発生します
const transcriber = await pipeline('automatic-speech-recognition', MODEL_NAME);

日本語に対応させる場合は、Multilingualモデルである 'Xenova/whisper-tiny' を指定してください。

C. 音声入力のハンドリング (重要)

ここが実装の最大のハマりポイントです。Whisperモデルはサンプリングレート 16,000Hz (16kHz) の入力を前提としています。ブラウザのデフォルト（44.1kHzや48kHz）のままデータを渡すと、認識精度が著しく低下するか、全く認識しません。

今回は AudioContext 作成時にサンプルレートを強制する方法で対応します。

// 16kHzでAudioContextを作成
const audioContext = new (window.AudioContext || window.webkitAudioContext)({ 
    sampleRate: 16000 
});

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const source = audioContext.createMediaStreamSource(stream);

// 音声データを処理するプロセッサを作成
// (本番環境では AudioWorklet の使用が推奨されますが、単一ファイル構成のためScriptProcessorを使用)
const processor = audioContext.createScriptProcessor(4096, 1, 1);

processor.onaudioprocess = (e) => {
    const inputData = e.inputBuffer.getChannelData(0);
    // ここで inputData (Float32Array) を蓄積していく
};

D. 推論の実行

録音停止後、溜め込んだ音声データ（Float32Array）をそのまま transcriber 関数に渡します。

// audioData: 録音データを結合したFloat32Array
const result = await transcriber(audioData);

console.log(result);
// Output: { text: " Hello world. This is a test." }

6. トラブルシューティング / 注意点

初期ロード時間のUX対策

モデルファイル（約40MB〜）のダウンロードが発生するため、初回アクセス時は数秒〜数十秒の待ち時間があります。「モデル準備中...」といったローディング表示の実装は必須です。Service Workerを使って明示的にキャッシュ戦略を組むことも有効です。

モバイル端末でのメモリ制限

iOS Safariなどでは、WebAssemblyが使用できるメモリ量に厳しい制限があります。

推奨: whisper-tiny または whisper-base の量子化モデル
非推奨: small 以上のモデルはクラッシュするリスクが高いです

Cross-Origin Isolation (マルチスレッド化)

ONNX Runtime Webのパフォーマンスを最大化（SharedArrayBufferによるマルチスレッド処理）するには、Webサーバー側で以下のレスポンスヘッダーを設定する必要があります。

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

GitHub Pagesなどヘッダー操作ができない環境では、シングルスレッドモード（デフォルト）で動作させます。

推論速度を上げたい

デフォルトではWasmバックエンド（CPU）で動作します。WebGPU バックエンドの使用を検討してください（Transformers.js v3以降でサポート強化中）。ただし、ブラウザ互換性に注意が必要です。

7. メリット・デメリットまとめ

メリット

完全無料: API利用料もサーバー代もかかりません。
プライバシー: 音声データが一切外部に出ません。機密情報を扱う議事録アプリなどに最適です。
オフライン動作: 一度モデルをキャッシュすれば、ネット環境がない場所でも動作します。

デメリット

初期ロード: 数十MB〜数百MBのダウンロードが発生するため、Webサイトの「初期表示」には向きません。
マシンスペック依存: ユーザーの端末性能に依存します。古いスマホでは動作が重くなる可能性があります。
精度: ブラウザで動かせるモデルサイズに限界があるため、Whisper large-v3 のような最高精度は出せません。

8. まとめ

サーバーレスでWhisperを動かす実装について解説しました。

Transformers.js を使うことで、Pythonエンジニアでも馴染みのあるAPIでクライアントサイド推論が実装できる。
16kHzへのリサンプリング が実装の肝である。
プライバシー や コスト の観点で、ブラウザ推論は有力な選択肢になり得る。

APIの従量課金を気にせず、ユーザーのデバイスパワーを活用する「エッジAI」のアプローチは、今後のWeb開発において重要なスキルセットになると考えられます。

今回のデモコードをベースに、「ブラウザ完結型のボイスメモ」や「リアルタイム字幕アプリ」などを作ってみてはいかがでしょうか。

参考リンク

Appendix: デモコード

以下は本記事の内容を実装した完全動作するデモです。HTMLファイルとして保存し、ローカルサーバー（VS Code Live Server等）で開くとすぐに試せます。

<!DOCTYPE html>
<html lang="ja">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Whisper ブラウザデモ - サーバー不要ASR</title>
    <style>
        * {
            box-sizing: border-box;
        }

        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
            max-width: 640px;
            margin: 0 auto;
            padding: 20px;
            background: #f5f5f5;
            color: #333;
        }

        h1 {
            font-size: 1.5rem;
            text-align: center;
            margin-bottom: 8px;
        }

        .subtitle {
            text-align: center;
            color: #666;
            font-size: 0.9rem;
            margin-bottom: 24px;
        }

        .card {
            background: white;
            border-radius: 12px;
            padding: 24px;
            box-shadow: 0 2px 8px rgba(0,0,0,0.1);
            margin-bottom: 16px;
        }

        .status-section {
            text-align: center;
            margin-bottom: 20px;
        }

        #status {
            font-size: 0.95rem;
            color: #666;
            margin-bottom: 12px;
        }

        #status.loading {
            color: #f59e0b;
        }

        #status.ready {
            color: #10b981;
        }

        #status.recording {
            color: #ef4444;
        }

        #status.processing {
            color: #3b82f6;
        }

        .progress-container {
            width: 100%;
            height: 6px;
            background: #e5e7eb;
            border-radius: 3px;
            overflow: hidden;
            display: none;
        }

        .progress-container.visible {
            display: block;
        }

        #progress {
            height: 100%;
            background: linear-gradient(90deg, #3b82f6, #8b5cf6);
            border-radius: 3px;
            transition: width 0.3s ease;
            width: 0%;
        }

        .controls {
            display: flex;
            gap: 12px;
            justify-content: center;
            margin-bottom: 20px;
        }

        button {
            padding: 12px 24px;
            font-size: 1rem;
            border: none;
            border-radius: 8px;
            cursor: pointer;
            transition: all 0.2s ease;
            font-weight: 500;
        }

        button:disabled {
            opacity: 0.5;
            cursor: not-allowed;
        }

        #recordBtn {
            background: #ef4444;
            color: white;
        }

        #recordBtn:hover:not(:disabled) {
            background: #dc2626;
        }

        #recordBtn.recording {
            animation: pulse 1.5s infinite;
        }

        @keyframes pulse {
            0%, 100% { transform: scale(1); }
            50% { transform: scale(1.05); }
        }

        #stopBtn {
            background: #6b7280;
            color: white;
        }

        #stopBtn:hover:not(:disabled) {
            background: #4b5563;
        }

        .result-section h2 {
            font-size: 1rem;
            margin-bottom: 12px;
            color: #374151;
        }

        #result {
            min-height: 120px;
            padding: 16px;
            background: #f9fafb;
            border: 1px solid #e5e7eb;
            border-radius: 8px;
            font-size: 1rem;
            line-height: 1.6;
            white-space: pre-wrap;
            word-wrap: break-word;
        }

        #result:empty::before {
            content: "ここに文字起こし結果が表示されます...";
            color: #9ca3af;
        }

        .info {
            font-size: 0.8rem;
            color: #9ca3af;
            text-align: center;
            margin-top: 16px;
        }

        .model-select {
            margin-bottom: 20px;
        }

        .model-select label {
            display: block;
            font-size: 0.9rem;
            color: #374151;
            margin-bottom: 8px;
        }

        .model-select select {
            width: 100%;
            padding: 10px 12px;
            font-size: 1rem;
            border: 1px solid #d1d5db;
            border-radius: 8px;
            background: white;
            cursor: pointer;
        }

        .model-select select:disabled {
            background: #f3f4f6;
            cursor: not-allowed;
        }

        .timer {
            font-size: 2rem;
            font-weight: bold;
            color: #ef4444;
            text-align: center;
            margin-bottom: 16px;
            font-variant-numeric: tabular-nums;
            display: none;
        }

        .timer.visible {
            display: block;
        }
    </style>
</head>
<body>
    <h1>🎤 Whisper ブラウザデモ</h1>
    <p class="subtitle">サーバー不要・完全ローカルで動作する音声認識</p>

    <div class="card">
        <div class="model-select">
            <label for="modelSelect">モデルを選択:</label>
            <select id="modelSelect">
                <option value="Xenova/whisper-tiny.en">whisper-tiny.en (英語専用・最速)</option>
                <option value="Xenova/whisper-tiny">whisper-tiny (多言語対応・日本語OK)</option>
                <option value="Xenova/whisper-base.en">whisper-base.en (英語専用・高精度)</option>
                <option value="Xenova/whisper-base">whisper-base (多言語対応・高精度)</option>
            </select>
        </div>

        <div class="status-section">
            <div id="status">モデルを読み込んでください</div>
            <div class="progress-container" id="progressContainer">
                <div id="progress"></div>
            </div>
        </div>

        <div class="timer" id="timer">00:00</div>

        <div class="controls">
            <button id="recordBtn" disabled>🎙️ 録音開始</button>
            <button id="stopBtn" disabled>⏹️ 停止して認識</button>
        </div>

        <div class="result-section">
            <h2>📝 認識結果</h2>
            <div id="result"></div>
        </div>

        <p class="info">
            ※ 初回はモデルのダウンロード（約40〜150MB）が発生します<br>
            ※ 音声データは外部に送信されません
        </p>
    </div>

    <script type="module">
        import { pipeline, env } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.17.0';

        // 設定
        env.allowLocalModels = false;
        env.useBrowserCache = true;

        // DOM要素
        const modelSelect = document.getElementById('modelSelect');
        const statusEl = document.getElementById('status');
        const progressContainer = document.getElementById('progressContainer');
        const progressEl = document.getElementById('progress');
        const recordBtn = document.getElementById('recordBtn');
        const stopBtn = document.getElementById('stopBtn');
        const resultEl = document.getElementById('result');
        const timerEl = document.getElementById('timer');

        // 状態管理
        let transcriber = null;
        let audioContext = null;
        let mediaStream = null;
        let processor = null;
        let audioChunks = [];
        let isRecording = false;
        let timerInterval = null;
        let recordingStartTime = null;

        // ステータス更新
        function setStatus(message, className = '') {
            statusEl.textContent = message;
            statusEl.className = className;
        }

        // タイマー更新
        function updateTimer() {
            if (!recordingStartTime) return;
            const elapsed = Math.floor((Date.now() - recordingStartTime) / 1000);
            const minutes = Math.floor(elapsed / 60).toString().padStart(2, '0');
            const seconds = (elapsed % 60).toString().padStart(2, '0');
            timerEl.textContent = `${minutes}:${seconds}`;
        }

        // モデル読み込み
        async function loadModel() {
            const modelName = modelSelect.value;
            
            setStatus('モデルを読み込み中...', 'loading');
            progressContainer.classList.add('visible');
            progressEl.style.width = '0%';
            
            recordBtn.disabled = true;
            modelSelect.disabled = true;

            try {
                transcriber = await pipeline('automatic-speech-recognition', modelName, {
                    progress_callback: (progress) => {
                        if (progress.status === 'progress') {
                            const percent = Math.round((progress.loaded / progress.total) * 100);
                            progressEl.style.width = `${percent}%`;
                            setStatus(`ダウンロード中... ${percent}%`, 'loading');
                        } else if (progress.status === 'done') {
                            progressEl.style.width = '100%';
                        }
                    }
                });

                setStatus('✅ 準備完了！録音を開始できます', 'ready');
                recordBtn.disabled = false;
                
                setTimeout(() => {
                    progressContainer.classList.remove('visible');
                }, 1000);

            } catch (error) {
                console.error('Model loading error:', error);
                setStatus('❌ モデルの読み込みに失敗しました', '');
                modelSelect.disabled = false;
            }
        }

        // 録音開始
        async function startRecording() {
            try {
                // 16kHzでAudioContextを作成
                audioContext = new (window.AudioContext || window.webkitAudioContext)({
                    sampleRate: 16000
                });

                mediaStream = await navigator.mediaDevices.getUserMedia({ 
                    audio: {
                        channelCount: 1,
                        sampleRate: 16000
                    }
                });

                const source = audioContext.createMediaStreamSource(mediaStream);
                
                // ScriptProcessorを使用（AudioWorkletは単一ファイル構成では複雑なため）
                processor = audioContext.createScriptProcessor(4096, 1, 1);
                audioChunks = [];

                processor.onaudioprocess = (e) => {
                    if (isRecording) {
                        const inputData = e.inputBuffer.getChannelData(0);
                        audioChunks.push(new Float32Array(inputData));
                    }
                };

                source.connect(processor);
                processor.connect(audioContext.destination);

                isRecording = true;
                recordingStartTime = Date.now();
                
                // タイマー開始
                timerEl.classList.add('visible');
                timerEl.textContent = '00:00';
                timerInterval = setInterval(updateTimer, 1000);

                setStatus('🔴 録音中...', 'recording');
                recordBtn.disabled = true;
                recordBtn.classList.add('recording');
                stopBtn.disabled = false;
                modelSelect.disabled = true;

            } catch (error) {
                console.error('Recording error:', error);
                setStatus('❌ マイクへのアクセスに失敗しました', '');
            }
        }

        // 録音停止と認識
        async function stopRecording() {
            isRecording = false;
            
            // タイマー停止
            clearInterval(timerInterval);
            timerEl.classList.remove('visible');

            // リソース解放
            if (processor) {
                processor.disconnect();
                processor = null;
            }
            if (mediaStream) {
                mediaStream.getTracks().forEach(track => track.stop());
                mediaStream = null;
            }
            if (audioContext) {
                await audioContext.close();
                audioContext = null;
            }

            recordBtn.classList.remove('recording');
            stopBtn.disabled = true;

            if (audioChunks.length === 0) {
                setStatus('⚠️ 音声が録音されていません', '');
                recordBtn.disabled = false;
                return;
            }

            // 音声データを結合
            const totalLength = audioChunks.reduce((acc, chunk) => acc + chunk.length, 0);
            const audioData = new Float32Array(totalLength);
            let offset = 0;
            for (const chunk of audioChunks) {
                audioData.set(chunk, offset);
                offset += chunk.length;
            }

            setStatus('🔄 認識処理中...', 'processing');
            resultEl.textContent = '';

            try {
                const result = await transcriber(audioData, {
                    language: modelSelect.value.includes('.en') ? 'en' : null,
                    task: 'transcribe'
                });

                resultEl.textContent = result.text.trim() || '(認識結果なし)';
                setStatus('✅ 認識完了！', 'ready');

            } catch (error) {
                console.error('Transcription error:', error);
                setStatus('❌ 認識処理に失敗しました', '');
                resultEl.textContent = `エラー: ${error.message}`;
            }

            recordBtn.disabled = false;
        }

        // イベントリスナー
        modelSelect.addEventListener('change', loadModel);
        recordBtn.addEventListener('click', startRecording);
        stopBtn.addEventListener('click', stopRecording);

        // 初期ロード
        loadModel();
    </script>
</body>
</html>

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up