GPT-4o-miniを活用したリアルタイム翻訳＆要約機能の実装例

Posted at 2024-09-08

1. 概要

この記事では、GPT-4o-miniを活用したリアルタイム翻訳機能に、さらにリアルタイム要約機能を追加したシステムについて紹介します。会議や長時間のディスカッションの際に、このシステムを使うことで内容を簡潔に把握しやすくなり、会議の効率が劇的に向上することが期待されます。

2. 特徴や使い方

このシステムには以下の特徴があります：

音声の文字起こし
- 認識が確定するまでの間は薄いフォントで表示。
- 認識が確定したらテキストが確定され、入力完了。
リアルタイム翻訳
- 確定したテキストをGPT-4o-miniを用いて即座に翻訳。
- プロンプトを工夫することで、無駄な出力を防止。
リアルタイム要約
- 翻訳を3回ごとに、要約を生成。
- 要約時に過去の要約を再利用し、積み上げ形式で内容をまとめる。
- これにより、冗長な会話を要約しやすく、内容の一貫性を維持。
- 書式も毎回再利用されるため、要約のクオリティが向上。

3. コード

フォルダ構成

app.py
templates/index.html

[ app.py ]

from flask import Flask, render_template, request, jsonify
import requests

app = Flask(__name__)

# Hard-coded API Key
API_KEY = "YOUR_APIKEY"

# Route to serve the frontend
@app.route('/')
def index():
    return render_template('index.html')

# Endpoint to process translations
@app.route('/translate', methods=['POST'])
def translate():
    data = request.get_json()
    message_history = data.get('messageHistory')
    
    api_url = "https://api.openai.com/v1/chat/completions"
    
    prompt = "あなたはServiceNowの会議での字幕翻訳ツールです。次の音声文字認識テキストを臨場感あふれるかつ読みやすい日本語にしてください。結果をシステムに表示するため結果以外の文字は必ず削除してください。」「の記号は出力禁止。"
    
    payload = {
        "model": "gpt-4o-mini",
        "messages": [
            {"role": "system", "content": prompt},
            *message_history
        ],
        "max_tokens": 1000
    }

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {API_KEY}"
    }

    try:
        response = requests.post(api_url, headers=headers, json=payload)
        response.raise_for_status()
        gpt_response = response.json()
        translated_text = gpt_response['choices'][0]['message']['content']
        return jsonify({'translation': translated_text})
    except requests.exceptions.RequestException as e:
        return jsonify({'error': str(e)}), 500

# Enhanced summarization logic combining previous translations and summary
@app.route('/summarize', methods=['POST'])
def summarize():
    data = request.get_json()
    text = data.get('text')  # Translations field content
    previous_summary = data.get('previousSummary')  # Summary field content
    
    api_url = "https://api.openai.com/v1/chat/completions"
    
    # Combined prompt for creating a new summary
    prompt = f"読みやすい議事録を作成してください。リッチテキスト:\n{text}\n\nPrevious Summary:\n{previous_summary}"
    
    payload = {
        "model": "gpt-4o-mini",
        "messages": [
            {"role": "system", "content": "Summarize the content below"},
            {"role": "user", "content": prompt}
        ],
        "max_tokens": 3000
    }

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {API_KEY}"
    }

    try:
        response = requests.post(api_url, headers=headers, json=payload)
        response.raise_for_status()
        gpt_response = response.json()
        summary_text = gpt_response['choices'][0]['message']['content'].strip()
        return jsonify({'summary': summary_text})
    except requests.exceptions.RequestException as e:
        return jsonify({'error': str(e)}), 500

# Run Flask app with debug mode enabled
if __name__ == '__main__':
    app.run(debug=True)

[ templates/index.html ]

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Speech Recognition with GPT-4o-mini</title>
    <style>
        body { font-family: Arial, sans-serif; margin: 0; padding: 20px; display: flex; justify-content: center; height: 100vh; box-sizing: border-box; }
        .container { display: flex; flex-direction: column; width: 100%; max-width: 1600px; height: 100%; }
        .column-wrapper { display: flex; justify-content: space-between; gap: 20px; flex-grow: 1; height: 100%; }
        .column { flex: 1; padding: 10px; display: flex; flex-direction: column; height: 100%; }
        h2 { margin-top: 0; text-align: center; font-size: 1.5em; }
        #output, #gptResponse, #summary { 
            flex-grow: 1; 
            border: 1px solid #ccc; 
            padding: 10px; 
            margin-top: 10px;
            overflow-y: auto;  /* Scrollable fields */
            white-space: pre-wrap; 
            height: 100%; /* Adjust height for scroll */
            font-size: 1em;  /* Adjust font size for better readability */
            line-height: 1.5em;  /* Adjust line height for better spacing */
        }
        .button-group { text-align: center; margin-top: 10px; display: flex; justify-content: center; gap: 10px; }
        button { padding: 10px 20px; background-color: #4CAF50; color: white; border: none; cursor: pointer; margin: 0 5px; }
        button:hover { background-color: #45a049; }
        button:disabled { background-color: #cccccc; cursor: not-allowed; }
        select { width: 100%; padding: 5px; margin-top: 5px; }
        .interim { color: gray; font-style: italic; }
        .error { color: red; font-weight: bold; }
        #summary { 
            overflow-y: auto; 
            height: 100%; 
            white-space: normal;  /* Allow line breaks */
            background-color: #f8f9fa; 
            padding: 10px;
            word-wrap: break-word;  /* Ensure long words are wrapped */
        }
        .full-width { width: 100%; text-align: center; margin-bottom: 10px; }
    </style>
    <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>  <!-- marked.js CDN -->
</head>
<body>
    <div class="container">
        <div class="full-width">
            <select id="languageSelect">
                <option value="en-US">English</option>
            </select>
        </div>
        <div class="button-group">
            <button id="startButton">Start</button>
            <button id="stopButton" disabled>Stop</button>
            <button id="clearButton">Clear</button>
        </div>
        <div class="column-wrapper">
            <div class="column column-left">
                <h2>Speech Recognition</h2>
                <div id="output"></div>
            </div>
            <div class="column">
                <h2>GPT-4o-mini Translation</h2>
                <div id="gptResponse"></div>
            </div>
            <div class="column">
                <h2>Summary</h2>
                <div id="summary"></div> <!-- Markdown will be rendered here -->
            </div>
        </div>
    </div>

    <script>
        const startButton = document.getElementById('startButton');
        const stopButton = document.getElementById('stopButton');
        const clearButton = document.getElementById('clearButton');
        const output = document.getElementById('output');
        const gptResponse = document.getElementById('gptResponse');
        const summary = document.getElementById('summary');
        const languageSelect = document.getElementById('languageSelect');

        let recognition;
        let finalTranscript = '';
        let messageHistory = [];
        let interimTranscript = '';
        let translationCount = 0;
        let accumulatedSummary = '';  // Stores the cumulative summary

        function startRecognition() {
            recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
            recognition.lang = languageSelect.value;
            recognition.interimResults = true;
            recognition.continuous = true;

            recognition.onresult = (event) => {
                interimTranscript = '';  // Reset interim transcript

                for (let i = event.resultIndex; i < event.results.length; i++) {
                    if (event.results[i].isFinal) {
                        finalTranscript += event.results[i][0].transcript + ' ';
                        updateOutput(finalTranscript, '');
                        processText(event.results[i][0].transcript);  // Only send finalized text
                    } else {
                        interimTranscript += event.results[i][0].transcript;
                    }
                }

                updateOutput(finalTranscript, interimTranscript);  // Update both final and interim texts
            };

            recognition.onerror = (event) => {
                console.error("Error: ", event.error);
            };

            recognition.start();
            startButton.disabled = true;
            stopButton.disabled = false;
        }

        function stopRecognition() {
            if (recognition) {
                recognition.stop();
                startButton.disabled = false;
                stopButton.disabled = true;
            }
        }

        function clearOutput() {
            output.textContent = '';
            gptResponse.textContent = '';
            summary.innerHTML = '';  // Clear rich text content
            finalTranscript = '';
            messageHistory = [];
            accumulatedSummary = '';  // Clear accumulated summary
            translationCount = 0;     // Reset translation count
        }

        async function processText(text) {
            messageHistory.push({ role: 'user', content: text });

            try {
                const response = await fetch('/translate', {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify({ messageHistory })
                });

                const data = await response.json();
                if (data.translation) {
                    gptResponse.innerHTML += `${data.translation}\n`;
                    messageHistory.push({ role: 'assistant', content: data.translation });
                    translationCount++;

                    if (translationCount % 3 === 0) {
                        await updateSummary();
                    }
                } else if (data.error) {
                    gptResponse.innerHTML += `<span class="error">${data.error}</span>\n`;
                }
            } catch (error) {
                gptResponse.innerHTML += `<span class="error">Translation failed: ${error.message}</span>\n`;
            }

            autoScroll();  // Auto-scroll after translation
        }

        function updateOutput(finalText, interimText) {
            output.innerHTML = finalText;
            if (interimText) {
                output.innerHTML += `<span class="interim">${interimText}</span>`;
            }
            autoScroll();
        }

        function autoScroll() {
            output.scrollTop = output.scrollHeight;
            gptResponse.scrollTop = gptResponse.scrollHeight;
            summary.scrollTop = summary.scrollHeight;
        }

        async function updateSummary() {
            const allTranslations = gptResponse.innerHTML;
            const response = await fetch('/summarize', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({ text: allTranslations, previousSummary: accumulatedSummary })
            });

            const data = await response.json();
            if (data.summary) {
                accumulatedSummary = data.summary;
                summary.innerHTML = marked.parse(accumulatedSummary);  // Correctly use marked.js to render Markdown
            } else if (data.error) {
                summary.innerHTML = `Error: ${data.error}`;
            }
        }

        startButton.addEventListener('click', startRecognition);
        stopButton.addEventListener('click', stopRecognition);
        clearButton.addEventListener('click', clearOutput);
    </script>
</body>
</html>

4. さいごに

GPT-4o-miniを活用したリアルタイム翻訳＆要約機能は、特に長時間の会議やディスカッションの効率を上げるツールとして大変有効です。この機能を使えば、会話が横道にそれることを防ぎ、会議を生産的に進行できます。今後、さらに多言語対応やモデルのアップグレードも視野に入れ、改良していきたいと考えています。

皆さまの生産性が向上することを心から祈っています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up