Jetson Orin nanoをNVMe SSD起動にしてWhisperを動かしDifyのMCPで家電を操作する

Last updated at 2025-06-16Posted at 2025-05-21

はじめに

AliExpressでseeed studioのJetson Orin nano 8GBを手に入れたので使っていきます。

NVMe SSDへのインストールは公式の手順だとx86のUbuntuマシンが必要なのですが、「そんなものはない」です。
Raspberry Piでは古の手法となりました、ddを使ってコピーしました。(確実ではないので自己責任で)

そこからAIアシスタントっぽく音声文字起こし(Whisper)を動かし、AIエージェントと連携させてみました。

前提環境

Jetson Orin nano 8GB
ファームウェア 36.4.3 (最新でラッキー)
microSD (64GBにjetpack 6.2)
NVMe SSD
k8s上のdify (チャットボット用)
何かしらのWebサーバ (Webアプリ用)

1台でmicroSDからNVMe SSD起動に変更する

設定を済ませた状態でNVMeを装着してOSを起動させます。
init 3でCLIモードにし(本当はinit 1が良い)、ddでNVMeにコピーします。

$ sudo init 3
$ sudo sync
$ sudo sync
$ sudo sync
$ sudo dd if=/dev/mmcblk0 of=/dev/nvme0n1 bs=4M conv=fsync status=progress
63136858112 bytes (63 GB, 59 GiB) copied, 731 s, 86.4 MB/s
15072+1 records in
15072+1 records out
63218647040 bytes (63 GB, 59 GiB) copied, 733.685 s, 86.2 MB/s

一旦fsckをかけてpartedでext4のパーティションをresizepartします。

$ sudo e2fsck /dev/nvme0n1p1
e2fsck 1.46.5 (30-Dec-2021)
/dev/nvme0n1p1: recovering journal
Setting free inodes count to 3515114 (was 3515141)
Setting free blocks count to 8882250 (was 8882497)
/dev/nvme0n1p1: clean, 238486/3753600 files, 6169526/15051776 blocks
$ sudo parted /dev/nvme0n1
GNU Parted 3.4
Using /dev/nvme0n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p
Warning: Not all of the space available to /dev/nvme0n1 appears to be used, you can fix the GPT to use all of the space (an extra 853299248
blocks) or continue with the current setting?
Fix/Ignore? F
Model: TS500GMTE110Q-E (nvme)
Disk /dev/nvme0n1: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name                Flags
 2      1049kB  135MB   134MB                A_kernel
 3      135MB   136MB   786kB                A_kernel-dtb
 4      136MB   169MB   33.2MB               A_reserved_on_user
 5      170MB   304MB   134MB                B_kernel
 6      304MB   305MB   786kB                B_kernel-dtb
 7      305MB   338MB   33.2MB               B_reserved_on_user
 8      339MB   423MB   83.9MB               recovery
 9      423MB   423MB   524kB                recovery-dtb
10      424MB   491MB   67.1MB  fat32        esp                 boot, esp
11      491MB   575MB   83.9MB               recovery_alt
12      575MB   575MB   524kB                recovery-dtb_alt
13      576MB   643MB   67.1MB               esp_alt
14      643MB   1062MB  419MB                UDA
15      1062MB  1565MB  503MB                reserved
 1      1566MB  63.2GB  61.7GB  ext4         APP

(parted) resizepart 1 500GB
(parted) quit

Information: You may need to update /etc/fstab.

fsck -fyで修復してからresize2fsします。

$ sudo e2fsck -fy /dev/nvme0n1p1
e2fsck 1.46.5 (30-Dec-2021)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Inode 1175669 ref count is 1, should be 2.  Fix? yes
 :
Free inodes count wrong (3515114, counted=3515115).
Fix? yes

/dev/nvme0n1p1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p1: 238485/3753600 files (0.2% non-contiguous), 6168438/15051776 blocks

$ sudo resize2fs /dev/nvme0n1p1
resize2fs 1.46.5 (30-Dec-2021)
Resizing the filesystem on /dev/nvme0n1p1 to 121688104 (4k) blocks.
The filesystem on /dev/nvme0n1p1 is now 121688104 (4k) blocks long.

うまくいったら/mnt/nvmeにmountして起動デバイスのデバイスファイルをNVMeにします。

$ mkdir -p /mnt/nvme
$ sudo mount /dev/nvme0n1p1 /mnt/nvme
$ sudo perl -pi -e "s/mmcblk0p1/nvme0n1p1/g" /mnt/nvme/boot/extlinux/extlinux.conf

root=/dev/nvme0n1p1とかになればOKです。

$ cat /mnt/nvme/boot/extlinux/extlinux.conf
TIMEOUT 30
DEFAULT primary

MENU TITLE L4T boot options

LABEL primary
      MENU LABEL primary kernel
      LINUX /boot/Image
      INITRD /boot/initrd
      APPEND ${cbootargs} root=/dev/nvme0n1p1 rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200 firmware_class.path=/etc/firmware fbcon=map:0 nospectre_bhb video=efifb:off console=tty0

# When testing a custom kernel, it is recommended that you create a backup of
# the original kernel and add a new entry to this file so that the device can
# fallback to the original kernel. To do this:
#
# 1, Make a backup of the original kernel
#      sudo cp /boot/Image /boot/Image.backup
#
# 2, Copy your custom kernel into /boot/Image
#
# 3, Uncomment below menu setting lines for the original kernel
#
# 4, Reboot

# LABEL backup
#    MENU LABEL backup kernel
#    LINUX /boot/Image.backup
#    INITRD /boot/initrd
#    APPEND ${cbootargs}

umountしてOSを停止させて、microSDを取り出して電源を入れます。
うまくNVMe SSDから起動しない場合はBIOS設定のリセットをしてみてください。

$ sudo umount /mnt/nvme
$ sudo shutdown -h now

SUPER化と、統合メモリなので永続的にCLI起動にしておきます。(メモリがもったいない)

$ sudo nvpmodel -m 1
$ sudo systemctl set-default multi-user.target
Removed /etc/systemd/system/default.target.
Created symlink /etc/systemd/system/default.target → /lib/systemd/system/multi-user.target.
$ sudo systemctl isolate multi-user.target

音声文字起こしのサービスをデプロイする

OSパッケージの更新はしておいてください。NVMe化してからの方が短時間で済むと思います。
AIコンテナが利用できるよう準備します。

$ git clone https://github.com/dusty-nv/jetson-containers
Cloning into 'jetson-containers'...
remote: Enumerating objects: 31968, done.
remote: Counting objects: 100% (1420/1420), done.
remote: Compressing objects: 100% (651/651), done.
remote: Total 31968 (delta 1072), reused 808 (delta 768), pack-reused 30548 (from 4)
Receiving objects: 100% (31968/31968), 223.20 MiB | 11.01 MiB/s, done.
Resolving deltas: 100% (21327/21327), done.

$ bash jetson-containers/install.sh
+++ readlink -f jetson-containers/install.sh
++ dirname /home/haomei/jetson-containers/install.sh
+ ROOT=/home/haomei/jetson-containers
+ INSTALL_PREFIX=/usr/local/bin
++ lsb_release -rs
+ LSB_RELEASE=22.04
+ '[' 22.04 = 24.04 ']'
 :
 :
Successfully installed DockerHub-API-0.5 furl-2.1.4 orderedmultidict-1.0.1 pyyaml-6.0.2 tabulate-0.9.0 termcolor-3.1.0 wget-3.2
+ sudo ln -sf /home/haomei/jetson-containers/autotag /usr/local/bin/autotag
+ sudo ln -sf /home/haomei/jetson-containers/jetson-containers /usr/local/bin/jetson-containers

音声文字起こしをしたいので、whisperのコンテナイメージを作成します。

$ jetson-containers run --name whisper $(autotag whisper)
Namespace(packages=['whisper'], prefer=['local', 'registry', 'build'], disable=[''], user='dustynv', output='/tmp/autotag', quiet=False, verbose=False)
-- L4T_VERSION=36.4.3  JETPACK_VERSION=6.2  CUDA_VERSION=12.6
-- Finding compatible container image for ['whisper']

Couldn't find a compatible container for whisper, would you like to build it? [y/N] y

待つこと35分。(NW環境に依存します)

起動するとJupyterLabが起動するので、ブラウザでアクセスして遊んでみてください。
私は使用しないのでさっさと[Ctrl]+[d]で終了させました。

whisper:r36.4-cu126-22.04
V4L2_DEVICES:
### ARM64 architecture detected
### Jetson Detected
SYSTEM_ARCH=tegra-aarch64
+ sudo docker run --runtime nvidia --env NVIDIA_DRIVER_CAPABILITIES=compute,utility,graphics -it --rm --network host --shm-size=8g --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/haomei/jetson-containers/data:/data -v /etc/localtime:/etc/localtime:ro -v /etc/timezone:/etc/timezone:ro --device /dev/snd -e PULSE_SERVER=unix:/run/user/1000/pulse/native -v /run/user/1000/pulse:/run/user/1000/pulse --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 --name whisper whisper:r36.4-cu126-22.04
allow 10 sec for JupyterLab to start @ https://192.168.1.64:8888 (password nvidia)
JupterLab logging location:  /var/log/jupyter.log  (inside the container)
root@orin:/opt/whisper#

ビルドしたコンテナイメージをベースにOpenAI API互換の音声文字起こしサービスにします。

$ mkdir whisper
$ cd whisper
~/whisper$ mkdir app
~/whisper$ mkdir model
~/whisper$ cat <<EOF > app/main.py
import os
import tempfile
import warnings
import logging
import numpy as np

from fastapi import FastAPI, File, UploadFile, HTTPException, Depends, Form
from fastapi.middleware.cors import CORSMiddleware
from fastapi.security.api_key import APIKeyHeader
import whisper
import librosa

warnings.filterwarnings("ignore")
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

API_KEY = os.getenv("API_KEY", "default_api_key")
API_KEY_NAME = "Authorization"
api_key_header = APIKeyHeader(name=API_KEY_NAME, auto_error=True)

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

def verify_api_key(api_key: str = Depends(api_key_header)):
    if api_key != API_KEY:
        raise HTTPException(status_code=403, detail="Invalid API Key")
    return api_key

model_name = os.getenv("MODEL", "turbo")
model = whisper.load_model(model_name, download_root="/model")
threshold_db = float(os.getenv("THRESHOLD_DB", "-50"))

@app.post("/v1/audio/transcriptions")
async def transcribe_audio(
    file: UploadFile = File(...),
    model_param: str = Form("whisper-1"),
    prompt: str = Form(""),
    response_format: str = Form("json"),
    temperature: float = Form(0.0),
    language: str = Form("ja"),
    api_key: str = Depends(verify_api_key)
):
    audio_bytes = await file.read()
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_in:
        tmp_in.write(audio_bytes)
        tmp_in.flush()
        input_path = tmp_in.name

    y, sr = librosa.load(input_path, sr=None)
    rms = np.sqrt(np.mean(y**2))
    db = 20 * np.log10(rms + 1e-6)
    if db < threshold_db:
        return {"text": ""}

    result = model.transcribe(input_path, language='ja')
    return {"text": result["text"]}
EOF

Dockerfileを作ります。FROMはビルド中のメッセージから確認するか、sudo docker imagesで確認してください。

~/whisper$ cat <<EOF > Dockerfile
FROM whisper:r36.4-cu126-22.04
RUN pip install openai-whisper fastapi uvicorn python-multipart librosa
WORKDIR /app
VOLUME /model
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--log-level", "info"]
EOF

あとはビルドしてAIコンテナの起動コマンドを参考にして起動させます。

~/whisper$ sudo docker build -t whisper:api .
~/whisper$ sudo docker run --runtime nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,graphics -e API_KEY=whisper_api_key -dt --restart always --name whisper --network host --shm-size=8g -v /home/xxx/whisper/model:/model -v /home/xxx/whisper/app:/app whisper:api

チャットボット(difyのAIエージェント)と連携させる

ブラウザで音声を録音しながらwhisperへAPI送信するWebApp(HTMLとJavaScript)を作成します。(Chrome前提)
whisperから返ってきた音声認識文字列はdifyのworkflowチャットAPIへ送信します。

<!DOCTYPE html>
<html lang="ja">
<head>
  <meta charset="UTF-8">
  <!-- スマホ向けにViewport設定 -->
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>音声文字起こしとチャットボット連携</title>
  <style>
    body {
      font-family: Arial, sans-serif;
      max-width: 800px;
      margin: 0 auto;
      padding: 20px;
    }
    button {
      padding: 10px 15px;
      margin: 5px;
      cursor: pointer;
    }
    #result, #botResponse {
      border: 1px solid #ccc;
      padding: 10px;
      margin: 10px 0;
      min-height: 100px;
      max-height: 200px;
      overflow-y: auto;
    }
    .rms-container {
      margin: 10px 0;
      position: relative;
      height: 20px;
    }
    .rms-bar {
      height: 100%;
      background: red;
      width: 0%;
      transition: width 0.1s, background 0.1s;
    }
    .rms-threshold-line {
      position: absolute;
      top: 0;
      height: 100%;
      width: 2px;
      background: blue;
      left: 50%;
    }
    label {
      display: block;
      margin: 10px 0;
    }
  </style>
</head>
<body>
  <h2>音声文字起こしとチャットボット連携</h2>
  
  <div>
    <label>RMS閾値設定:
      <input type="range" id="threshold" min="0" max="0.1" step="0.01" value="0.05" />
      <span id="thresholdValue">0.05</span>
    </label>
    <div class="rms-container">
      <div>RMS: <span id="rmsValue">0.000</span></div>
      <div class="rms-bar" id="rmsBar"></div>
      <div class="rms-threshold-line" id="thresholdLine"></div>
    </div>
    <label>無音判定(sec): <input type="number" id="silenceSeconds" value="1" min="0.1" max="2" step="0.1" /></label>
    <label>音声分割(sec): <input type="number" id="maxSegmentSeconds" value="10" min="1" max="10" step="1" /></label>
    <br><br>
  </div>

  <button id="start">録音開始</button>
  <button id="stop">録音停止</button>

  <h3>音声認識結果</h3>
  <div id="result"></div>

  <h3>チャットボット応答</h3>
  <div id="botResponse"></div>

  <script>
    let stream, audioCtx, analyser, currentRecorder;
    let recording = false;
    let silenceThreshold = 0.05;
    let silenceDuration = 1000; // ms
    let maxSegmentDuration = 5000;
    let thresholdExceedDuration = 200;
    let thresholdExceedStart = null;
    let silenceStart = null, segmentTimer;
    let segmentThresholdExceeded = false;

    // 認識結果（過去最大3回分）を保持する配列
    const resultLines = [];
    // チャットボット応答を保持する配列
    const botResponses = [];
    let latestText = "";

    const resultDiv = document.getElementById('result');
    const botResponseDiv = document.getElementById('botResponse');
    const rmsValueEl = document.getElementById('rmsValue');
    const rmsBarEl = document.getElementById('rmsBar');
    const thresholdLineEl = document.getElementById('thresholdLine');

    function initSettings() {
      const thresholdEl = document.getElementById('threshold');
      const thresholdValEl = document.getElementById('thresholdValue');
      thresholdEl.addEventListener('input', () => {
        silenceThreshold = parseFloat(thresholdEl.value);
        thresholdValEl.textContent = thresholdEl.value;
        updateThresholdLine();
      });
      document.getElementById('silenceSeconds').addEventListener('change', e => {
        silenceDuration = parseFloat(e.target.value) * 1000;
      });
      document.getElementById('maxSegmentSeconds').addEventListener('change', e => {
        maxSegmentDuration = parseFloat(e.target.value) * 1000;
      });
      updateThresholdLine();
    }

    function updateThresholdLine() {
      const ratio = Math.min(1, silenceThreshold / 0.1);
      thresholdLineEl.style.left = (ratio * 100) + '%';
    }

    async function startRecording() {
      stream = await navigator.mediaDevices.getUserMedia({ audio: true });
      audioCtx = new (window.AudioContext || window.webkitAudioContext)();
      const source = audioCtx.createMediaStreamSource(stream);
      analyser = audioCtx.createAnalyser();
      analyser.fftSize = 1024;
      source.connect(analyser);
      recording = true;
      startSegment();
      monitorSilence();
    }

    function startSegment() {
      currentRecorder = new MediaRecorder(stream);
      segmentThresholdExceeded = false;
      let audioChunks = [];
      currentRecorder.ondataavailable = e => {
        audioChunks.push(e.data);
      };
      currentRecorder.onstop = () => {
        if (!segmentThresholdExceeded) {
          if (recording) startSegment();
          return;
        }
        const blob = new Blob(audioChunks, { type: 'audio/wav' });
        sendToWhisperAPI(blob);
        if (recording) startSegment();
      };
      currentRecorder.start();
      segmentTimer = setTimeout(() => {
        if (currentRecorder.state === 'recording') {
          currentRecorder.stop();
          clearTimeout(segmentTimer);
        }
      }, maxSegmentDuration);
    }

    function monitorSilence() {
      const data = new Uint8Array(analyser.fftSize);
      analyser.getByteTimeDomainData(data);
      let sum = 0;
      for (let i = 0; i < data.length; i++) {
        const val = (data[i] - 128) / 128;
        sum += val * val;
      }
      const rms = Math.sqrt(sum / data.length);
      rmsValueEl.textContent = rms.toFixed(3);
      const barWidth = Math.min(100, (rms / 0.1) * 100);
      rmsBarEl.style.width = barWidth + '%';
      rmsBarEl.style.background = (rms < silenceThreshold) ? 'red' : 'green';
      if (rms >= silenceThreshold) {
        if (thresholdExceedStart === null) {
          thresholdExceedStart = Date.now();
        }
        if (Date.now() - thresholdExceedStart >= thresholdExceedDuration) {
          segmentThresholdExceeded = true;
        }
      } else {
        thresholdExceedStart = null;
      }
      if (rms < silenceThreshold) {
        if (!silenceStart) {
          silenceStart = Date.now();
        } else if (Date.now() - silenceStart > silenceDuration) {
          if (currentRecorder && currentRecorder.state === 'recording') {
            currentRecorder.stop();
            clearTimeout(segmentTimer);
          }
          silenceStart = null;
        }
      } else {
        silenceStart = null;
      }
      if (recording) requestAnimationFrame(monitorSilence);
    }

    function stopRecording() {
      recording = false;
      if (currentRecorder && currentRecorder.state === 'recording') {
        currentRecorder.stop();
      }
      if (stream) {
        stream.getTracks().forEach(track => track.stop());
      }
      if (audioCtx) {
        audioCtx.close();
      }
    }

    async function sendToChatbotAPI(text) {
      const url = 'http://dify/v1/workflows/run';
      const payload = {
        inputs: {
          input: text
        },
        response_mode: "blocking",
        user: "node-red"
      };
      try {
        const response = await fetch(url, {
          method: 'POST',
          headers: {
            'Content-Type': 'application/json',
            'authorization': 'bearer app-1234abc'
          },
          body: JSON.stringify(payload)
        });
        if (!response.ok) throw new Error(`HTTP error: ${response.status}`);
        const data = await response.json();
        
        // チャットボットの応答を取得
        const botResponse = data.data.outputs.text || "応答を取得できませんでした";
        
        // 応答を配列に追加し、表示を更新
        botResponses.push(botResponse);
        if (botResponses.length > 3) botResponses.shift();
        botResponseDiv.innerHTML = botResponses.join('<br><hr><br>');
        botResponseDiv.scrollTop = botResponseDiv.scrollHeight;
        
        return botResponse;
      } catch (error) {
        console.error('チャットボットAPI通信エラー:', error);
        const errorMessage = 'チャットボットの応答に失敗しました。';
        botResponses.push(errorMessage);
        if (botResponses.length > 3) botResponses.shift();
        botResponseDiv.innerHTML = botResponses.join('<br><hr><br>');
        return errorMessage;
      }
    }

    async function sendToWhisperAPI(blob) {
      const formData = new FormData();
      formData.append('file', blob, 'audio.wav');
      formData.append('response_format', 'json');
      fetch('http://whisper/v1/audio/transcriptions', {
        method: 'POST',
        headers: { 'Authorization': 'whisper_api_key' },
        body: formData
      })
      .then(res => res.json())
      .then(async data => {
        if (data.text) {
          latestText = data.text;
          resultLines.push(data.text);
          if (resultLines.length > 3) resultLines.shift();
          resultDiv.innerHTML = resultLines.join('<br>');
          resultDiv.scrollTop = resultDiv.scrollHeight;
          
          // 音声文字起こし結果をチャットボットAPIに送信
          await sendToChatbotAPI(data.text);
        }
      })
      .catch(err => console.error('音声認識エラー:', err));
    }

    document.getElementById('start').addEventListener('click', startRecording);
    document.getElementById('stop').addEventListener('click', stopRecording);
    window.addEventListener('load', initSettings);
  </script>
</body>
</html>

Claude 3.7 Sonnetに作らせましたが、見栄えは想像以下でした。
発話された音声を文章として認識させるため、以下の機能を前提として実装させています。
・RMS閾値：音声を認識する音圧レベルを設定します。静かな場所なら0.01で良いです。
・無音判定：設定秒数の間無音の場合(喋り終わり判定)、whisperにAPIを送信します。
・音声分割：強制的にwhisperに録音データを送信するまでの秒数です。

difyのworkflowには、エージェントを仕込んでおり自宅内のMCPサーバ(これもdifyで作成)を設定しています。
Function Callingを使っているので、モデルはtools(Function Calling)に対応しているqwen3:14bを使っています。あまり性能の低いモデルを使用するとツール(MCP)を使ってくれないので、ある程度性能の良いモデルを選んだ方が良いです。

エージェント戦略はAgent Strategies (Support MCP Tools)を使わせていたいだいてます。

MCPサーバはdifyのプラグイン「mcp-server」を使わせていただいています。
ここに照明をOFFにするworkflowを登録しています。ワークフロー名と説明はLLMのモデルが認識しやすいものにしてください。

オリジナルなので参考になりませんが、照明をOFFにするworkflowは以下の単純なものです。
自作の学習リモコンサーバへ赤外線情報をPOSTしているだけです。

学習リモコンはRaspberry Pi Zero 2 WH(64bit Raspbian)にビットトレードワンさんのIRリモコンHATを載せています。動かしているコンテナアプリは以下で公開しています。

おわりに

若干ラグがありますが、jetson orin nano 8GBで音声文字起こしをして既存のチャットボットサーバと連携できました。kubernetes含めて全てローカル稼働です。
まだ応答速度の課題はありますが、他の有効な使い方が無いか色々試してみたいです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up