gpt-oss:120bのベンチマーク(RTX4090 48GB+ RTX5090 32GB=80GB)

Posted at 2025-10-03

80GBのビデオメモリ構成

下記の図の通り、PCと外部接続の2つで、GPUを認識させ構成した。

48GBのRTX4090 については下記で記事にした。
gpt-oss:120bを動かすマシンの作成

nvidia-smi

PS C:\Users\mitsuyasukentaro\Downloads\llama-b6673-bin-win-cuda-12.4-x64> nvidia-smi
Fri Oct  3 13:51:15 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 581.42                 Driver Version: 581.42         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090      WDDM  |   00000000:01:00.0  On |                  N/A |
|  0%   44C    P0             73W /  598W |    1035MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090      WDDM  |   00000000:02:00.0 Off |                  Off |
| 30%   30C    P8             12W /  450W |       0MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Ollamaのパフォーマンス

touch-spさんの記事 OllamaのベンチマークをとるプログラムをOllama Python Libraryを使って書き換えましたを参考に、次のPythonコードで速度ベンチマーク (トークン/秒) を行った。

from ollama import Client

client = Client(
  host="http://127.0.0.1:11434"
)

def client_chat():
    response = client.chat(
        model="gpt-oss:120b",
        messages=[
            {
                "role": "user",
                "content": "how to make gui with 3 buttons in pyside6"
            },
        ],
        options={
            "num_ctx": 8192,
        },
    )

    return response.eval_count, response.eval_duration

if __name__ == "__main__":

    total_tokens = 0
    total_time = 0
    for _ in range(3):
        eval_count, eval_duration = client_chat()
        total_tokens += eval_count
        total_time += eval_duration    
    rate = (total_tokens / total_time) * 10**9

    print(f"tokens per second: {rate:.2f} tokens/second")

結果が下記の通り。

tokens per second: 135.34 tokens/second

llama.cppのパフォーマンス

プロンプトサイズを65536まで段階的に増やしたが、大幅なパフォーマンス低下はなし。
gpt-oss:120bを活用するのに十分なスピードが得られる。

PS C:\Users\mitsuyasukentaro\Downloads\llama-b6673-bin-win-cuda-12.4-x64> .\llama-bench -ngl 999 -m "C:\Users\mitsuyasukentaro\AppData\Local\llama.cpp\ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf" -fa 1 -mmp 0 -p 512,1024,2048,4096,8192,16384,32768,65536
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from C:\Users\mitsuyasukentaro\Downloads\llama-b6673-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Users\mitsuyasukentaro\Downloads\llama-b6673-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\mitsuyasukentaro\Downloads\llama-b6673-bin-win-cuda-12.4-x64\ggml-cpu-icelake.dll
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |           pp512 |      4141.10 ± 50.31 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |          pp1024 |      4784.26 ± 30.82 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |          pp2048 |      5123.90 ± 26.35 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |          pp4096 |      5263.70 ± 11.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |          pp8192 |      5179.07 ± 21.21 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |         pp16384 |      4873.97 ± 22.30 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |         pp32768 |       4324.36 ± 9.41 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |         pp65536 |      3460.73 ± 27.31 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |           tg128 |        177.13 ± 1.92 |

build: d64c8104 (6673)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up