0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

gpt-oss:120bのベンチマーク(RTX4090 48GB+ RTX5090 32GB=80GB)

Posted at

80GBのビデオメモリ構成

下記の図の通り、PCと外部接続の2つで、GPUを認識させ構成した。

image.png

48GBのRTX4090 については下記で記事にした。
gpt-oss:120bを動かすマシンの作成

nvidia-smi

PS C:\Users\mitsuyasukentaro\Downloads\llama-b6673-bin-win-cuda-12.4-x64> nvidia-smi
Fri Oct  3 13:51:15 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 581.42                 Driver Version: 581.42         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090      WDDM  |   00000000:01:00.0  On |                  N/A |
|  0%   44C    P0             73W /  598W |    1035MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090      WDDM  |   00000000:02:00.0 Off |                  Off |
| 30%   30C    P8             12W /  450W |       0MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Ollamaのパフォーマンス

touch-spさんの記事 OllamaのベンチマークをとるプログラムをOllama Python Libraryを使って書き換えました を参考に、次のPythonコードで速度ベンチマーク (トークン/秒) を行った。

from ollama import Client

client = Client(
  host="http://127.0.0.1:11434"
)

def client_chat():
    response = client.chat(
        model="gpt-oss:120b",
        messages=[
            {
                "role": "user",
                "content": "how to make gui with 3 buttons in pyside6"
            },
        ],
        options={
            "num_ctx": 8192,
        },
    )

    return response.eval_count, response.eval_duration

if __name__ == "__main__":

    total_tokens = 0
    total_time = 0
    for _ in range(3):
        eval_count, eval_duration = client_chat()
        total_tokens += eval_count
        total_time += eval_duration    
    rate = (total_tokens / total_time) * 10**9

    print(f"tokens per second: {rate:.2f} tokens/second")

結果が下記の通り。

tokens per second: 135.34 tokens/second

llama.cppのパフォーマンス

プロンプトサイズを65536まで段階的に増やしたが、大幅なパフォーマンス低下はなし。
gpt-oss:120bを活用するのに十分なスピードが得られる。

PS C:\Users\mitsuyasukentaro\Downloads\llama-b6673-bin-win-cuda-12.4-x64> .\llama-bench -ngl 999 -m "C:\Users\mitsuyasukentaro\AppData\Local\llama.cpp\ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf" -fa 1 -mmp 0 -p 512,1024,2048,4096,8192,16384,32768,65536
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from C:\Users\mitsuyasukentaro\Downloads\llama-b6673-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Users\mitsuyasukentaro\Downloads\llama-b6673-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\mitsuyasukentaro\Downloads\llama-b6673-bin-win-cuda-12.4-x64\ggml-cpu-icelake.dll
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |           pp512 |      4141.10 ± 50.31 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |          pp1024 |      4784.26 ± 30.82 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |          pp2048 |      5123.90 ± 26.35 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |          pp4096 |      5263.70 ± 11.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |          pp8192 |      5179.07 ± 21.21 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |         pp16384 |      4873.97 ± 22.30 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |         pp32768 |       4324.36 ± 9.41 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |         pp65536 |      3460.73 ± 27.31 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA,RPC   | 999 |  1 |    0 |           tg128 |        177.13 ± 1.92 |

build: d64c8104 (6673)
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?