この記事に触発されて
software and hardware
- Ubuntu 24.04.3 LTS x86_64 + Intel i7-4770 + GeForce RTX 3060 LHR 12GB + Mem 16GB
- Ubuntu 24.04.3 LTS on Windows 11 x86_64(WSL2) + Ryzen 9 5900X + GeForce RTX 5060 Ti 16GB + Mem 64GB
- MacBook Air (M1, Late 2020) + Mem 16GB
driver and cuda toolkit(native)
preparation
$ curl -LO https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4328 100 4328 0 0 47797 0 --:--:-- --:--:-- --:--:-- 48088
$ sudo apt install ./cuda-keyring_1.1-1_all.deb
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'cuda-keyring' instead of './cuda-keyring_1.1-1_all.deb'
The following NEW packages will be installed:
cuda-keyring
0 upgraded, 1 newly installed, 0 to remove and 13 not upgraded.
Need to get 0 B/4,328 B of archives.
After this operation, 18.4 kB of additional disk space will be used.
Get:1 /tmp/cuda-keyring_1.1-1_all.deb cuda-keyring all 1.1-1 [4,328 B]
Selecting previously unselected package cuda-keyring.
(Reading database ... 88298 files and directories currently installed.)
Preparing to unpack .../tmp/cuda-keyring_1.1-1_all.deb ...
Unpacking cuda-keyring (1.1-1) ...
Setting up cuda-keyring (1.1-1) ...
Scanning processes...
Scanning processor microcode...
Scanning linux images...
Running kernel seems to be up-to-date.
The processor microcode seems to be up-to-date.
No services need to be restarted.
No containers need to be restarted.
No user sessions are running outdated binaries.
No VM guests are running outdated hypervisor (qemu) binaries on this host.
$ sudo apt update
$ sudo apt install cuda-drivers cuda-toolkit cudnn
$ sudo apt install build-essential clang cmake libomp-dev libcurl4-openssl-dev
$ sudo reboot
2025/10/03時点で、5060Tiをネイティブで利用する場合は、プロプライエタリなcuda-driversとかではなくopenドライバ(nvidia-driver-580-openなど)である必要であった。.runファイルからのプロプライエタリドライバインストールでは問題ないらしいが未確認。
dmesg: NVRM: installed in this system requires use of the NVIDIA open kernel modules.
for WSL2
check devices
nvidia-smi
$ nvidia-smi
Sat Oct 4 11:56:18 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 41C P8 13W / 180W | 1MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
add Environment variables
~/.bashrc
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
llama.cpp
git clone
$ git clone https://github.com/ggml-org/llama.cpp.git
Cloning into 'llama.cpp'...
remote: Enumerating objects: 64283, done.
remote: Counting objects: 100% (106/106), done.
remote: Compressing objects: 100% (84/84), done.
remote: Total 64283 (delta 67), reused 22 (delta 22), pack-reused 64177 (from 3)
Receiving objects: 100% (64283/64283), 169.25 MiB | 16.09 MiB/s, done.
Resolving deltas: 100% (46688/46688), done.
非力な環境で全力を出すとOOMで死んだ。そんな場合は大人しく-j無しか、4ぐらいが無難。
build llama.cpp
$ cd llama.cpp/
$ cmake -B build -DGGML_CUDA=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
-- The C compiler identification is Clang 18.1.3
-- The CXX compiler identification is Clang 18.1.3
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMAKE_BUILD_TYPE=Release
-- Found Git: /usr/bin/git (found version "2.43.0")
-- The ASM compiler identification is Clang with GNU-like command-line
-- Found assembler: /usr/bin/clang
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- Found OpenMP_C: -fopenmp=libomp (found version "5.1")
-- Found OpenMP_CXX: -fopenmp=libomp (found version "5.1")
-- Found OpenMP: TRUE (found version "5.1")
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native
-- Found CUDAToolkit: /usr/local/cuda/targets/x86_64-linux/include (found version "13.0.88")
-- CUDA Toolkit found
-- Using CUDA architectures: native
-- The CUDA compiler identification is NVIDIA 13.0.88
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- CUDA host compiler is GNU 13.3.0
-- Including CUDA backend
-- ggml version: 0.9.4
-- ggml commit: 898acba6
-- Found CURL: /usr/lib/x86_64-linux-gnu/libcurl.so (found version "8.5.0")
-- Configuring done (7.1s)
-- Generating done (0.2s)
-- Build files have been written to: /home/ubuntu/llama.cpp/build
$ cmake --build build --config Release -j
[ 0%] Building C object examples/gguf-hash/CMakeFiles/sha256.dir/deps/sha256/sha256.c.o
[ 1%] Building C object examples/gguf-hash/CMakeFiles/sha1.dir/deps/sha1/sha1.c.o
[ 1%] Building CXX object tools/mtmd/CMakeFiles/llama-llava-cli.dir/deprecation-warning.cpp.o
(中略)
[100%] Linking CXX executable ../../bin/llama-server
[100%] Built target llama-server
$ ./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 6688 (898acba6)
built with Ubuntu clang version 18.1.3 (1ubuntu1) for x86_64-pc-linux-gnu
benchmark
Model: ggml-org/gpt-oss-20b-GGUF
1. RTX3060 LHR 12GB
llma-bench
$ ./build/bin/llama-bench -m ~/gpt-oss-20b-mxfp4.gguf -fa 0 -p 512,2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | pp512 | 1941.97 ± 13.68 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | pp2048 | 1653.15 ± 2.37 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | tg128 | 95.47 ± 0.19 |
build: 898acba6 (6688)
$ ./build/bin/llama-bench -m ~/gpt-oss-20b-mxfp4.gguf -fa 1 -p 512,2048,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | pp512 | 2277.15 ± 20.66 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | pp2048 | 2284.11 ± 4.67 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | pp8192 | 2174.17 ± 7.34 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | tg128 | 99.03 ± 0.07 |
build: 898acba6 (6688)
2. RTX5060Ti 16GB
llma-bench
$ ./build/bin/llama-bench -m ~/gpt-oss-20b-mxfp4.gguf -fa 0 -p 512,2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | pp512 | 3470.53 ± 49.19 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | pp2048 | 2913.32 ± 5.37 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | tg128 | 123.74 ± 0.75 |
build: 898acba6 (6688)
$ ./build/bin/llama-bench -m ~/gpt-oss-20b-mxfp4.gguf -fa 1 -p 512,2048,8192,32768,65536,131072
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | pp512 | 4136.18 ± 18.09 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | pp2048 | 3502.15 ± 1549.37 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | pp8192 | 4056.16 ± 6.62 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | pp32768 | 2913.48 ± 311.66 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | pp65536 | 2084.95 ± 77.79 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | pp131072 | 1342.23 ± 15.29 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | tg128 | 120.97 ± 2.21 |
build: 898acba6 (6688)
3. MacBook Air (M1, Late 2020)
llama-cli
% ./build/bin/llama-cli --version
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 5.958 sec
ggml_metal_device_init: GPU name: Apple M1
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 12713.12 MB
version: 6688 (898acba6)
built with Apple clang version 17.0.0 (clang-1700.3.19.1) for arm64-apple-darwin25.0.0
llma-bench
% ./build/bin/llama-bench -m ~/gpt-oss-20b-MXFP4.gguf
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.018 sec
ggml_metal_device_init: GPU name: Apple M1
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 12713.12 MB
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
/Users/hoge/src/llama.cpp/ggml/src/ggml-metal/ggml-metal-context.m:241: fatal error
(lldb) process attach --pid 57612
error: attach failed: this is a non-interactive debug session, cannot get permission to debug processes.
zsh: abort ./build/bin/llama-bench -m ~/gpt-oss-20b-MXFP4.gguf
![]()
lmstudio-community/gemma-3-12b-it-GGUF
llma-bench
% ./build/bin/llama-bench -m ~/gemma-3-12b-it-Q4_K_M.gguf
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.017 sec
ggml_metal_device_init: GPU name: Apple M1
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 12713.12 MB
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | Metal,BLAS | 4 | pp512 | 79.66 ± 0.08 |
| gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | Metal,BLAS | 4 | tg128 | 7.85 ± 0.01 |
build: 898acba6 (6688)
# なおRTX3060
#| model | size | params | backend | ngl | test | t/s |
#| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
#| gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | CUDA | 99 | pp512 | 1236.21 ± 3.94 |
#| gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | CUDA | 99 | tg128 | 39.18 ± 0.03 |
メモがてら
RTX4070やRyzen AI MAX+ 395でのベンチマーク結果があるサイト