llama.cpp build with CUDA. and benchmark.

llama.cpp

Last updated at 2025-10-05Posted at 2025-10-04

この記事に触発されて

software and hardware

Ubuntu 24.04.3 LTS x86_64 + Intel i7-4770 + GeForce RTX 3060 LHR 12GB + Mem 16GB
Ubuntu 24.04.3 LTS on Windows 11 x86_64(WSL2) + Ryzen 9 5900X + GeForce RTX 5060 Ti 16GB + Mem 64GB
MacBook Air (M1, Late 2020) + Mem 16GB

driver and cuda toolkit(native)

preparation

$ curl -LO https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4328  100  4328    0     0  47797      0 --:--:-- --:--:-- --:--:-- 48088

$ sudo apt install ./cuda-keyring_1.1-1_all.deb
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'cuda-keyring' instead of './cuda-keyring_1.1-1_all.deb'
The following NEW packages will be installed:
  cuda-keyring
0 upgraded, 1 newly installed, 0 to remove and 13 not upgraded.
Need to get 0 B/4,328 B of archives.
After this operation, 18.4 kB of additional disk space will be used.
Get:1 /tmp/cuda-keyring_1.1-1_all.deb cuda-keyring all 1.1-1 [4,328 B]
Selecting previously unselected package cuda-keyring.
(Reading database ... 88298 files and directories currently installed.)
Preparing to unpack .../tmp/cuda-keyring_1.1-1_all.deb ...
Unpacking cuda-keyring (1.1-1) ...
Setting up cuda-keyring (1.1-1) ...
Scanning processes...
Scanning processor microcode...
Scanning linux images...

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.

$ sudo apt update

$ sudo apt install cuda-drivers cuda-toolkit cudnn

$ sudo apt install build-essential clang cmake libomp-dev libcurl4-openssl-dev

$ sudo reboot

2025/10/03時点で、5060Tiをネイティブで利用する場合は、プロプライエタリなcuda-driversとかではなくopenドライバ（nvidia-driver-580-openなど）である必要であった。.runファイルからのプロプライエタリドライバインストールでは問題ないらしいが未確認。

dmesg: NVRM: installed in this system requires use of the NVIDIA open kernel modules.

for WSL2

check devices

nvidia-smi

$ nvidia-smi
Sat Oct  4 11:56:18 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   41C    P8             13W /  180W |       1MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

add Environment variables

~/.bashrc

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

llama.cpp

git clone

$ git clone https://github.com/ggml-org/llama.cpp.git
Cloning into 'llama.cpp'...
remote: Enumerating objects: 64283, done.
remote: Counting objects: 100% (106/106), done.
remote: Compressing objects: 100% (84/84), done.
remote: Total 64283 (delta 67), reused 22 (delta 22), pack-reused 64177 (from 3)
Receiving objects: 100% (64283/64283), 169.25 MiB | 16.09 MiB/s, done.
Resolving deltas: 100% (46688/46688), done.

非力な環境で全力を出すとOOMで死んだ。そんな場合は大人しく-j無しか、4ぐらいが無難。

build llama.cpp

$ cd llama.cpp/

$ cmake -B build -DGGML_CUDA=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
-- The C compiler identification is Clang 18.1.3
-- The CXX compiler identification is Clang 18.1.3
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMAKE_BUILD_TYPE=Release
-- Found Git: /usr/bin/git (found version "2.43.0")
-- The ASM compiler identification is Clang with GNU-like command-line
-- Found assembler: /usr/bin/clang
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- Found OpenMP_C: -fopenmp=libomp (found version "5.1")
-- Found OpenMP_CXX: -fopenmp=libomp (found version "5.1")
-- Found OpenMP: TRUE (found version "5.1")
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native
-- Found CUDAToolkit: /usr/local/cuda/targets/x86_64-linux/include (found version "13.0.88")
-- CUDA Toolkit found
-- Using CUDA architectures: native
-- The CUDA compiler identification is NVIDIA 13.0.88
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- CUDA host compiler is GNU 13.3.0
-- Including CUDA backend
-- ggml version: 0.9.4
-- ggml commit:  898acba6
-- Found CURL: /usr/lib/x86_64-linux-gnu/libcurl.so (found version "8.5.0")
-- Configuring done (7.1s)
-- Generating done (0.2s)
-- Build files have been written to: /home/ubuntu/llama.cpp/build

$ cmake --build build --config Release -j
[  0%] Building C object examples/gguf-hash/CMakeFiles/sha256.dir/deps/sha256/sha256.c.o
[  1%] Building C object examples/gguf-hash/CMakeFiles/sha1.dir/deps/sha1/sha1.c.o
[  1%] Building CXX object tools/mtmd/CMakeFiles/llama-llava-cli.dir/deprecation-warning.cpp.o
（中略）
[100%] Linking CXX executable ../../bin/llama-server
[100%] Built target llama-server

$ ./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 6688 (898acba6)
built with Ubuntu clang version 18.1.3 (1ubuntu1) for x86_64-pc-linux-gnu

benchmark

Model: ggml-org/gpt-oss-20b-GGUF

1. RTX3060 LHR 12GB

llma-bench

$ ./build/bin/llama-bench -m ~/gpt-oss-20b-mxfp4.gguf -fa 0 -p 512,2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |           pp512 |      1941.97 ± 13.68 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |          pp2048 |       1653.15 ± 2.37 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |           tg128 |         95.47 ± 0.19 |

build: 898acba6 (6688)

$ ./build/bin/llama-bench -m ~/gpt-oss-20b-mxfp4.gguf -fa 1 -p 512,2048,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           pp512 |      2277.15 ± 20.66 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |          pp2048 |       2284.11 ± 4.67 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |          pp8192 |       2174.17 ± 7.34 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           tg128 |         99.03 ± 0.07 |

build: 898acba6 (6688)

2. RTX5060Ti 16GB

llma-bench

$ ./build/bin/llama-bench -m ~/gpt-oss-20b-mxfp4.gguf -fa 0 -p 512,2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |           pp512 |      3470.53 ± 49.19 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |          pp2048 |       2913.32 ± 5.37 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |           tg128 |        123.74 ± 0.75 |

build: 898acba6 (6688)

$ ./build/bin/llama-bench -m ~/gpt-oss-20b-mxfp4.gguf -fa 1 -p 512,2048,8192,32768,65536,131072
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           pp512 |      4136.18 ± 18.09 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |          pp2048 |    3502.15 ± 1549.37 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |          pp8192 |       4056.16 ± 6.62 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |         pp32768 |     2913.48 ± 311.66 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |         pp65536 |      2084.95 ± 77.79 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |        pp131072 |      1342.23 ± 15.29 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           tg128 |        120.97 ± 2.21 |

build: 898acba6 (6688)

3. MacBook Air (M1, Late 2020)

llama-cli

% ./build/bin/llama-cli --version
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 5.958 sec
ggml_metal_device_init: GPU name:   Apple M1
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 12713.12 MB
version: 6688 (898acba6)
built with Apple clang version 17.0.0 (clang-1700.3.19.1) for arm64-apple-darwin25.0.0

llma-bench

% ./build/bin/llama-bench -m ~/gpt-oss-20b-MXFP4.gguf
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.018 sec
ggml_metal_device_init: GPU name:   Apple M1
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 12713.12 MB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
/Users/hoge/src/llama.cpp/ggml/src/ggml-metal/ggml-metal-context.m:241: fatal error
(lldb) process attach --pid 57612
error: attach failed: this is a non-interactive debug session, cannot get permission to debug processes.
zsh: abort      ./build/bin/llama-bench -m ~/gpt-oss-20b-MXFP4.gguf

lmstudio-community/gemma-3-12b-it-GGUF

llma-bench

% ./build/bin/llama-bench -m ~/gemma-3-12b-it-Q4_K_M.gguf
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.017 sec
ggml_metal_device_init: GPU name:   Apple M1
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 12713.12 MB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | Metal,BLAS |       4 |           pp512 |         79.66 ± 0.08 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | Metal,BLAS |       4 |           tg128 |          7.85 ± 0.01 |

build: 898acba6 (6688)

# なおRTX3060
#| model                          |       size |     params | backend    | ngl |            test |                  t/s |
#| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
#| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |           pp512 |       1236.21 ± 3.94 |
#| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |           tg128 |         39.18 ± 0.03 |

メモがてら

RTX4070やRyzen AI MAX+ 395でのベンチマーク結果があるサイト

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up