0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

llama.cpp build with CUDA. and benchmark.

Last updated at Posted at 2025-10-04

この記事に触発されて

software and hardware

  1. Ubuntu 24.04.3 LTS x86_64 + Intel i7-4770 + GeForce RTX 3060 LHR 12GB + Mem 16GB
  2. Ubuntu 24.04.3 LTS on Windows 11 x86_64(WSL2) + Ryzen 9 5900X + GeForce RTX 5060 Ti 16GB + Mem 64GB
  3. MacBook Air (M1, Late 2020) + Mem 16GB

driver and cuda toolkit(native)

preparation
$ curl -LO https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4328  100  4328    0     0  47797      0 --:--:-- --:--:-- --:--:-- 48088

$ sudo apt install ./cuda-keyring_1.1-1_all.deb
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'cuda-keyring' instead of './cuda-keyring_1.1-1_all.deb'
The following NEW packages will be installed:
  cuda-keyring
0 upgraded, 1 newly installed, 0 to remove and 13 not upgraded.
Need to get 0 B/4,328 B of archives.
After this operation, 18.4 kB of additional disk space will be used.
Get:1 /tmp/cuda-keyring_1.1-1_all.deb cuda-keyring all 1.1-1 [4,328 B]
Selecting previously unselected package cuda-keyring.
(Reading database ... 88298 files and directories currently installed.)
Preparing to unpack .../tmp/cuda-keyring_1.1-1_all.deb ...
Unpacking cuda-keyring (1.1-1) ...
Setting up cuda-keyring (1.1-1) ...
Scanning processes...
Scanning processor microcode...
Scanning linux images...

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.

$ sudo apt update

$ sudo apt install cuda-drivers cuda-toolkit cudnn

$ sudo apt install build-essential clang cmake libomp-dev libcurl4-openssl-dev

$ sudo reboot

2025/10/03時点で、5060Tiをネイティブで利用する場合は、プロプライエタリなcuda-driversとかではなくopenドライバ(nvidia-driver-580-openなど)である必要であった。.runファイルからのプロプライエタリドライバインストールでは問題ないらしいが未確認。

dmesg: NVRM: installed in this system requires use of the NVIDIA open kernel modules.

for WSL2

check devices

nvidia-smi
$ nvidia-smi
Sat Oct  4 11:56:18 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   41C    P8             13W /  180W |       1MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

add Environment variables

~/.bashrc
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

llama.cpp

git clone
$ git clone https://github.com/ggml-org/llama.cpp.git
Cloning into 'llama.cpp'...
remote: Enumerating objects: 64283, done.
remote: Counting objects: 100% (106/106), done.
remote: Compressing objects: 100% (84/84), done.
remote: Total 64283 (delta 67), reused 22 (delta 22), pack-reused 64177 (from 3)
Receiving objects: 100% (64283/64283), 169.25 MiB | 16.09 MiB/s, done.
Resolving deltas: 100% (46688/46688), done.

非力な環境で全力を出すとOOMで死んだ。そんな場合は大人しく-j無しか、4ぐらいが無難。

build llama.cpp
$ cd llama.cpp/

$ cmake -B build -DGGML_CUDA=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
-- The C compiler identification is Clang 18.1.3
-- The CXX compiler identification is Clang 18.1.3
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMAKE_BUILD_TYPE=Release
-- Found Git: /usr/bin/git (found version "2.43.0")
-- The ASM compiler identification is Clang with GNU-like command-line
-- Found assembler: /usr/bin/clang
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- Found OpenMP_C: -fopenmp=libomp (found version "5.1")
-- Found OpenMP_CXX: -fopenmp=libomp (found version "5.1")
-- Found OpenMP: TRUE (found version "5.1")
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native
-- Found CUDAToolkit: /usr/local/cuda/targets/x86_64-linux/include (found version "13.0.88")
-- CUDA Toolkit found
-- Using CUDA architectures: native
-- The CUDA compiler identification is NVIDIA 13.0.88
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- CUDA host compiler is GNU 13.3.0
-- Including CUDA backend
-- ggml version: 0.9.4
-- ggml commit:  898acba6
-- Found CURL: /usr/lib/x86_64-linux-gnu/libcurl.so (found version "8.5.0")
-- Configuring done (7.1s)
-- Generating done (0.2s)
-- Build files have been written to: /home/ubuntu/llama.cpp/build

$ cmake --build build --config Release -j
[  0%] Building C object examples/gguf-hash/CMakeFiles/sha256.dir/deps/sha256/sha256.c.o
[  1%] Building C object examples/gguf-hash/CMakeFiles/sha1.dir/deps/sha1/sha1.c.o
[  1%] Building CXX object tools/mtmd/CMakeFiles/llama-llava-cli.dir/deprecation-warning.cpp.o
(中略)
[100%] Linking CXX executable ../../bin/llama-server
[100%] Built target llama-server

$ ./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 6688 (898acba6)
built with Ubuntu clang version 18.1.3 (1ubuntu1) for x86_64-pc-linux-gnu

benchmark

Model: ggml-org/gpt-oss-20b-GGUF

1. RTX3060 LHR 12GB

llma-bench
$ ./build/bin/llama-bench -m ~/gpt-oss-20b-mxfp4.gguf -fa 0 -p 512,2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |           pp512 |      1941.97 ± 13.68 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |          pp2048 |       1653.15 ± 2.37 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |           tg128 |         95.47 ± 0.19 |

build: 898acba6 (6688)

$ ./build/bin/llama-bench -m ~/gpt-oss-20b-mxfp4.gguf -fa 1 -p 512,2048,8192
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           pp512 |      2277.15 ± 20.66 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |          pp2048 |       2284.11 ± 4.67 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |          pp8192 |       2174.17 ± 7.34 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           tg128 |         99.03 ± 0.07 |

build: 898acba6 (6688)

2. RTX5060Ti 16GB

llma-bench
$ ./build/bin/llama-bench -m ~/gpt-oss-20b-mxfp4.gguf -fa 0 -p 512,2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |           pp512 |      3470.53 ± 49.19 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |          pp2048 |       2913.32 ± 5.37 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |           tg128 |        123.74 ± 0.75 |

build: 898acba6 (6688)

$ ./build/bin/llama-bench -m ~/gpt-oss-20b-mxfp4.gguf -fa 1 -p 512,2048,8192,32768,65536,131072
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           pp512 |      4136.18 ± 18.09 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |          pp2048 |    3502.15 ± 1549.37 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |          pp8192 |       4056.16 ± 6.62 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |         pp32768 |     2913.48 ± 311.66 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |         pp65536 |      2084.95 ± 77.79 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |        pp131072 |      1342.23 ± 15.29 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           tg128 |        120.97 ± 2.21 |

build: 898acba6 (6688)

3. MacBook Air (M1, Late 2020)

llama-cli
% ./build/bin/llama-cli --version
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 5.958 sec
ggml_metal_device_init: GPU name:   Apple M1
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 12713.12 MB
version: 6688 (898acba6)
built with Apple clang version 17.0.0 (clang-1700.3.19.1) for arm64-apple-darwin25.0.0
llma-bench
% ./build/bin/llama-bench -m ~/gpt-oss-20b-MXFP4.gguf
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.018 sec
ggml_metal_device_init: GPU name:   Apple M1
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 12713.12 MB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
/Users/hoge/src/llama.cpp/ggml/src/ggml-metal/ggml-metal-context.m:241: fatal error
(lldb) process attach --pid 57612
error: attach failed: this is a non-interactive debug session, cannot get permission to debug processes.
zsh: abort      ./build/bin/llama-bench -m ~/gpt-oss-20b-MXFP4.gguf

:thinking:

lmstudio-community/gemma-3-12b-it-GGUF

llma-bench
% ./build/bin/llama-bench -m ~/gemma-3-12b-it-Q4_K_M.gguf
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.017 sec
ggml_metal_device_init: GPU name:   Apple M1
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 12713.12 MB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | Metal,BLAS |       4 |           pp512 |         79.66 ± 0.08 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | Metal,BLAS |       4 |           tg128 |          7.85 ± 0.01 |

build: 898acba6 (6688)

# なおRTX3060
#| model                          |       size |     params | backend    | ngl |            test |                  t/s |
#| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
#| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |           pp512 |       1236.21 ± 3.94 |
#| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |           tg128 |         39.18 ± 0.03 |

メモがてら

RTX4070やRyzen AI MAX+ 395でのベンチマーク結果があるサイト

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?