GPT-OSS 20Bのパフォーマンス比較 (MLX vs GGUF + Flash Attention)

Last updated at 2025-08-17Posted at 2025-08-16

TL;DR

GGUF + Flash Attentionが一番良さげ。

Machine spec

Apple M1 Max
64 GB

Test condition

Runtime

LM Studio: 0.3.23 (Build 3)
Metal llama.cpp: v1.46.0
LM Studio MLX: v0.22.2
Harmony: v0.3.4

Models

- MLX 8bit
    - https://huggingface.co/lmstudio-community/gpt-oss-20b-MLX-8bit
- MLX 4bit
    - https://huggingface.co/nightmedia/gpt-oss-20b-q4-hi-mlx
- GGUF MXFP4
    - https://huggingface.co/lmstudio-community/gpt-oss-20b-GGUF

Model	File size
MLX 8bit	22.26 GB
MLX 4bit	13.11 GB
GGUF MXFP4	12.11 GB

Model settings

Context window: 32768

Results

色々他のアプリ起動している状態で計測したので、メモリ使用量は目安程度に見てください

Test #1 Code Review (20k context)

Model	#1	#2	Output Quality	RAM Usage (System)
MLX 8bit	29.75 tok/sec 5609 tokens 90.57s to first token	28.80 tok/sec 4959 tokens 80.18s to first token	Great	65%
MLX 4bit	33.37 tok/sec 4324 tokens 86.27s to first token	33.86 tok/sec 5686 tokens 87.70s to first token	Bad	56%
GGUF MXFP4 (w/o Flash attention)	9.33 tok/sec 4081 tokens 79.15s to first token	9.44 tok/sec 5028 tokens 78.39s to first token	Great	59%
GGUF MXFP4 (w/ Flash attention)	39.47 tok/sec 4400 tokens 43.55s to first token	38.16 tok/sec 5016 tokens 46.09s to first token	Great	55%

MLXとGGUFでfirst tokenが生成されるまでの時間が結構違うのには驚いた。

MLX 4bit は間違えていないのに typo とかいってくる
推論速度も Flash attention 有り GGUF に負けているので使う理由がないかも

MLX 4bitの回答

Issue	Current code	Fix / suggestion
`__future__` import typo	`from __future__ import annotations`	`from __future__ import annotations` (or simply `import __future__`).
`argparse` typo	`import argparse`	`import argparse`.
Headers name	`"User-Agent"` vs `"User-Agent"`. The correct header is `User-Agent`; case‑insensitive but standard.	Keep consistent.

Test #2 Summarization (4k input token)

Model	#1	#2	Output Quality
MLX 8bit	40.23 tok/sec 1029 tokens 17.10s to first toke	40.87 tok/sec 955 tokens 16.01s to first token	Great
MLX 4bit	51.61 tok/sec 1130 tokens 14.31s to first token	50.15 tok/sec 1016 tokens 14.96s to first token	Bad
GGUF MXFP4 (w/o Flash attention)	26.41 tok/sec 1836 tokens 6.56s to first token	29.64 tok/sec 1022 tokens 0.45s to first token	Great
GGUF MXFP4 (w/ Flash attention)	54.60 tok/sec 982 tokens 5.97s to first token	57.16 tok/sec 1003 tokens 0.37s to first token	Great

MLX 4bit は System Prompt で与えた制約を無視してきた。

結論

MacならMLXが一番優秀だろうと思っていたが、Flash Attentionを有効にしたGGUFの方が優れていた。

MLXとGGUFでBit数を合わてもGGUFの方が早いし、回答精度もGGUFの方が良かった。
(MLXへの変換が悪さしてる? MXFP4が優秀?)

MLX 8bitを常用していましたが、GGUF + Flash Attentionに乗り換えます。

補足

Flash Attentionはモデルロード時の画面で切り替えられます。
GGUF版にしか出ないので注意。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up