1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

GPT-OSS 20Bのパフォーマンス比較 (MLX vs GGUF + Flash Attention)

Last updated at Posted at 2025-08-16

TL;DR

GGUF + Flash Attentionが一番良さげ。

Machine spec

Apple M1 Max
64 GB

Test condition

Runtime

LM Studio: 0.3.23 (Build 3)
Metal llama.cpp: v1.46.0
LM Studio MLX: v0.22.2
Harmony: v0.3.4

Models

- MLX 8bit
    - https://huggingface.co/lmstudio-community/gpt-oss-20b-MLX-8bit
- MLX 4bit
    - https://huggingface.co/nightmedia/gpt-oss-20b-q4-hi-mlx
- GGUF MXFP4
    - https://huggingface.co/lmstudio-community/gpt-oss-20b-GGUF
Model File size
MLX 8bit 22.26 GB
MLX 4bit 13.11 GB
GGUF MXFP4 12.11 GB

Model settings

Context window: 32768

Results

色々他のアプリ起動している状態で計測したので、メモリ使用量は目安程度に見てください

Test #1 Code Review (20k context)

Model #1 #2 Output Quality RAM Usage (System)
MLX 8bit 29.75 tok/sec
5609 tokens
90.57s to first token
28.80 tok/sec
4959 tokens
80.18s to first token
Great 65%
MLX 4bit 33.37 tok/sec
4324 tokens
86.27s to first token
33.86 tok/sec
5686 tokens
87.70s to first token
Bad 56%
GGUF MXFP4 (w/o Flash attention) 9.33 tok/sec
4081 tokens
79.15s to first token
9.44 tok/sec
5028 tokens
78.39s to first token
Great 59%
GGUF MXFP4 (w/ Flash attention) 39.47 tok/sec
4400 tokens
43.55s to first token
38.16 tok/sec
5016 tokens
46.09s to first token
Great 55%

MLXとGGUFでfirst tokenが生成されるまでの時間が結構違うのには驚いた。

MLX 4bit は間違えていないのに typo とかいってくる
推論速度も Flash attention 有り GGUF に負けているので使う理由がないかも

MLX 4bitの回答
Issue Current code Fix / suggestion
__future__ import typo from __future__ import annotations from __future__ import annotations (or simply import __future__).
argparse typo import argparse import argparse.
Headers name "User-Agent" vs "User-Agent". The correct header is User-Agent; case‑insensitive but standard. Keep consistent.

Test #2 Summarization (4k input token)

Model #1 #2 Output Quality
MLX 8bit 40.23 tok/sec
1029 tokens
17.10s to first toke
40.87 tok/sec
955 tokens
16.01s to first token
Great
MLX 4bit 51.61 tok/sec
1130 tokens
14.31s to first token
50.15 tok/sec
1016 tokens
14.96s to first token
Bad
GGUF MXFP4 (w/o Flash attention) 26.41 tok/sec
1836 tokens
6.56s to first token
29.64 tok/sec
1022 tokens
0.45s to first token
Great
GGUF MXFP4 (w/ Flash attention) 54.60 tok/sec
982 tokens
5.97s to first token
57.16 tok/sec
1003 tokens
0.37s to first token
Great

MLX 4bit は System Prompt で与えた制約を無視してきた。

結論

MacならMLXが一番優秀だろうと思っていたが、Flash Attentionを有効にしたGGUFの方が優れていた。

MLXとGGUFでBit数を合わてもGGUFの方が早いし、回答精度もGGUFの方が良かった。
(MLXへの変換が悪さしてる? MXFP4が優秀?)

MLX 8bitを常用していましたが、GGUF + Flash Attentionに乗り換えます。

補足

Flash Attentionはモデルロード時の画面で切り替えられます。
GGUF版にしか出ないので注意。

Screenshot 2025-08-17 at 1.26.10.jpg

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?