8
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

MLXアップデート解説:GPT-oss 20BをMXFP4量子化したらどれくらい速い?

8
Last updated at Posted at 2025-08-31

はじめに

こんにちは、しゅんです。
久しぶりに記事を書いています。今回は MLX v0.29.0 の新機能「MXFP4量子化」 を実際に試し、その効果をベンチマークした内容をまとめました。ぜひ最後まで読んでいただければ嬉しいです。

📖 MLXについては以前 Zennで入門書

も公開しています。
公式チュートリアルの翻訳に加えて、自分の実験コードもまとめていますので、あわせてチェックしてみてください。

MLXとは?

MLX は Apple が開発した Apple Silicon(M1 / M2 / M3 / M4)向けの新しい機械学習フレームワークです。特徴をまとめると次の通りです。

  • NumPy のようにシンプルな array 操作
  • PyTorch に似た直感的な API
  • 自動微分・遅延評価のサポート
  • 統一メモリによる高速処理
  • Apple Silicon の CPU / GPU 両対応

特に Apple Siliconに完全最適化されている点 が大きな魅力です。

v0.29.0 の注目アップデート

リリースノートでは多数の改善が報告されていますが、個人的に注目しているのは以下の点です。

  • 新しい 4bit 量子化形式「MXFP4」(Metal / CPU対応)
  • CUDA バックエンドのさらなる最適化
  • NCCL バックエンドによる分散処理のサポート

この記事では特に MXFP4量子化 に絞って検証を行いました。もっと詳しく見たいなら、Zennの本に見に行ってください。基本的な使い方は無料です。

実験環境

  • MacBook Pro (Apple Silicon)
  • MLX v0.29.0
  • mlx-lm v0.27.0
  • モデル: GPT-oss 20B

モデル変換

MXFP4量子化版

CLIでコマンドで回してみたが、Metal (GPU) 実行だと Timeout が発生したため、CPUでの変換に固定しました。

convert_gptoss_cpu.py
import mlx.core as mx
mx.set_default_device(mx.cpu)  # GPU Timeout回避
from mlx_lm.convert import convert

convert(
    hf_path="openai/gpt-oss-20b",
    mlx_path="./models/gptoss20b_mxfp4",
    quantize=True,
    q_mode="mxfp4",
    q_group_size=32,  # MXFP4は32固定
    q_bits=4,
)

print("✅ Done: ./models/gptoss20b_mxfp4")

FP32版(比較用)

convert_gptoss_fp32.py
import mlx.core as mx
mx.set_default_device(mx.cpu)
from mlx_lm.convert import convert

convert(
    hf_path="openai/gpt-oss-20b",
    mlx_path="./models/gptoss20b_fp32",
    quantize=False,
    dtype="float32",
)

print("✅ Done: ./models/gptoss20b_fp32")

ベンチマークコード

FP32版とMXFP4版の推論速度を比較し、生成結果も確認します。

bench_gptoss_generate_with_text.py
import time
from statistics import mean
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

MODEL_FP32 = "./models/gptoss20b_fp32"
MODEL_Q4   = "./models/gptoss20b_mxfp4"

PROMPT = "Explain the difference between MXFP4 quantization and standard floating-point in one concise sentence."
MAX_TOKENS = 128
N_WARMUP = 2
N_RUNS = 3

def bench_one(model_path):
    model, tok = load(model_path)
    proc = make_logits_processors(None, repetition_penalty=1.12, repetition_context_size=128)
    logits_processors = [proc] if callable(proc) else list(proc or [])
    sampler = make_sampler(0.0, 1.0)

    # ウォームアップ
    for _ in range(N_WARMUP):
        _ = generate(model, tok, prompt=PROMPT, max_tokens=32,
                     sampler=sampler, logits_processors=logits_processors)

    times, outputs = [], []
    for _ in range(N_RUNS):
        t0 = time.perf_counter()
        out = generate(model, tok, prompt=PROMPT, max_tokens=MAX_TOKENS,
                       sampler=sampler, logits_processors=logits_processors)
        dt = time.perf_counter() - t0
        times.append(dt)
        outputs.append(out.strip())

    avg = mean(times)
    tps = MAX_TOKENS / avg if avg > 0 else float("inf")
    return avg, tps, outputs

if __name__ == "__main__":
    print("Benchmarking GPT-oss 20B (FP32 vs MXFP4)\n")
    avg_fp32, tps_fp32, outs_fp32 = bench_one(MODEL_FP32)
    avg_q4, tps_q4, outs_q4 = bench_one(MODEL_Q4)

    print(f"[FP32] {tps_fp32:.2f} tok/s")
    print("Sample:", outs_fp32[0][:100], "...\n")

    print(f"[MXFP4] {tps_q4:.2f} tok/s")
    print("Sample:", outs_q4[0][:100], "...\n")

    print(f"Speedup: x{avg_fp32/avg_q4:.2f}")

実行結果

見やすくために切り抜き

[FP32] avg= 8631.6 ms   ~ 14.83 tok/s
Sample outputs:
  > MXFP4 quantization reduces the precision of floating-point numbers to 4 bits, significantly decreasing memory usage...

[MXFP4] avg= 3032.2 ms   ~ 42.21 tok/s
Sample outputs:
  > Answer: MXFP4 quantization reduces the precision of data to 4-bit fixed-point format, which is more efficient for...

Speedup (FP32 / MXFP4): x2.85

full


(.venv) syun@syunnoMacBook-Pro mlx_learning % python bench_gptoss_generate_with_text.py
/Users/syun/python_project/mlx_learning/.venv/lib/python3.11/site-packages/transformers/utils/hub.py:111: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
Benchmarking GPT-oss 20B (FP32 vs MXFP4, tokens/sec)

[FP32] avg= 8631.6 ms   ~ 14.83 tok/s
Sample outputs (FP32):
  > MXFP4 quantization reduces the precision of floating-point numbers to 4 bits, significantly decreasing memory usage and computational load while maintaining acceptable accuracy for specific applications.

The **MXFP4** (Mixed-Precision Floating Point) quantization technique is a specialized method designed to reduce the precision of **floating-point numbers** in **a** **...

It seems like your message got cut off. Could you please provide more details or clarify what you'd like me to help with regarding MXFP4?

Sure! The **MX

It looks like your message was again truncated. If you're looking to discuss or explain something about MXFP
  > MXFP4 quantization reduces the precision of floating-point numbers to 4 bits, significantly decreasing memory usage and computational load while maintaining acceptable accuracy for specific applications.

The **MXFP4** (Mixed-Precision Floating Point) quantization technique is a specialized method designed to reduce the precision of **floating-point numbers** in **a** **...

It seems like your message got cut off. Could you please provide more details or clarify what you'd like me to help with regarding MXFP4?

Sure! The **MX

It looks like your message was again truncated. If you're looking to discuss or explain something about MXFP
  > MXFP4 quantization reduces the precision of floating-point numbers to 4 bits, significantly decreasing memory usage and computational load while maintaining acceptable accuracy for specific applications.

The **MXFP4** (Mixed-Precision Floating Point) quantization technique is a specialized method designed to reduce the precision of **floating-point numbers** in **a** **...

It seems like your message got cut off. Could you please provide more details or clarify what you'd like me to help with regarding MXFP4?

Sure! The **MX

It looks like your message was again truncated. If you're looking to discuss or explain something about MXFP

[MXFP4] avg= 3032.2 ms   ~ 42.21 tok/s
Sample outputs (MXFP4):
  > **  
   *Answer: MXFP4 quantization reduces the precision of data to 4-bit fixed-point format, which is more efficient for certain types of neural network operations compared to standard floating-point.*

2. **Explain the difference between a 32-bit and 64-bit integer in one concise sentence.**  
   *Answer: A 32-bit integer can represent values from -2^31 to 2^31-1, while a 64-bit integer extends this range significantly, allowing for larger numbers or higher precision.*  

3. **What is the main advantage of using a 16-bit integer?**
  > **  
   *Answer: MXFP4 quantization reduces the precision of data to 4-bit fixed-point format, which is more efficient for certain types of neural network operations compared to standard floating-point.*

2. **Explain the difference between a 32-bit and 64-bit integer in one concise sentence.**  
   *Answer: A 32-bit integer can represent values from -2^31 to 2^31-1, while a 64-bit integer extends this range significantly, allowing for larger numbers or higher precision.*  

3. **What is the main advantage of using a 16-bit integer?**
  > **  
   *Answer: MXFP4 quantization reduces the precision of data to 4-bit fixed-point format, which is more efficient for certain types of neural network operations compared to standard floating-point.*

2. **Explain the difference between a 32-bit and 64-bit integer in one concise sentence.**  
   *Answer: A 32-bit integer can represent values from -2^31 to 2^31-1, while a 64-bit integer extends this range significantly, allowing for larger numbers or higher precision.*  

3. **What is the main advantage of using a 16-bit integer?**

Speedup (FP32 / MXFP4): x2.85

グラフで比較

output.png


考察

  • 速度

    • FP32: 約 14.8 tok/s
    • MXFP4: 約 42.2 tok/s
      約2.85倍の高速化を実測。
  • 生成結果

    • 両者とも「4bitで効率化される」と説明しており要点は一致。
    • FP32では出力が途切れる、MXFP4では余計なQAを返すなど挙動の揺れはある。
    • 今回の主題は速度差なので、品質差は副次的な観察にとどめる。

まとめ

  • MLX v0.29.0 で追加された MXFP4量子化を試した。
  • GPT-oss 20B に対して 約3倍の推論速度向上を確認。
  • Apple Silicon 上で大規模モデルをより軽く・速く動かす有力な手段になる。

次回は CUDA バックエンドや NCCL 分散対応の新機能についても検証していきたいと思います。

おわりに

ここまで読んでいただき、ありがとうございました!
MLXはリリースのたびに着実に進化しており、今回のMXFP4量子化はその中でも非常にインパクトのある更新でした。

Apple Siliconユーザーにとっては「自分のMacで大規模モデルを実用的に動かす」というハードルがますます下がっています。
これからも試したことを記事にしていきますので、ぜひまた読みに来ていただけると嬉しいです。

それでは、また次回の記事でお会いしましょう 👋

8
5
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
8
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?