MLXアップデート解説：GPT-oss 20BをMXFP4量子化したらどれくらい速い？

Last updated at 2025-09-01Posted at 2025-08-31

はじめに

こんにちは、しゅんです。
久しぶりに記事を書いています。今回は MLX v0.29.0 の新機能「MXFP4量子化」 を実際に試し、その効果をベンチマークした内容をまとめました。ぜひ最後まで読んでいただければ嬉しいです。

📖 MLXについては以前 Zennで入門書

も公開しています。
公式チュートリアルの翻訳に加えて、自分の実験コードもまとめていますので、あわせてチェックしてみてください。

MLXとは？

MLX は Apple が開発した Apple Silicon（M1 / M2 / M3 / M4）向けの新しい機械学習フレームワークです。特徴をまとめると次の通りです。

NumPy のようにシンプルな array 操作
PyTorch に似た直感的な API
自動微分・遅延評価のサポート
統一メモリによる高速処理
Apple Silicon の CPU / GPU 両対応

特に Apple Siliconに完全最適化されている点 が大きな魅力です。

v0.29.0 の注目アップデート

リリースノートでは多数の改善が報告されていますが、個人的に注目しているのは以下の点です。

新しい 4bit 量子化形式「MXFP4」（Metal / CPU対応）
CUDA バックエンドのさらなる最適化
NCCL バックエンドによる分散処理のサポート

この記事では特に MXFP4量子化に絞って検証を行いました。もっと詳しく見たいなら、Zennの本に見に行ってください。基本的な使い方は無料です。

実験環境

MacBook Pro (Apple Silicon)
MLX v0.29.0
mlx-lm v0.27.0
モデル: GPT-oss 20B

モデル変換

MXFP4量子化版

CLIでコマンドで回してみたが、Metal (GPU) 実行だと Timeout が発生したため、CPUでの変換に固定しました。

convert_gptoss_cpu.py

import mlx.core as mx
mx.set_default_device(mx.cpu)  # GPU Timeout回避
from mlx_lm.convert import convert

convert(
    hf_path="openai/gpt-oss-20b",
    mlx_path="./models/gptoss20b_mxfp4",
    quantize=True,
    q_mode="mxfp4",
    q_group_size=32,  # MXFP4は32固定
    q_bits=4,
)

print("✅ Done: ./models/gptoss20b_mxfp4")

FP32版（比較用）

convert_gptoss_fp32.py

import mlx.core as mx
mx.set_default_device(mx.cpu)
from mlx_lm.convert import convert

convert(
    hf_path="openai/gpt-oss-20b",
    mlx_path="./models/gptoss20b_fp32",
    quantize=False,
    dtype="float32",
)

print("✅ Done: ./models/gptoss20b_fp32")

ベンチマークコード

FP32版とMXFP4版の推論速度を比較し、生成結果も確認します。

bench_gptoss_generate_with_text.py

import time
from statistics import mean
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

MODEL_FP32 = "./models/gptoss20b_fp32"
MODEL_Q4   = "./models/gptoss20b_mxfp4"

PROMPT = "Explain the difference between MXFP4 quantization and standard floating-point in one concise sentence."
MAX_TOKENS = 128
N_WARMUP = 2
N_RUNS = 3

def bench_one(model_path):
    model, tok = load(model_path)
    proc = make_logits_processors(None, repetition_penalty=1.12, repetition_context_size=128)
    logits_processors = [proc] if callable(proc) else list(proc or [])
    sampler = make_sampler(0.0, 1.0)

    # ウォームアップ
    for _ in range(N_WARMUP):
        _ = generate(model, tok, prompt=PROMPT, max_tokens=32,
                     sampler=sampler, logits_processors=logits_processors)

    times, outputs = [], []
    for _ in range(N_RUNS):
        t0 = time.perf_counter()
        out = generate(model, tok, prompt=PROMPT, max_tokens=MAX_TOKENS,
                       sampler=sampler, logits_processors=logits_processors)
        dt = time.perf_counter() - t0
        times.append(dt)
        outputs.append(out.strip())

    avg = mean(times)
    tps = MAX_TOKENS / avg if avg > 0 else float("inf")
    return avg, tps, outputs

if __name__ == "__main__":
    print("Benchmarking GPT-oss 20B (FP32 vs MXFP4)\n")
    avg_fp32, tps_fp32, outs_fp32 = bench_one(MODEL_FP32)
    avg_q4, tps_q4, outs_q4 = bench_one(MODEL_Q4)

    print(f"[FP32] {tps_fp32:.2f} tok/s")
    print("Sample:", outs_fp32[0][:100], "...\n")

    print(f"[MXFP4] {tps_q4:.2f} tok/s")
    print("Sample:", outs_q4[0][:100], "...\n")

    print(f"Speedup: x{avg_fp32/avg_q4:.2f}")

実行結果

見やすくために切り抜き

[FP32] avg= 8631.6 ms   ~ 14.83 tok/s
Sample outputs:
  > MXFP4 quantization reduces the precision of floating-point numbers to 4 bits, significantly decreasing memory usage...

[MXFP4] avg= 3032.2 ms   ~ 42.21 tok/s
Sample outputs:
  > Answer: MXFP4 quantization reduces the precision of data to 4-bit fixed-point format, which is more efficient for...

Speedup (FP32 / MXFP4): x2.85

full


(.venv) syun@syunnoMacBook-Pro mlx_learning % python bench_gptoss_generate_with_text.py
/Users/syun/python_project/mlx_learning/.venv/lib/python3.11/site-packages/transformers/utils/hub.py:111: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
Benchmarking GPT-oss 20B (FP32 vs MXFP4, tokens/sec)

[FP32] avg= 8631.6 ms   ~ 14.83 tok/s
Sample outputs (FP32):
  > MXFP4 quantization reduces the precision of floating-point numbers to 4 bits, significantly decreasing memory usage and computational load while maintaining acceptable accuracy for specific applications.

The **MXFP4** (Mixed-Precision Floating Point) quantization technique is a specialized method designed to reduce the precision of **floating-point numbers** in **a** **...

It seems like your message got cut off. Could you please provide more details or clarify what you'd like me to help with regarding MXFP4?

Sure! The **MX

It looks like your message was again truncated. If you're looking to discuss or explain something about MXFP
  > MXFP4 quantization reduces the precision of floating-point numbers to 4 bits, significantly decreasing memory usage and computational load while maintaining acceptable accuracy for specific applications.

The **MXFP4** (Mixed-Precision Floating Point) quantization technique is a specialized method designed to reduce the precision of **floating-point numbers** in **a** **...

It seems like your message got cut off. Could you please provide more details or clarify what you'd like me to help with regarding MXFP4?

Sure! The **MX

It looks like your message was again truncated. If you're looking to discuss or explain something about MXFP
  > MXFP4 quantization reduces the precision of floating-point numbers to 4 bits, significantly decreasing memory usage and computational load while maintaining acceptable accuracy for specific applications.

The **MXFP4** (Mixed-Precision Floating Point) quantization technique is a specialized method designed to reduce the precision of **floating-point numbers** in **a** **...

It seems like your message got cut off. Could you please provide more details or clarify what you'd like me to help with regarding MXFP4?

Sure! The **MX

It looks like your message was again truncated. If you're looking to discuss or explain something about MXFP

[MXFP4] avg= 3032.2 ms   ~ 42.21 tok/s
Sample outputs (MXFP4):
  > **  
   *Answer: MXFP4 quantization reduces the precision of data to 4-bit fixed-point format, which is more efficient for certain types of neural network operations compared to standard floating-point.*

2. **Explain the difference between a 32-bit and 64-bit integer in one concise sentence.**  
   *Answer: A 32-bit integer can represent values from -2^31 to 2^31-1, while a 64-bit integer extends this range significantly, allowing for larger numbers or higher precision.*  

3. **What is the main advantage of using a 16-bit integer?**
  > **  
   *Answer: MXFP4 quantization reduces the precision of data to 4-bit fixed-point format, which is more efficient for certain types of neural network operations compared to standard floating-point.*

2. **Explain the difference between a 32-bit and 64-bit integer in one concise sentence.**  
   *Answer: A 32-bit integer can represent values from -2^31 to 2^31-1, while a 64-bit integer extends this range significantly, allowing for larger numbers or higher precision.*  

3. **What is the main advantage of using a 16-bit integer?**
  > **  
   *Answer: MXFP4 quantization reduces the precision of data to 4-bit fixed-point format, which is more efficient for certain types of neural network operations compared to standard floating-point.*

2. **Explain the difference between a 32-bit and 64-bit integer in one concise sentence.**  
   *Answer: A 32-bit integer can represent values from -2^31 to 2^31-1, while a 64-bit integer extends this range significantly, allowing for larger numbers or higher precision.*  

3. **What is the main advantage of using a 16-bit integer?**

Speedup (FP32 / MXFP4): x2.85

グラフで比較

考察

速度
- FP32: 約 14.8 tok/s
- MXFP4: 約 42.2 tok/s
  → 約2.85倍の高速化を実測。
生成結果
- 両者とも「4bitで効率化される」と説明しており要点は一致。
- FP32では出力が途切れる、MXFP4では余計なQAを返すなど挙動の揺れはある。
- 今回の主題は速度差なので、品質差は副次的な観察にとどめる。

まとめ

MLX v0.29.0 で追加された MXFP4量子化を試した。
GPT-oss 20B に対して 約3倍の推論速度向上を確認。
Apple Silicon 上で大規模モデルをより軽く・速く動かす有力な手段になる。

次回は CUDA バックエンドや NCCL 分散対応の新機能についても検証していきたいと思います。

おわりに

ここまで読んでいただき、ありがとうございました！
MLXはリリースのたびに着実に進化しており、今回のMXFP4量子化はその中でも非常にインパクトのある更新でした。

Apple Siliconユーザーにとっては「自分のMacで大規模モデルを実用的に動かす」というハードルがますます下がっています。
これからも試したことを記事にしていきますので、ぜひまた読みに来ていただけると嬉しいです。

それでは、また次回の記事でお会いしましょう 👋

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up