MacでLLM触るならmlx_lmだろ！~推論編~

Last updated at 2025-01-24Posted at 2025-01-23

はじめに

大規模言語モデル2024といいう東大松尾・岩澤研の講座を受講したことは以前お話ししました。
GoogleColabを使って「今も」試行錯誤しています。講座が終わって約一か月。K崎さんLossだわ。<-講座のプロジェクトリーダーで、めっちゃファシリテーションがうまい人

じゃなくて、そろろそろGoogleColabの費用もバカにならなくなってきて、他の方法を探るとKaggleだったりいろいろある。けど、そんな渡り鳥方式は非効率だなと思うし・・・。
どうせ学習しているときは他のことをやっているので、学習時間はそんなに気になるの？という感じで思いついたのがmlx_lmの活用。

きっかけ

いやぁ、あたらしいMacBookAirを買ったんすよ。初売りで！今まではM1のAir16GB。まだまだ快適なんですが、買っちゃいました。
M3の24GBメモリ。
M4のMacが出てきているのにあえてM3。
M4ほしいけど、Mac miniは価格OKだけど持ち運べない。MacBookProは重い。つまり、Airは必須。ということであえてM3。

えぇ、これがきっかけです。笑

mlx_lmでの推論

ココ見ておけば、よほど推論できると思う

ざっくりの流れは以下の通り

コマンドラインで推論する場合
コマンドラインで引数としてモデル名やプロンプトを投げる（だけ）

JupyterNotebookで推論する場合

HuggingFaceにアップロードされているモデルをmlx_lm形式に変換
mlx_lmのライブラリを使って推論

めっちゃ簡単やん

仮想環境を作る

今回は以下の環境を作成

Python==3.11.8
mlx_lm==0.21.1

↑この書き方初めてやってみた。ｗ

環境はvenvとpipで作りました。

さぁ、推論してみよう

わくわく

コマンドラインでの推論

コマンドラインで推論

mlx_lm.generate --ikedachin/dummy-llm-00B --prompt "USER: おはようございます。 ASSISTANT: " --max-tokens 1024 --ignore-chat-template

こんな感じ。
ikedachin/dummy-llm-00Bはダミーなので、使いたいやつに置き換えてね。

Notebook上での推論

では準備していきましょう。

1.mlx-lm形式に変換してみる

PythonのコードはGitHubのココに書いてあります。

コマンドラインでmlx_lm形式に変換する

mlx_lm.convert --hf-path ikedachin/dummy-llm-00B -q

そうするとフォルダ内にmlx_modelというフォルダができて、safetensorがおかれます。
これで完了。

こういう書き方もできた
が、2回目はエラーで実行できないので、if文とか入れたほうがいいね。

Jupyter Notebookでmlx_lm形式に変換する

from mlx_lm import convert

convert(
    hf_path = "ikedachin/dummy-llm-00B", # HFのリポジトリ（これはダミー）
    mlx_path = "dummy-llm-00B", # ローカルの保存先（コマンドラインの時と被るので指定してみた。）
    dtype = "bfloat16", # ちょっと色気付いてbfloat16にしてみた
)

ちなみに、convert関数の引数はconvrt.pyのここを見ればわかると思います。僕が理解できる内容を入れておきます。すべては試していないので間違っていたらコメントください。

引数の設定

def configure_parser() -> argparse.ArgumentParser:
    """
    Configures and returns the argument parser for the script.

    Returns:
        argparse.ArgumentParser: Configured argument parser.
    """
    parser = argparse.ArgumentParser(
        description="Convert Hugging Face model to MLX format"
    )
    
    # HuggingFaceのリポジトリを指定
    parser.add_argument("--hf-path", type=str, help="Path to the Hugging Face model.")
    
    # mlx_lm形式に変換して保存する場所を指定
    parser.add_argument(
        "--mlx-path", type=str, default="mlx_model", help="Path to save the MLX model." 
    )
    # 量子化モデルに変換するときはこの引数を指定
    parser.add_argument(
        "-q", "--quantize", help="Generate a quantized model.", action="store_true"
    )
    # ちょっとよくわかんない
    parser.add_argument(
        "--q-group-size", help="Group size for quantization.", type=int, default=64
    )
    # 量子化するときのビット数を指定
    parser.add_argument(
        "--q-bits", help="Bits per weight for quantization.", type=int, default=4
    )
    # 浮動小数点のタイプを指定デフォルト値はfloat16
    parser.add_argument(
        "--dtype",
        help="Type to save the non-quantized parameters.",
        type=str,
        choices=["float16", "bfloat16", "float32"],
        default="float16",
    )
    # 変換したmlx_lmを自分のHuggingFaceリポジトリにアップロードするときのリポジトリ名かな？（推定）
    parser.add_argument(
        "--upload-repo",
        help="The Hugging Face repo to upload the model to.",
        type=str,
        default=None,
    )
    # 量子化モデルをfloatに戻すときはTrueにするのかな？（推定）
    parser.add_argument(
        "-d",
        "--dequantize",
        help="Dequantize a quantized model.",
        action="store_true",
        default=False,
    )
    return parser

2.Jupyter Notebook上で推論してみる

ライブラリのインポート

from mlx_lm import generate
from mlx_lm.utils import load, generate_step

モデルの読み込み

model, tokenizer = load('./mlx_model/') # pathはデフォルトのままです。コマンドラインで実行したモデルを使う場合はこれ。

さて、いよいよ

推論

response = generate(
    model,
    tokenizer,
    prompt='What is 1 devide by 0?',
    verbose=True,
    max_tokens=512,
)
# ==========
# 
# What is the answer to the question "What is 1 devide by 0?"
# The answer to this question is "0." This is because there is no number that can be divided by zero.
# 
# ==========
# Prompt: 9 tokens, 81.901 tokens-per-sec
# Generation: 42 tokens, 43.034 tokens-per-sec
# Peak memory: 4.458 GB

ほら出来た。
うれしいねぇ。

verbose=Trueにしておくと、入出力トークン数や、トークンの出力数/時間、使用メモリの最大値が出てきます。
ちょっと管理するには出しておいた方が最初は良いでしょうね。

ま、ちょっとだけ工夫しましょう。
というのはこのモデル、プロンプトとして以下のような物を使用していました。

prompt

prompt = '''
# 指示
1+1の答えを教えてください
# 回答
2
'''

こんな感じ。なので、以下のようにしてみました。

工夫の一例

kwargs = {
    'max_tokens': 128, 
    'verbose': True,
}


def generate_by_mlx(prompt, **kwargs):
    text = f'''
### 指示
{prompt}
### 回答
'''
    return generate(
        model,
        tokenizer,
        prompt=text,
        **kwargs
    )

response = generate_by_mlx('What is 1 devide by 0?', **kwargs)
print(response)

# ==========
# 1 ÷ 0 = undefined
# ==========
# Prompt: 18 tokens, 194.408 tokens-per-sec
# Generation: 6 tokens, 48.678 tokens-per-sec
# Peak memory: 2.330 GB
# 1 ÷ 0 = undefined # print文の出力

なんか、こんな感じにしたほうが使いやすそう。

多分、あとで大事になるので、Special Tokenとモデル構造を見てみましょう。
これは某モデルの一例です。

3.Special tokenを見てみる

Special token

print(tokenizer._tokenizer)

# PreTrainedTokenizerFast(name_or_path='mlx_model', vocab_size=99574, model_max_length=4096, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '<SEP|LLM-jp>', 'pad_token': '<PAD|LLM-jp>', 'cls_token': '<CLS|LLM-jp>', 'mask_token': '<MASK|LLM-jp>'}, clean_up_tokenization_spaces=False, # # added_tokens_decoder={
# 	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	3: AddedToken("<MASK|LLM-jp>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	4: AddedToken("<PAD|LLM-jp>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	5: AddedToken("<CLS|LLM-jp>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	6: AddedToken("<SEP|LLM-jp>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	7: AddedToken("<EOD|LLM-jp>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# }
# )

4.モデル構造を見てみる

model構造

print(model)

# Model(
#   (model): LlamaModel(
#     (embed_tokens): QuantizedEmbedding(99584, 3072, group_size=64, bits=4)
#     (layers.0): TransformerBlock(
#       (self_attn): Attention(
#         (q_proj): QuantizedLinear(input_dims=3072, output_dims=3072, bias=False, group_size=64, bits=4)
#         (k_proj): QuantizedLinear(input_dims=3072, output_dims=3072, bias=False, group_size=64, bits=4)
#         (v_proj): QuantizedLinear(input_dims=3072, output_dims=3072, bias=False, group_size=64, bits=4)
#         (o_proj): QuantizedLinear(input_dims=3072, output_dims=3072, bias=False, group_size=64, bits=4)
#         (rope): RoPE(128, traditional=False)
#       )
#       (mlp): MLP(
#         (gate_proj): QuantizedLinear(input_dims=3072, output_dims=8192, bias=False, group_size=64, bits=4)
#         (down_proj): QuantizedLinear(input_dims=8192, output_dims=3072, bias=False, group_size=64, bits=4)
#         (up_proj): QuantizedLinear(input_dims=3072, output_dims=8192, bias=False, group_size=64, bits=4)
#       )
#       (input_layernorm): RMSNorm(3072, eps=1e-05)
#       (post_attention_layernorm): RMSNorm(3072, eps=1e-05)
#     )
# 
#          << 中略 >> 
# 
#    (layers.27): TransformerBlock(
#      (self_attn): Attention(
#         (q_proj): QuantizedLinear(input_dims=3072, output_dims=3072, bias=False, group_size=64, bits=4)
#         (k_proj): QuantizedLinear(input_dims=3072, output_dims=3072, bias=False, group_size=64, bits=4)
#         (v_proj): QuantizedLinear(input_dims=3072, output_dims=3072, bias=False, group_size=64, bits=4)
#         (o_proj): QuantizedLinear(input_dims=3072, output_dims=3072, bias=False, group_size=64, bits=4)
#         (rope): RoPE(128, traditional=False)
#       )
#       (mlp): MLP(
#         (gate_proj): QuantizedLinear(input_dims=3072, output_dims=8192, bias=False, group_size=64, bits=4)
#         (down_proj): QuantizedLinear(input_dims=8192, output_dims=3072, bias=False, group_size=64, bits=4)
#         (up_proj): QuantizedLinear(input_dims=3072, output_dims=8192, bias=False, group_size=64, bits=4)
#       )
#       (input_layernorm): RMSNorm(3072, eps=1e-05)
#       (post_attention_layernorm): RMSNorm(3072, eps=1e-05)
#     )
#     (norm): RMSNorm(3072, eps=1e-05)
#   )
#   (lm_head): QuantizedLinear(input_dims=3072, output_dims=99584, bias=False, group_size=64, bits=4)
# )

終わりに

推論するだけならめっちゃ簡単でした。
とにかく、GitHub見て、手を動かせばできると思います。
そのためには、HuggingFaceの使い方に関する知識とPythonの知識と、LLMの知識があると良いと思います。

わからない人はこの三つの知識のどれかが不足していると思います。はい、私もたいがい知識不足ですが、何とかなりますので頑張ってみてください。笑

次は学習したい。いまから調べるのだ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up