More than 1 year has passed since last update.

LLMAdvent Calendar 2023

@wayama_ryousuke(ryousuke wayama)

Mixtral 8x7B を llama.cpp で試してみた

Last updated at 2023-12-13Posted at 2023-12-13

Mistral AI から公開された Mixtral 8x7B を、llama.cpp を使って動かしてみました。

MoE, Mixtral 8x7B とは

Mistral AI

Mistral AI は、「13B モデルを上回る 7B モデル」として話題を集めた Mistral-7B の開発元企業です。

先日（2023/12/11）、その Mistral AI が Mixtral 8x7B を公開し、LLM界隈で話題になっています。
最大の特徴は、MoE (Mixtures of Experts) というしくみを採用している点です。
このアーキテクチャは GPT-4 でも採用していると噂されており (参考: The Decoder)、興味をひかれたため、今回その性能を調査してみました。

MoE

MoE (Mixtures of Experts) は、複数のニューラルネットワークを組み合わせるアプローチです。
特定のタスクやドメインに特化したネットワークを複数用意し、それらを切り替えながら学習・推論することで、効率的なモデルを構築します。

MoE では、モデルが ゲートネットワーク（gating network）と複数の エキスパートネットワーク（expert network）という2つのサブネットワークで構成されています。
ゲート部分によって、入力に対してどのエキスパートを選択するかを決め、選択されたエキスパートだけが出力を担当します。

これにより、性能向上と学習・推論時間の短縮という2つのメリットを得ることができます。
性能向上に関しては、それぞれのエキスパートごとに異なるタスクへ最適化することができるため、より幅広いタスクへの対応が可能になります。
また、ゲートによって活性化されないエキスパートはスキップするため、計算量を減らすことができ、高速な学習・推論を実現できます。

HuggingFace 公式ブログに詳しく解説がありますので、そちらをご参照ください。

Mixtral 8x7B

Mistral AI ブログ: https://mistral.ai/news/mixtral-of-experts/
HuggingFace
- Base: https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
- Instruct: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

Mixtral 8x7B は、MoE の一種である SMoE (Sparse Mixtures of Experts) を採用した LLM です。

このモデルは 高い性能をもつにもかかわらず、トークンの生成速度が速い という特長があります。
Mixtral 8x7B は Llama 2 70B と同等以上のベンチマーク成績を収めていますが、MoE アーキテクチャによって一部のパラメタのみを使用するため、実効速度は 12.9B モデルと同等です。

メモリ使用量に関しては、生成速度の向上ほどの目立った改善は見られませんが、量子化していない Mixtral 8x7B のメモリ使用量は 100GB 前後で、47B の未量子化モデルと同程度のメモリ消費量です (参考: Mistral)。
パラメタ数から単純計算した値 (8x7B=56B 相当) と比べてメモリ消費が少ないのは、SMoEモデルのパラメタ共有の特性によるものです。
Mixtral 8x7Bでは、エキスパート間で一部のパラメータを共有しているため、全体的なメモリ使用量が削減されています。

	比較	備考
パフォーマンス	7B < 70B <= 8x7B	Mistral によれば、Mixtral 8x7B は Llama 2 70B をベンチマークで上回る。ただし、Mistxal と Llama 2 は別モデルであり、ベンチマークの成績が MoE の効果かは不明
生成速度 (T/s)	70B < 8x7B (実質 12.9B) < 7B	各トークンの推論で利用するパラメタは 12.9B のため、実質的に 12.9B モデルと同等の推論速度になる
メモリ使用量 (GB)	7B < 8x7B (実質 46.7B) < 70B	未量子化 Mistral 8x7B の VRAM 消費量は 100GB。量子化モデルの VRAM 消費量は、 `Q2_K` が 18.14 GB, `Q4_K_M` が 28.94 GB (llama.cpp/GGUF, TheBloke)

なお、学習に用いたデータセットは、Mistral 公式ブログでは明記されていません。

Mixtral is pre-trained on data extracted from the open Web – we train experts and routers simultaneously.

Mixtral は、オープン Web から抽出したデータ上で事前学習されており、エキスパートとルーターを同時に訓練しています。

出典: Mistral AI 公式ブログ (https://mistral.ai/news/mixtral-of-experts/)、日本語訳は Google Bard

試してみた

Colab 環境 (V100 GPU) で、さっそく Mixtral 8x7B を試してみました。
ノートブックは、各節のリンクをご参照ください。

モデルは TheBloke さんの 2bit / 4bit 量子化モデル (GGUF) を利用しました。
量子化していないバージョン (HuggingFace) の場合、検証環境ではメモリ不足になってしまうためです。

GGUF モデルのリンクはこちらです。

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF

モデルの実行には llama.cpp を利用しました。
main ブランチは執筆時点で MoE に未対応のため (PR - GitHub)、mixtral ブランチをチェックアウトしています。

$ git clone https://github.com/ggerganov/llama.cpp
$ git checkout mixtral

2bit 量子化モデル

まず、2bit 量子化モデル (mixtral-8x7b-instruct-v0.1.Q2_K.gguf ) で推論してみます。

Colab ノートブックはこちらです。

https://colab.research.google.com/drive/1-RdbkdVtFJaA6cah8uI1lqbvc9Vto3h2

!./main \
  -m mixtral-8x7b-instruct-v0.1.Q2_K.gguf \
  --color \
  --ctx-size 2048 \
  --n-gpu-layers 28 \
  --n-predict -1 \
  --temp 0.7 \
  --repeat_penalty 1.1 \
  --threads 8 \

...

[INST] Alas, a wild unicorn is approaching! As a llama, what should I do? [/INST] Being stampeded by a unicorn would be an exciting but potentially dangerous experience! Here are some steps you can take to prepare for and react to this situation:

1. Acknowledge that this is a rare and magical opportunity: You have been given the chance to interact with a mythical creature, which could be very enriching.
2. Stay calm: It's important not to panic or show fear, as it may agitate the unicorn. Being collected and thoughtful in your actions will help keep the situation under control.
3. Assess the unicorn's behavior: If the unicorn appears aggressive or upset, try to determine why that is and whether you can alleviate the problem. Be aware of any objects or perceived barriers between you and the unicorn.
4. Use your wits and charm: As a llama, you have a unique understanding of other animals' emotions and behaviors. Use this knowledge to connect with the unicorn and gauge its mood. Approach it slowly and speak in a gentle, soothing voice. If possible, provide some food for the unicorn to strengthen your bond.
5. Document your interaction: Since you are humanoid, you can use tools and devices that might help calm the unicorn or at least record what's happening. Use these items sparingly and only if necessary.
6. Know when to back off: If the unicorn continues to show aggression despite your best efforts, remove yourself from the situation and prioritize safety over interaction with a unicorn.

Remember that every unicorn is unique, so this advice may need to be adjusted for each specific case. Good luck! [end of text]

llama_print_timings:        load time =    5411.94 ms
llama_print_timings:      sample time =     218.99 ms /   360 runs   (    0.61 ms per token,  1643.92 tokens per second)
llama_print_timings: prompt eval time =     868.27 ms /    30 tokens (   28.94 ms per token,    34.55 tokens per second)
llama_print_timings:        eval time =   27547.38 ms /   359 runs   (   76.73 ms per token,    13.03 tokens per second)
llama_print_timings:       total time =   28776.71 ms

トークンの生成速度 (eval time) は 13.03 トークン/秒 です。
体感では、ChatGPT の Web UI と同程度の速さでレスポンスが生成されます。
2bit 量子化とはいえ、70B 相当の性能をもつモデルを V100 で動かしていることを考えれば、かなり速いと言えるでしょう。

4bit 量子化モデル

続いて、4bit 量子化モデル (mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf) で推論してみます。

Colab ノートブックはこちらです。

https://colab.research.google.com/drive/1qnPyklYVqN8RYiSla_lfAyKpPU9eXxsI

!./main \
  -m mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
  --color \
  --ctx-size 2048 \
  --n-gpu-layers 19 \
  --n-predict -1 \
  --temp 0.7 \
  --repeat_penalty 1.1 \
  --threads 8 \
  --prompt "<s>[INST] Alas, a wild unicorn is approaching! As a llama, what should I do? [/INST]" 

...

[INST] Alas, a wild unicorn is approaching! As a llama, what should I do? [/INST] If you encounter a mythical creature like a unicorn, it's important to remember that unicorns are typically portrayed as gentle and magical beings in folklore. However, it's still crucial to prioritize safety and proceed with caution. As a llama, you could observe the unicorn from a distance. There's no need to run away or show fear, but maintain a respectful distance.

Remember that unicorns are often drawn to innocent and pure-hearted individuals in folklore, so simply being yourself should suffice. However, it's important to note that this is purely a hypothetical scenario since unicorns do not exist in reality. [end of text]

llama_print_timings:        load time =    6848.02 ms
llama_print_timings:      sample time =      90.27 ms /   152 runs   (    0.59 ms per token,  1683.91 tokens per second)
llama_print_timings: prompt eval time =    3407.39 ms /    30 tokens (  113.58 ms per token,     8.80 tokens per second)
llama_print_timings:        eval time =   21834.46 ms /   151 runs   (  144.60 ms per token,     6.92 tokens per second)
llama_print_timings:       total time =   25394.22 ms
Log end

6.92 トークン/秒 となりました。
RAM にオフロードするレイヤ数が増えるので、推論速度は落ちていますが、実行環境やモデルの性能を考えれば十分な速度と言えるでしょう。

日本語のプロンプト

4bit 量子化モデルを使って、日本語のプロンプトも入力してみました。
ELYZA-tasks-100 から質問を3つ投げてみます。

Q. 仕事の熱意を取り戻すためのアイデアを5つ挙げてください。

新しい技能を学 bureau (e.g. a class, workshop, or online course): By learning something new, you can challenge yourself and reignite your passion for your work. This could be directly related to your job, such as taking a course on a new software program, or unrelated, like learning how to paint or play an instrument.

Find a mentor: A mentor is someone who can provide guidance, support, and encouragement. They can help you see things from a different perspective, offer advice on how to approach challenges, and share their own experiences with you. Having a mentor can help rekindle your enthusiasm for your job by giving you new ideas and insights.

Set new goals: If you're feeling uninspired at work, it might be because you don't have any clear objectives to strive towards. Setting new goals can give you a sense of purpose and direction. These could be short-term goals, like completing a project by a certain date, or long-term goals, like earning a promotion or starting your own business.

Take a break: Sometimes, the best way to reignite your passion for your work is to take a step back from it. This could mean taking a few days off, going on vacation, or even just taking a short break during the day to recharge. When you come back to your job refreshed and rejuvenated, you'll be more likely to approach it with renewed energy and enthusiasm.

Reflect on why you chose this career in the first place: Sometimes, we lose sight of why we wanted to do this work in the first place. Take some time to reflect on what drew you to this field, what excites you about it, and what you hope to achieve. Reminding yourself of your original motivations can help rekindle your passion and remind you why you love your job.

Q. クマが海辺に行ってアザラシと友達になり、最終的には家に帰るというプロットの短編小説を書いてください。

一日、クマが森から海辺に飛び立ちました。クマは長い間眺め続けていると、ある日見つけたアザラシの群れに心をひっくり返りました。そこで、クマは海辺に遊びに来ること decided を決意しました。

クマが海辺に到着したら、アザラシたちはクマを見かけてすっかり戸惑いました。しかし、クマが優しくやさしく笑顔でアザラシたちに向き合ったことで、あるアザラシはクマに心を開き始めました。

時間の経過と共に、クマとそのアザラシは遊びに出ていたり、水族の世界を観察するために潜っていたりし、とても親しみ合っていきました。アザラシたちがクマを心待ちでいたような気がしました。

でも、クマは森での家族や友達との絆も忘れられないと思い、最終的には元の家に帰ることに決意しました。アザラシたちはクマが去ることを知っており、それでもクマを見送ってくれました。

クマは森に戻ってからも、いつものように眠い日々を過ごしていました。けれど、心にはあの時の楽しい思いや、アザラシたちとの出会いが残り続けていました。そして、もう一度アザラシたちと再会する日が来るかもしれないと、クマは微笑んで forest を見渡しました。

クマは海辺に行って、アザラシと友達になりました。そして、最終的には元の家に帰り、心に残された思いや出会いを忘れることはありませんでした。

Q. 古代ギリシャを学ぶ上で知っておくべきポイントは？

古代ギリシャを学びた場合、以下のポイントを考慮すると有益でしょう:

ギリシャ語の勉強: 古代ギリシアの文化や思想を理解するには、ギリシャ語の基本的な知識が必要です。ギリシャ語の単語と文法は、現在の西洋文化に大きく影響を与えています。

ギリシアの歴史: 古代ギリシアの時代は約1000年にわたり、約800年間から500年前の時期にかけて存在しました。この期間中、ギリシャは複数の政治的・文化的な変遷を経験しました。それ�ach 変遷についての知識が重要です。

ギリシアの地域: 古代ギリシアは、ギリシャ半島と周囲のnumerous islandを含むエgean Seaの近くにありました。ギリシアの地理的特徴や、そこで起きる政治・文化の交流は、ギリシアの歴史と文化に大きく影響を与えました。

ギリシアの神話: ギリシアの神話は、ギリシャ人の信仰や文化的背景を反映しています。彼らは、多くの自然現象や人間的な品格についての神話を持っていました。

ギリシアの芸術: ギリシアの芸術は、世界中で著名です。彼らの建築、雕刻、絵画、および哲学的な考え方は、現代の文化に大きく影響を与えています。

ギリシアの哲osoフィロソフィー: 古代ギリシアでは、多くの著名な哲osoが活躍しました。彼らの考え方や教えを理解することは、西洋哲学の基礎を구成する重要な部分です。

ギリシアの政治: 古代ギリシアでは、多くの政治制度が試行されました。デモクラシーは、ギリシアで最初に実践された政治制度です。

ギリシャ人とローマ人: ギリシア語とギリシアの文化がローマに広まり、ローマ人はギリシャの文化を受け継いだことです。ギリシアとローマの相互影響を理解することは、西洋文化の基礎を理解するために重要です。

これらのポイント以外にも、古代ギリシャを学ぶ上で有益な情報となる場合があります。しかし、これらは最低限の知識として重要であると考えられます。

1つめの回答は途中から英語になっているほか、2・3個目の回答も不自然な表現があり、日本語性能はやや微妙な印象です。
（「一日、クマが森から海辺に飛び立ちました」「アザラシの群れに心をひっくり返りました」など）

公式ブログには英語・フランス語・イタリア語・ドイツ語・スペイン語に対応しているとの記載があり、Mixtral-8x7B は複数言語を扱えるようですが、回答を見るかぎり日本語はまだ未熟なようです。

まとめ

この記事では Mixtral 8x7B を紹介し、HuggingFace で公開されている 2bit 量子化・4bit 量子化モデルを試してみました。
日本語の性能は改善の余地がありますが、トークンの生成速度は V100 環境でも高速で、MoE の効果が確認できました。

これまで、70B 相当の高性能なモデルを動かすには A100 などの高価な GPU が必要で、推論速度も遅かったため、潤沢なリソースがないとモデルを活用することが難しいという課題がありました。
MoE の採用によって、70B 相当とされる高性能なモデルを V100 などの比較的廉価な GPU で、実用的な速度で動かせるのは大きなメリットだと思います。

今後の LLM 開発では、MoE の採用が進んでいくと予想されます。
ファインチューンについては、現時点ではソースコード等は公開されていませんが、今後対応が進むと思われます。
日本語圏においても、MoE を使った LLM の開発が一般化することで、性能や推論速度の向上が進むと期待されます。

参考リンク

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up