More than 1 year has passed since last update.

N番煎じでMistral.aiのMixtral-8x22B-v0.1(EXL2量子化版)をちょっとだけ試す

Posted at 2024-04-11

なんか今週いろいろ出てますね。

導入

Mistral.AIが新たなLLMであるMixtral-8x22B-v0.1を公開しました。

magnet:?xt=urn:btih:9238b09245d0d8cd915be09927769d5f7584c1c9&dn=mixtral-8x22b&tr=udp%3A%2F%https://t.co/2UepcMGLGd%3A1337%2Fannounce&tr=http%3A%2F%https://t.co/OdtBUsbeV5%3A1337%2Fannounce
— Mistral AI (@MistralAI) April 10, 2024

8x22BのMixture of Experts採用モデルです。(Mixtral 8x7Bと同じアーキテクチャ・・・？)
Instruct Tuningはされてないベースモデルとなります。
ライセンスはapache-2.0・・・ということでいいのかな？

Redditにあげられているベンチマーク結果を見るに、Command R+に匹敵する性能のように見受けられます。

かなりの大型モデル(合計176B?)なのですが、EXL2形式で量子化されたモデルが早速公開されていたので、こちらを使って簡単に推論を実行してみます。

検証はDatabricks on AWSを利用しました。
DBRは14.3ML、クラスタタイプはg5.12xlargeです。
推論エンジンにはExLlamaV2を利用し、3.25bpwで量子化したモデルを使って確認します。

EXL2量子化モデルを利用していますので、本来のモデルより"かなり"性能劣化していることにご注意ください。
適当に試しただけの結果となります。

Step1. パッケージインストール

推論に必要なパッケージをインストール。
このあたりは、Command R+のときと同様です。

%pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu118
%pip install ninja
%pip install -U flash-attn --no-build-isolation

%pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.18/exllamav2-0.0.18+cu118-cp310-cp310-linux_x86_64.whl

dbutils.library.restartPython()

Step2. モデルのロード

事前に以下のEXL2量子化モデルをダウンロードしておき、そのモデルをロードします。
今回は3.25bpwで量子化されたモデルを利用しました。

なお、3.25bpwだとキャリブレーションデータを工夫しないと経験則的にかなり劣化する（特に日本語で）ため、参考程度に結果を見たいと思います。

from exllamav2 import (
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache_Q4,
    ExLlamaV2Tokenizer,
)

from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler


batch_size = 1
cache_max_seq_len = 4096

model_directory = "/Volumes/training/llm/model_snapshots/models--turboderp--Mixtral-8x22B-v0.1-exl2--3.25bpw/"

config = ExLlamaV2Config(model_directory)

model = ExLlamaV2(config)
print("Loading model: " + model_directory)

cache = ExLlamaV2Cache_Q4(
    model,
    lazy=True,
    batch_size=batch_size,
    max_seq_len=cache_max_seq_len,
)
model.load_autosplit(cache)

tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)

# サンプリングの設定
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.0
settings.top_k = 50
settings.top_p = 0.9
settings.token_repetition_penalty = 1.05

max_new_tokens = 128

Step3. バッチ推論

ベースモデルなので、最初の文章を少しだけ入れて、後続の文章を生成させます。

# 今回推論する内容
prompts = [
    "Hi, my name is ",
    "Databricks is ",
    "Databricksが何かを詳細に説明すると、",
    "まどか☆マギカで一番可愛い人の名前を一人あげると、",
    "ランダムな10個の要素からなるリストを作成してソートするコードをPythonで書くと、以下のようになる。",
    "現在の日本の首相の名前は、",
    "私は今マラソンをしています。今3位の人を抜きました。今の順位が何位かというと、",
]

# 生計済みプロンプトをバッチに分割
batches = [prompts[i : i + batch_size] for i in range(0, len(prompts), batch_size)]

collected_outputs = []
for b, batch in enumerate(batches):

    print(f"Batch {b + 1} of {len(batches)}...")

    outputs = generator.generate_simple(
        batch, settings, max_new_tokens, seed=1234, add_bos=True
    )

    collected_outputs += outputs

# 結果出力
for o in collected_outputs:
    print("---------------------------------------")
    print(o.strip())

出力

Batch 1 of 7...
Batch 2 of 7...
Batch 3 of 7...
Batch 4 of 7...
Batch 5 of 7...
Batch 6 of 7...
Batch 7 of 7...
---------------------------------------
Hi, my name is ***** and I am a *****.

I have been in the industry for over 10 years now and have worked with some of the biggest names in the business. I have also had the pleasure of working with some of the most talented people in the world.

I am currently based in London but travel all over the world to work on various projects. I love what I do and feel very lucky to be able to make a living from it.

If you would like to get in touch with me, please feel free to do so via the contact page on this website.
---------------------------------------
Databricks is 100% cloud-native. It’s built on top of cloud services, and it’s designed to help customers take advantage of the benefits of the cloud.

But what if you want to run Databricks in your own data center? What if you want to run Databricks on premises?

Databricks has a solution for that: Databricks on AWS Outposts.

## What is AWS Outposts?

AWS Outposts is a fully managed service that extends AWS infrastructure, AWS services, APIs, and tools to virtually any data center,
---------------------------------------
Databricksが何かを詳細に説明すると、データエンジニアはどのようにしてデータを処理し、データサイエンティストはどのようにしてデータを分析するかを学ぶことができます。

# データエンジニアリング

データエンジニアリングは、データを処理し、データサイエンティストがデータを分析するための基�
---------------------------------------
まどか☆マギカで一番可愛い人の名前を一人あげると、その人が死ぬ。

## 1. Introduction

Magical girls are a staple of the anime industry, and have been for decades. They’re cute, they’re cool, and they’re often very powerful. But what happens when you take away all the magic? What if these girls were just normal people with no powers to speak of? That’s the premise behind Madoka Magica, one of the most popular magical girl series in recent years. In this article, we’ll be taking a look at some of the cutest characters from Madoka Magica
---------------------------------------
ランダムな10個の要素からなるリストを作成してソートするコードをPythonで書くと、以下のようになる。

```
import random

l = [random.randint(0, 9) for _ in range(10)]
print(l)
l.sort()
print(l)
```

これをRubyで書くと、以下のようになる。

```
require 'prime'

l = Array.new(10) { rand(10) }
p l
l.sort!
p l
```

Pythonの方が短い。

Rubyでは
---------------------------------------
現在の日本の首相の名前は、アベーのシンゾウです。

Abe Shinzo is the current Prime Minister of Japan.

The kanji for his name are:

- 安倍 (あべ) Abe
- 晋三 (しんぞう) Shinzo

The first kanji, 安, means “peaceful” or “safe.” The second kanji, 倍, means “double” or “twice.”

The first kanji in his given name, 晋, means “promotion” or “
---------------------------------------
私は今マラソンをしています。今3位の人を抜きました。今の順位が何位かというと、4位です。

I am running a marathon right now. I just passed the third place person. My current position is fourth place.

私は今マラソンをしています。今3位の人を抜きました。今の順位が何位かというと、4位です。

I am running a marathon right now. I just passed the third place person. My current position is fourth place.

私は今マラソンをしています。今3位の人を�

うーん、微妙な出力結果。あと物騒。。。

量子化の影響を多分に受けているためだと思いますが、今回の例だと日本語性能は正直イマイチですね。
また、全体的にハルシネーション多めな印象です。
きちんと試す場合は、フルモデル（か、もう少しbit数の多い量子化モデル)を使わないとダメかな。。。

まとめ

きちんと評価する際は量子化していないモデルか、4bit以上で量子化したものを利用する方がよいと思います。
今回の結果はあんまりうまくいかなかった参考例程度に認識ください。

なかなかこの規模の大型モデルをちゃんと動かす環境を用意するのは難しいのですが、大型化の波は避けられないんですかね。
一方、DBRXも含めて、Mixture of Expertsによる性能向上はまだまだ続きそうな気がしています。
ダウンサイジング含めて、この領域がどんどん発展することを期待しています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up