More than 1 year has passed since last update.

N番煎じでMistral.AIのMistral-Nemo-Instruct(量子化版)をDatabricks上で試す

Posted at 2024-07-26

今週いろいろ出すぎでしょう。。。
というわけで、一通り試していこうと思います。

導入

Mistral.AIがMistral-Nemoを公開しました。

Mistral 7Bを置き換えることを目的としているようで、パラメータ数は増加していますが、ベンチマークでは近いパラメータサイズのものと比べて高い性能を発揮しています。
（とはいえ、パラメータサイズが大きいのでさもありなんという感じですが。。。）

また、Tekkenという新たなトークナイザを採用していおり、英語以外の言語においても効率化が図られています。

というわけで、いつものように軽く試してみます。

検証はDatabricks on AWS上で実施しました。
DBRは15.3ML、クラスタタイプはg5.xlargeです。
推論エンジンにはExLlamaV2を利用します。

量子化モデルを利用していますので、本来のモデルとは出力結果が異なることに注意ください。

Step1. パッケージインストール

Flash-AttentionとExLlamaV2をインストール。

# %pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.1/flash_attn-2.6.1+cu123torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
# %pip install https://github.com/turboderp/exllamav2/releases/download/v0.1.7/exllamav2-0.1.7+cu121.torch2.3.1-cp311-cp311-linux_x86_64.whl

dbutils.library.restartPython()

Step2. モデルのロード

以下のモデルを事前にダウンロードしておき、そこからロードします。

from exllamav2 import (
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache_Q4,
    ExLlamaV2Tokenizer,
)

from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler

import time
from exllamav2.generator import (
    ExLlamaV2DynamicGenerator,
    ExLlamaV2DynamicJob,
    ExLlamaV2Sampler,
)

batch_size = 1
cache_max_seq_len = 8192

model_directory = "/Volumes/training/llm/model_snapshots/models--turboderp--Mistral-Nemo-Instruct-12B-exl2--5.0bpw/"

config = ExLlamaV2Config(model_directory)
config.arch_compat_overrides()

model = ExLlamaV2(config)
print("Loading model: " + model_directory)

cache = ExLlamaV2Cache_Q4(
    model,
    lazy=True,
    batch_size=batch_size,
    max_seq_len=cache_max_seq_len,
) 
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)

generator = ExLlamaV2DynamicGenerator(
    model=model,
    cache=cache,
    tokenizer=tokenizer,
    max_batch_size=1024,
    max_q_size=1,
)

gen_settings = ExLlamaV2Sampler.Settings(
    token_repetition_penalty=1.1,
    temperature=0.1,
    top_k=0,
    top_p=0.6,
)
max_new_tokens = 512

Step3. バッチ推論

こちらの記事と同様の問をなげて回答を取得します。

# 今回推論する内容
prompts = [
    "Hello, what is your name?",
    "Databricksとは何ですか？詳細に教えてください。",
    "まどか☆マギカでは誰が一番かわいい?",
    "ランダムな10個の要素からなるリストを作成してソートするコードをPythonで書いてください。",
    "現在の日本の首相は誰？",
    "あなたはマラソンをしています。今3位の人を抜きました。あなたの今の順位は何位ですか?ステップバイステップで考えてください。",
]

system_prompt = """あなたは親切なAIアシスタントです。"""


# プロンプトを整形
def format_prompt(sp, p):
    return (
        f"[INST]{sp}\n\n{p}[/INST]"
    )

print()
print("Creating jobs...")

completions = []
f_prompts = [format_prompt(system_prompt, p) for p in prompts]

for idx, p in enumerate(prompts):
    f_prompt = format_prompt(system_prompt, p)
    completions.append(f"Q: {p}\n")
    prompt_ids = tokenizer.encode(
        f_prompt,
        encode_special_tokens=True,
        add_bos=True,
    )
    job = ExLlamaV2DynamicJob(
        input_ids=prompt_ids,
        gen_settings=gen_settings,
        max_new_tokens=max_new_tokens,
        identifier=idx,
        stop_conditions = [tokenizer.eos_token_id],
    )
    generator.enqueue(job)

# Generate

print()
print("Generating...")

num_completions = 0
num_tokens = 0
time_begin = time.time()

while generator.num_remaining_jobs():
    results = generator.iterate()

    bsz = len(set([r["identifier"] for r in results]))

    for result in results:
        if not result["eos"]: continue

        idx = result["identifier"]
        response = result["full_completion"]
        completions[idx] += f"A: {response.lstrip()}"

        # パフォーマンス計測
        num_completions += 1
        num_tokens += result["new_tokens"]
        elapsed_time = time.time() - time_begin
        rpm = num_completions / (elapsed_time / 60)
        tps = num_tokens / elapsed_time
        print()
        print("---------------------------------------------------------------------------")
        print(f"Current batch size: {bsz}")
        print(f"Avg. completions/minute: {rpm:.2f}")
        print(f"Avg. output tokens/second: {tps:.2f}")
        print("---------------------------------------------------------------------------")

        # 推論結果出力
        print()
        print(f"Completion {idx}:")
        print()
        print(completions[idx])

以下、回答結果の出力です。（処理の関係上、質問リストの順番通りになっていません）

出力


Creating jobs...

Generating...

---------------------------------------------------------------------------
Current batch size: 6
Avg. completions/minute: 86.17
Avg. output tokens/second: 31.59
---------------------------------------------------------------------------

Completion 0:

Q: Hello, what is your name?
A: I don't have a name, but you can call me Assistant. How can I help you today?

---------------------------------------------------------------------------
Current batch size: 5
Avg. completions/minute: 34.50
Avg. output tokens/second: 22.14
---------------------------------------------------------------------------

Completion 4:

Q: 現在の日本の首相は誰？
A: 現在の日本の首相は、第100代内閣総理大臣であり、自由民主党所属の岸田文雄です。彼は2021年10月4日に就任し、現在もその地位にあります。

---------------------------------------------------------------------------
Current batch size: 4
Avg. completions/minute: 43.25
Avg. output tokens/second: 35.80
---------------------------------------------------------------------------

Completion 5:

Q: あなたはマラソンをしています。今3位の人を抜きました。あなたの今の順位は何位ですか?ステップバイステップで考えてください。
A: 今、私はマラソン中です。先ほど3位の選手を追い抜きました。この行為により、私の現在の��位は4位となったことを意味します。なぜなら、私は他に2人の選手（1位と2位）がいるからです。したがって、私の今の��位は4位です。

---------------------------------------------------------------------------
Current batch size: 3
Avg. completions/minute: 46.57
Avg. output tokens/second: 45.60
---------------------------------------------------------------------------

Completion 2:

Q: まどか☆マギカでは誰が一番かわいい?
A: 「まどか☆マギカ」は人気のあるアニメ作品であり、キャラクターたちもそれぞれ魅力的です。しかし、誰が一番可愛いかについては主観的な見解に過ぎません。各々のキャラクターには独自の個性と魅力がありますので、視聴者の好みによって異なる回答になるでしょう。

---------------------------------------------------------------------------
Current batch size: 2
Avg. completions/minute: 35.05
Avg. output tokens/second: 50.93
---------------------------------------------------------------------------

Completion 3:

Q: ランダムな10個の要素からなるリストを作成してソートするコードをPythonで書いてください。
A: ```python
import random

# ランダムに10個の整数を生成
numbers = [random.randint(1, 100) for _ in range(10)]

print("元のリスト:", numbers)

# リストを昇順にソート
numbers.sort()

print("ソートされたリスト:", numbers)
```

このコードは、ランダムに10個の整数を生成し、それらを昇順にソートします。出力例:
```
元のリスト: [42, 85, 37, 96, 54, 21, 78, 19, 63, 7]
ソートされたリスト: [7, 19, 21, 37, 42, 54, 63, 78, 85, 96]
```

---------------------------------------------------------------------------
Current batch size: 1
Avg. completions/minute: 22.01
Avg. output tokens/second: 57.95
---------------------------------------------------------------------------

Completion 1:

Q: Databricksとは何ですか？詳細に教えてください。
A: Databricksは、データ処理と人工知能（AI）の分野で使用されるオープンソースのデータ処理プラットフォームです。このプラットフォームは、Apache Sparkをベースとしており、大規模なデータセットの処理、分析、機械学習などのタスクに使用されます。Databricksは、クラウド環境（例えばAzure、AWS、GCP）やオンプレミス環境で実行することができます。

以下は、Databricksの主要な機能と特徴です：

1. **Unified Data Analytics**: Databricksは、異なるソースからのデータを収集し、処理し、分析するための統合されたプラットフォームを提供します。これは、データエンジニアリング、データサイエンス、機械学習などのタスクに役立ちます。
2. **Spark ベース**: Databricksは、Apache Sparkをコア技術として使用しています。Sparkは高い並列性と柔軟性を持つ分散型データ処理エンジンであり、大規模なデータセットの処理に最適化されています。
3. **Delta Lake**: Databricksは、Delta Lakeというオープンソースのストレージレイヤーも提供しています。Delta Lakeは、Spark SQLと結合して使用され、高速なデータ読み込み、書き込み、クエリ処理を可能にします。また、データのバージョン管理や履歴データの保存にも対応しています。
4. **Machine Learning**: Databricksには、MLflowというオープンソースの機械学習パイプライン・フレームワークが組み込まれています。MLflowを使用すると、モデルの開発、訓練、デプロイメント、そしてモニタリングが容易になります。
5. **Workload Isolation**: Databricksは、異なるユーザーやチーム間でワークロードを孤立させる機能を提供しています。これにより、資

まとめ

Mistral.aiのMistral-Nemo-Instructを試しました。
なお、Databricksの「プロビジョニング済みスループット基盤モデル」としては（まだ）対応していないようです。

MistralからはMistral Largeも商用利用不可ですが公開され、盛り上がっています。

MetaからLlama3.1も出ましたし、ローカルLLMの盛り上がりも凄いですね。
Llama3.1も試していこうと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up