LLMモデル "Llama3" を 4bit 量子化して実行してみた

Last updated at 2024-07-24Posted at 2024-04-21

概要

一昨日発表された Llama3 を4bit量子化してつかってみました
GPUの VRAM は 6GB 程度消費します
Llama3の語彙数は 32000(Llama2) => 128256 へと大幅に増えました

デモ

本記事で試した Meta-Llama-3-8B-Instruct を当社Playground にホスティングしました！
実際にチャットとして体験可能です

2024.7.23 に Llama3.1 がでましたので、チャットもそちらに変更しました
https://chatstream.net/?model_id=meta_llama_3_1_8b_instruct&ws_name=chat_app_en

環境

NVIDIA RTX A5000
Python 3.11.4

ソースコード

Llama3 を bitsandbytes で 4bit 量子化して読み込み、GPUメモリ使用量を確認します

import transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed
from datetime import datetime

def cout_memory_availability(label):
    """
    GPUのメモリ利用状況と現在時刻を表示する
    :param label: メモリ状況のラベル
    :return: なし
    """

    # 現在時刻を取得して指定されたフォーマットでフォーマットする
    now = datetime.now()
    formatted_now = now.strftime("%Y年%m月%d日%H時%M分%S秒")

    total_memory = torch.cuda.get_device_properties(0).total_memory / (1024 ** 2)  # 総メモリをMB単位で取得
    reserved_memory = torch.cuda.memory_reserved(0) / (1024 ** 2)  # 予約済みメモリをMB単位で取得
    available_memory = total_memory - reserved_memory  # 利用可能なメモリを計算

    print(f"{formatted_now} - {label}: GPUの総メモリ: {total_memory:.2f} MB")
    print(f"{formatted_now} - {label}: 予約済みメモリ: {reserved_memory:.2f} MB")
    print(f"{formatted_now} - {label}: 利用可能なメモリ: {available_memory:.2f} MB")


set_seed(42) # シード固定

cout_memory_availability("モデル読み込み前")

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

# 4ビット量子化の設定を作成
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# 4ビット量子化して読み込む
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=quantization_config,
                                             device_map="auto")

cout_memory_availability("モデル読み込み後")

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    model_kwargs={"torch_dtype": torch.bfloat16},
)

messages = [
    {"role": "system",
     "content": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."},
    {"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(f"prompt:{prompt}")

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print(outputs[0]["generated_text"])


cout_memory_availability("推論後")

実行結果

プロンプトの形式

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>

プロンプト(input)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.<|eot_id|><|start_header_id|>user<|end_header_id|>

Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

生成されたテキスト(output)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.<|eot_id|><|start_header_id|>user<|end_header_id|>

Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Nice to meet you! I'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a helpful and accurate manner. I'm here to assist you with any questions, tasks, or topics you'd like to discuss.

I'm trained on a massive dataset of text from the internet and can generate human-like responses to a wide range of topics, from science and history to entertainment and culture. I can also help with tasks such as language translation, summarization, and more.

Feel free to ask me anything, and I'll do my best to provide a helpful and informative response. What's on your mind?

（日本語訳）

「はじめまして！私はMeta AIによって開発されたAIアシスタントのLLaMAです。人間の入力を理解し、役立つ正確な方法で応答することができます。どんな質問、タスク、話題についてもお手伝いします。

私はインターネット上の膨大なテキストデータセットで訓練されており、科学や歴史からエンターテインメント、文化に至るまで幅広いトピックに対して人間のような反応を生成することができます。言語翻訳、要約などのタスクにも対応しています。

何でもお尋ねください。できる限りお役に立てるよう努めます。何か気になることはありますか？」

ちゃんと、 polite にお返事してくれました。これからよろしくね llama3さん！

GPUメモリ消費量

モデル読み込み後

2024年04月21日08時58分40秒 - モデル読み込み後: GPUの総メモリ: 24563.50 MB
2024年04月21日08時58分40秒 - モデル読み込み後: 予約済みメモリ: 5934.00 MB
2024年04月21日08時58分40秒 - モデル読み込み後: 利用可能なメモリ: 18629.50 MB

推論実行後

2024年04月21日08時58分48秒 - 推論後: GPUの総メモリ: 24563.50 MB
2024年04月21日08時58分48秒 - 推論後: 予約済みメモリ: 6488.00 MB
2024年04月21日08時58分48秒 - 推論後: 利用可能なメモリ: 18075.50 MB

おまけ

注目のLlama新モデルということもあってか、HuggingFaceのサーバーが込み合ってるらしく、なんども再実行したせいでモデルのダウンロードにまる１日費やしました。

今朝(2024/4/21)の時点ですが、ダウンロード数すごいことになってるなー

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up