More than 1 year has passed since last update.

無料版GoogleColabで13BのLLMを動かす方法（OOMにならない！/Fugaku-LLM-13Bも動いたぞォ！）

Last updated at 2024-05-26Posted at 2024-05-19

無料版GoogleColabで7Bや13BのLLMを動かす方法（OOMにならない！）

結論

bitsandbytes(量子化)で4bitや8bitでロードすればいい
(メモリの消費量が抑えられる代わりに、LLMの精度は落ちます)

AutoModelForCausalLM.from_pretrainedの引数に load_in_8bit=True,や load_in_4bit=True,を追加するだけ

(7Bはbitsandbytes使わなくてもGPUで動かせたため、消しました)

成功例

accelerateとbitsandbytesのインストール

なぜか、普通にインストールすると謎のエラーが出るので、おまじない（アンインストール＆インストール）を行う

!pip uninstall accelerate bitsandbytes -y
!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes

13Bを4bitでロード

論理上動くはずって感じでしたが、富岳LLMも動きましたねェ～～
そして、4bitで読み込んでも回答はちゃんと返ってきた！

ソースコード

※Fugaku-LLMは規約同意やらなんやらしないと使えないので、HuggingFaceにログインする必要があります

from huggingface_hub import login
# HuggingFaceログイン
login("HF_TOKEN")

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "Fugaku-LLM/Fugaku-LLM-13B-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16,
                                             device_map="auto", load_in_4bit=True,)
model.eval()

system_example = "以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。"
instruction_example = "スーパーコンピュータ「富岳」の名前の由来を教えてください。"

prompt = f"{system_example}\n\n### 指示:\n{instruction_example}\n\n### 応答:\n"

input_ids = tokenizer.encode(prompt,
                             add_special_tokens=False,
                             return_tensors="pt")
tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    do_sample=True,
    temperature=0.1,
    top_p=1.0,
    repetition_penalty=1.0,
    top_k=0
)
out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up