Small LLM "Gemma-2B"を動かしてみた

Posted at 2024-07-01

はじめに

GoogleのLLMでGemmaというAIモデルがあります。

サイズに対して高性能であることが特徴のモデルで、開発者のPCで直接実行することができると書かれています。なるほど、2B（パラメータ数が約20億）のモデルはCPUで動かすことも想定されているようです。モデルの推論性能も良く、7Bの比較ではLLaMA2と同等以上だとテクニカルレポートには書かれています。
今回はこのGemma-2Bを自宅マシンのCPUとGPUで推論実行してみました。

モデルのロード準備

Hugging Faceへのログイン
https://huggingface.co
Gemmaの学習済みモデルはHugging Faceで公開されています。すでにアカウントがある場合は[Log In]します。アカウントがない場合は[Sign Up]から作成します。
Googleへのモデル利用申請
https://huggingface.co/google/gemma-2b-it
Gemma-2b-itのページです。baseモデルではなくインストラクションモデル(it)を使うことにします。モデルの利用にあたってはGoogleに利用申請が必要なので、必要事項を書いて提出します。
Hugging Faceのアクセストークン作成
https://huggingface.co/settings/tokens
Hugging FaceのモデルにCLIでアクセスするために、アクセストークンを作成します。

必要モジュールのインストール

環境に応じ、必要なPythonモジュールをインストールしておきます。

$ pip install transformers
$ pip install accelerate # GPU実行のみ

CPUでの推論実行

チュートリアルを実行してみます。
注意点として、AutoTokenizerとAutoModelForCausalLMのfrom_pretrainedにはtoken=で発行したアクセストークンを指定する必要があります。また、generateにはmax_new_tokensで生成するトークンの上限を指定することができるので、結果を得るために十分な数を指定しておきます。（デフォルトは20）
最初はモデルのダウンロードが必要なので時間がかかります。2回目からはチェックポイントからロードできます。

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

token = "***" # 発行したアクセストークン
tokenizer = AutoTokenizer.from_pretrained(
    "google/gemma-2b-it",
    token=token)
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    token=token,
    torch_dtype=torch.bfloat16
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids, max_new_tokens=1000)
print(tokenizer.decode(outputs[0]))

GPUでの推論実行

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

token = "***" # 発行したアクセストークン
tokenizer = AutoTokenizer.from_pretrained(
    "google/gemma-2b-it",
    token=token)
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    device_map="auto",
    token=token,
    torch_dtype=torch.bfloat16
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=1000)
print(tokenizer.decode(outputs[0]))

どちらでも実行ができました（GPUは10秒、CPUは4分程かかりました）

詠んでくれたマシンラーニングについての詩

Machines, they weave and they learn, From the data, they discern.
Algorithms, a symphony, Unleash the power of the machine.
Data as the canvas, a masterpiece, Machine paints, a new perspective.
From the past, the future takes flight, With the wisdom of machines, day and night.
Algorithms, a dance of the mind, Unleash the power of the machine.
Learning, adapting, a constant flow, The machine's wisdom, a story to know.
So let the machines, with their might, Help us solve problems, day and night.
From healthcare to finance, they take the lead, A future of possibilities, indeed.
(Google翻訳) 機械は織り、学習します。データから彼らは識別します。
アルゴリズム、交響曲、マシンの力を解き放ちます。
データをキャンバスとした傑作、マシンペイント、新たな視点。
過去から未来が飛び立ち、昼も夜も機械の知恵を駆使して。
アルゴリズム、心のダンス、マシンの力を解き放ちます。
学習、適応、一定の流れ、機械の知恵、知っておくべき物語。
だから、機械にその力を発揮させてください。昼も夜も問題解決にご協力ください。
ヘルスケアから金融まで、彼らは主導権を握り、まさに可能性の未来。

実行環境

CPU: Intel Core(TM) i7-9700 CPU
GPU: NVIDIA GeForce RTX 2080 Ti (11GB GDDR6)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up