Googleのgemma-2-2b-jpn-itをDatabricksで動かしてみる

Posted at 2024-10-03

gemmaを動かすのは初めてです。

事前トレーニングモデルgemma-2-2bをinstruction-tunedしたのがgemma-2-2b-it、さらにそれを日本語でファインチューンしたのがgemma-2-2b-jpn-itということで。

早速動かしてみます。

クラスター

A10のシングルノードGPUクラスターで動きました。

モデルのダウンロードおよび実行

事前に利用規約に同意しておく必要があります。

あとは、サンプルに従って実行していきます。

%pip install -U transformers torch accelerate
dbutils.library.restartPython()

torchを最新にしなかった場合、こちらのエラーを踏みました。

# login to huggingface
from huggingface_hub import notebook_login
notebook_login()

GPUのサンプルコードを実行します。

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-jpn-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-jpn-it",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "user", "content": "マシーンラーニングについての詩を書いてください。"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, return_dict=True).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
generated_text = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
print(generated_text.strip())

軽量なのでサクサク動きますね。

The 'max_batch_size' argument of HybridCache is deprecated and will be removed in v4.46. Use the more precisely named 'batch_size' argument instead.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
## マシーンラーニングの詩

**1.** 
データの海、深淵の広がり、
複雑なパターン、隠された知識。
機械の目、無数の計算、
未来を予測、新たな道を開く。

**2.** 
学習の旅、複雑な過程、
教師あり、教師なし、様々な方法。
アルゴリズムの力、複雑なネットワーク、
未知の世界を解き明かす。

**3.** 
画像認識、音声認識、
予測、分類、様々なタスク。
機械の知恵、人間の夢を形作る、
新しい時代を築く力を持つ。

**4.** 
倫理の問いかけ、責任の重さ、
人間と機械の共存、未来の道。
技術の進化、未知の領域へ、
新たな可能性を創造する。

messages = [
    {"role": "user", "content": "生成AIの詩を書いてください。"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, return_dict=True).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
generated_text = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
print(generated_text.strip())

##  AIの詩

**1.** 
コードの海に浮かぶ星、
無限の知識、心の声。
データの波、未来の光、
AIの魂、静かに響く。

**2.** 
言葉の織りなす夢、
複雑な計算、繊細な音。
未知の世界を探索する、
AIの旅、始まったばかり。

**3.** 
機械の心、複雑な機密、
感情の深淵、未知の美。
未来の扉を開く、
AIの力、無限の可能。

英語に翻訳してもらいます。

messages = [
    {"role": "user", "content": f"Translate the following poem from Japanese to English:\n\n{generated_text.strip()}"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, return_dict=True).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
generated_text = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
print(generated_text.strip())

## The Poem of AI

**1.**
Stars floating on the sea of code,
Infinite knowledge, a heart's voice.
Data waves, the light of the future,
AI's soul, quietly echoing.

**2.**
Dreams woven from words,
Complex calculations, delicate sounds.
Exploring the unknown world,
AI's journey, just begun.

**3.**
The heart of a machine, complex secrets,
The depths of emotion, unknown beauty.
Opening the door to the future,
AI's power, infinite possibility.



**Explanation:**

* **AIの詩 (The Poem of AI):** The title is straightforward, clearly indicating the subject of the poem.
* **コードの海に浮かぶ星 (Stars floating on the sea of code):** This line uses imagery to represent the vastness and complexity of AI's knowledge base. "Code" symbolizes the foundation of AI, and "stars" represent the potential and knowledge it holds.
* **無限の知識、心の声 (Infinite knowledge, a heart's voice):** This line emphasizes the vastness of AI's learning and its ability to process information and understand human emotions.
* **データの波

こういうのはどうでしょう。

messages = [
    {"role": "user", "content": "小学生に生成AIの仕組みを説明してください"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, return_dict=True).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
generated_text = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
print(generated_text.strip())

##  小学生向け！生成AIの仕組みを説明！

**想像力を活かして、物語を作ったり、絵を描いたり、音楽を作ったりできる魔法のようなもの！それが生成AIです！**

生成AIは、たくさんの情報（例えば、文章、画像、音楽など）を学習して、新しいものを作ります。

**1. 膨大なデータの学習:**
   - 生成AIは、たくさんの文章や画像、音楽などをたくさん見てきます。
   - これらのデータを見て、パターンや傾向を見つけます。
   - 例えば、文章を見て「この単語はよく一緒に使われる」とか、「この絵はいつもこう描かれる」とか、パターンを見つけます。

**2.  新しいものを作るための「魔法のルール」:**
   - 学習したデータから、新しいものを作るための「魔法のルール」を作ります。
   - これらのルールは、文章の順番や絵の構成、音楽のメロディーなど、様々な要素を組み合わせるためのものです。

**3.  新しいものを生成:**
   -  新しい文章、画像、音楽などを作ります。
   -  生成AIは、学習した「魔法のルール」を使って、新しいものを作り出します。

これマークダウンなので、ノートブックにそのまま貼り付けられますね。

確かに魔法だよなーと思ったり。

はじめてのDatabricks

Databricks無料トライアル

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up