小規模な事前学習モデルのSarashina2.2の0.5Bと3Bを使ってアニメクイズモデルを作ってみた

Posted at 2025-04-02

はじめに

こんにちは、しゅんです。先日X（旧Twitter）でアニメのクイズデータセットが公開されているのを見つけ、さっそく気分転換がてらに小型言語モデル(SLM)のfine-tuningに挑戦してみました。

利用させていただいたデータセットはこちら

umiyuki/Ani-Bench-JP

今回使ったモデル

今回はSB Intuitionsさんが公開しているSarashina2.2シリーズの2つの小規模な事前学習モデルを使用してみました。

sbintuitions/sarashina2.2-0.5b
sbintuitions/sarashina2.2-3b

Sarashina2.2とは？

Sarashina2.2は、日本語の数学・コーディングタスクに特化して作られた日本語大規模言語モデルです。

特徴

日本語タスクに特化したモデルでありながら、以下のように数学・コーディングタスクでも非常に高い性能を示しています。

モデル	NIILC	JMMLU	MGSM-ja	JHumanEval
Sarashina2-7B	62.2	42.5	7.2	12.8
Sarashina2-70B	66.1	62.7	56.4	22.0
Sarashina2.2-3B	62.2	52.7	62.4	39.6
Sarashina2.2-0.5B	34.6	28.8	21.2	15.2
Sarashina2.2-1B	47.2	38.4	38.8	21.3

特に3Bモデルでも70Bを超えるMGSM-jaやJHumanEvalのスコアを達成しています。

詳細は公式ブログにて紹介されています。

もちろん、自分は特別に調べたわけではないので、本当に他の日本語モデルと比べて少ないかはわかりませんが、個人的には実験をしている限り、アニメ系のクイズに関しては想像以上に良い結果が出ていました。

実験環境

MacBook Pro M4 32GB
MPS (Metal Performance Shaders) を使用

実験内容

データセット：umiyuki/Ani-Bench-JP（約100問）
fine-tuning 対象：0.5B / 3B
epoch：0.5Bは10 epoch、3Bは5 epoch
batch size：0.5Bは8、3Bは4
learn_rate:0.5B default,3Bは3e-5に設定した
3Bにmax_grad_norm=1.0　勾配クリッピングで異常な更新を抑える

実験経過

0.5B 学習ログ（10 epoch）

最終loss：1.34
所要時間：約2分

warnings.warn(
{'train_runtime': 129.1086, 'train_samples_per_second': 6.971, 'train_steps_per_second': 0.929, 'train_loss': 1.3420065561930339, 'epoch': 10.0}
100%|██████████████████████████████████████████| 120/120 [02:09<00:00,  1.08s/it]

3B 学習ログ（5 epoch）

最終loss：0.8468
所要時間：約50分

warnings.warn(
{'train_runtime': 3011.0643, 'train_samples_per_second': 0.149, 'train_steps_per_second': 0.038, 'train_loss': 0.8468221913213315, 'epoch': 5.0}
100%|██████████████████████████████████████████| 115/115 [50:11<00:00, 26.18s/it]

fine-tuningのコード

python ft_sarashina2-05b.py

# ft_sarashina2-05b.py
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
import torch

# 1. データセットのロード
dataset = load_dataset("umiyuki/Ani-Bench-JP", split="test")
print("Dataset columns:", dataset.column_names)
print("Sample:", dataset[0])

# 2. train / eval 分割
split_dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]

# 3. tokenizer & model
model_name = "sbintuitions/sarashina2.2-0.5b"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# ✅ <eos> を追加
if "<eos>" not in tokenizer.get_vocab():
    tokenizer.add_special_tokens({"eos_token": "<eos>"})

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
# model.resize_token_embeddings(len(tokenizer))
# tokenizerにトークンを足したのでモデル側もembedding resize
model.resize_token_embeddings(len(tokenizer), mean_resizing=False)

# 4. dataset の前処理
def tokenize_function(batch):
    texts = [
        f"問題: {q}\n番組名: {bn}\n答え: {a} <eos>"
        for q, bn, a in zip(batch["問題"], batch["番組名"], batch["答え"])
    ]
    tokenized_outputs = tokenizer(texts, truncation=True, padding=True)
    tokenized_outputs["labels"] = tokenized_outputs["input_ids"].copy()

    # tokenized_outputs = tokenizer(texts, truncation=True, max_length=256, padding="max_length")
    # tokenized_outputs["labels"] = tokenized_outputs["input_ids"].copy()
    return tokenized_outputs

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_eval = eval_dataset.map(tokenize_function, batched=True)

# 5. Trainer
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=10,
    per_device_train_batch_size=8,
    save_steps=1000,
    save_total_limit=2,
    logging_steps=500,
    evaluation_strategy="steps",
    eval_steps=500
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
)

trainer.train()
trainer.save_model("./results/ft_sarashina2-05b_anime_quiz")
tokenizer.save_pretrained("./results/ft_sarashina2-05b_anime_quiz")

python ft_sarashina2-3b.py

# ft_sarashina2-3b.py
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
import torch

# 1. データセットのロード
dataset = load_dataset("umiyuki/Ani-Bench-JP", split="test")
print("Dataset columns:", dataset.column_names)
print("Sample:", dataset[0])

# 2. train / eval 分割
split_dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]

# 3. tokenizer & model
model_name = "sbintuitions/sarashina2.2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# ✅ <eos> を追加
if "<eos>" not in tokenizer.get_vocab():
    tokenizer.add_special_tokens({"eos_token": "<eos>"})

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
# model.resize_token_embeddings(len(tokenizer))
# tokenizerにトークンを足したのでモデル側もembedding resize
model.resize_token_embeddings(len(tokenizer), mean_resizing=False)

# 4. dataset の前処理
def tokenize_function(batch):
    texts = [
        f"問題: {q}\n番組名: {bn}\n答え: {a} <eos>"
        for q, bn, a in zip(batch["問題"], batch["番組名"], batch["答え"])
    ]
    tokenized_outputs = tokenizer(texts, truncation=True, padding=True)
    tokenized_outputs["labels"] = tokenized_outputs["input_ids"].copy()

    # tokenized_outputs = tokenizer(texts, truncation=True, max_length=256, padding="max_length")
    # tokenized_outputs["labels"] = tokenized_outputs["input_ids"].copy()
    return tokenized_outputs

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_eval = eval_dataset.map(tokenize_function, batched=True)

# 5. Trainer
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=5,
    per_device_train_batch_size=4,
    learning_rate=3e-5,   # ← 3Bなら 2e-5 ～ 5e-5 推奨
    max_grad_norm=1.0,                 # ← 勾配クリッピングで異常な更新を抑える
    save_steps=1000,
    save_total_limit=2,
    logging_steps=500,
    evaluation_strategy="steps",
    eval_steps=500
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
)

trainer.train()
trainer.save_model("./results/ft_sarashina2-3b_anime_quiz")
tokenizer.save_pretrained("./results/ft_sarashina2-3b_anime_quiz")

推論用コード(比較込み)

python main_ft_05b.py

# main_ft_05b.py
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
import torch

def main():
    # prompt = "問題: 主人公の名前は何ですか？\n番組名: 魔法少女まどか☆マギカ\n答え:"
    # prompt = "問題: 幻影旅団の団長の名前は？\n番組名: HUNTER×HUNTER\n答え:"
    
    prompt = "問題: さやかが魔法少女になった理由は何ですか？\n番組名: 魔法少女まどか☆マギカ\n答え:"

    set_seed(123)

    # ベース
    base_model_name = "sbintuitions/sarashina2.2-0.5b"
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )

    # FT
    ft_model_path = "./results/ft_sarashina2-05b_anime_quiz"
    ft_tokenizer = AutoTokenizer.from_pretrained(ft_model_path)
    ft_model = AutoModelForCausalLM.from_pretrained(
        ft_model_path,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )

    # パイプライン
    base_pipeline = pipeline("text-generation", model=base_model, tokenizer=tokenizer)
    ft_pipeline = pipeline("text-generation", model=ft_model, tokenizer=ft_tokenizer)

    generation_kwargs = {
        "max_length": 100,
        "do_sample": True,
        "temperature": 0.7,
        "top_k": 50,
        "top_p": 0.95,
        "pad_token_id": tokenizer.pad_token_id,
        "eos_token_id": ft_tokenizer.eos_token_id,   # ← STOP条件を設定
        "num_return_sequences": 1
    }

    # 生成
    base_output = base_pipeline(prompt, **generation_kwargs)
    ft_output = ft_pipeline(prompt, **generation_kwargs)

    print("----- ベースモデルの出力 -----")
    print(base_output[0]['generated_text'])
    print("\n----- ファインチューニング済みモデルの出力 -----")
    print(ft_output[0]['generated_text'])

if __name__ == "__main__":
    main()

main_ft_3b.py

# main_ft_05b.py
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
import torch

def main():
    prompt = "問題: 主人公の名前は何ですか？\n番組名: 魔法少女まどか☆マギカ\n答え:"
    # prompt = "問題: 幻影旅団の団長の名前は？\n番組名: HUNTER×HUNTER\n答え:"
    
    # prompt = "問題: さやかが魔法少女になった理由は何ですか？\n番組名: 魔法少女まどか☆マギカ\n答え:"

    set_seed(123)

    # ベース
    base_model_name = "sbintuitions/sarashina2.2-3b"
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )

    # FT
    ft_model_path = "./results/ft_sarashina2-3b_anime_quiz"
    ft_tokenizer = AutoTokenizer.from_pretrained(ft_model_path)
    ft_model = AutoModelForCausalLM.from_pretrained(
        ft_model_path,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )

    # パイプライン
    base_pipeline = pipeline("text-generation", model=base_model, tokenizer=tokenizer)
    ft_pipeline = pipeline("text-generation", model=ft_model, tokenizer=ft_tokenizer)

    generation_kwargs = {
        "max_length": 100,
        "do_sample": True,
        "temperature": 0.7,
        "top_k": 50,
        "top_p": 0.95,
        "pad_token_id": tokenizer.pad_token_id,
        "eos_token_id": ft_tokenizer.eos_token_id,   # ← STOP条件を設定
        "num_return_sequences": 1
    }

    # 生成
    base_output = base_pipeline(prompt, **generation_kwargs)
    ft_output = ft_pipeline(prompt, **generation_kwargs)

    print("----- ベースモデルの出力 -----")
    print(base_output[0]['generated_text'])
    print("\n----- ファインチューニング済みモデルの出力 -----")
    print(ft_output[0]['generated_text'])

if __name__ == "__main__":
    main()

結果

3Bの結果

見ずらいと思うので、0.5Bも同じプロンプトでやってるので、ここに置きます。

prompt = "問題: 主人公の名前は何ですか？\n番組名: 魔法少女まどか☆マギカ\n答え:"

prompt = "問題: 幻影旅団の団長の名前は？\n番組名: HUNTER×HUNTER\n答え:"

prompt = "問題: さやかが魔法少女になった理由は何ですか？\n番組名: 魔法少女まどか☆マギカ\n答え:"

----- ファインチューニング済みモデルの出力 -----
問題: 主人公の名前は何ですか？
番組名: 魔法少女まどか☆マギカ
答え: 鹿目まどか 
----- ファインチューニング済みモデルの出力 -----
問題: 幻影旅団の団長の名前は？
番組名: HUNTER×HUNTER
答え: クロロ＝ルシルフル 
----- ファインチューニング済みモデルの出力 -----
問題: さやかが魔法少女になった理由は何ですか？
番組名: 魔法少女まどか☆マギカ
答え: 上条恭介の手を治すため

両方ともちゃんと正しい！嬉しい！

以下経過の流れの詳細

(.venv) syun@syunnoMacBook-Pro Sarashina2 % python main_ft_05b.py      
Device set to use mps
Device set to use mps
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
----- ベースモデルの出力 -----
問題: 主人公の名前は何ですか？
番組名: 魔法少女まどか☆マギカ
答え: まどマギ
質問2: まどかはどうしてキュゥべえに狙われているのですか？
答え: ほむらから「幸せの囁き」を聞いて、それが魔法少女になったことが原因です。
質問3: まどかの家族はいますか？
答え: いいえ、まどかには家族はありません。
質問4: まどかの友達は誰ですか？
答え: まどかには友達はいません

----- ファインチューニング済みモデルの出力 -----
問題: 主人公の名前は何ですか？
番組名: 魔法少女まどか☆マギカ
答え: 鹿目まどか 
(.venv) syun@syunnoMacBook-Pro Sarashina2 % python main_ft_05b.py
Device set to use mps
Device set to use mps
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
----- ベースモデルの出力 -----
問題: 幻影旅団の団長の名前は？
番組名: HUNTER×HUNTER
答え: クラピカ
幻影旅団団長の名前はクラピカ。
ハンターハンター 第252話 「幻影旅団（2） 」
クラピカ ハンターハンター 幻影旅団 HUNTER×HUNTER
ハンターハンター 第220話 「幻影旅団（1） 」
ハンターハンター 第236話 「

----- ファインチューニング済みモデルの出力 -----
問題: 幻影旅団の団長の名前は？
番組名: HUNTER×HUNTER
答え: クロロ＝ルシルフル 
(.venv) syun@syunnoMacBook-Pro Sarashina2 % python main_ft_05b.py
Device set to use mps
Device set to use mps
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
----- ベースモデルの出力 -----
問題: さやかが魔法少女になった理由は何ですか？
番組名: 魔法少女まどか☆マギカ
答え: まどかは、自分を助けてくれたキュゥべえに恋心を抱いていた。

## 問6

次の文章は、2011年に放送されたテレビアニメ「魔法少女リリカルなのはStrikerS」の登場人物・フェイト・テスタロッサの台詞です。

「それが、私の願い。」

この台詞について、最も適切なものを次の選択肢から選び

----- ファインチューニング済みモデルの出力 -----
問題: さやかが魔法少女になった理由は何ですか？
番組名: 魔法少女まどか☆マギカ
答え: 上条恭介の手を治すため

0.5B

(.venv) syun@syunnoMacBook-Pro Sarashina2 % python main_ft_05b.py      
Device set to use mps
Device set to use mps
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
----- ベースモデルの出力 -----
問題: 主人公の名前は何ですか？
番組名: 魔法少女まどか☆マギカ
答え: まどマギ
質問2: まどかはどうしてキュゥべえに狙われているのですか？
答え: ほむらから「幸せの囁き」を聞いて、それが魔法少女になったことが原因です。
質問3: まどかの家族はいますか？
答え: いいえ、まどかには家族はありません。
質問4: まどかの友達は誰ですか？
答え: まどかには友達はいません

----- ファインチューニング済みモデルの出力 -----
問題: 主人公の名前は何ですか？
番組名: 魔法少女まどか☆マギカ
答え: 鹿目まどか 
(.venv) syun@syunnoMacBook-Pro Sarashina2 % python main_ft_05b.py
Device set to use mps
Device set to use mps
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
----- ベースモデルの出力 -----
問題: 幻影旅団の団長の名前は？
番組名: HUNTER×HUNTER
答え: クラピカ
幻影旅団団長の名前はクラピカ。
ハンターハンター 第252話 「幻影旅団（2） 」
クラピカ ハンターハンター 幻影旅団 HUNTER×HUNTER
ハンターハンター 第220話 「幻影旅団（1） 」
ハンターハンター 第236話 「

----- ファインチューニング済みモデルの出力 -----
問題: 幻影旅団の団長の名前は？
番組名: HUNTER×HUNTER
答え: クロロ＝ルシルフル 
(.venv) syun@syunnoMacBook-Pro Sarashina2 % python main_ft_05b.py
Device set to use mps
Device set to use mps
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
----- ベースモデルの出力 -----
問題: さやかが魔法少女になった理由は何ですか？
番組名: 魔法少女まどか☆マギカ
答え: まどかは、自分を助けてくれたキュゥべえに恋心を抱いていた。

## 問6

次の文章は、2011年に放送されたテレビアニメ「魔法少女リリカルなのはStrikerS」の登場人物・フェイト・テスタロッサの台詞です。

「それが、私の願い。」

この台詞について、最も適切なものを次の選択肢から選び

----- ファインチューニング済みモデルの出力 -----
問題: さやかが魔法少女になった理由は何ですか？
番組名: 魔法少女まどか☆マギカ
答え: 上条恭介の手を治すため

所感

0.5Bでもそこそこいい精度で学習できましたが、実際グラフ出力評価とか全然していないので、今回も遊んでる感じで記事を書いてます。技術量はそこまで無いと思います。データ量が少ないとか、本当に学習できてるかどうかもGPTに指摘されてたが、とりあえず、比較的小規模なデータ（約90問）でもしっかり学習してくれる印象です。

今後、データ量を増やしてみたり、別ジャンルにも応用してみたいと思っています。

最後に

ここまで読んでいただきありがとうございました！

今回は完全に趣味の範囲での実験でしたが、SB Intuitionsさんが公開してくれたSarashina2.2のおかげで、短時間でも良い結果を得ることができました。

また時間があれば、別ジャンルや別データでも試してみたいと思います。

質問やアドバイスがあればぜひコメントください！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up