〖Transformers×埋め込み×Decoding〗Notebookを丸ごと読ませて、Qiita記事をLLMに全部書かせてみた

Last updated at 2026-01-20Posted at 2026-01-20

Transformers×埋め込み×Decoding〗Notebookを丸ごと読ませて、Qiita記事をLLMに全部書かせてみた

※この記事は、Transformers / 埋め込み表現 / Decoding の理解を深めるために自分で作成した実験Notebookを入力として、Notebook内のコード・実行結果（出力）・セル内コメント（考察）をすべて拾い上げた上で、LLM（ChatGPT）にQiita記事として再構成させたものです。明らかな誤植以外は修正を加えていません。

Notebookで何を試し、どんな結果が出て、そこから何を考えたのかを「あとから自分が読み返しても理解できる」ことを主眼にまとめています。

想定読者

Transformers（Hugging Face Transformers）を触り始めたが、「Tokenizerが何してるか」と「埋め込みが何か」と「Decodingが何か」が、まだ頭の中でつながっていない
Top-k / Top-p / Temperature という単語は見たことあるが、結局なにを操作してるのかを自分の手で確かめたい
「実験して納得する」タイプ（←私）

この記事でやること（章構成）

Transformers / 埋め込み / Decoding の前提整理（Notebook外の補足）
Notebookの実験内容をセル順に全部なぞる
- Tokenizerの挙動（英語 vs 日本語）
- 埋め込みベクトルの取り出しと可視化
- Top-k / Temperature / Greedy / Top-p の挙動確認
- タスク（知識系・創造系）での得手不得手の確認
最後にまとめと参考文献

前提知識：Transformers / 埋め込み / Decoding（Notebook外の補足）

Transformers（ざっくり）

Transformerは、Attention（自己注意）を軸にしたニューラルネットワークアーキテクチャで、LLM（大規模言語モデル）のベースになっているやつです。
Hugging Face Transformers は、そのTransformer系モデルを「ロードして」「トークナイズして」「推論して」「生成する」まで一通り扱えるライブラリです。

公式Doc（Text generation / generation strategies）
https://huggingface.co/docs/transformers/en/main_classes/text_generation
https://huggingface.co/docs/transformers/en/generation_strategies

埋め込み（Embedding）

トークン（token id）をモデルに入れるとき、まず「埋め込みベクトル」に変換されます。
埋め込みは 語彙数V × 次元d の行列（(V, d)）として学習されていて、token id をキーにその行を引っ張ってくるイメージです。

今回のNotebookでは、Mistral-7Bの埋め込み行列が

語彙数 V = 32000
埋め込み次元 d = 4096

という形で出てきます（後で実際に確認します）。

Decoding（生成時の「次のトークンの選び方」）

モデルは、次トークンに対して「語彙全体に対する確率分布」を出します。
その確率分布から どうやって次の1トークンを選ぶかが Decoding（生成戦略）です。

Hugging Faceでも decoding strategy の話がまとまっています。
https://huggingface.co/docs/transformers/en/generation_strategies

代表的なDecoding手法（概要／メリット／デメリット）

ここはNotebookの理解を助けるために、先に整理しておきます。
（※Notebookでは Beam Search以外 を主に扱います。Beam Searchは概要だけ）

Greedy Decoding（常に確率最大を選択：決定的）

概要：毎ステップ argmax（確率最大のトークン）を選ぶ
メリット
- 決定的で再現性が高い（同じ入力→同じ出力）
- 速い
- 「知識を問う短い回答」などは安定しやすい
デメリット
- 文章が単調・反復的になりやすい（いわゆる text degeneration）
- 「創造性」「多様性」は出にくい

参考：Hugging Face “How to generate text”
https://huggingface.co/blog/how-to-generate

Top-k Sampling（上位k個に制限してサンプリング）

概要：確率上位k個だけ残して正規化し、その中からランダムに1つサンプル
メリット
- ランダム性が入り、出力が多様になる
- 低確率の変な語（ノイズ）をある程度カットできる
デメリット
- k の決め方が難しい（小さすぎるとGreedy寄り、大きすぎるとノイズ混入）
- 知識系の質問で、外すときは普通に外す

Top-p（Nucleus Sampling：累積確率pに入る集合からサンプリング）

概要：確率上位から足して累積確率が p を超える最小集合（nucleus）を作り、その中からサンプル
メリット
- 分布がシャープなときは候補集合が小さくなり、分布がフラットなときは広がる
- Top-kより「状況に応じて候補数が変わる」ので扱いやすいことが多い
デメリット
- p が大きすぎるとノイズ混入、低すぎるとほぼ確定的
- ランダム性が入るので再現性が落ちる
- タスクによっては脱線しやすい

Top-p（Nucleus Sampling）の提案元：Holtzman et al. (2019)
https://arxiv.org/abs/1904.09751

Beam Search（複数候補を保持して探索）

概要：各ステップで上位候補を複数（beam）保持し、全体尤度が高い系列を探索
メリット
- 「翻訳」など、答えが比較的決まるタスクで強いことが多い
- Greedyより良い系列が見つかる可能性
デメリット
- 開放型生成（物語など）では単調・反復になりやすいことがある
- 計算が重い（beam幅に比例）

参考：Transformers generation parameters（num_beams など）
https://huggingface.co/docs/transformers/en/main_classes/text_generation

Notebookを読む（ここからセル順）

以降は、Notebookのセルを順番に追いながら
「何を試したか」「何が出たか」「そこから何を考えたか」をまとめます。

0. 実験の狙い（Notebook冒頭）

Notebook冒頭には、Decoding理解のためにやりたいことが書かれています。

Tokenizerの挙動を確認し、token id と文字列の対応を掴む
埋め込みベクトルを取り出して「意味が近いと近いのか？」を可視化する
Decoding（Top-k / Temperature / Greedy / Top-p）を実際に動かして挙動を確認する
「タスクによって得手不得手があるのか」を試す

この後の構成は、そのままこの流れになっています。

1. 環境準備（Colab用セル）

Notebookには「Google Colab環境のみこのセルを実行」というセルがあります。
ここはだいたい pip install transformers accelerate bitsandbytes 的なやつ（環境依存）なので、記事としては「Colabで動かした」前提だけ押さえておけばOK。

2. モデルロードと生成してみる（Mistral-7B）

試したこと

mistralai/Mistral-7B-v0.1 をロードして、適当に generate() を回してみます。

モデルカード：
https://huggingface.co/mistralai/Mistral-7B-v0.1

実行コード（Notebook該当セル）

# mistralai/Mistral-7B-v0.1
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# 量子化の設定
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16)

# tokenizer, modelのロード
model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config,
    dtype=torch.float16
)
model.to('cuda')

# 試しにデフォルトパラメータでNext Token Predictionしてみる
input = "The capital of Japan is"
encoded_input = tokenizer(input, return_tensors='pt').to(model.device)
output = model.generate(**encoded_input)

print(tokenizer.decode(output[0]))

出力（Notebookの実行結果）

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
<s> The capital of Japan is Tokyo. It is the largest city in the world. It is also the most populous metropolitan
'\n# openai-community/gpt2-xl\n# https://huggingface.co/openai-community/gpt2-xl\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\n# モデルのロード\nmodel_name = "gpt2-xl"\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(model_name).to(\'cuda\')\n\n# 試しに適当なパラメータでNext Token Predictionしてみる\ninput = "The capital of Japan is"\nencoded_input = tokenizer(input, return_tensors=\'pt\').to(model.device)\noutput = model.generate(**encoded_input)\n\nprint(tokenizer.decode(output[0]))\n'

考察（補足）

生成自体は普通にできていて、The capital of Japan is Tokyo... まで出ています。
Setting pad_token_id... は、open-end generation 時に pad が未設定なので eos を代用するよ、という類のメッセージ（Transformers側のよくあるやつ）。
出力の後半に gpt2-xl のコード文字列がそのまま出ているのは、Notebook側で「文字列としてセルに残っているものが出力された」状態っぽいです（実験メモを残していた可能性）。
→ 記事としては「こういうメモが出力に混ざることもあるので、最後にセル出力を掃除しておくと読みやすい」ぐらいの感想。

3. Tokenizerの動作確認（英語 vs 日本語）

3.1 英語文のエンコード・デコード

実行コード

# 英語の文章に対するTokenizerのエンコード・デコード

input = "The capital of Japan is"

# Tokenizerのエンコード・デコード結果
token_ids = tokenizer.encode(input)
print(tokenizer.encode(input))
print(tokenizer.decode(token_ids))

# 各トークン（ID）と単語の対応関係
for token_id in token_ids:
    print(f"{token_id}: {tokenizer.decode([token_id])}")

出力

[1, 415, 5565, 302, 4720, 349]
<s> The capital of Japan is
1: <s>
415: The
5565: capital
302: of
4720: Japan
349: is

Notebookコメント（考察）

英文をエンコードすると、文の始まりを表す 1: <s>（BOS: Beginning of Sentence）が先頭に追加される。

3.2 日本語文のエンコード・デコード

実行コード

# 日本語の文章に対するTokenizerのエンコード・デコード

input = "日本の首都は"

token_ids = tokenizer.encode(input)
print(tokenizer.encode(input))
print(tokenizer.decode(token_ids))

for token_id in token_ids:
    print(f"{token_id}: {tokenizer.decode([token_id])}")

出力

[1, 28705, 29142, 29119, 28993, 29993, 29460, 29277]
<s> 日本の首都は
1: <s>
28705: 
29142: 日
29119: 本
28993: の
29993: 首
29460: 都
29277: は

Notebookコメント（考察）

日本文をエンコードすると、1: <s>（BOS）と、28705（SentencePieceの境界を表す特殊トークン ▁）が先頭に追加される。

3.3 英文と日本文で `▁` の挙動が違う理由（追記考察を含む）

Notebookでは「英文と日本文で特殊トークン（▁）追加の挙動が異なる理由」という見出しがあり、追加確認しています。

実行コード

word = "Hello world"

token_ids = tokenizer.encode(word, add_special_tokens=True)
print("With special tokens:", token_ids)
print("Decoded:", tokenizer.decode(token_ids))

token_ids = tokenizer.encode(word)
print("\nUsing .encode():", token_ids)
print("Decoded:", tokenizer.decode(token_ids))

token_ids = tokenizer.encode(word, add_special_tokens=False)
print("\nWithout special tokens:", token_ids)
print("Decoded:", tokenizer.decode(token_ids))

出力

With special tokens: [1, 22557, 1526]
Decoded: <s> Hello world

Using .encode(): [1, 22557, 1526]
Decoded: <s> Hello world

Without special tokens: [22557, 1526]
Decoded: Hello world

考察

少なくとも "Hello world" のケースでは、▁ が明示的に見える形では出てこない（tokenizerの内部表現と表示のされ方の問題がある）。
SentencePieceは「空白を含むrawテキストからサブワード化する」ため、語頭・単語境界の扱いが 英語（空白で単語分割できる言語） と 日本語（空白が基本ない言語） で見え方が変わりやすい。
SentencePiece自体の説明は元論文が一番まとまっています：
https://arxiv.org/abs/1808.06226
（「事前に単語分割された入力を前提にしない」＝raw sentenceからsubwordを学習できる、という話が重要）

4. 語彙数と埋め込み行列の確認

4.1 語彙数（vocab size）

実行コード

# Tokenizerの語彙数の確認
print(tokenizer.vocab_size)

出力

4.2 埋め込み行列（Embedding matrix）のshapeを見る

実行コード

# 分散表現（埋め込みベクトル）の確認

# embedding matrix を取り出す
emb = model.get_input_embeddings().weight

print(emb.shape)
print(emb[0])
print(emb[0].shape)

出力

torch.Size([32000, 4096])
tensor([-4.0588e-03,  1.6499e-04, -4.6997e-03,  ..., -1.8597e-04,
        -9.9945e-04,  4.0531e-05], device='cuda:0', dtype=torch.float16,
       grad_fn=<SelectBackward0>)
torch.Size([4096])

Notebookコメント（考察）

語彙数は32000で、それぞれ4096次元のベクトル。
→ 埋め込み＝ (V, d) 行列というイメージが綺麗に一致。

4.3 token_id → token の対応表を眺める

実行コード

import pandas as pd

# token_id → token の表を作る
df = pd.DataFrame({
    "token_id": list(range(tokenizer.vocab_size)),
    "token": [tokenizer.decode([i]) for i in range(tokenizer.vocab_size)]
})

df

出力（Notebook表示：一部）

 token_id  token
0             0  <unk>
1             1    <s>
2             2   </s>
3             3   (制御文字相当)
4             4   (制御文字相当)
...         ...    ...
31995     31995      အ
31996     31996      執
31997     31997      벨
31998     31998      ゼ
31999     31999      梦

[32000 rows x 2 columns]

（全部は表示されないので、気になったtoken idを個別に decode([id]) で見るのが現実的）

5. 埋め込みベクトルの可視化（Cosine類似度 / PCA）

5.1 単語→埋め込みベクトルへの変換関数

実行コード

# 単語→埋め込みベクトルへの変換用の関数

import numpy as np

def get_word_embedding(word: str, device):
    # encode（特殊トークンなし）→ token ids
    token_ids = tokenizer.encode(word, add_special_tokens=False)
    token_ids = torch.tensor(token_ids).to(device)

    # embedding matrix
    emb = model.get_input_embeddings().weight

    # subword embedding を平均（サブワード平均）
    word_emb = emb[token_ids].mean(dim=0)
    return word_emb

5.2 可視化対象の単語群

実行コード

# 可視化対象の単語リスト
words = [
    # 王族の組
    "king", "queen", "prince", "princess",
    # スポーツの組
    "baseball", "soccer", "tennis", "basketball",
    # 人間の組
    "man", "boy", "woman", "lady",
    # 動物の組
    "dog", "cat", "cow", "fish",
    # 概念の関連が薄い組
    "table", "idea", "character", "sea"
]

5.3 埋め込み行列を作る（20語×4096次元）

実行コード

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import torch

# 単語埋め込みの行列を作るため、単語毎の埋め込みを計算し2次元行列にする
with torch.no_grad():
    emb_list = [get_word_embedding(w, model.device).detach().cpu().numpy() for w in words]
emb_matrix = np.stack(emb_list, axis=0)    # (num_words, hidden_dim)
print("emb_matrix shape:", emb_matrix.shape)

出力

emb_matrix shape: (20, 4096)

5.4 Cosine類似度ヒートマップ

実行コード

# Cosine類似度のヒートマップ

# 単語間の Cosine 類似度行列を計算
cos_sim_matrix = cosine_similarity(emb_matrix)  # shape: (num_words, num_words)

plt.figure(figsize=(8, 8))
im = plt.imshow(cos_sim_matrix, interpolation="nearest")

plt.colorbar(im, fraction=0.046, pad=0.04)
plt.xticks(ticks=range(len(words)), labels=words, rotation=90)
plt.yticks(ticks=range(len(words)), labels=words)
plt.title("Cosine similarity between word embeddings (Mistral-7B)")

plt.tight_layout()
plt.show()

可視化結果

Notebookコメント（考察：追記反映）

概念の関連が薄い組（table, idea, character, sea）以外は、概ね同じ組の単語と近い傾向。
fish, cow が例外的に遠いのは、それぞれ海洋生物・家畜で、ペット（cat, dog）の概念とは遠いからと類推。

5.5 PCAで2次元に圧縮して散布図

実行コード

# PCAで2次元に次元削減した散布図

pca = PCA(n_components=2)
emb_2d = pca.fit_transform(emb_matrix)  # shape: (num_words, 2)

plt.figure(figsize=(8, 8))
plt.scatter(emb_2d[:, 0], emb_2d[:, 1])

for i, w in enumerate(words):
    plt.text(emb_2d[i, 0], emb_2d[i, 1], w)

plt.title("PCA projection of word embeddings (Mistral-7B)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.axhline(0, linewidth=0.5)
plt.axvline(0, linewidth=0.5)
plt.grid(True, linewidth=0.3)

plt.tight_layout()
plt.show()

可視化結果

Notebookコメント（考察：追記反映）

同じ組同士は近いが、王族の組と人間の組の間の距離よりも、人間の組と概念の関連が薄い組の間の距離が近いのは違和感。
→ 密で高次元なベクトルを2次元に圧縮したことによる情報損失（主成分得点が小さい）の影響と考えられる。

6. Decoding（Top-k / Temperature / Greedy / Top-p）

ここからが本題。
Notebookでは Top-k → Temperature → Greedy → Top-p の順に触って、最後にタスク別の挙動を確認しています。

6.1 Top-k Sampling：まずは「次トークン分布」を見る

実行コード

# Top-k Samplingを試してみる

input = "The capital of Japan is"
encoded_input = tokenizer(input, return_tensors='pt').to(model.device)   # Pytorchのテンソルとして返す

# next-token prediction向けのロジットから確率に変換する
with torch.no_grad():
    outputs = model.forward(**encoded_input)
    last_logits = outputs.logits[0, -1]
    probs = torch.nn.functional.softmax(last_logits, dim=-1)

# 尤度Top-10のtoken-id
top_vals, top_ids = probs.topk(10)
top_vals = top_vals.detach().cpu().numpy()
top_ids = top_ids.detach().cpu().tolist()
labels = [f"{t_id}: {tokenizer.decode(t_id)}" for t_id in top_ids]

# 結果をプロット
plt.figure(figsize=(12, 6))
plt.bar(range(len(top_vals)), top_vals)
plt.xticks(range(len(labels)), labels, rotation=90)
plt.ylabel('Probability')
plt.title('Top-10 token')
plt.tight_layout()
plt.show()

可視化結果

ポイントはここで、モデルは「次に来そうなトークンの確率分布」を持っているのが体感できるところ。

6.2 Temperature：確率分布の「尖り」を変える

実行コード

# Temperatureによって、Top-k Samplingの上位の候補の確率分布がどう変化するかを確認する

temperatures = [0.4, 1.0, 1.6]
top_vals_list, labels_list = [], []

for T in temperatures:
    probs_at_T = (last_logits / T).softmax(dim=-1)

    top_vals, top_ids = probs_at_T.topk(10)

    # GPU → CPU
    top_vals = top_vals.detach().cpu()
    top_ids = top_ids.detach().cpu()

    top_vals_list.append(top_vals)

    labels = [f"{tid.item()}: {tokenizer.decode(tid)}" for tid in top_ids]
    labels_list.append(labels)

# Plot
fig, axes = plt.subplots(1, 3, figsize=(16, 6), sharey=True)

for ax, T, top_vals, labels in zip(axes, temperatures, top_vals_list, labels_list):
    ax.bar(range(len(top_vals)), top_vals)
    ax.set_xticks(range(len(labels)))
    ax.set_xticklabels(labels, rotation=90)
    ax.set_ylabel("Probability")
    ax.set_title(f"Temperature={T}")

plt.tight_layout()
plt.show()

可視化結果

Notebookコメント（考察：追記反映）

Temperatureは、次の単語候補の分布の尖り具合を調整するパラメータ。Notebookでは下記の式が引用されている：

\text{Softmax}(z_w, T) = \frac{e^{z_w / T}}{\sum_{j=1}^{V} e^{z_j / T}}

Tを0に近づけるほど分布は尖り、Tを大きくするほど平坦になり、ランダム性が高まる。

参考（Notebook内で引用されていた解説）：
https://cohere.com/llmu/parameters-for-controlling-outputs

6.3 Greedy Decoding（<s>あり／なしで出力が変わる）

実行コード（まずは `add_special_tokens=True`）

text = "The capital of Japan is"

for i in range(5):
    encoded_input = tokenizer(text, return_tensors="pt", add_special_tokens=True).to(model.device)
    with torch.no_grad():
        output = model.generate(**encoded_input, max_new_tokens=20, do_sample=False)  # greedy
    print(f"--------------- {i+1}th iteration result ---------------")
    print(tokenizer.decode(output[0]))

出力

--------------- 1th iteration result ---------------
<s> The capital of Japan is Tokyo. It is the largest city in the world. It is also the most populous metropolitan
--------------- 2th iteration result ---------------
<s> The capital of Japan is Tokyo. It is the largest city in the world. It is also the most populous metropolitan
--------------- 3th iteration result ---------------
<s> The capital of Japan is Tokyo. It is the largest city in the world. It is also the most populous metropolitan
--------------- 4th iteration result ---------------
<s> The capital of Japan is Tokyo. It is the largest city in the world. It is also the most populous metropolitan
--------------- 5th iteration result ---------------
<s> The capital of Japan is Tokyo. It is a

実行コード（次に `add_special_tokens=False`）

text = "The capital of Japan is"

for i in range(5):
    encoded_input = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
    with torch.no_grad():
        output = model.generate(**encoded_input, max_new_tokens=20, do_sample=False)  # greedy
    print(f"--------------- {i+1}th iteration result ---------------")
    print(tokenizer.decode(output[0]))

出力

--------------- 1th iteration result ---------------
The capital of Japan is Tokyo.

## What is the capital of Japan?

Tokyo

## What
--------------- 2th iteration result ---------------
The capital of Japan is Tokyo.

## What is the capital of Japan?

Tokyo

## What
...（以降同様）

Notebook見出し：`<s>（BOS）の有無によって出力が変わる理由`

ざっくり言うと モデルが「文頭から生成するモード」になったり「途中からの続き生成」になったりするので、生成の癖が変わる、という理解になります。
（実務でも「Instructモデル」「チャットテンプレ」などでここを適当にすると急に挙動が変わるやつ）

6.4 国を変えて確認（傾向の再現）

実行コード（抜粋）

# 国を変えていくつか試してみる
texts = [
    "The capital of Japan is",
]

for t in texts:
    print("------------------------------ 1th text ------------------------------")
    for add_sp in [True, False]:
        print(f"--------------- add_special_tokens: {add_sp} ---------------")
        encoded_input = tokenizer(t, return_tensors="pt", add_special_tokens=add_sp).to(model.device)
        with torch.no_grad():
            output = model.generate(**encoded_input, max_new_tokens=60, do_sample=False)
        print(tokenizer.decode(output[0]))

出力（抜粋）

------------------------------ 1th text ------------------------------
--------------- add_special_tokens: True ---------------
<s> The capital of Japan is Tokyo. It is the largest city in the world. It is also the most populous metropolitan area in the world. Tokyo is a city of contrasts. ...
--------------- add_special_tokens: False ---------------
The capital of Japan is Tokyo.

## What is the capital of Japan?

Tokyo

## What is the capital of Japan and why?

Tokyo

## What is the capital of Japan and why

Notebookコメント（考察）

どちらの設定でも “Tokyo” 自体は安定して出るが、後続の挙動は <s> の有無で大きく変わる。

7. Top-P Sampling（Nucleus Sampling）を自前実装して理解する

ここからが個人的に一番良かったところ。
Transformersの引数で top_p=... と書くだけだと「何が起きてるか」流れが見えないので、自分で top-p フィルタを書くのは理解が一気に進みます。

Top-p の原典：Holtzman et al. (2019)
https://arxiv.org/abs/1904.09751

7.1 Top-P Samplingの自前実装

実行コード

# Top-P Samplingの自前実装

import torch

def sample_top_p(probs, p=0.9):
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    # nucleus（累積確率pまでの集合）をマスク
    sorted_indices_to_remove = cumulative_probs > p
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = False

    sorted_probs[sorted_indices_to_remove] = 0
    sorted_probs = sorted_probs / sorted_probs.sum()

    next_token = torch.multinomial(sorted_probs, num_samples=1)
    next_token_id = sorted_indices[next_token]
    return next_token_id

7.2 とりあえず試す（Top-pで生成）

実行コード

# とりあえず試してみる

input = "The capital of Japan is"
encoded_input = tokenizer(input, return_tensors='pt').to(model.device)

with torch.no_grad():
    output = model.generate(
        **encoded_input,
        do_sample=True,
        top_p=0.9,
        temperature=1.4,
        max_new_tokens=20
    )

print(tokenizer.decode(output[0]))

出力

<s> The capital of Japan is Tokyo, and if you are going to visit Japan, this is one of the cities that you should

7.3 top_p小さめ・temperature小さめ → 確定的になりやすい

実行コード

# top_pを小さめ、temperature小さめで何パターンか出力　→　固定的なパターンの出力が期待値

for i in range(5):
    encoded_input = tokenizer("The capital of Japan is", return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **encoded_input,
            do_sample=True,
            top_p=0.2,
            temperature=0.2,
            max_new_tokens=20
        )
    print(f"{i+1}th iter: ", tokenizer.decode(output[0]))

出力

1th iter:  <s> The capital of Japan is Tokyo. It is the largest city in the world. It is also the most populous metropolitan
2th iter:  <s> The capital of Japan is Tokyo. It is the largest city in the world. It is also the most populous metropolitan
...（以降同様）

Notebookコメント（考察）

確定的な出力が続き、ほぼ Greedy に近い挙動になっている。

7.4 top_p小さめのまま temperature大きめ → ランダム性が増える

実行コード

# top_pを小さめのまま、temperature大きめにふってみる

for i in range(5):
    encoded_input = tokenizer("The capital of Japan is", return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **encoded_input,
            do_sample=True,
            top_p=0.2,
            temperature=1.4,
            max_new_tokens=20
        )
    print(f"{i+1}th iter: ", tokenizer.decode(output[0]))

出力（抜粋）

1th iter:  <s> The capital of Japan is one of the most fascinating cities in the world. It is a place where you can find everything you
2th iter:  <s> The capital of Japan is Tokyo. It is the largest city in the world with a population of 37 million people.
3th iter:  <s> The capital of Japan is Tokyo. Tokyo is the most populous metropolis in the world. The city is a place
4th iter:  <s> The capital of Japan is Tokyo. It is one of the most populous cities in the world. Tokyo is the center of
5th iter:  <s> The capital of Japan is Tokyo. Tokyo is the most populous metropolis in the world. The city has many attra

Notebookコメント（考察）

ランダム性が高くなり、毎回微妙に違う文が出てくる。

7.5 top_p大きめ & temperature大きめ → 多様だけど崩れやすい

実行コード

# top_p, temperatureの両方を大きめにふってみる

for i in range(5):
    encoded_input = tokenizer("The capital of Japan is", return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **encoded_input,
            do_sample=True,
            top_p=0.9,
            temperature=1.4,
            max_new_tokens=20
        )
    print(f"{i+1}th iter: ", tokenizer.decode(output[0]))

出力（抜粋）

1th iter:  <s> The capital of Japan is certainly diverse with overcrowded monotons streets. Around the important parts of Tokyo wait the
2th iter:  <s> The capital of Japan is our next destination on around Asia itinerary. We were planned to visit Japan during autumn and weather
3th iter:  <s> The capital of Japan is situated where Tokyo I addressed and subsequently turned into what is today Ka tower tun as well. This place
4th iter:  <s> The capital of Japan is channelling the enthusiasm and international collaboration that inspired the High-Level Forum on Covid-1
5th iter:  <s> The capital of Japan is known primarily for shining screensaver-sized billboards, rainstorm riddled street corners and stuffed

Notebookコメント（考察）

ランダム性に加えて文のバリエーションが多様になる一方、内容の破綻も混ざりやすくなる。

8. タスク毎の得手・不得手を検証（知識系 / 創造系）

8.1 知識を問う質問：ランダム性が高いと外しやすい

実行コード

# (top_p, temperature)のパラメータの組

pairs = [
    (0.2, 0.2),
    (0.9, 1.4),
]

questions = [
    "What is the capital of United States?",
    "When did World War II end?",
    "Who painted the Mona Lisa?",
]

for top_p, temp in pairs:
    print(f"--------------- top_p: {top_p}, temperature: {temp} ---------------")
    for q in questions:
        encoded_input = tokenizer("<s> " + q + " ", return_tensors="pt").to(model.device)
        with torch.no_grad():
            output = model.generate(
                **encoded_input,
                do_sample=True,
                top_p=top_p,
                temperature=temp,
                max_new_tokens=40
            )
        print(tokenizer.decode(output[0]))

出力（抜粋）

--------------- top_p: 0.2, temperature: 0.2 ---------------
<s> What is the capital of United States? 

Washington, D.C.

Washington, D.C. is the capital of the United States of America. It is
<s> When did World War II end? 

The war in Europe ended on May 8, 1945, when Germany surrendered. The war in the Pacific ended
<s> Who painted the Mona Lisa? 

The Mona Lisa is a half-length portrait of a woman by the Italian artist Leonardo da Vinci, which has been acclaimed
--------------- top_p: 0.9, temperature: 1.4 ---------------
<s> What is the capital of United States? Every High School student knows that it is Washington D. C. So why this question here on Online Quizzes an India channel?. Because education should
<s> When did World War II end? Only folks with short memories believe that event was the sinking of the doomed container ship European Archer running off Block Island today when a couple of
<s> Who painted the Mona Lisa? http://changetypefict.weebly.com Embibe’s Top educators teaching Live Math Science classes since 200

Notebookコメント（考察：追記反映）

Greedy Decoding（あるいはそれに近い設定）では知識を問う質問の答えを正しく回答出来ているのに対して、ランダム性が高いTop-P Samplingでは直接的に回答出来ていない。
→ 知識を問う質問では確定寄り（低温・低top_p）の設定が有利。

8.2 物語の枕詞：Greedyは固定、Samplingは多様

実行コード

# 入力テキスト

texts = [
    "<s> Once upon a time,",
    "<s> Long, long ago,",
    "<s> In the beginning,",
]

print("--------------- Greedy Decoding (3 runs) ---------------")
for i in range(3):
    print(f"\n--- Run {i+1} ---")
    for t in texts:
        encoded_input = tokenizer(t, return_tensors="pt").to(model.device)
        with torch.no_grad():
            output = model.generate(
                **encoded_input,
                do_sample=False,
                max_new_tokens=30
            )
        print(tokenizer.decode(output[0]))

print("\n--------------- Top-p Sampling (3 runs, top_p=0.9, temperature=1.4) ---------------")
for i in range(3):
    print(f"\n--- Run {i+1} ---")
    for t in texts:
        encoded_input = tokenizer(t, return_tensors="pt").to(model.device)
        with torch.no_grad():
            output = model.generate(
                **encoded_input,
                do_sample=True,
                top_p=0.9,
                temperature=1.4,
                max_new_tokens=30
            )
        print(tokenizer.decode(output[0]))

出力（抜粋）

--------------- Greedy Decoding (3 runs) ---------------

--- Run 1 ---
<s> Once upon a time, there was a little girl who loved to read. She read everything she could get her hands on, and she loved to write stories of her own.
<s> Long, long ago, in a galaxy far, far away, I was a young man. I was a young man who was a fan of the original Star Wars tril
<s> In the beginning, there was the Word.

And the Word was with God.

And the Word was God.

And the Word became flesh and

--- Run 2 ---
（Run1と同じ）

--- Run 3 ---
（Run1と同じ）

--------------- Top-p Sampling (3 runs, top_p=0.9, temperature=1.4) ---------------

--- Run 1 ---
<s> Once upon a time, ghostly Ford Stratas fouaghd customers kultree raettiile. Faore vie fellows Tay toved Straight abiudi
<s> Long, long ago, his friend Jimmy gave Dylan a Sixth Generation/ (6G, or 'The plug radio') Airwave after learning some differences about Clan Albums
<s> In the beginning, people could hardly believe what they heard. The smartphone coming from Microsoft—of PC and word processor fame, it seemed—supplies presentation and story

--- Run 2 ---
<s> Once upon a time, asking if fire horses should be free on people's territory was received with visible anger on the forum while allowing them to graze free on unsc
...

Notebookコメント（考察：追記反映）

ランダム性が高いTop-P Samplingでは毎回多様な物語風の答えが返ってきているのに対して、Greedy Decodingでは毎回同じ答えが返ってくる。
→ 物語等の創造性が欲しいなら sampling が向くが、崩れた文章も出やすいので、top_p / temperature の調整が重要。

まとめ（自分用メモ）

Tokenizerは「文字列→token id」だけでなく、BOS <s> や SentencePiece境界などの「文脈制御」も含む。
埋め込み行列は (vocab_size, hidden_dim) の形で取り出せて、サブワード平均でも「だいたい意味っぽい」近さが見える。
Decodingは結局 次トークン分布からどう選ぶかの話で、
- Temperatureは分布の尖りを変える
- Top-k / Top-p は候補集合を切る
- Greedyは決定的・安定
タスク依存がはっきりしていて、
- 知識系は確定寄り（低温・低top_pなど）が強い
- 創造系は多様性が必要なので sampling が効く（ただし破綻もしやすい）

参考文献

Hugging Face Transformers: Generation strategies
https://huggingface.co/docs/transformers/en/generation_strategies
Hugging Face Transformers: Text generation API
https://huggingface.co/docs/transformers/en/main_classes/text_generation
Hugging Face Blog: How to generate text (decoding overview)
https://huggingface.co/blog/how-to-generate
Top-p (Nucleus Sampling) 原論文：Holtzman et al. (2019)
https://arxiv.org/abs/1904.09751
SentencePiece 原論文：Kudo & Richardson (2018)
https://arxiv.org/abs/1808.06226
Mistral-7B-v0.1 model card
https://huggingface.co/mistralai/Mistral-7B-v0.1

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

〖Transformers×埋め込み×Decoding〗Notebookを丸ごと読ませて、Qiita記事をLLMに全部書かせてみた

Transformers×埋め込み×Decoding〗Notebookを丸ごと読ませて、Qiita記事をLLMに全部書かせてみた

想定読者

この記事でやること（章構成）

前提知識：Transformers / 埋め込み / Decoding（Notebook外の補足）

Transformers（ざっくり）

埋め込み（Embedding）

Decoding（生成時の「次のトークンの選び方」）

代表的なDecoding手法（概要／メリット／デメリット）

Greedy Decoding（常に確率最大を選択：決定的）

Top-k Sampling（上位k個に制限してサンプリング）

Top-p（Nucleus Sampling：累積確率pに入る集合からサンプリング）

Beam Search（複数候補を保持して探索）

Notebookを読む（ここからセル順）

0. 実験の狙い（Notebook冒頭）

1. 環境準備（Colab用セル）

2. モデルロードと生成してみる（Mistral-7B）

試したこと

実行コード（Notebook該当セル）

出力（Notebookの実行結果）

考察（補足）

3. Tokenizerの動作確認（英語 vs 日本語）

3.1 英語文のエンコード・デコード

実行コード

出力

Notebookコメント（考察）

3.2 日本語文のエンコード・デコード

実行コード

出力

Notebookコメント（考察）

3.3 英文と日本文で ▁ の挙動が違う理由（追記考察を含む）

実行コード

出力

考察

4. 語彙数と埋め込み行列の確認

4.1 語彙数（vocab size）

実行コード

出力

4.2 埋め込み行列（Embedding matrix）のshapeを見る

実行コード

出力

Notebookコメント（考察）

4.3 token_id → token の対応表を眺める

実行コード

出力（Notebook表示：一部）

5. 埋め込みベクトルの可視化（Cosine類似度 / PCA）

5.1 単語→埋め込みベクトルへの変換関数

実行コード

5.2 可視化対象の単語群

実行コード

5.3 埋め込み行列を作る（20語×4096次元）

実行コード

出力

5.4 Cosine類似度ヒートマップ

実行コード

可視化結果

Notebookコメント（考察：追記反映）

5.5 PCAで2次元に圧縮して散布図

実行コード

可視化結果

Notebookコメント（考察：追記反映）

6. Decoding（Top-k / Temperature / Greedy / Top-p）

6.1 Top-k Sampling：まずは「次トークン分布」を見る

実行コード

可視化結果

6.2 Temperature：確率分布の「尖り」を変える

実行コード

可視化結果

Notebookコメント（考察：追記反映）

6.3 Greedy Decoding（<s>あり／なしで出力が変わる）

実行コード（まずは add_special_tokens=True）

出力

実行コード（次に add_special_tokens=False）

出力

Notebook見出し：<s>（BOS）の有無によって出力が変わる理由

6.4 国を変えて確認（傾向の再現）

実行コード（抜粋）

出力（抜粋）

Notebookコメント（考察）

7. Top-P Sampling（Nucleus Sampling）を自前実装して理解する

3.3 英文と日本文で `▁` の挙動が違う理由（追記考察を含む）

実行コード（まずは `add_special_tokens=True`）

実行コード（次に `add_special_tokens=False`）

Notebook見出し：`<s>（BOS）の有無によって出力が変わる理由`