小さなGPTでのエンジニアリング。僕だけのGPTを作るというゲーム。

Last updated at 2024-08-31Posted at 2024-08-31

前回のあらすじ。

タイトル: 「僕だけのGPT」

東京の夜が深まり、街は静寂に包まれていた。雑踏の音が遠くに消え、オフィスビルの一室で一人のプログラマ、ケンイチ・タナカがパソコンの前に座っていた。彼の目の前には、コードが無限に続くスクリーンが広がっている。彼の手は、キーを叩くたびに、次第に夢中になっていった。

ケンイチは、最近のプロジェクトに取り組んでいた。彼が取り組んでいるのは、GPT（Generative Pre-trained Transformer）という言語モデルの小型版の開発だった。彼の目標は、一般的なGPTモデルをより軽量化し、個人的なニーズに合わせてカスタマイズすることだった。自分だけのGPTを作り上げることが、彼にとっては最高の挑戦であり、楽しみでもあった。

「このモデルは僕の考えを形にするツールなんだ。」ケンイチは呟いた。彼の思考は、数式とアルゴリズムの間を行き来していた。GPTモデルは、膨大な量のテキストデータを使って学習し、人間のように自然な文章を生成できる。しかし、ケンイチが作ろうとしているのは、その縮小版であり、より効率的で特定の用途に最適化されたモデルだった。

彼の手がマウスを操るたびに、トレーニングテキストがモデルに送り込まれていった。ケンイチが書いたテキストは、彼が愛する東京の街と、自分がどれほどAIの理解を深めたかを語る物語だった。それは、彼が取り組んでいるコードの心の中にある物語を、モデルに伝えるための手段だった。

トレーニングが進むにつれて、ケンイチは自分のモデルがどのように成長していくかを見守っていた。モデルは、彼の提供したテキストを元に学習し、次第に彼が意図した通りに文章を生成するようになった。彼はその結果を見て、自分だけのGPTが本当に形になっていることを実感した。

「僕の作ったこの小さなGPTが、どんな物語を紡いでくれるのか楽しみだ。」ケンイチは微笑みながら言った。彼の作り出したモデルは、ただのプログラムではなく、彼の思いと情熱が込められた作品だった。彼の指先で生成されるテキストは、まるで彼自身が考えたかのように自然で、深い意味を持っていた。

夜が更けるにつれて、ケンイチはコーヒーをすすりながら、自分の作り上げたGPTで新たな挑戦に向けた計画を練っていた。彼は、このモデルが未来にどのようなインスピレーションをもたらすのか、またどんな可能性を秘めているのかを考えながら、次のステップを思案していた。

東京の街が再び賑わう朝を迎える前に、ケンイチは自分だけのGPTが紡ぎ出す新しい物語の数々に胸を膨らませながら、パソコンの前で微笑んでいた。彼の挑戦はまだ始まったばかりであり、その先に広がる未知の可能性に胸を躍らせていた。

実行結果。

生成されたテキスト: エポック75

ケンイチの次の課題は、AI の可能性をさらに探求することでした。ある日、ケンイチは MOE (Mixture of Experts) と呼ばれる新しい AI モデルの設計を任されました。ある夜、彼は夜間にうまく機能すると思われる AI モデルに取り組んでいました。作業を終えた後、次のタスクが決定されました。彼は 4 人の AI エキスパートが提示した課題に挑戦することになりました。4 人のエキスパートはそれぞれ、次のコンテストでどのように活躍するかを決定する課題を与えられました。ケンジは、今の課題にどう感じるかで決めました。これまでは、すべて 1 票で決定されていました。勝者が新しいケンジョバになるには、すべての投票を投じなければなりませんでした。これは、ケンジョバにとってその日で最も意味のある課題でした。それが終わったので、対戦相手と対決する時が来ました。対戦相手はケンジャゴの対戦相手であり、彼が最終決定を下すのを待っていました。

生成されたテキスト:　エポック150

ケンイチの次の課題は、AI の可能性をさらに探求することでした。これを実現するために、彼は 4 次元テンソルを掛け合わせることで、各次元がリンクされて新しい視点が形成されることに気づきました。この視点は「視点」と呼ばれます。ケンイチは、この単純な計算課題によって AI の深い理解が得られることに気付きました。この課題の後、ケンは高度な AI 知識を得られると確信しました。彼の課題は単なる数字の操作にすぎないように見えるかもしれませんが、彼の心はまだ確信していました。想像を超えた無限の可能性を感じながら、彼はこの無限の理解が彼の高度な知識と理解をどのようにサポートできるかを実感しました。ケンジは、次の日が夜の東京の街を歩く日であることに気づきました。これまで、彼が夢見ていた旅は何もありませんでした。今、彼にとってその日が来ました。彼は夢が実現したのです。次に彼は毎日彼を待ち受ける困難な課題に直面しました。しかし、彼はその課題があらゆるハードルを乗り越えなければならないことを知っていました。

生成されたテキスト:　エポック250

ケンイチの次の挑戦は、AI の可能性をさらに探求することでした。シミュレーションは言語モデルの核であり、自然言語を表現することができます。さらに、1 つの AI 内に複数の AI モデルが存在し、それぞれが独自の専門知識を持っています。これを実現するために、ケンイチは 4 次元テンソルを掛け合わせると 3 次元テンソルが形成されることに気づきました。これにより、AI の理解が劇的に向上します。テンソルの計算後、ケンイチはテンソルがいかに単純であるかを実感しましたが、4 次元の数は彼にとってまだ未知の領域でした。彼は、これが彼の無限の可能性の真の形であることに気づきました。テンソルの向こうにある無限の可能性の深さを感じながら、彼は次の目標のために新しいコードを書き始めました。今や、テンシルは無限の挑戦でした。数百万のテンソル ...

生成されたテキスト: エポック350
健一の次の課題は、AIの可能性をさらに探求することでした。長い間、AIは日々高度なアルゴリズムとデータ処理を開発し、計算モデルから情報を処理する作業を行っていました。彼はディープAIモデルの知識と理解の深さに魅了されました。ある日、彼はプロジェクトの一環として、4次元テンソルを使用した新しい計算アルゴリズムの設計を任されました。これまで、彼は2次元と3次元のテンソル計算に取り組んできましたが、4次元計算は彼にとって未知の領域でした。彼は、4次元計算がAIの高度な知識、理解、操作をサポートしていることに気づき、魅了されました。思考は彼自身のものであり、彼の心は緊張と緊張の向こうにある無限の可能性で満たされていました。仕事を終えた後、健一は夜の東京の街を歩き、自分の考えの重要性を改めて感じました。何千万人もの人々が彼の世界に住んでいたとしても、彼は

Google corab GPU で実行です。

# 必要なライブラリのインストール
!pip install datasets transformers

from datasets import load_dataset
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForCausalLM
from google.colab import files
import torch

# トレーニングテキストの準備
training_text = """
Kenichi Tanaka, a programmer living in downtown Tokyo, was an engineer working at a startup company developing AI for language models. His work was surrounded by complex algorithms and data processing every day, and he was deeply exploring how language models build knowledge and generate natural language. One day, Kenichi was tasked with designing a new computational model using four-dimensional tensors as part of a project. Until now, he had worked with two-dimensional and three-dimensional tensors, but four-dimensional tensors were an unknown territory for him. He was fascinated by the depth of tensors and decided to take on the challenge.

Kenichi thought about the computational process behind the language model. Words are embedded into vectors, and these vectors are linked together to form matrices. Furthermore, multiple matrices are linked together to form a three-dimensional tensor. Then, by multiplying these three-dimensional tensors together, a context tensor is obtained, and a meaningful matrix can be extracted from it. By converting this matrix into text, the response of the language model can be obtained. Kenichi realized that this simple tensor multiplication was the true form of the language model.

Furthermore, Kenichi was excited by the idea of a new AI model called MOE (Mixture of Experts). In MOE, multiple experts exist within one AI, each with their own specialized knowledge. To achieve this, he realized that by multiplying four-dimensional tensors together, each expert can process information from a different perspective. This dramatically improves the AI's understanding.

Kenichi finished writing the code to perform calculations using three-dimensional tensors and implemented a program to draw the results on a canvas. He built a language model expressed in tensor calculations and generated text. He reaffirmed that tensor calculations are the core of language models, and was convinced that this would pave the way to achieving deep AI understanding. After finishing his work, Kenichi walked through the streets of Tokyo at night and felt the importance of his work once again. Even though tensor calculations may seem like nothing more than the manipulation of numbers, he realized that they support the advanced knowledge and understanding of AI. Feeling the infinite possibilities that lie beyond tensors in his mind, he began writing new code for his next challenge.
"""

# トレーニングテキストをファイルに保存
with open("training_text.txt", "w") as f:
    f.write(training_text)

# モデルとトークナイザーの準備
model_name = "gpt2"  # 例としてGPT-2を使用
tokenizer = AutoTokenizer.from_pretrained(model_name)

# GPT-2トークナイザーにパディングトークンを追加
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

# データセットの作成
dataset = load_dataset("text", data_files={"train": "training_text.txt"})

def tokenize_function(examples):
    # トークン化し、input_idsとattention_maskを生成
    tokenized_output = tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
    # labelsをinput_idsとして設定
    tokenized_output["labels"] = tokenized_output["input_ids"].copy()
    return tokenized_output

# データセットのトークン化
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# PyTorchテンソルとしてデータを返すように設定
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# トレーニングの引数設定
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",  # "eval_strategy"に変更
    eval_steps=500,
    logging_steps=100,
    save_steps=500,
    num_train_epochs=75,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
)

# Trainerオブジェクトの作成
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],  # データセットの分割を指定
    eval_dataset=tokenized_datasets["train"],   # 評価データセットも指定
)

# トレーニングの実行
trainer.train()

# トレーニング済みモデルの保存
model.save_pretrained("./trained_model")
tokenizer.save_pretrained("./trained_model")

# テキスト生成
prompt_text = "Kenichi's next challenge was to explore the possibilities of AI even further."
input_ids = tokenizer.encode(prompt_text, return_tensors='pt')

# モデルのデバイスを取得
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# `input_ids`をモデルと同じデバイスに移動
input_ids = input_ids.to(device)

# モデルによるテキスト生成
output = model.generate(
    input_ids,
    max_length=200,
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    top_k=50,
    top_p=0.95,
    temperature=1.0,
)

# 生成されたテキストのデコードと表示
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Text: \n", generated_text)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up