論文紹介：Textbooks Are All You Need #AI

概要

紹介する論文
モデルサイズ1.3B、データセットサイズ7Bという規模で、HumanEval(pass@1)で50.6%、MBPP(pass@1)で55.5%を達成
GPT-3.5による合成データとWebデータで事前学習を行い、演習データで微調整。どちらも教科書レベルのデータ。

データセットの内容

The StackとStackOverflow（6B）
GPT-3.5よる合成データ（1B）
Python演習問題データ（180M）

データセットへの事前処理

The StackとStackOverflow

GPT-4によってアノテーションする。「基本的なコーディングを学ぶ学生にとって、教育的な価値があるか」
ランダムフォレスト分類器の訓練
1. codegenモデルのエンコーダーを使用して、出力エンベッディングを特徴として使用する

GPT-3.5よる合成データ

基本的には、プロンプトを手動で作成し、それをGPT3.5で読み込ませ、合成データセットを作成したよう。
プロンプトの内容は、推論やアルゴリズムについてのトピックに制約を与えることで多様性を出したらしい。
詳細は記載されておらず、よくわからない。

以下、プロンプト例。

To begin, let us define singular and nonsingular matrices. A matrix is said to be singular if its
determinant is zero. On the other hand, a matrix is said to be nonsingular if its determinant is not
zero. Now, let's explore these concepts through examples.

Example 1: Consider the matrix A = np.array([[1, 2], [2, 4]]). We can check if this matrix is
singular or nonsingular using the determinant function. We can define a Python function, `
is_singular(A)`, which returns true if the determinant of A is zero, and false otherwise.

import numpy as np
def is_singular(A):
	det = np.linalg.det(A)
	if det == 0:
		return True
	else:
		return False
A = np.array([[1, 2], [2, 4]])
print(is_singular(A)) # True

Python演習問題データ

GPT3.5で作成
内容は未完成のdocstringで、続きを予測する形になっている。
HumanEvalなど評価データセットが含まれないように汚染除去している。

以下、問題例。

def valid_guessing_letters(word: str, guesses: List[str]) -> List[str]:
"""
Returns a list of valid guessing letters, which are letters that have not been guessed yet and
are present in the word.
Parameters:
word (str): The word to guess.
guesses (List[str]): A list of letters that have already been guessed.
Returns:
List[str]: A list of valid guessing letters.
"""
valid_letters = []
for letter in word:
	if letter not in guesses and letter not in valid_letters:
		valid_letters.append(letter)
return valid_letters

モデルアーキテクチャ

変わったことはしていない。
マルチヘッドアテンション(MHA)のFlashAttentionを使用したデコーダのみのトランスフォーマーモデル。
CodeGen のような、MHAとMLP層を並列構成。
1.3Bモデルは、24層、2048の隠れ次元、8192のMLPインナー次元、各64次元の32のアテンションヘッドから構成。
codegen-350M-monoと同じトークナイザを使用。

学習

こちらもオーソドックス
データセット配列からスライスした 2048 個の配列長に対して、次のトークンの予測損失で学習。
AdamWオプティマイザ、linear-warmup-linear-decay学習率スケジュール、attentionとresidual dropout 0.1でfp16学習。
Nvidia-A100 GPU ✕ 8で、deepspeedを用いて学習。
事前学習は、The StackとStackOverflow、GPT-3.5よる合成データを使用して行う。
- 有効バッチサイズ1024(データ並列と勾配累積を含む)、最大学習率1e-3、ウォームアップ750ステップ、重み減衰0.1、合計36,000ステップ、約8エポック、総トークン数50B。
ファインチューニングは、Python演習問題データを使用し、基本的には事前学習と同じ設定。
- 有効バッチサイズ256、ウォームアップ50ステップ、最大学習率1e-4、重み減衰0.01、合計6,000ステップ、1000ステップごとに保存。