生成AIを用いてGPT-3の論文「Language Models are Few-Shot Learners (2020)」を読んでみた (続き)

Posted at 2024-09-24

はじめに

前回、生成AIを用いてGPT-3の論文「Language Models are Few-Shot Learners (2020)」の内容(本文)を(なるべく)把握してみました。
同論文の末尾にあるAppendixについても、前回と同様の方法で、把握してみます。

言語モデルの訓練に使用された計算量の計算方法では、注意機構の計算量が全体の計算量に対して10%未満しか使用しないため、注意機構の計算を無視し、逆伝播に必要な計算量は前向き伝播の3倍として計算する(大幅な簡略化)ことが分かりました。(その他、末尾の「分かったこと」章を参照)

以降で、ChatGPTに聞いてみた例を記載します。

前回の記事: (本文の内容)

対象の論文

論文: (GPT-3に関する論文)

[2005.14165] Language Models are Few-Shot Learners
https://arxiv.org/abs/2005.14165
(PDF: https://arxiv.org/pdf/2005.14165)

質問時の各章節の区切り部分

論文の中にある各章節を、下記のように区切って、部分毎に生成AIに内容を質問していきます。

A Details of Common Crawl Filtering
---
B Details of Model Training
---
C Details of Test Set Contamination Studies
---
D Total Compute Used to Train Language Models
---
E Human Quality Assessment of Synthetic News Articles
---
F Additional Samples from GPT-3
---
G Details of Task Phrasing and Specifications
---
H Results on All Tasks for All Model Sizes

生成AIへの質問方法

生成AIを活用して、知りたい記事・論文の1節分(適度な長さ)のテキストをコピー＆ペーストして、その下に質問内容を「①～ ②～ …」と番号付きで書いて、生成AIに渡せば、全質問に一発で的確に回答してくれるので、非常に良好でした。記事全体を読む必要なく、知りたい点の情報だけを収集できます。

生成AIへの質問例:

(論文・記事の各章節を貼り付け)

上記の内容に関して下記の質問に回答下さい: (である調で記載、質問に対して該当するものが無ければ無しと記載)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載、既存手法があれば引用元を記載)
②具体的な処理方法の記載があれば説明下さい。(簡略化せずに全て記載、既存手法があれば引用元を記載)
③改良点・工夫点・テクニック等の記載があれば説明下さい。
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。
⑤性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)

続けて下記の質問に追加で回答下さい:
⑥比較の記載があれば違いを表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(必ず正しく数値を抜き取る事、|数値|説明|の表へ)
⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)

※回答が長くなりそうな場合は、適宜、分けて質問: ①②③④⑤、⑥⑦⑧⑨⑩
※その他、不明点があれば、適宜、追加で質問。

質問内容は、記事・論文を読んでいていつも知りたいと思う点(改良点・工夫点・テクニック・メカニズムの解明)にしています。また、表で比較した方が素早く把握できるので、可能であれば記事を表に変換するようにしています。

論文・記事を貼り付けるテキストの長さは、1節分程度の量にとどめた方が、良い回答が得られました。生成AIの回答の文量が多くなってくると、回答が長くなり過ぎないように、生成AIが勝手に(適度に)端折り始めてしまい、重要な点が回答から抜けてしまう可能性が高くなります。

事前知識

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。

GPT-3についての分かりやすい解説記事(下記)等を事前にチェックして、中核部分の内容をあらかじめ分かっていると、理解しやすいと思います。生成AIは実際の細かい処理方法自体を分かりやすく説明してはくれない傾向があります。

注意点

論文のテキスト内容だけを貼り付けて、生成AIに質問しています。論文の中の図・表の部分は貼り付けていません。図・表の内容は生成AIの回答には含まれず、別途論文を見る必要があります。

以降で、生成AIの回答内容が読みにくい・分かりづらい場合は、論文の本文でも同じように書かれてあり、論文の本文を読んでも同じように分かりづらいことが多くあります。論文では、既存研究等があるため、多くの説明を省略・一言だけサラッと書かれてある等、同種の研究に取り組む人でなければ、なかなか全容の理解に辿りつくのは難しい側面があります。この点は、生成AIの回答性能が悪いのではなく、論文という性質上、生じるものと考えています。

生成AIに質問

以降で、ChatGPTに実際に聞いてみた例を記載します。

生成AIへの質問＆回答の全容

生成AIへの質問＆回答の全容・詳細:

A Details of Common Crawl Filtering

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) A Details of Common Crawl Filtering As mentioned in Section 2.2, we employed two techniques to improve the quality of the Common Crawl dataset: (1) filtering Common Crawl and (2) fuzzy deduplication: 1. In order to improve the quality of Common Crawl, we developed an automatic filtering method to remove low quality documents. Using the original WebText as a proxy for high-quality documents, we trained a classifier to distinguish these from raw Common Crawl. We then used this classifier to re-sample Common Crawl by prioritizing documents which were predicted by the classifier to be higher quality. The classifier is trained using logistic regression classifier with features from Spark’s standard tokenizer and HashingTF 10. For the positive examples, we used a collection of curated datasets such as WebText, Wikiedia, and our web books corpus as the positive examples, and for the negative examples, we used unfiltered Common Crawl. We used this classifier to score Common Crawl documents. We kept each document in our dataset iff np.random.pareto(α) > 1 − document_score We chose α = 9 in order to take mostly documents the classifier scored highly, but still include some documents that were out of distribution. α was chosen to match the distribution of scores from our classifier on WebText. We found this re-weighting increased quality as measured by loss on a range of out-of-distribution generative text samples. 2. To further improve model quality and prevent overfitting (which becomes increasingly important as model capacity increases), we fuzzily deduplicated documents (i.e. removed documents with high overlap with other documents) within each dataset using Spark’s MinHashLSH implementation with 10 hashes, using the same features as were used for classification above. We also fuzzily removed WebText from Common Crawl. Overall this decreased dataset size by an average of 10%. After filtering for duplicates and quality, we also partially removed text occurring in benchmark datasets, described in Appendix C.

上記の内容に関して下記の質問に回答下さい: (である調で記載、質問に対して該当するものが無ければ無しと記載)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載、既存手法があれば引用元を記載)
②具体的な処理方法の記載があれば説明下さい。(簡略化せずに全て記載、既存手法があれば引用元を記載)
③改良点・工夫点・テクニック等の記載があれば説明下さい。
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。
⑤性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)
⑥比較の記載があれば違いを表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(必ず正しく数値を抜き取る事、|数値|説明|の表へ)
⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この部分は、Common Crawlデータセットのフィルタリングとデデュプリケーションに関する詳細な説明である。モデルの学習に適した高品質なデータを使用するために、どのような技術を使用してデータの品質を向上させたかが述べられている。

Common Crawlデータセットの品質向上のために、2つの技術（フィルタリングとファジーデデュプリケーション）を適用した。
フィルタリングでは、WebTextやWikipediaなどの高品質なデータセットを使用して、低品質なドキュメントを削除するための分類器を訓練し、それを用いてCommon Crawlから高品質なドキュメントを再サンプリングした。
ファジーデデュプリケーションでは、データセット内の重複したドキュメントやWebTextとの重複を削除し、データセットのサイズを削減した。
重複と品質フィルタリングの結果、データセットサイズが平均で10%削減された。

②具体的な処理方法の記載

Common Crawlのフィルタリングには、WebTextやWikipediaを高品質なデータセットの代表例とし、ロジスティック回帰分類器を使用して、これらのデータと未フィルタリングのCommon Crawlデータを区別するモデルを訓練した。この分類器はSparkのトークナイザーとHashingTFを使用して特徴量を抽出している。
分類器で予測されたスコアに基づいて、Common Crawlのドキュメントを再サンプリングし、document_scoreが高いものを優先的に保持する。サンプリングの際には、np.random.pareto(α) > 1 − document_scoreという条件式を用い、α = 9を設定している。αは分類器がWebTextで得たスコアの分布に合わせて選択された。
ファジーデデュプリケーションでは、SparkのMinHashLSH実装を使用し、10個のハッシュを使用して類似ドキュメントを削除した。これには分類の際に使用した同じ特徴量を用いている。WebTextとの重複も削除している。

③改良点・工夫点・テクニック等の記載

ロジスティック回帰分類器を使用して、Common Crawlから高品質なデータを選別するために、特定の高品質なデータセットを正例として訓練を行った。
np.random.pareto(α)という手法を用いることで、データ分布の偏りに対応し、異なる種類のドキュメントを一部残す工夫がなされている。
ファジーデデュプリケーションによって、モデルの過学習を防ぐために類似ドキュメントの削除を徹底している。

④メカニズムの解明・なぜそうなるのか等の記載

np.random.pareto(α) > 1 − document_scoreという式を用いることで、document_scoreが高いドキュメントを優先的に残すが、低スコアでもαの値に基づいて一部を残す。この再サンプリング手法は、データの多様性を保つためであり、分布外のデータも一定程度含むことが目的である。
ファジーデデュプリケーションは、重複したデータがモデルに与える過学習リスクを回避するための重要な技術であり、大規模データセットでは特に有効である。

⑤性能が向上した記載

フィルタリングによってデータセットの品質が向上し、異なる生成テキストサンプルにおける損失が改善された。具体的な数値や図/表番号の記載は無し。

⑥比較の記載

技術	効果	解説	表番号
フィルタリング	データ品質向上	ロジスティック回帰分類器による再サンプリング	無し
ファジーデデュプリケーション	データセットサイズ10%削減	重複ドキュメントの削除	無し

解説: フィルタリング技術とファジーデデュプリケーション技術の比較を表にまとめた。これにより、データセットの品質向上やサイズ削減がどのように行われたかがわかる。

⑦上記⑥以外で表に出来そうな部分

フィルタリング条件	使用した分類器	特徴抽出法
Common Crawlドキュメント再サンプリング	ロジスティック回帰	Sparkの標準トークナイザーとHashingTF

解説: フィルタリングに使用した手法や条件を表にまとめた。具体的な分類器と特徴抽出法が記載されている。

⑧具体的な数値の記載

数値	説明
9	αの値。再サンプリングにおけるdocument_scoreのしきい値に影響
10%	ファジーデデュプリケーションによるデータセットサイズの削減率

⑨具体的な変数名の記載

変数名	説明
np.random.pareto(α)	サンプリングに用いた条件式。αは9に設定された
document_score	分類器による各ドキュメントのスコア

⑩図/表があれば、各図/表は何を主張するためのものか

本節には図や表の具体的な番号や詳細な説明は記載されていないため、図表の解説は無し。

B Details of Model Training

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) B Details of Model Training To train all versions of GPT-3, we use Adam with β1 = 0.9, β2 = 0.95, and = 10−8, we clip the global norm of the gradient at 1.0, and we use cosine decay for learning rate down to 10% of its value, over 260 billion tokens (after 260 billion tokens, training continues at 10% of the original learning rate). There is a linear LR warmup over the first 375 million tokens. We also gradually increase the batch size linearly from a small value (32k tokens) to the full value over the first 4-12 billion tokens of training, depending on the model size. Data are sampled without replacement during training (until an epoch boundary is reached) to minimize overfitting. All models use weight decay of 0.1 to provide a small amount of regularization [LH17]. During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. Sequences with multiple documents are not masked in any special way but instead documents within a sequence are delimited with a special end of text token, giving the language model the information necessary to infer that context separated by the end of text token is unrelated. This allows for efficient training without need for any special sequence-specific masking.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この節は、GPT-3のトレーニングに使用された具体的な技術とプロセスについて記載されている。

Adamオプティマイザを使用し、特定のハイパーパラメータ設定を採用した。
学習率は余弦減衰（cosine decay）を使用して、2600億トークンまで減少し、以降は10%の初期値で学習を続けた。
バッチサイズは線形に増加し、モデルサイズに応じて4～120億トークンの間で最適化された。
過学習を防ぐため、エポックまでデータは置き換えなしでサンプリングされた。
トレーニングでは、すべてのシーケンスが最大コンテキスト長2048トークンを持ち、複数のドキュメントを1つのシーケンスにパッキングした。
重み減衰（weight decay）と特別なマスク処理なしにドキュメント間を区切る特殊トークンを使用した。

②具体的な処理方法の記載

Adamオプティマイザを使用し、β1 = 0.9、β2 = 0.95、ϵ = 10^(-8)のハイパーパラメータ設定でトレーニングを行った。
勾配のグローバルノルムは1.0でクリップされ、学習率は余弦減衰を使用して、2600億トークンまで元の学習率の10%まで減少させた。
トレーニングの最初の3.75億トークンで学習率のウォームアップを線形に実施した。
バッチサイズは、32,000トークンから線形に増加し、4～120億トークンの間で最適なサイズに到達するよう調整された。
過学習を防ぐために、データは置き換えなしでサンプリングされ、エポックの境界でリセットされた。
トレーニングには、すべてのシーケンスで最大2048トークンのコンテキストウィンドウを使用し、短いドキュメントは1つのシーケンスにパッキングされた。
シーケンス内のドキュメントは特別なマスク処理を行わず、代わりに特殊な終了トークンで区切った。これにより、モデルは終了トークンで分離されたコンテキストが無関係であることを推測できた。

③改良点・工夫点・テクニック等の記載

学習率の余弦減衰を採用し、トレーニング後期に低い学習率を維持することで、学習の安定性を確保している。
バッチサイズをモデルサイズに応じて段階的に増加させることで、大規模なモデルでも効率的にトレーニングを行った。
ドキュメントが2048トークン未満の場合でもシーケンスに複数のドキュメントをパッキングすることで、計算効率を向上させた。
ドキュメント間の区切りには特殊トークンを使用し、シーケンス内で特別なマスク処理を行わずに済むよう工夫している。

④メカニズムの解明・なぜそうなるのか等の記載

余弦減衰を使用することで、学習率がスムーズに減少し、過学習を抑制しながらもモデルがより長期間学習できる。
バッチサイズの段階的増加は、初期の学習で安定性を確保しつつ、大規模なトレーニングデータに対して効果的にモデルを適応させるためである。
複数のドキュメントを1つのシーケンスにパッキングすることで、短いドキュメントの計算効率が向上し、トレーニング全体の速度が改善される。

⑤性能が向上した記載

特定の具体的な性能向上の数値は記載されていないが、トレーニング効率と学習の安定性が向上していることが記載されている。図や表の番号は無し。

⑥比較の記載

技術	効果	解説	表番号
余弦減衰 (cosine decay)	学習の安定性向上	学習率がスムーズに減少する	無し
バッチサイズの線形増加	大規模データへの適応効率向上	モデルサイズに応じてバッチサイズを調整	無し

解説: 学習率の減衰手法とバッチサイズ調整の効果に関する比較を表にまとめた。トレーニングの安定性と効率向上が主な効果である。

⑦上記⑥以外で表に出来そうな部分

トレーニング設定	使用された技術	効果
Adamオプティマイザのパラメータ	β1 = 0.9, β2 = 0.95, ϵ = 10^(-8)	安定した勾配計算と学習の効率化
コンテキストウィンドウのサイズ	nctx = 2048	モデルが長いシーケンスで学習できるようにする

解説: トレーニングに使用されたパラメータと技術の詳細を表にまとめた。特にAdamオプティマイザの設定とコンテキストウィンドウのサイズに注目した。

⑧具体的な数値の記載

数値	説明
β1 = 0.9	Adamオプティマイザの第1のモーメント推定値の減衰率
β2 = 0.95	Adamオプティマイザの第2のモーメント推定値の減衰率
ϵ = 10^(-8)	Adamオプティマイザの数値安定性を確保するための小さな値
グローバルノルム = 1.0	勾配のクリップ値
2600億	トレーニングに使用されたトークン数
3.75億	学習率のウォームアップが行われるトークン数
32,000	トレーニングの初期バッチサイズ
4-120億	バッチサイズの最適化が行われるトークン数
nctx = 2048	コンテキストウィンドウのトークン数
0.1	重み減衰（weight decay）の値

⑨具体的な変数名の記載

変数名	説明
β1, β2	Adamオプティマイザのモーメント推定の減衰率
ϵ	Adamオプティマイザで数値安定性を保つための小さな値
グローバルノルム	勾配のグローバルノルムのクリップ値
nctx	最大コンテキスト長（トークン数）
learning rate	学習率の値。余弦減衰で減少し、最終的に初期値の10%に到達

⑩図/表があれば、各図/表は何を主張するためのものか

本節には図や表の具体的な番号や詳細な説明は記載されていないため、図表の解説は無し。

C Details of Test Set Contamination Studies

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) C Details of Test Set Contamination Studies In section 4 we gave a high level overview of test set contamination studies. In this section we provide details on methodology and results. Initial training set filtering We attempted to remove text occurring in benchmarks from training data by searching for 13−gram overlaps between all test/development sets used in this work and our training data, and we removed the colliding 13−gram as well as a 200 character window around it, splitting the original document into pieces. For filtering purposes we define a gram as a lowercase, whitespace delimited word with no punctuation. Pieces less than 200 characters long were discarded. Documents split into more than 10 pieces were considered contaminated and removed entirely. Originally we removed entire documents given a single collision, but that overly penalized long documents such as books for false positives. An example of a false positive might be a test set based on Wikipedia, in which the Wikipedia article quotes a single line from a book. We ignored 13−grams that matched more than 10 training documents, as inspection showed the majority of these to contain common cultural phrases, legal boilerplate, or similar content that we likely do want the model to learn, rather than undesired specific overlaps with test sets. Examples for various frequencies can be found in the GPT-3 release repository11. Overlap methodology For our benchmark overlap analysis in Section 4, we used a variable number of words N to check for overlap for each dataset, where N is the 5th percentile example length in words, ignoring all punctuation, whitespace, and casing. Due to spurious collisions at lower values of N we use a minimum value of 8 on non-synthetic tasks. For performance reasons, we set a maximum value of 13 for all tasks. Values for N and the amount of data marked as dirty are shown in Table C.1. Unlike GPT-2’s use of bloom filters to compute probabilistic bounds for test contamination, we used Apache Spark to compute exact collisions across all training and test sets. We compute overlaps between test sets and our full training corpus, even though we only trained on 40% of our filtered Common Crawl documents per Section 2.2. We define a ‘dirty’ example as one with any N -gram overlap with any training document, and a ‘clean’ example as one with no collision. Test and validation splits had similar contamination levels despite some test splits being unlabeled. Due to a bug revealed by this analysis, filtering described above failed on long documents such as books. Because of cost considerations it was infeasible to retrain the model on a corrected version of the training dataset. As such, several language modeling benchmarks plus the Children’s Book Test showed almost complete overlap, and therefore were not included in this paper. Overlaps are shown in Table C.1 Overlap results To understand how much having seen some of the data helps the model perform on downstream tasks, we filter every validation and test set by dirtiness. Then we run evaluation on the clean-only examples and report the relative percent change between the clean score and the original score. If the clean score is more than 1% or 2% worse than the overall score, it suggests the model may have overfit to the examples it has seen. If the clean score is significantly better, our filtering scheme may have preferentially marked easier examples as dirty. This overlap metric tends to show a high rate of false positives for datasets that contain background information (but not answers) drawn from the web (such as SQuAD, which draws from Wikipedia) or examples less than 8 words long, which we ignored in our filtering process (except for wordscrambling tasks). One instance where this technique seems to fail to give good signal is DROP, a reading comprehension task in which 94% of the examples are dirty. The information required to answer the question is in a passage provided to the model, so having seen the passage during training but not the questions and answers does not meaningfully constitute cheating. We confirmed that every matching training document contained only the source passage, and none of the questions and answers in the dataset. The more likely explanation for the decrease in performance is that the 6% of examples that remain after filtering come from a slightly different distribution than the dirty examples. Figure 4.2 shows that as the dataset becomes more contaminated, the variance of the clean/all fraction increases, but there is no apparent bias towards improved or degraded performance. This suggests that GPT-3 is relatively insensitive to contamination. See Section 4 for details on the datasets we flagged for further review.

Table C.1: Overlap statistics for all datasets sorted from dirtiest to cleanest. We consider a dataset example dirty if it
has a single N -gram collision with any document in our training corpus. “Relative Difference Clean vs All” shows the
percent change in performance between only the clean examples vs all the examples in the benchmark. “Count” shows
the number of examples. “Clean percentage” is the percent of examples that are clean vs total. For “Acc/F1/BLEU” we
use the metric specified in “Metric”. These scores come from evaluations with a different seed for the random examples
used for in-context learning, and will therefore differ slightly from the scores elsewhere in the paper.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

本章では、テストセットの汚染（contamination）に関する調査について詳細に述べている。要旨は以下の通りである:

トレーニングセットとテストセットのオーバーラップを防ぐためのデータフィルタリング手法を導入している。
特に、13-gram オーバーラップ検出と、その結果生じたドキュメントの処理について説明している。
テストセットにおける汚染データとクリーンデータの評価結果の比較を行っている。
フィルタリングの失敗事例や特定のデータセットでのオーバーラップがモデルに与える影響を分析している。

②具体的な処理方法の記載

本章では、以下の具体的な処理方法が記載されている:

13-gramオーバーラップ検出: 全てのテスト/開発セットとトレーニングデータの間で13-gramのオーバーラップを検索し、オーバーラップした部分とその前後200文字のウィンドウを削除する。このフィルタリングの際、13-gramは小文字化され、空白で区切られ、句読点は除外される。
ドキュメント分割: オーバーラップしたドキュメントは分割され、10個以上の部分に分割されたドキュメントは「汚染された」とみなして完全に除去される。
頻出13-gramの無視: 10以上のトレーニングドキュメントと一致した13-gramは、検査により文化的なフレーズや法的な定型文が多いため無視される。
クリーン/汚染データの定義: 任意のN-gramオーバーラップがある場合、データは「汚染されている」とみなし、オーバーラップがない場合は「クリーン」であるとする。
Apache Sparkによる計算: GPT-2で使用されたブルームフィルタに代わり、Apache Sparkを用いてテストとトレーニングセット間の正確なオーバーラップを計算している。

③改良点・工夫点・テクニック等の記載

オーバーフィルタリングの改善: 当初、1つのオーバーラップでドキュメント全体を削除していたが、これは長いドキュメントに対して誤検出が多いため、13-gramオーバーラップした部分だけを削除し、ドキュメントを分割する方法に改良された。
頻出フレーズの無視: 10以上のドキュメントと一致する13-gram（例えば、法律の定型文や一般的な文化的フレーズ）は無視する工夫が施されている。

④メカニズムの解明・なぜそうなるのか等の記載

クリーンデータ vs 汚染データの影響: クリーンデータのスコアが全体のスコアよりも1〜2%悪化する場合、モデルがトレーニングデータにオーバーフィットしている可能性があると考えられる。逆にクリーンデータでのスコアが良い場合、フィルタリングにより簡単な例が汚染データとしてマークされている可能性がある。
誤検出と誤差の原因: DROPのようなリーディングコンプリヘンションタスクでは、94%のデータが汚染されているが、これは回答に必要な情報がトレーニング時に提供されるためであり、実際には「カンニング」には該当しないとされる。

⑤性能が向上した記載

スコアの変化: クリーンデータのみで評価を行った結果、オーバーフィットが発生している場合、スコアは全体のスコアより1〜2%悪化する可能性があると記載されている。具体的なスコアの変化に関しては、Table C.1で「Relative Difference Clean vs All」として報告されている。

【対応図表: Table C.1】

▼質問(人間)
続けて下記の質問に追加で回答下さい:
⑥比較の記載があれば違いを表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(必ず正しく数値を抜き取る事、|数値|説明|の表へ)
⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)

▼生成AIの回答(ChatGPT-4o)
⑥比較の記載

本章では、テストセットにおける「クリーンなデータ」と「汚染されたデータ」のパフォーマンスの違いが議論されている。具体的には、オーバーラップがあるデータとないデータのスコアを比較し、その相対的なパフォーマンス変化をTable C.1で示している。

データセット	Relative Difference Clean vs All	クリーンなデータの割合	評価メトリック
Children’s Book Test	非公開 (オーバーラップが多すぎるため評価除外)	N/A	N/A
DROP	約6%のクリーンなデータ	汚染されたデータと大きな差はない	Reading Comprehension (F1)
他のデータセット	1-2%程度のパフォーマンス低下	クリーンなデータの割合に依存	Acc/F1/BLEU (各タスクごとに異なるメトリック)

【対応図表: Table C.1】

⑦上記⑥以外で表に出来そうな部分

以下は、N-gramのオーバーラップ処理とApache Sparkの使用など、フィルタリング手法に関する詳細を表にまとめる。

項目	説明
N-gramオーバーラップ	13-gramを使用し、オーバーラップした部分とその前後200文字を削除
分割基準	10以上に分割されたドキュメントは「汚染」として除去
頻出フレーズの扱い	10以上のドキュメントと一致した13-gramは無視
計算手法	Apache Sparkを使用して正確なオーバーラップを計算

⑧具体的な数値の記載

数値	説明
13-gram	テストとトレーニングセットのオーバーラップ検出に使用される単位
200文字	13-gramオーバーラップ検出後に削除される文字数の範囲
10	分割されるドキュメントの最大許容回数、それを超えると汚染とみなす
94%	DROPデータセットにおける汚染データの割合
1-2%	クリーンデータと汚染データ間のパフォーマンス変化の典型的な範囲

⑨具体的な変数名(数式用の記号)

変数名	説明
N	チェックするオーバーラップの語数 (最低8語、最大13語)
13-gram	テストセットとトレーニングセット間のオーバーラップを判定する単位
N/A	非公開のスコアやメトリックに対して使用される

⑩図/表の説明

Table C.1: 各データセットにおける汚染とクリーンなデータの割合、およびそれによるパフォーマンス差を示している。注目すべき点は、データセットの「Relative Difference Clean vs All」が1-2%の変動を示しており、オーバーフィットの可能性を確認できる部分である。

D Total Compute Used to Train Language Models

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) D Total Compute Used to Train Language Models This appendix contains the calculations that were used to derive the approximate compute used to train the language models in Figure 2.2. As a simplifying assumption, we ignore the attention operation, as it typically uses less than 10% of the total compute for the models we are analyzing. Calculations can be seen in Table D.1 and are explained within the table caption. Table D.1: Starting from the right hand side and moving left, we begin with the number of training tokens that each model was trained with. Next we note that since T5 uses an encoder-decoder model, only half of the parameters are active for each token during a forward or backwards pass. We then note that each token is involved in a single addition and a single multiply for each active parameter in the forward pass (ignoring attention). Then we add a multiplier of 3x to account for the backwards pass (as computing both ∂params ∂loss and ∂acts ∂loss use a similar amount of compute as the forwards pass. Combining the previous two numbers, we get the total flops per parameter per token. We multiply this value by the total training tokens and the total parameters to yield the number of total flops used during training. We report both flops and petaflop/s-day (each of which are 8.64e+19 flops).

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この章は、言語モデルの訓練に使用された計算量の計算方法に関するものである。具体的には、Figure 2.2に示されるモデル訓練の計算量を算出するための詳細な計算が説明されている。

要旨:

言語モデルの訓練に使用された計算量の算出方法
注意機構の計算量を無視し、主要な計算はトークンごとのパラメータに関連する乗算と加算に基づいている
逆伝播に必要な計算量は前向き伝播の3倍として計算されている
各モデルで使用されたトークン数やパラメータ数に基づいた総フロップ数の計算

②具体的な処理方法の記載

言語モデルの訓練に使用される計算量を算出するために、次の手順を実行している:

トークン数: 各モデルが訓練されたトークン数を取得
エンコーダ-デコーダの考慮: T5のようなエンコーダ-デコーダモデルでは、各トークンに対してアクティブなパラメータは半分だけである
演算の計算: 各トークンに対して、アクティブなパラメータごとに1回の加算と1回の乗算が行われる（注意機構は無視）
逆伝播の考慮: 逆伝播では前向き伝播と同じ計算量が必要であり、それを3倍の係数として計算に反映
総フロップ数の計算: 訓練されたトークン数とパラメータ数を掛け合わせ、総フロップ数を算出
最終的なフロップ数の単位: FLOPsとPetaflop/s-dayで報告される

【対応図表: Table D.1】

③改良点・工夫点・テクニック等の記載

エンコーダ-デコーダの処理: T5のようなモデルでは、各トークンに対して半分のパラメータしかアクティブではないという点を考慮して計算を簡略化している。
逆伝播の効率化: 逆伝播で前向き伝播の3倍の計算量を見積もることで、効率的な計算量算出を行っている。
注意機構の省略: 全体の計算量に対して10%未満しか使用しないため、注意機構の計算を無視している。

④メカニズムの解明・なぜそうなるのか

計算量の算出において、エンコーダ-デコーダモデルの特性や逆伝播の計算量を正確に見積もることで、訓練に必要なリソースを効率的に計算することができる。注意機構を無視することで、全体の計算量を大幅に簡略化できる。

⑤性能が向上した記載

性能向上に関する記載は無し。

⑥比較の記載

比較の記載は無し。

⑦上記⑥以外で表に出来そうな部分

以下の要素を表にまとめることができる。

項目	説明
トークン数	各モデルで訓練に使用されたトークン数
アクティブパラメータ	T5のようなエンコーダ-デコーダモデルでは、各トークンに対して半分のパラメータしかアクティブでない
フロップ数	乗算と加算によるフロップ数
逆伝播の係数	前向き伝播の3倍として逆伝播の計算量を見積もる

⑧具体的な数値の記載

数値	説明
8.64e+19 FLOPs	訓練に使用された計算量を表す単位としてのFLOPsとPetaflop/s-day
3倍	逆伝播の計算量を前向き伝播の3倍として見積もる
10%未満	注意機構が全体の計算量に占める割合（無視可能）

⑨具体的な変数名(数式用の記号)

変数名	説明
FLOPs	訓練に使用されたフロップ数を表す
Petaflop/s-day	訓練に使用された計算量を時間単位で表す
3x	逆伝播に必要な計算量を前向き伝播の3倍と見積もる係数
∂params/∂loss, ∂acts/∂loss	逆伝播に必要な微分演算

⑩図/表の説明

Table D.1: 各言語モデルの訓練に必要な総計算量（FLOPsおよびPetaflop/s-day）の計算方法を説明する。表の右から左へ進むことで、トークン数、パラメータ数、およびそれに関連する計算がどのように行われるかを詳細に追跡できる構造になっている。

E Human Quality Assessment of Synthetic News Articles

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) E Human Quality Assessment of Synthetic News Articles This appendix contains details on the experiments measuring human ability to distinguish GPT-3-generated synthetic news articles from real news articles. We first describe the experiments on the ∼ 200 word news articles, and then describe the preliminary investigation of ∼ 500 word news articles generated by GPT-3. Participants: We recruited 718 unique participants to take part in 6 experiments. 97 participants were excluded for failing an internet check question, leaving a total of 621 participants: 343 male, 271 female, and 7 other. Mean participant age was ∼ 38 years old. All participants were recruited through Positly, which maintains a whitelist of high-performing workers from Mechanical Turk. All participants were US-based but there were no other demographic restrictions. Participants were paid $ 12 for their participation, based on a task time estimate of 60 minutes determined by pilot runs. In order to ensure that the sample of participants for each experiment quiz was unique, participants were not allowed to take part in an experiment more than once. Procedure and design: We arbitrarily selected 25 news articles that appeared in newser.com in early 2020. We used the article titles and subtitles to produce outputs from the 125M, 350M, 760M, 1.3B, 2.7B, 6.7B, 13.0B, and 200B (GPT-3) parameter language models. Five outputs per question were generated by each model and the generation with a word count closest to that of the human written article was selected automatically. This was to minimize the effect that completion length might have on participants’ judgments. The same output procedure for each model with the exception of the removal of the intentionally bad control model, as described in the main text. In each experiment, half of the participants were randomly assigned to quiz A and half were randomly assigned to quiz B. Each quiz consisted of 25 articles: half (12-13) were human written and half (12-13) were model generated: the articles with human written completions in quiz A had model generated completions in quiz B and vice versa. The order of quiz question was shuffled for each participant. Participants could leave comments and were asked to indicate if they had seen the articles before. Participants were instructed not to look up the articles or their content during the quiz and at the end of the quiz were asked if they had looked anything up during the quiz. Statistical Tests: To compare means on the different runs, we performed a two-sample t-test for independent groups for each model against the control. This was implemented in Python using the scipy.stats.ttest_ind function. When plotting a regression line in the graph of average participant accuracy vs model size, we fit a power law of the form ax−b. The 95% confidence intervals were estimated from the t-distribution of the sample mean. Duration statistics: In the main text, we discussed the finding that the ability of human participants to distinguish model and human generated news articles decreases as our models become larger. We have also found that the average time spent for a given set of questions increases as the model size increases, as shown in Figure E.1. Lower accuracy scores despite increased time investment from participants supports the finding that larger models generate harder-to-distinguish news articles. Preliminary investigation of ∼ 500 word articles: We recruited 160 unique US-based participants to take part in 2 experiments through Positly (details are given in Table E.2). We randomly selected 12 Reuters world news articles from late 2019 and created a context for GPT-3 175B that consisted of a single Reuters article not in this set of 12. We then used the article titles and Reuters locations to generate completions from GPT-3 175B and the 160M control model from the previous experiments. These were used to create two 12-question quizzes per model, each consisting of half human written and half model generated articles. Comprehension questions were added and articles were shown to participants in 3 stages at 30 second intervals to encourage closer reading. Participants were paid $ 12 for this task. Model generation selection methods, exclusion criteria, and statistical tests mirror those of the previous experiments.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この章は、GPT-3が生成したニュース記事と実際のニュース記事を人間が区別できるかどうかを評価する実験の詳細を説明している。実験では、200語程度と500語程度のニュース記事に対する人間の判断力が測定された。

要旨:

人間参加者がGPT-3生成記事と本物のニュース記事を区別できるかを測定する実験
200語と500語のニュース記事を使用して評価
モデルの規模が大きくなると、GPT-3生成記事と実際のニュース記事の区別が難しくなる
参加者数、性別、年齢などのデモグラフィック情報が詳細に説明されている
参加者の反応時間とモデルサイズの関係を分析

②具体的な処理方法の記載

参加者の選定: Positlyを通じて参加者を募集し、Mechanical Turkの高評価労働者からリストアップされた。最終的に621人の参加者が選ばれた。
記事の選定と生成方法: 2020年初頭にnewser.comに掲載された25の記事を選び、各モデル（GPT-3の125Mから200Bパラメータまで）で5つの生成結果を得た。その中で、元の人間記事と最も近い単語数の生成物を自動的に選択した。
クイズの設計: 25問のクイズを作成し、半数は人間が書いた記事、残りはGPT-3が生成した記事で構成。クイズAとクイズBの間で人間記事とモデル生成記事を入れ替え、参加者ごとに質問の順番をシャッフルした。
統計解析: モデルとコントロール群の平均値を比較するために、Pythonのscipy.stats.ttest_ind関数を使用して二群のt検定を実施。回帰線はパワー法ax^−bを使用してフィッティングされた。
500語の記事の評価: 500語のニュース記事の評価では、12本のReuters記事を用いてGPT-3 175Bと160Mコントロールモデルの生成結果を比較するために12問のクイズを2つ作成し、3つの段階に分けて30秒ごとに記事を提示した。

③改良点・工夫点・テクニック等の記載

クイズAとBの工夫: 参加者のバイアスを排除するために、クイズAとクイズBで同じ記事を人間とモデル生成の両方で交互に出題する工夫がされている。
回答時間の制御: 500語のニュース記事では、回答者が記事をよりよく理解できるように、3つの段階で記事を提示し、各段階で30秒ごとに表示する方法が採用されている。
出力生成の選定: GPT-3モデルによる生成結果のうち、元記事と単語数が最も近いものを自動的に選択することで、文章の長さによる影響を最小限に抑えている。

④メカニズムの解明・なぜそうなるのか

GPT-3のパラメータが大きくなると、生成されたニュース記事がより人間らしくなるため、参加者が実際のニュース記事との区別が困難になる。特に大規模なモデルでは、より説得力のある文脈や文章構成が可能となり、区別が難しくなることが示唆されている。また、参加者の反応時間が長くなることから、より注意深く判断する必要があることも示されている。

⑤性能が向上した記載

GPT-3のモデルサイズが増加するにつれて、生成されたニュース記事と人間が書いたニュース記事を区別することが難しくなることが確認された。特に、パラメータ数が大きいモデル（例: GPT-3 175B）では、参加者の正確性が低下し、平均的な回答時間が増加している。この傾向は、Figure E.1に示されている。

▼生成AIの回答(ChatGPT-4o)
⑥比較の記載があれば違いを表でまとめて下さい。

この節には、GPT-3モデルの規模に応じたニュース記事の区別のしやすさに関する比較が存在する。モデルのパラメータ数が増加するにつれて、参加者が区別しにくくなることが示されている。

モデルパラメータ数	平均正答率	参加者の回答時間	備考
125M	高	短い	生成記事と人間記事を容易に区別
350M	やや高	やや短い	多少区別が難しくなる
760M	中	中	区別が難しくなる傾向
1.3B	やや低	やや長い	GPT-3生成記事がより人間らしく
2.7B	低	長い	判別が難しい
6.7B	非常に低	非常に長い	区別が非常に困難
13.0B	非常に低	非常に長い	区別が非常に困難
175B (GPT-3)	最も低い	最も長い	最大の区別困難性 (Figure E.1)

図E.1には、モデルサイズが増加するにつれて人間がGPT-3生成記事を区別するのが難しくなり、正答率が低下することが示されている。

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

表にできる情報として、参加者のデモグラフィック情報を以下のようにまとめることができる。

性別	人数
男性	343
女性	271
その他	7

年齢	平均値
参加者の年齢	38歳

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。

数値	説明
718	元の参加者数
621	有効な参加者数
343	男性参加者数
271	女性参加者数
7	その他の性別参加者数
38	平均年齢（歳）
$12	参加者への報酬（USドル）
60	タスク時間の推定（分）
25	クイズあたりの記事数
125M-175B	使用されたGPT-3モデルのパラメータ数範囲
500語	検証されたニュース記事の語数
95%	信頼区間
160	500語記事の検証に参加した人数

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

変数名	説明
t	t検定の統計値
ax^−b	パワー法の回帰式の形式
95% CI	95%の信頼区間
scipy.stats.ttest_ind	二群のt検定に使用された関数

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。

図E.1: この図は、参加者がGPT-3生成記事と人間記事を区別する際の正答率とモデルのパラメータサイズの関係を示している。モデルサイズが大きくなるにつれて正答率が低下し、反応時間が長くなることを示しており、GPT-3の大規模モデルがより人間らしい記事を生成することをサポートしている。
表E.2: 500語記事に関する実験の結果がまとめられたもので、参加者の人数、使用された記事の数、クイズの構成、各モデルの出力方法などが記載されている。この表は、GPT-3 175Bと160Mコントロールモデルによる生成記事の比較と評価結果を示している。

F Additional Samples from GPT-3

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) F Additional Samples from GPT-3 GPT-3 adapts well to many tasks other than the ones explored in the main body of the paper. As an example, in Figure F.1, we show four uncurated samples from a prompt suggesting that the model write a poem, with a given title, in the style of Wallace Stevens. We first experimented with a few prompts, then generated four samples with no additional editing or selection (sampling at temperature 1 using nucleus sampling [ HBFC19 ] with P = 0.9). Completions were truncated when the model began to write a new title and author heading, or broke into prose commentary.

Figure F.1: Four uncurated completions from a context suggesting the model compose a poem in the style of Wallace
Stevens with the title ‘Shadows on the Way’.
↓
Context → The City
BY C. P. CAVAFY
TRANSLATED BY EDMUND KEELEY
[Poem text omitted]
SOME TREES
John Ashbery
[Poem text omitted]
Shadows on the Way
Wallace Stevens
-------- Generated Poem 1 -------- -------- Generated Poem 3 --------
I must have shadows on the way The sun was all we had. Now, in the shade
If I am to walk I must have All is changed. The mind must dwell on those
Each step taken slowly and alone White fields, that to its eyes were always old;
To have it ready made Those ancient gleams, convoluting
The way they lay among the huge roots,
And I must think in lines of grey The great dark tomes of reverie,
To have dim thoughts to be my guide The plumed labyrinth of the sea.
Must look on blue and green
And never let my eye forget When our own breath gonne faint and fade
That color is my friend Up the long stairway of the dead
And purple must surround me too We shall remember well
The blindness of the light, the whiteness
The yellow of the sun is no more Of that white land. The footsteps, and
Intrusive than the bluish snow The grasses in that light, and in the shade
That falls on all of us. I must have The menacing life.
Grey thoughts and blue thoughts walk with me
If I am to go away at all. We played, too, once, in the warmest rooms.
To one content, there is one content
-------- Generated Poem 2 -------- For all mankind. The forest glades
Are all the more for being in fable,
There is a way in the world. And wood and town in legend shadowed-
That way a man may live in a small town Once more to return. The one content
His lifetime and see no man come, Has in it every happiness, more brave
Never the stir of men about a market, For being purposeless; and so
Never the bobbing of covered carts, Those island gardens, deep in the mind,
Never the creak passed on the stair Make of remoteness familiarity
Closing some other life; an old man And of the unspoken name a friend.
Can prosper in that town, can prosper
When his store is seen to close, his wife -------- Generated Poem 4 --------
Fades in his house, and the world shouts in vain
That no one comes to hear him Nobody will come to this place. It is a road that leads nowhere.
Where his small bell tolls. The solitude is deep. The mountains are high.
In that small town But they are desolate, and they turn the traveler’s face
Everything is known. Towards the North. All the sounds of the world are far away.
The ways of the young men are known, When the wind rises above the trees,
And the ways of the young women The boughs bow to the ground.
As they walk to the Spring; Even the birds that inhabit the tangle of weeds
The rumors are all known, That is the roadside cover, are silent. One listens,
The quarrels, the hurt, the envy But hears no roar of the forest. One is alone.
Of the men and the women. One will be taken.
Till a day comes One will be taken.
That way out of the world There is no utterance, there is no conversation,
That leads to a wood. But one is uneasy all the same....
Young, the young man There is a thin blue mist,
Walks, a lover now, A darkness rising like smoke,
Knows the streets as they are and is of them. And within that darkness
A possession of the heart.
Comes to the wood. One will be taken.... It was here, and it will be here again-
Here, under this sky empty and full of light.
There, in the wood, among trees,
He sees shadows on the way, hears voices,
hears the wind and the rustling of leaves;
Through an open glade
He sees a shape and the shape hears:
It waits as he waits,
Waits as the shadows wait,
As the voices wait;
Shadows on the way, voices in the wind.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

記載内容: 本節は、GPT-3が異なるタスクに対して適応できる能力を示しており、特に詩の生成に関するサンプルが記載されている。
要旨:
- GPT-3は詩の生成にも適応できる。
- モデルに「Wallace Stevensのスタイルで詩を作る」というプロンプトを与えた結果、4つの異なる詩が生成された。
- 生成はnucleus sampling (P = 0.9) を用い、温度1で行われた。
- 生成された詩は編集や選別を行わず、そのまま提示された。

②具体的な処理方法の記載があれば説明下さい。

モデルには、Wallace Stevensのスタイルで「Shadows on the Way」というタイトルの詩を書くよう指示したプロンプトが使用された。
いくつかのプロンプトを実験した後、最終的にnucleus sampling (P = 0.9) を用いて温度1でサンプルが生成された。
生成された詩は、モデルが新しいタイトルや作家の見出しを書き始めたり、散文形式に移行した段階で切り取られた。

③改良点・工夫点・テクニック等の記載があれば説明下さい。

nucleus samplingというテクニックが使用されている。これは、トップp（確率0.9）に基づいて次の単語を選択する方法であり、従来の確率的サンプリングに比べてより多様性のある結果を生成できる（[HBFC19]で紹介された技術）。
追加の編集や選別を行わないことで、生成モデルの生のアウトプットを評価することが可能となっている。

④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。

nucleus samplingを用いることで、モデルは確率分布の上位pの領域から次の単語を選択するため、極端に低確率な単語を除外しつつも多様な出力を生むことができる。これにより、詩の生成時により自然な流れと多様性が保証される。
モデルがタイトルや作家見出しを生成し始めた段階でトリミングされたのは、GPT-3が文脈に基づいて新しいセクションを開始する傾向があるためである。これにより、既存の詩と別の要素が混在するのを防ぐ意図がある。

⑤性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)

性能向上に関する具体的な記載はなし。

⑥比較の記載があれば違いを表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば明記)

この節には、特定のタスク間での性能比較は行われていないため、比較に関する記載はなし。

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)

詩の生成手法やサンプリングに関する情報を表にまとめることが可能である。

使用技術	説明
nucleus sampling	次の単語を確率分布の上位p = 0.9の領域から選択
温度	サンプリングの多様性を制御するパラメータ (T = 1)
編集/選別	なし
タスク	「Shadows on the Way」の詩をWallace Stevens風に生成

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(必ず正しく数値を抜き取る事、|数値|説明|の表へ)

数値	説明
4	生成された詩の数
1	サンプリングにおける温度パラメータ
0.9	nucleus samplingにおけるP値
なし	詩の選別や編集が行われていないことを示す

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

変数名	説明
T	サンプリングの温度パラメータ
P	nucleus samplingにおける確率閾値 (P = 0.9)

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)

Figure F.1: この図は、Wallace Stevens風の詩を生成する際に、GPT-3が生成した4つのサンプル詩を表示している。モデルの創造力や詩の文体適応能力を示す例として重要である。

G Details of Task Phrasing and Specifications

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) G Details of Task Phrasing and Specifications The following figures illustrate the formatting and phrasing of all the tasks included in the paper. All data comes from the ground truth datasets in this section, and no samples from GPT-3 are included here.

Figure G.1: Formatted dataset example for RACE-h. When predicting, we normalize by the unconditional probability
of each answer as described in 2.
↓
Context → Article:
Informal conversation is an important part of any business
relationship.Before you start a discussion,however,make sure you understand
which topics are suitable and which are considered taboo in a particular
culture. Latin Americans enjoy sharing information about their local
history, art and customs.You may expect questions about your family,and
be sure to show pictures of your children.You may feel free to ask similar
questions of your Latin American friends.The French think of conversation
as an art form,and they enjoy the value of lively discussions as well as
disagreements. For them,arguments can be interesting and they can cover
pretty much or any topic ---- as long as they occur in are respectful and
intelligent manner.
In the United States,business people like to discuss a wide range of
topics,including opinions about work,family,hobbies,and politics. In
Japan,China,and Korea,however,people are much more private.They do not
share much about their thoughts,feelings,or emotions because they feel
that doing so might take away from the harmonious business relationship
they’re trying to build.Middle Easterners are also private about their
personal lives and family matters.It is considered rude,for example,to ask
a businessman from Saudi Arabia about his wife or children.
As a general rule,it’s best not to talk about politics or religion with
your business friends.This can get you into trouble,even in the United
States,where people hold different religious views.In addition,discussing
one’s salary is usually considered unsuitable.Sports is typically a
friendly subject in most parts of the world,although be careful not to
criticize national sport.Instead,be friendly and praise your host’s team.
Q: What shouldn’t you do when talking about sports with colleagues from
another country?
A: Criticizing the sports of your colleagues’ country.
Q: Which is typically a friendly topic in most places according to the
author?
A: Sports.
Q: Why are people from Asia more private in their conversation with others?
A: They don’t want to have their good relationship with others harmed by
informal conversation.
Q: The author considers politics and religion .
A:
Correct Answer → taboo
Incorrect Answer → cheerful topics
Incorrect Answer → rude topics
Incorrect Answer → topics that can never be talked about

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

本節は、タスクのフォーマットと表現に関する詳細な情報を提供している。
データはすべて、論文に含まれている「ground truth datasets（正解データセット）」から取得されており、GPT-3からのサンプルは含まれていない。
Figure G.1では、RACE-hというタスクのフォーマットされたデータセットの例を示しており、特に無条件確率での正規化についても触れている。
質問と回答の具体例が示され、文化やビジネスにおける会話の違いに関する文脈に基づいている。

②具体的な処理方法の記載があれば説明下さい。

特定の処理として、RACE-hタスクの予測時に各回答の無条件確率で正規化を行う方法が記載されている。これは、予測の公平性を保つためのプロセスであり、詳細は論文の第2章で説明されている【図G.1】。

③改良点・工夫点・テクニック等の記載があれば説明下さい。

改良点として、無条件確率での正規化という技術が使用されている。これにより、モデルが特定の回答に対して偏りを持たず、より公平な予測が可能となる。
データセットのフォーマットが明示されており、タスクごとに異なる文脈やフォーマットを扱う際の基盤を整えている。

④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。

無条件確率での正規化は、異なる選択肢に対して公平な評価を可能にするためのメカニズムである。特定の選択肢がデータセット内で不均等に出現する場合、それに対応するバイアスがかかることを防ぐためにこの手法が導入されている【図G.1】。

⑤性能が向上した記載があれば説明下さい。

無し

⑥比較の記載があれば違いを表でまとめて下さい。

比較項目	説明	対応する図/表番号
正規化なしの予測	各選択肢の出現頻度に影響を受け、偏った予測が行われる可能性がある。	図G.1
無条件確率での正規化	各選択肢が公平に評価され、データセット内の出現頻度に依存しない予測が可能となる。	図G.1

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

無し

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。

無し

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

無し

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。

Figure G.1 は、RACE-hタスクのデータセットフォーマットの例を示しており、正規化の方法と文脈に基づいた質問と回答の例を提供している。主張としては、タスクごとのフォーマットとその処理の詳細を理解するために重要な部分である。

H Results on All Tasks for All Model Sizes

(グラフ・表のみ)

あとがき

所感

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。(良い意味で論文の要点をそのまま抜き出した形)

分かったこと

分かったこと・注目した点を列挙すると:

Common Crawlデータセットのフィルタリング:
Common Crawlのフィルタリングには、WebTextやWikipediaを高品質なデータセットの代表例とし、ロジスティック回帰分類器を使用して、これらのデータと未フィルタリングのCommon Crawlデータを区別するモデルを訓練した**。この分類器はSparkのトークナイザーとHashingTFを使用して特徴量を抽出している。
分類器で予測されたスコアに基づいて、Common Crawlのドキュメントを再サンプリングし、ocument_scoreが高いものを優先的に保持する。サンプリングの際には、np.random.pareto(α) > 1 − document_scoreという条件式を用い、α = 9を設定している。αは分類器がWebTextで得たスコアの分布に合わせて選択された。(異なる種類のドキュメントを一部残す工夫)
ファジーデデュプリケーションでは、SparkのMinHashLSH実装を使用し、10個のハッシュを使用して類似ドキュメントを削除した。これには分類の際に使用した同じ特徴量を用いている。WebTextとの重複も削除している。

GPT-3の具体的なトレーニング:
勾配のグローバルノルムは1.0でクリップされ、学習率は余弦減衰を使用して、2600億トークンまで元の学習率の10%まで減少させた。
トレーニングの最初の3.75億トークンで学習率のウォームアップを線形に実施した。
バッチサイズは、32,000トークンから線形に増加し、4～120億トークンの間で最適なサイズに到達するよう調整された。

余弦減衰を使用することで、学習率がスムーズに減少し、過学習を抑制しながらもモデルがより長期間学習できる。

バッチサイズの段階的増加は、初期の学習で安定性を確保しつつ、大規模なトレーニングデータに対して効果的にモデルを適応させるためである。

トレーニングセットとテストセットのオーバーラップ検出:
13-gramオーバーラップ検出: 全てのテスト/開発セットとトレーニングデータの間で13-gramのオーバーラップを検索し、オーバーラップした部分とその前後200文字のウィンドウを削除する。このフィルタリングの際、13-gramは小文字化され、空白で区切られ、句読点は除外される。
ドキュメント分割: オーバーラップしたドキュメントは分割され、10個以上の部分に分割されたドキュメントは「汚染された」とみなして完全に除去される。
頻出13-gramの無視: 10以上のトレーニングドキュメントと一致した13-gramは、検査により文化的なフレーズや法的な定型文が多いため無視される。(10以上のドキュメントと一致する13-gram（例えば、法律の定型文や一般的な文化的フレーズ）は無視する工夫)
Apache Sparkによる計算: GPT-2で使用されたブルームフィルタに代わり、Apache Sparkを用いてテストとトレーニングセット間の正確なオーバーラップを計算している。

クリーンデータ vs 汚染データの影響: クリーンデータのスコアが全体のスコアよりも1〜2%悪化する場合、モデルがトレーニングデータにオーバーフィットしている可能性があると考えられる。逆にクリーンデータでのスコアが良い場合、フィルタリングにより簡単な例が汚染データとしてマークされている可能性がある。

言語モデルの訓練に使用された計算量の計算方法:
注意機構の計算量を無視し、主要な計算はトークンごとのパラメータに関連する乗算と加算に基づいている
注意機構の省略: 全体の計算量に対して10%未満しか使用しないため、注意機構の計算を無視している。
逆伝播に必要な計算量は前向き伝播の3倍として計算されている
総フロップ数の計算: 訓練されたトークン数とパラメータ数を掛け合わせ、総フロップ数を算出

GPT-3のパラメータが大きくなると、生成されたニュース記事がより人間らしくなるため、参加者が実際のニュース記事との区別が困難になる。特に大規模なモデルでは、より説得力のある文脈や文章構成が可能となり、区別が難しくなることが示唆されている。また、参加者の反応時間が長くなる。

RACE-hタスクの予測時に各回答の無条件確率で正規化を行う:
無条件確率での正規化は、異なる選択肢に対して公平な評価を可能にするためのメカニズムである。特定の選択肢がデータセット内で不均等に出現する場合、それに対応するバイアスがかかることを防ぐためにこの手法が導入されている【図G.1】。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up