生成AIを用いてGPT-2の論文「Language Models are Unsupervised Multitask Learners (2019)」を読んでみた

Posted at 2024-09-21

はじめに

生成AIを用いてGPT-2の論文「Language Models are Unsupervised Multitask Learners (2019)」の内容を(なるべく)把握してみました。(生成AIが)論文の記載内容を始めから最後まで読んで、実際にどのような記載があるのかを把握します。

(論文の分かりやすい解説記事は見るのですが、実際の論文までチェックしないので、生成AIを使って内容を把握してみました。)

ゼロショットに重きを置いた論文で、モデルの容量（パラメータ数）はゼロショットタスク転移の成功に不可欠であり、容量が増加することで性能も向上する、という主張をしていることが分かりました。
全体の趣向として、zero-shotもしくはfew-shotでも精度が出るような汎用的なモデルを目指していることが分かりました。
(末尾の「分かったこと」章を参照)

以降で、ChatGPTに聞いてみた例を記載します。

他例: 同類の方法を使って読んでみた結果

対象の論文

論文: (GPT-2に関する論文)

Language Models are Unsupervised Multitask Learners (2019)
https://paperswithcode.com/paper/language-models-are-unsupervised-multitask
(PDF: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

質問時の各章節の区切り部分

論文の中にある各章節を、下記のように区切って、部分毎に生成AIに内容を質問していきます。

Abstract
---
1.-Introduction
---
2.-Approach
---
2.1. Training Dataset
---
2.2. Input Representation
---
2.3. Model
---
3.-Experiments
3.1. Language Modeling
---
3.2. Children’s Book Test
3.3. LAMBADA
3.4. Winograd Schema Challenge
3.5. Reading Comprehension
---
3.6. Summarization
3.7. Translation
3.8. Question Answering
---
4.-Generalization vs Memorization
---
5.-Related Work
---
6.-Discussion
---
7.-Conclusion
---
8.-Appendix A: Samples
8.1. Model capacity
8.2. Text Memorization
8.3. Diversity
8.4. Robustness

生成AIへの質問方法

生成AIを活用して、知りたい記事・論文の1節分(適度な長さ)のテキストをコピー＆ペーストして、その下に質問内容を「①～ ②～ …」と番号付きで書いて、生成AIに渡せば、全質問に一発で的確に回答してくれるので、非常に良好でした。記事全体を読む必要なく、知りたい点の情報だけを収集できます。

生成AIへの質問例:

(論文・記事の各章節を貼り付け)

上記の内容に関して下記の質問に回答下さい: (である調で記載、質問に対して該当するものが無ければ無しと記載、対応する図/表番号があれば記載)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載)
②改良点・工夫点・テクニック等の記載があれば説明下さい。
③性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。
⑤具体的な処理方法の記載があれば説明下さい。(簡略化せずに全て記載、既存手法の適用であれば引用元を記載)

続けて下記の質問に追加で回答下さい:
⑥比較の記載があれば違いを表でまとめて下さい。(対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(|数値|説明|の表へ)
⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)

※回答が長くなりそうな場合は、適宜、分けて質問: ①②③④⑤、⑥⑦⑧⑨⑩
※その他、不明点があれば、適宜、追加で質問。

質問内容は、記事・論文を読んでいていつも知りたいと思う点(改良点・工夫点・テクニック・メカニズムの解明)にしています。また、表で比較した方が素早く把握できるので、可能であれば記事を表に変換するようにしています。

論文・記事を貼り付けるテキストの長さは、1節分程度の量にとどめた方が、良い回答が得られました。生成AIの回答の文量が多くなってくると、回答が長くなり過ぎないように、生成AIが勝手に(適度に)端折り始めてしまい、重要な点が回答から抜けてしまう可能性が高くなります。

事前知識

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。

GPT-2についての分かりやすい解説記事(下記)等を事前にチェックして、中核部分の内容をあらかじめ分かっていると、理解しやすいと思います。生成AIは実際の細かい処理方法自体を分かりやすく説明してはくれない傾向があります。

注意点

論文のテキスト内容だけを貼り付けて、生成AIに質問しています。論文の中の図・表の部分は貼り付けていません。図・表の内容は生成AIの回答には含まれず、別途論文を見る必要があります。

以降で、生成AIの回答内容が読みにくい・分かりづらい場合は、論文の本文でも同じように書かれてあり、論文の本文を読んでも同じように分かりづらいことが多くあります。論文では、既存研究等があるため、多くの説明を省略・一言だけサラッと書かれてある等、同種の研究に取り組む人でなければ、なかなか全容の理解に辿りつくのは難しい側面があります。この点は、生成AIの回答性能が悪いのではなく、論文という性質上、生じるものと考えています。

生成AIに質問

以降で、ChatGPTに実際に聞いてみた例を記載します。

生成AIへの質問＆回答の全容

生成AIへの質問＆回答の全容・詳細:

Abstract

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) Abstract Natural language processing tasks, such as ques- tion answering, machine translation, reading com- prehension, and summarization, are typically approached with supervised learning on task- specific datasets. We demonstrate that language models begin to learn these tasks without any ex- plicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the an- swers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and in- creasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested lan- guage modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain co- herent paragraphs of text. These findings suggest a promising path towards building language pro- cessing systems which learn to perform tasks from their naturally occurring demonstrations.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この部分では、自然言語処理タスクに対するゼロショット学習の成功を示した研究結果が記載されている。具体的には、新しいデータセット「WebText」を使用してトレーニングされたGPT-2モデルが、明示的な教師なしでさまざまな自然言語処理タスクを学習できることを示している。

要旨

自然言語処理タスク（質問応答、機械翻訳、読解、要約など）に対して、タスク固有のデータセットを使用せずにゼロショットで学習を行う。
WebTextという新しいデータセットを使用してトレーニングされた言語モデルは、明示的な教師なしでタスクを学習し始めることができる。
CoQAデータセットにおいて、タスク固有のデータなしでF1スコア55を達成し、既存のベースラインシステムの3つを上回る性能を発揮。
モデルの容量（パラメータ数）はゼロショットタスク転移の成功に不可欠であり、容量が増加することで性能も向上する。
最大のモデルであるGPT-2は1.5Bパラメータを持つTransformerモデルであり、ゼロショット設定でテストされた8つの言語モデリングデータセットのうち7つで最先端の結果を達成した。
しかし、GPT-2は依然としてWebTextのデータ量に対して過適合していない。
モデルのサンプルは、改善を反映し、論理的な段落を生成することができる。

②改良点・工夫点・テクニック等の記載

WebTextという大規模な新しいデータセットを使用して、従来のタスク固有のデータセットに依存しないトレーニング方法を工夫した。
モデルの容量を増加させることで、ゼロショット学習の性能を向上させる工夫を行った。
大規模なTransformerアーキテクチャを採用し、パラメータを1.5Bに増やすことで、モデルの学習能力を最大化する取り組みを行った。

③性能が向上した記載

CoQAデータセットにおいて、トレーニングデータ（127,000+の例）を使用せずに、F1スコア55を達成しており、3つのベースラインシステムの性能を上回った。
GPT-2はゼロショット設定で、テストされた8つの言語モデリングデータセットのうち7つで最先端の結果を達成した (具体的なデータセットの詳細は本文中に記載されていないが、性能向上は言及されている)。

④メカニズムの解明・なぜそうなるのか等の記載

モデルの容量（パラメータ数）はゼロショット学習の成功に直接的な影響を与えるとされており、モデルのサイズが増えることで性能が向上する理由は、より多くの情報を保持し、タスク間での転移学習能力が高まるためと考えられる。
WebTextのような大規模で多様なデータセットを使用することで、タスク固有のデータセットなしでもタスクを学習できるようになるという点が、モデルのゼロショット学習能力を向上させた。

⑤具体的な処理方法の記載

モデルは、Transformerアーキテクチャを採用し、1.5Bのパラメータを持つGPT-2を使用した。
トレーニングデータセットとして「WebText」を使用し、これは数百万のウェブページから収集されたデータである。
質問応答タスクにおいては、ドキュメントに質問を条件として与えることで、言語モデルが回答を生成する。
CoQAデータセットでの性能は、ゼロショット設定でテストされ、F1スコア55を達成した。

⑥表に出来そうな部分

タスク	モデル	F1スコア	使用データセット	特徴
質問応答 (CoQA)	GPT-2 (1.5B)	55	WebText	明示的なトレーニングデータなしでベースラインを上回る
言語モデリング (7/8)	GPT-2 (1.5B)	最先端	WebText	ゼロショットで7/8のデータセットで最先端の結果を達成

⑦上記⑥以外で表に出来そうな部分

無し

⑧具体的な数値の記載

GPT-2のパラメータ数: 1.5B
CoQAデータセットでのF1スコア: 55
使用したトレーニングデータセット: WebText

⑨具体的な変数名の記載

無し

⑩各図/表の主張ポイント

無し (この要約においては、具体的な図や表に関する言及がないため)。

1. Introduction

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 1. Introduction Machine learning systems now excel (in expectation) at tasks they are trained for by using a combination of large datasets, high-capacity models, and supervised learning (Krizhevsky et al., 2012) (Sutskever et al., 2014) (Amodei et al., 2016). Yet these systems are brittle and sensitive to slight changes in the data distribution (Recht et al., 2018) and task specification (Kirkpatrick et al., 2017). Current sys- tems are better characterized as narrow experts rather than competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one. The dominant approach to creating ML systems is to col- lect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But the often erratic behavior of captioning models (Lake et al., 2017), reading comprehension systems (Jia & Liang, 2017), and image classifiers (Alcorn et al., 2018) on the diversity and variety of possible inputs highlights some of the short- comings of this approach. Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems. Progress towards robust systems with current architectures is likely to require training and measuring performance on a wide range of domains and tasks. Recently, several benchmarks have been proposed such as GLUE (Wang et al., 2018) and decaNLP (McCann et al., 2018) to begin studying this. Multitask learning (Caruana, 1997) is a promising frame- work for improving general performance. However, mul- titask training in NLP is still nascent. Recent work re- ports modest performance improvements (Yogatama et al., 2019) and the two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively (McCann et al., 2018) (Bowman et al., 2018). From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives. Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be re- quired to brute force our way there with current techniques. This motivates exploring additional setups for performing multitask learning. The current best performing systems on language tasks utilize a combination of pre-training and supervised fine- tuning. This approach has a long history with a trend to- wards more flexible forms of transfer. First, word vectors were learned and used as inputs to task-specific architec- tures (Mikolov et al., 2013) (Collobert et al., 2011), then the contextual representations of recurrent networks were transferred (Dai & Le, 2015) (Peters et al., 2018), and re- cent work suggests that task-specific architectures are no longer necessary and transferring many self-attention blocks is sufficient (Radford et al., 2018) (Devlin et al., 2018). These methods still require supervised training in order to perform a task. When only minimal or no supervised data is available, another line of work has demonstrated the promise of language models to perform specific tasks, such as commonsense reasoning (Schwartz et al., 2017) and sentiment analysis (Radford et al., 2017). In this paper, we connect these two lines of work and con- tinue the trend of more general methods of transfer. We demonstrate language models can perform down-stream tasks in a zero-shot setting – without any parameter or archi- tecture modification. We demonstrate this approach shows potential by highlighting the ability of language models to perform a wide range of tasks in a zero-shot setting. We achieve promising, competitive, and state of the art results depending on the task.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この章では、自然言語処理タスクにおける機械学習システムの現状と課題、またゼロショット学習のアプローチが提案されている。従来の手法の限界を指摘し、より一般的なシステムを構築するためのマルチタスク学習や転移学習の可能性について述べている。

要旨

現在の機械学習システムは、単一タスクに特化した狭い専門家としての性能を発揮するが、一般化能力が不足している。
マルチタスク学習は、より一般的なシステムを構築するための有望なアプローチであるが、現時点では成果は限られている。
転移学習の手法が進化しており、特に言語モデルがタスクをゼロショットで実行できる可能性がある。
本論文では、ゼロショット学習による下流タスクの遂行能力を示し、競争力のある結果を達成することを目指している。

②改良点・工夫点・テクニック等の記載

マルチタスク学習の課題として、現在のシステムではタスクごとのデータセットの収集と設計が大規模に必要であり、それをスケールアップするのは難しいことを指摘している。
言語モデルにおいて、タスク固有のアーキテクチャの設計が不要であり、自己注意層（self-attention blocks）の転移だけで十分であることを提案している。
ゼロショット学習の可能性を探るために、転移学習と未加工の言語モデルの性能向上を組み合わせた。

③性能が向上した記載

言語モデルのゼロショット設定での実験により、特定のタスクにおいて最先端の結果や競争力のある結果が得られたとされている。
詳細な数値は記載されていないが、ゼロショット学習で期待できる成果について言及している。

④メカニズムの解明・なぜそうなるのか等の記載

マルチタスク学習が一般化能力を向上させる可能性がある理由として、単一タスクではなく、広範なタスク間で学習することで、より頑健なシステムが構築されることが期待されるためである。
言語モデルが自己注意層を通じて、複数のタスクに適応する能力を持ち、これがタスク固有の設計なしでの転移学習を可能にしている。

⑤具体的な処理方法の記載

従来の手法では、大規模なデータセットを収集し、各タスクに対して個別のアーキテクチャを設計していた。
マルチタスク学習では、複数のデータセットやタスクを一括してトレーニングし、一般化されたパフォーマンスを得ることを目指している。
転移学習の手法では、まず事前学習を行い、その後、特定のタスクに対して微調整を行うアプローチが一般的である。しかし、ゼロショット学習では、事前学習されたモデルをそのまま使用して、タスクに対するアーキテクチャ変更や追加トレーニングを行わない。

▼質問(人間)
続けて下記の質問に追加で回答下さい:
⑥比較の記載があれば違いを表でまとめて下さい。(対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(|数値|説明|の表へ)
⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)

▼生成AIの回答(ChatGPT-4o)
⑥比較の記載

記載された比較に基づいて、以下の表を作成する。具体的な図/表番号は記載されていないため、該当する箇所に「図/表無し」とする。

アプローチ	特徴	結果	図/表番号
単一タスク学習	単一タスクに対して大規模なデータセットで学習	専門的なタスクには優れているが、一般化が難しい	図/表無し
マルチタスク学習	複数タスクを一度に学習	汎用的な性能の向上が期待されるが、限界もある	図/表無し
転移学習	事前学習されたモデルを使用し、特定のタスクに微調整	良好な性能を示すが、依然としてタスク固有のデータが必要	図/表無し
ゼロショット学習	事前学習のみでタスク固有のデータを使わずに実行	競争力のある性能を達成、最先端の結果を示す場合もある	図/表無し

⑦上記⑥以外で表に出来そうな部分

本文中で触れられているさまざまな言語モデルアプローチについて、表で整理する。

アプローチ	特徴	例
単語ベクトル学習	タスク固有のアーキテクチャに対して単語ベクトルを利用	Mikolov et al. (2013)
文脈ベクトル学習	リカレントネットワークの文脈表現を転移学習に利用	Dai & Le (2015), Peters et al. (2018)
自己注意層の転移学習	自己注意層の転移だけでタスク遂行が可能になる	Radford et al. (2018), Devlin et al. (2018)

⑧具体的な数値の記載

この章には具体的な数値が記載されていないため、数値の表は作成できない。

⑨具体的な変数名(数式用の記号)の記載

この章には具体的な数式や変数名が記載されていないため、該当する表は無し。

⑩図/表があれば、各図/表は何を主張するためのものか

この章では図や表が言及されていないため、図/表の説明は無し。

参考:

モデルの仕組みはGPTとほぼ同じ。
特定のタスクに特化した教師あり学習は行わず、より大きな言語コーパスを使って、より大きなモデルの言語モデルを事前学習させることにより、zero-shot、もしくはfew-shotのセッティングでも精度が出るような汎用的なモデルを目指す。

より大きなデータセット
GPT: BookCorpusデータセット → GPT-2: Webをクローリングして作成したWebText(40GB)
より大きなモデル
GPT: 12レイヤーの1億17百万パラメータ → GPT-2: 48レイヤーの15億42百万パラメータ

2. Approach

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 2. Approach At the core of our approach is language modeling. Lan- guage modeling is usually framed as unsupervised distri- bution estimation from a set of examples (x1, x2, ..., xn) each composed of variable length sequences of symbols (s1, s2, ..., sn). Since language has a natural sequential or- dering, it is common to factorize the joint probabilities over symbols as the product of conditional probabilities (Jelinek & Mercer, 1980) (Bengio et al., 2003): p(x) = n∏ i=1 p(sn|s1, ..., sn−1) (1) This approach allows for tractable sampling from and es- timation of p(x) as well as any conditionals of the form p(sn−k, ..., sn|s1, ..., sn−k−1). In recent years, there have been significant improvements in the expressiveness of mod- els that can compute these conditional probabilities, such as self-attention architectures like the Transformer (Vaswani et al., 2017). Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distri- bution p(output|input). Since a general system should be able to perform many different tasks, even for the same input, it should condition not only on the input but also on the task to be performed. That is, it should model p(output|input, task). This has been variously formalized in multitask and meta-learning settings. Task conditioning is often implemented at an architectural level, such as the task specific encoders and decoders in (Kaiser et al., 2017) or at an algorithmic level such as the inner and outer loop optimization framework of MAML (Finn et al., 2017). But as exemplified in McCann et al. (2018), language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols. For example, a translation training example can be written as the sequence (translate to french, english text, french text). Like- wise, a reading comprehension training example can be written as (answer the question, document, question, answer). McCann et al. (2018) demon- strated it was possible to train a single model, the MQAN, to infer and perform many different tasks on examples with this type of format. Language modeling is also able to, in principle, learn the tasks of McCann et al. (2018) without the need for explicit supervision of which symbols are the outputs to be pre- dicted. Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective. In this slightly toy setting, the concerns with density estimation as a principled training objective discussed in (Sutskever et al., 2015) are side stepped. The problem instead becomes whether we are able to, in practice, optimize the unsuper- vised objective to convergence. Preliminary experiments confirmed that sufficiently large language models are able to perform multitask learning in this toy-ish setup but learning is much slower than in explicitly supervised approaches. While it is a large step from the well-posed setup described above to the messiness of “language in the wild”, Weston (2016) argues, in the context of dialog, for the need to develop systems capable of learning from natural language directly and demonstrated a proof of concept – learning a QA task without a reward signal by using forward prediction of a teacher’s outputs. While dialog is an attractive approach, we worry it is overly restrictive. The internet contains a vast amount of information that is passively available without the need for interactive communication. Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning. We test whether this is the case by analyzing the performance of language models in a zero-shot setting on a wide variety of tasks.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この章は、言語モデリングを中心としたアプローチについて説明している。要旨は以下の通りである。

言語モデリングは、シンボル列の条件付き確率を推定する手法であり、文脈に基づいて次のシンボルを予測する形式である。
言語モデルは、単一タスクの学習だけでなく、複数のタスクを条件に応じて実行することも可能である。
言語によるタスクの指定は、タスク、入力、出力をシンボル列として表現する柔軟な手段を提供する。
言語モデルは、明示的な監督なしでも、自然言語のシーケンスからタスクを学び、ゼロショットで実行できる可能性がある。

②改良点・工夫点・テクニック等の記載

タスクを条件に応じて学習するために、言語そのものを使ってタスク、入力、出力をシンボル列として表現する柔軟な手法を導入している。このアプローチは、従来のアーキテクチャ的なタスク特化型のエンコーダー・デコーダーの使用を不要にしている。
自然言語からタスクを学習するという未解決の問題に対して、対話を使用する従来のアプローチを拡張し、インターネット上の膨大な情報を利用する手法を提案している。

③性能が向上した記載

性能向上に関しては、具体的な数値は記載されていないが、大規模な言語モデルがマルチタスク学習において十分にパフォーマンスを発揮することが示唆されている。また、ゼロショットでのタスク実行においても良好な結果が期待されている。

④メカニズムの解明・なぜそうなるのか等の記載

言語モデルは、シンボル列の条件付き確率を推定することで、次のシンボルを予測するタスクを自然に学習する。タスク、入力、出力をシーケンスとして表現することにより、タスク特化型のアーキテクチャを必要とせず、単一のモデルで複数のタスクを実行できる可能性が高まる。さらに、インターネット上の情報を利用することで、対話に依存せずに自然言語からタスクを学習できることが示されている。

⑤具体的な処理方法の記載

言語モデリングは、シンボル列の条件付き確率を推定することで行われる。具体的には、次のシンボルが文脈に基づいてどうなるかを予測するために、確率分布を学習する。
タスク学習においては、入力とタスクを同時に条件に含めた確率分布 $ p(\text{output} | \text{input}, \text{task}) $ を学習する。
従来のアーキテクチャ的手法ではなく、タスク、入力、出力をシンボル列として表現する方法に基づいて、特定のタスクを明示的に学習させることなく、自然言語からタスクを推測できるようにしている。

▼生成AIの回答(ChatGPT-4o)
⑥比較の記載

以下のように、従来の手法と本アプローチの違いが述べられている。

比較項目	従来の手法	本アプローチ
タスクの条件付け	アーキテクチャレベルで実装 (例: タスク特化型エンコーダー・デコーダー)	言語でタスクを指定 (シンボル列として表現)
学習方式	監督付き学習に基づく明示的なタスク特化型の設定	自然言語のシーケンスからタスクを学習 (ゼロショット学習)
対話の必要性	対話システムを通じてタスクを学習	対話なしで、インターネット上の情報を活用して学習
学習速度	明示的なタスク設定により高速	明示的な監督なしでは学習速度が遅くなる可能性

⑦上記⑥以外で表に出来そうな部分

項目	説明
言語モデリングの目的	シンボル列の条件付き確率を推定することで、次のシンボルを予測する
タスク指定の柔軟性	言語シーケンスを使ってタスクを柔軟に指定できる
マルチタスク学習	単一のモデルで複数のタスクを学習し実行可能

⑧具体的な数値の記載

該当する具体的な数値は記載されていない。

⑨具体的な変数名(数式用の記号)

変数名	説明
$ x_1, x_2, ..., x_n $	変数長のシンボル列
$ s_1, s_2, ..., s_n $	各シンボル
$ p(x) $	シンボル列 $ x $ の確率分布
$ p(s_n	s_1, ..., s_{n-1}) $
$ p(\text{output}	\text{input}, \text{task}) $

⑩図/表があれば、各図/表は何を主張するためのものか

このセクションには図や表の明記がない。

▼質問(人間)
続けて下記の質問に追加で回答下さい:
⑪関連研究の参照番号を全て列挙して、表にまとめて下さい。(元文にある内容のみ記載・類推して付け足しは不要、|参照番号|概要説明|の表へ、関連するもの同士でまとめて並べ替え)

▼生成AIの回答(ChatGPT-4o)
⑪関連研究の参照番号を全て列挙

参照番号	概要説明
Jelinek & Mercer, 1980	言語モデリングの条件付き確率推定に関する研究
Bengio et al., 2003	ニューラルネットワークを用いた言語モデリングの手法
Vaswani et al., 2017	Transformerアーキテクチャの提案、自己注意メカニズム
Kaiser et al., 2017	タスク特化型エンコーダー・デコーダーによるマルチタスク学習
Finn et al., 2017	MAML (Model-Agnostic Meta-Learning) フレームワークの提案
McCann et al., 2018	MQANモデルによるシンボル列の形式を使用したマルチタスク学習
Sutskever et al., 2015	密度推定を使用した学習目標に関する懸念の提起
Weston, 2016	対話システムでの自然言語から学習するモデルに関する研究

これらの研究が、言語モデリングやタスク条件付け、多様なタスク学習の分野における基盤となっている。

2.1. Training Dataset

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 2.1. Training Dataset Most prior work trained language models on a single do- main of text, such as news articles (Jozefowicz et al., 2016), Wikipedia (Merity et al., 2016), or fiction books (Kiros et al., 2015). Our approach motivates building as large and diverse a dataset as possible in order to collect natural lan- guage demonstrations of tasks in as varied of domains and contexts as possible. A promising source of diverse and nearly unlimited text is web scrapes such as Common Crawl. While these archives are many orders of magnitude larger than current language modeling datasets, they have significant data quality issues. Trinh & Le (2018) used Common Crawl in their work on commonsense reasoning but noted a large amount of doc- uments “whose content are mostly unintelligible”. We ob- served similar data issues in our initial experiments with Common Crawl. Trinh & Le (2018)’s best results were achieved using a small subsample of Common Crawl which included only documents most similar to their target dataset, the Winograd Schema Challenge. While this is a pragmatic approach to improve performance on a specific task, we want to avoid making assumptions about the tasks to be performed ahead of time. Instead, we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny. The resulting dataset, WebText, contains the text subset of these 45 million links. To extract the text from HTML responses we use a combination of the Dragnet (Peters & Lecocq, 2013) and Newspaper1 content extractors. All re- sults presented in this paper use a preliminary version of WebText which does not include links created after Dec 2017 and which after de-duplication and some heuristic based cleaning contains slightly over 8 million documents for a total of 40 GB of text. We removed all Wikipedia documents from WebText since it is a common data source for other datasets and could complicate analysis due to over- lapping training data with test evaluation tasks.

上記の内容に関して下記の質問に回答下さい: (である調で記載、質問に対して該当するものが無ければ無しと記載、対応する図/表番号があれば記載)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載)
②改良点・工夫点・テクニック等の記載があれば説明下さい。
③性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。
⑤具体的な処理方法の記載があれば説明下さい。(簡略化せずに全て記載、既存手法の適用であれば引用元を記載)
⑥比較の記載があれば違いを表でまとめて下さい。(対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(|数値|説明|の表へ)
⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

記載内容: 言語モデルのトレーニングデータセットの作成方法に関する記載である。

要旨:

以前の研究では、ニュース記事、Wikipedia、フィクション書籍など、単一ドメインのテキストを使用して言語モデルを訓練していた。
より多様で大規模なデータセットを収集するために、ウェブスクレイピング（主にCommon Crawl）を活用したアプローチが取られる。
Common Crawlのデータは大規模であるが、質の低い文書が多く含まれている。
WebTextという新しいデータセットを作成するために、Redditの「3 karma」以上のリンクを基にしたスクレイピング手法が使用された。
WebTextは45百万件のリンクを含み、約8百万件の文書で40GBのテキストを含む。
Wikipediaの文書は、他のデータセットとの重複を避けるために除外された。

②改良点・工夫点・テクニック等の記載

従来のCommon Crawlの使用例（Trinh & Le, 2018）では、ターゲットデータセットに似たサンプルを選別して性能を向上させていたが、このアプローチは特定のタスクに依存していた。
WebTextは、事前にタスクの仮定を行わずに幅広いドメインの質の高い文書を含むように設計された。
データの質を向上させるために、Redditの「3 karma」以上のリンクを基準にして、人間がフィルタリングしたリンクのみをスクレイピングした。
HTMLからテキストを抽出するためにDragnetとNewspaper1という2つのコンテンツ抽出ツールを使用した。

③性能が向上した記載

Trinh & Le (2018) の研究ではCommon Crawlのデータを選別してWinograd Schema Challengeの性能を向上させたとされているが、具体的な数値は示されていない。

④メカニズムの解明・なぜそうなるのか等の記載

WebTextでは、Redditの「3 karma」以上のリンクを基にすることで、ユーザーにとって興味深い、教育的な、または面白いとされるコンテンツが含まれているため、より質の高いデータセットが得られるとされている。
また、Wikipediaを除外することで、既存のテスト評価タスクと重複するトレーニングデータを避けることができる。

⑤具体的な処理方法の記載

Common Crawlの質の低いデータ問題に対処するために、新しいウェブスクレイピング手法が採用されている。
Redditの「3 karma」以上のリンクに基づいてスクレイピングを行い、質の高い文書を取得する。
DragnetおよびNewspaper1のコンテンツ抽出ツールを使用して、HTMLレスポンスからテキストを抽出する。
2017年12月以降のリンクは含まず、重複排除とヒューリスティックに基づいたクリーニングを行った結果、最終的に40GBのテキストからなる約8百万件の文書が得られた。

⑥比較の記載

手法	データセットの構成方法	タスク依存性	結果
Trinh & Le (2018)	Common Crawlからのサブサンプル	Winograd Schema Challengeに依存	良好な結果を報告
WebText (本論文)	Redditの「3 karma」以上のリンクに基づくスクレイピング	特定タスクに依存しない	より多様で質の高いデータを使用

⑦上記⑥以外で表に出来そうな部分

データセット	文書数	サイズ	生成方法
WebText	8百万件	40GB	Redditの「3 karma」以上のリンクに基づいて生成

⑧具体的な数値の記載

数値	説明
45百万件	Redditリンクの総数
8百万件	WebTextの最終的な文書数
40GB	WebTextの最終的なサイズ
3 karma	Redditで使用したリンク選別基準

⑨具体的な変数名(数式用の記号)

なし

⑩図/表があれば、各図/表は何を主張するためのものか

なし

2.2. Input Representation

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 2.2. Input Representation A general language model (LM) should be able to compute the probability of (and also generate) any string. Current large scale LMs include pre-processing steps such as lower- casing, tokenization, and out-of-vocabulary tokens which restrict the space of model-able strings. While processing Unicode strings as a sequence of UTF-8 bytes elegantly ful- fills this requirement as exemplified in work such as Gillick et al. (2015), current byte-level LMs are not competitive with word-level LMs on large scale datasets such as the One Billion Word Benchmark (Al-Rfou et al., 2018). We observed a similar performance gap in our own attempts to train standard byte-level LMs on WebText. Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and char- acter level inputs for infrequent symbol sequences. Despite its name, reference BPE implementations often operate on Unicode code points and not byte sequences. These imple- mentations would require including the full space of Uni- code symbols in order to model all Unicode strings. This would result in a base vocabulary of over 130,000 before any multi-symbol tokens are added. This is prohibitively large compared to the 32,000 to 64,000 token vocabularies often used with BPE. In contrast, a byte-level version of BPE only requires a base vocabulary of size 256. However, directly applying BPE to the byte sequence results in sub- optimal merges due to BPE using a greedy frequency based heuristic for building the token vocabulary. We observed BPE including many versions of common words like dog since they occur in many variations such as dog. dog! dog? . This results in a sub-optimal allocation of limited vocabulary slots and model capacity. To avoid this, we pre- vent BPE from merging across character categories for any byte sequence. We add an exception for spaces which sig- nificantly improves the compression efficiency while adding only minimal fragmentation of words across multiple vocab tokens. This input representation allows us to combine the empirical benefits of word-level LMs with the generality of byte-level approaches. Since our approach can assign a probability to any Unicode string, this allows us to evaluate our LMs on any dataset regardless of pre-processing, tokenization, or vocab size.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

記載内容: 言語モデルの入力表現に関する記載である。

要旨:

言語モデルは任意の文字列の確率を計算し、生成できるべきである。
現在の言語モデルでは、小文字化、トークナイズ、未知語トークンなどが事前処理に含まれ、モデリングできる文字列が制約されている。
バイトペアエンコーディング（BPE）は、頻繁に使用されるシンボルのシーケンスに対しては単語レベル、稀なシンボルには文字レベルの入力を使用する中間的なアプローチとして機能する。
既存のBPE実装では、Unicodeコードポイントに基づいているため、非常に大きな基本語彙サイズが必要であるが、バイトレベルのBPEでは語彙サイズが256に抑えられる。
BPEにより、頻出単語の異なるバリエーションが多く含まれ、語彙スロットやモデル容量が最適に活用されていないことが観察された。
本論文では、バイトシーケンスに対して文字カテゴリをまたぐマージを防ぐアプローチを採用し、空白に例外を設けることで効率的な圧縮を実現した。

②改良点・工夫点・テクニック等の記載

通常のBPEは、頻出単語（例: "dog", "dog!", "dog?"）の異なるバリエーションを語彙に多く含めてしまうため、語彙スロットやモデル容量の利用が非効率である。この問題を解決するために、文字カテゴリをまたぐBPEのマージを防止する工夫が施された。
空白に関しては例外を設け、圧縮効率を向上させる一方で、単語が複数の語彙トークンに分断されることを最小限に抑えた。

③性能が向上した記載

性能向上の具体的な数値や比較は記載されていないが、空白の扱いに対する例外の導入によって圧縮効率が向上し、語彙スロットの無駄な割り当てが改善されている。

④メカニズムの解明・なぜそうなるのか等の記載

通常のBPEでは頻繁に出現する単語の多様なバリエーション（例: "dog", "dog!", "dog?"）がすべて個別にトークン化されるため、無駄な語彙スロットの使用が発生する。これを防ぐために、BPEが文字カテゴリをまたいでマージすることを防ぎ、空白に関してのみ例外を設けることで、語彙スロットの最適な割り当てが可能となった。

⑤具体的な処理方法の記載

バイトレベルのBPEを使用し、基本語彙サイズを256に抑える。
BPEの適用に際して、文字カテゴリをまたぐマージを防ぐことで、無駄な語彙スロットの使用を削減する。
空白に関してのみ例外を設け、圧縮効率を向上させながら、単語の分断を最小限に抑える。

⑥比較の記載

モデル	語彙サイズ	処理方法	結果
通常のBPE	32,000〜64,000	文字カテゴリをまたいでマージ	非効率な語彙スロットの使用が観察
本論文のBPE	256	文字カテゴリをまたがずにマージ、空白に例外	圧縮効率が向上、語彙スロットの使用最適化

⑦上記⑥以外で表に出来そうな部分

語彙サイズ	圧縮効率	空白処理
32,000〜64,000	低い	なし
256	高い	あり（空白に例外）

⑧具体的な数値の記載

数値	説明
130,000	Unicodeシンボルを全てカバーするための語彙サイズ（通常のBPE）
32,000〜64,000	一般的なBPEの語彙サイズ
256	本論文のBPEでの基本語彙サイズ

⑨具体的な変数名(数式用の記号)

なし

⑩図/表があれば、各図/表は何を主張するためのものか

なし

2.3. Model

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 2.3. Model We use a Transformer (Vaswani et al., 2017) based archi- tecture for our LMs. The model largely follows the details of the OpenAI GPT model (Radford et al., 2018) with a few modifications. Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final self- attention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initial- ization by a factor of 1/√N where N is the number of residual layers. The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens and a larger batchsize of 512 is used.

Parameters Layers dmodel
117M 12 768
345M 24 1024
762M 36 1280
1542M 48 1600
Table 2. Architecture hyperparameters for the 4 model sizes.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

記載内容: 言語モデルに使用されるTransformerベースのアーキテクチャについての記載である。

要旨:

本モデルは、OpenAI GPTモデルの詳細に基づいており、いくつかの修正を加えている。
修正点として、Layer Normalizationを各サブブロックの入力に移動し、追加のLayer Normalizationを最終の自己注意ブロック後に追加している。
残差経路の累積を考慮した修正済みの初期化を使用し、残差層の重みを1/√N (Nは残差層の数)でスケーリングしている。
語彙サイズは50,257に拡張され、コンテキストサイズは512から1024トークンに増加している。
バッチサイズも512に増加している。

②改良点・工夫点・テクニック等の記載

Layer Normalizationを各サブブロックの入力に移動した。これは、He et al. (2016)のプレアクティベーション残差ネットワークに似た手法である。
残差経路の累積に対応するために、初期化を修正し、残差層の重みを1/√N（Nは残差層の数）でスケーリングする手法を導入している。
追加のLayer Normalizationを最終の自己注意ブロック後に追加し、モデルの安定性を向上させた。

③性能が向上した記載

コンテキストサイズを512から1024トークンに増やすことで、モデルがより広範な文脈を扱えるようになり、生成性能が向上した。
語彙サイズを50,257に拡張し、より広範な単語や記号に対応することで、言語モデリングの精度が向上している。

④メカニズムの解明・なぜそうなるのか等の記載

残差経路の重みをスケーリングする理由は、モデルが深くなるにつれて、残差経路に累積する情報が大きくなりすぎる問題を防ぐためである。この手法により、深層モデルでも安定した学習が可能となる。
Layer Normalizationを各サブブロックの入力に移動したことで、各層がより適切に正規化され、勾配消失の問題が軽減されている。

⑤具体的な処理方法の記載

残差層の初期化において、重みを1/√Nでスケーリングし、モデルが深くなっても残差経路の影響を適切に抑える。
コンテキストサイズを1024トークンに増加し、より長い文脈情報を扱うことができるようにした。
Layer Normalizationを各サブブロックの入力に移動し、最終的な自己注意ブロックの後にも追加している。

⑥比較の記載

モデル	Layer Normalizationの位置	残差経路の初期化	語彙サイズ	コンテキストサイズ	バッチサイズ
OpenAI GPT	サブブロック後	通常の初期化	50,257	512	512
本論文のモデル	サブブロック前および自己注意後	1/√Nでスケーリング	50,257	1024	512

⑦上記⑥以外で表に出来そうな部分

パラメータ	説明
1/√N	残差経路の初期化時のスケーリング係数
50,257	語彙サイズ
1024	コンテキストサイズ
512	バッチサイズ

⑧具体的な数値の記載

数値	説明
1/√N	残差経路の初期化時のスケーリング係数
50,257	語彙サイズ
1024	コンテキストサイズ
512	バッチサイズ
117M	パラメータ数 (12層, d=768)
345M	パラメータ数 (24層, d=1024)
762M	パラメータ数 (36層, d=1280)
1542M	パラメータ数 (48層, d=1600)

⑨具体的な変数名(数式用の記号)

変数名	説明
N	残差層の数
dmodel	各層のモデル次元数

⑩図/表があれば、各図/表は何を主張するためのものか

表2: モデルの4つのサイズにおけるハイパーパラメータの違いを示している。各モデルサイズに対する層数、次元数、およびパラメータ数の増加に伴う性能やメモリ消費のトレードオフを示すために用いられている。

3. Experiments

3.1. Language Modeling

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3. Experiments We trained and benchmarked four LMs with approximately log-uniformly spaced sizes. The architectures are summa- rized in Table 2. The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT (Devlin et al., 2018). Our largest model, which we call GPT-2, has over an order of magni- tude more parameters than GPT. The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText. All models still underfit Web- Text and held-out perplexity has as of yet improved given more training time. 3.1. Language Modeling As an initial step towards zero-shot task transfer, we are interested in understanding how WebText LM’s perform at zero-shot domain transfer on the primary task they are trained for – language modeling. Since our model operates on a byte level and does not require lossy pre-processing or tokenization, we can evaluate it on any language model benchmark. Results on language modeling datasets are commonly reported in a quantity which is a scaled or ex- ponentiated version of the average negative log probability per canonical prediction unit - usually a character, a byte, or a word. We evaluate the same quantity by computing the log-probability of a dataset according to a WebText LM and dividing by the number of canonical units. For many of these datasets, WebText LMs would be tested significantly out- of-distribution, having to predict aggressively standardized text, tokenization artifacts such as disconnected punctuation and contractions, shuffled sentences, and even the string which is extremely rare in WebText - occurring only 26 times in 40 billion bytes. We report our main re- sults in Table 3 using invertible de-tokenizers which remove as many of these tokenization / pre-processing artifacts as possible. Since these de-tokenizers are invertible, we can still calculate the log probability of a dataset and they can be thought of as a simple form of domain adaptation. We observe gains of 2.5 to 5 perplexity for GPT-2 with these de-tokenizers. WebText LMs transfer well across domains and datasets, improving the state of the art on 7 out of the 8 datasets in a zero-shot setting. Large improvements are noticed on small datasets such as Penn Treebank and WikiText-2 which have only 1 to 2 million training tokens. Large improvements are also noticed on datasets created to measure long-term dependencies like LAMBADA (Paperno et al., 2016) and the Children’s Book Test (Hill et al., 2015). Our model is still significantly worse than prior work on the One Billion Word Benchmark (Chelba et al., 2013). This is likely due to a combination of it being both the largest dataset and having some of the most destructive pre-processing - 1BW’s sentence level shuffling removes all long-range structure.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

記載内容: 言語モデル（LM）の訓練とベンチマーク、特にWebTextを用いたGPT-2モデルの性能評価についての記載である。

要旨:

4つの異なるサイズの言語モデルを訓練し、ベンチマークを行った。
モデルのサイズはlog-uniformに広がり、小さいものはGPTと同等、最大のものはGPT-2と呼ばれ、GPTよりもパラメータ数が大幅に多い。
モデルの学習率は、WebTextデータセットの5%の保持サンプルで最適なperplexityを達成するように手動で調整された。
言語モデリングのタスクで、zero-shotで異なるドメインへの転移性能を評価した。
WebText LMsは複数の言語モデリングデータセットに対して良好な転移性能を示し、特に小規模データセットでの改善が顕著であった。

②改良点・工夫点・テクニック等の記載

WebText LMはバイトレベルで動作し、ロスのある前処理やトークナイズを必要としない。この特性により、任意の言語モデルベンチマークで評価が可能となっている。
WebTextのデータセット内で発生する標準化されたテキストやトークナイズのアーティファクト（句読点の分離や文のシャッフルなど）を取り除くために、「逆変換可能なデトークナイザー」を使用した。

③性能が向上した記載

逆変換可能なデトークナイザーの使用により、GPT-2モデルで2.5から5のperplexityの改善が観察された【表3】。
WebText LMは、8つのデータセットのうち7つで、zero-shot設定において最先端の性能を達成した。
特にPenn TreebankやWikiText-2といった小規模データセットでは大きな改善が見られた【表3】。
LAMBADAやChildren’s Book Testのような長期依存関係を測定するデータセットでも顕著な改善が見られた【表3】。

④メカニズムの解明・なぜそうなるのか等の記載

WebText LMはバイトレベルで動作し、特定のトークナイゼーションに依存しないため、異なる言語モデリングデータセット間で柔軟な転移が可能である。これにより、異なるドメインのデータセットでも優れた性能を発揮する。
逆変換可能なデトークナイザーを使用することで、データセット間のトークナイズの違いを最小限に抑え、より一貫した評価が可能となった。

⑤具体的な処理方法の記載

各モデルは、WebTextの5%の保持サンプルに対するperplexityを基準に、学習率を手動で調整している。
言語モデリングの評価には、データセットのログ確率を計算し、データセットの標準的な単位で割るという手法を使用している。
逆変換可能なデトークナイザーを用いて、トークナイズのアーティファクトをできる限り除去し、ドメイン適応を簡素化している。

⑥比較の記載

データセット	WebText LMの性能 (zero-shot)	以前のSOTAとの比較	改善点
Penn Treebank	最先端を超える性能	以前のSOTAより高精度	大幅な改善
WikiText-2	最先端を超える性能	以前のSOTAより高精度	大幅な改善
LAMBADA	最先端を超える性能	以前のSOTAより高精度	大幅な改善
Children’s Book Test	最先端を超える性能	以前のSOTAより高精度	大幅な改善
One Billion Word Benchmark	以前のSOTAに劣る	パフォーマンスが低い	文シャッフルによる影響

⑦上記⑥以外で表に出来そうな部分

データセット名	特徴	トークナイズの影響
WebText	標準化されたテキスト	少ない影響
Penn Treebank	小規模データセット	大きな影響
WikiText-2	小規模データセット	大きな影響
LAMBADA	長期依存関係の評価	影響あり
Children’s Book Test	長期依存関係の評価	影響あり
One Billion Word Benchmark	大規模データセット	文のシャッフルによる影響

⑧具体的な数値の記載

数値	説明
2.5 - 5	GPT-2でのperplexityの改善幅
26	WebText内でが出現する回数（40億バイト中）
7/8	WebText LMがzero-shot設定で最先端の性能を超えたデータセット数
1-2百万	Penn TreebankやWikiText-2のトークン数
40億バイト	WebTextデータセットのサイズ

⑨具体的な変数名(数式用の記号)

変数名	説明
perplexity	言語モデルの評価指標、モデルがどれだけ混乱しているかを示す
log-probability	モデルがデータセットに対して与える対数確率
canonical unit	言語モデリングの標準単位（文字、バイト、単語など）

⑩図/表があれば、各図/表は何を主張するためのものか

表3: WebText LMが言語モデリングデータセットに対して達成したzero-shot設定でのperplexityの結果を示している。特に、逆変換可能なデトークナイザーを用いることで得られる性能向上が強調されており、2.5から5のperplexityの改善が観察されている。また、モデルが特定のドメインやデータセットに対してどれほど効果的に転移するかが評価されている。

3.2. Children’s Book Test

3.3. LAMBADA

3.4. Winograd Schema Challenge

3.5. Reading Comprehension

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3.2. Children’s Book Test The Children’s Book Test (CBT) (Hill et al., 2015) was created to examine the performance of LMs on different cat- egories of words: named entities, nouns, verbs, and preposi- tions. Rather than reporting perplexity as an evaluation met- ric, CBT reports accuracy on an automatically constructed cloze test where the task is to predict which of 10 possible choices for an omitted word is correct. Following the LM approach introduced in the original paper, we compute the probability of each choice and the rest of the sentence con- ditioned on this choice according to the LM, and predict the one with the highest probability. As seen in Figure 2 performance steadily improves as model size is increased and closes the majority of the gap to human performance on this test. Data overlap analysis showed one of the CBT test set books, The Jungle Book by Rudyard Kipling, is in WebText, so we report results on the validation set which has no significant overlap. GPT-2 achieves new state of the art results of 93.3% on common nouns and 89.1% on named entities. A de-tokenizer was applied to remove PTB style tokenization artifacts from CBT. 3.3. LAMBADA The LAMBADA dataset (Paperno et al., 2016) tests the ability of systems to model long-range dependencies in text. The task is to predict the final word of sentences which require at least 50 tokens of context for a human to successfully predict. GPT-2 improves the state of the art from 99.8 (Grave et al., 2016) to 8.6 perplexity and increases the accuracy of LMs on this test from 19% (Dehghani et al., 2018) to 52.66%. Investigating GPT-2’s errors showed most predictions are valid continuations of the sentence, but are not valid final words. This suggests that the LM is not using the additional useful constraint that the word must be the final of the sentence. Adding a stop-word filter as an approximation to this further increases accuracy to 63.24%, improving the overall state of the art on this task by 4%. The previous state of the art (Hoang et al., 2018) used a different restricted prediction setting where the outputs of the model were constrained to only words that appeared in the context. For GPT-2, this restriction is harmful rather than helpful since 19% of answers are not in context. We use a version of the dataset without preprocessing. 3.4. Winograd Schema Challenge The Winograd Schema challenge (Levesque et al., 2012) was constructed to measure the capability of a system to perform commonsense reasoning by measuring its ability to resolve ambiguities in text. Recently Trinh & Le (2018) demonstrated significant progress on this challenge using LMs, by predicting the resolution of the ambiguity with higher probability. We follow their problem formulation and visualize the performance of our models with both full and partial scoring techniques in Figure 3. GPT-2 improves state of the art accuracy by 7%, achieving 70.70%. The dataset is quite small with only 273 examples so we recommend reading Trichelair et al. (2018) to help contextualize this result. 3.5. Reading Comprehension The Conversation Question Answering dataset (CoQA) Reddy et al. (2018) consists of documents from 7 different domains paired with natural language dialogues between a question asker and a question answerer about the document. CoQA tests reading comprehension capabilities and also the ability of models to answer questions that depend on conversation history (such as “Why?”). Greedy decoding from GPT-2 when conditioned on a doc- ument, the history of the associated conversation, and a final token A: achieves 55 F1 on the development set. This matches or exceeds the performance of 3 out of 4 base- line systems without using the 127,000+ manually collected question answer pairs those baselines were trained on. The supervised SOTA, a BERT based system (Devlin et al., 2018), is nearing the 89 F1 performance of humans. While GPT-2’s performance is exciting for a system without any su- pervised training, some inspection of its answers and errors suggests GPT-2 often uses simple retrieval based heuristics such as answer with a name from the document in response to a who question.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か: GPT-2の様々なベンチマークでの性能評価について。
要旨:
- Children’s Book Test (CBT): GPT-2の性能は、特に名詞と固有名詞での精度が高く、新しい最先端の結果を達成した。
- LAMBADA: 長距離依存性の予測に関して、GPT-2は従来の手法より大幅に性能が向上したが、文の最後の単語を予測する際の制約を考慮していない問題がある。
- Winograd Schema Challenge: GPT-2は常識推論タスクで7%の精度向上を達成し、70.70%の精度に到達した。
- Reading Comprehension (CoQA): GPT-2は未学習でのテストで、いくつかのベースラインシステムに匹敵する性能を示したが、単純な検索ベースのヒューリスティックに頼る傾向がある。

②改良点・工夫点・テクニック等の記載

Children’s Book Test (CBT): PTBスタイルのトークン化アーティファクトを取り除くためにデトークナイザーを使用した。
LAMBADA: 文の最後の単語を予測する際に、追加の制約（ストップワードフィルター）を導入し、精度を向上させた。
Winograd Schema Challenge: GPT-2の性能向上は、部分得点と完全得点の両方で視覚化された。
CoQA: 未学習のGPT-2が、質問に応じて文書と会話の履歴に基づいて回答を生成する手法を採用。

③性能が向上した記載

Children’s Book Test (CBT): GPT-2は、共通名詞で93.3%、固有名詞で89.1%の精度を達成した。
LAMBADA: GPT-2は、従来の99.8から8.6へのパープレキシティの大幅な向上と、精度の19%から52.66%への改善を達成した。さらにストップワードフィルターを適用することで、63.24%の精度に向上した。
Winograd Schema Challenge: GPT-2は、精度が70.70%に達し、従来の手法を7%上回った。
Reading Comprehension (CoQA): GPT-2はF1スコア55を達成し、いくつかのベースラインシステムを超える性能を示した。

④メカニズムの解明・なぜそうなるのか等の記載

LAMBADA: GPT-2のエラー解析によって、多くの予測が文の妥当な続きであるが、文の最後の単語としては妥当ではないことが判明した。この問題を解決するために、ストップワードフィルターが導入され、精度が向上した。

⑤具体的な処理方法の記載

Children’s Book Test (CBT): 各選択肢に対する残りの文をGPT-2で条件付けて確率を計算し、最も高い確率の選択肢を予測した。
LAMBADA: 文の最後の単語を予測する際に、少なくとも50トークンの文脈が必要であることを考慮し、ストップワードフィルターを使用してさらなる精度向上を図った。
Winograd Schema Challenge: トレーニングされたGPT-2モデルを使用し、部分得点と完全得点の両方でタスクを評価した。
Reading Comprehension (CoQA): GPT-2は、文書と会話履歴に基づいて回答を生成し、Greedy decodingを用いてF1スコアを計算した。

⑥比較の記載

ベンチマーク	GPT-2の結果	従来の結果	精度の向上幅
CBT (共通名詞)	93.3%	-	新しい最先端結果
CBT (固有名詞)	89.1%	-	新しい最先端結果
LAMBADA	52.66%	19%	+33.66%
LAMBADA (ストップワードフィルター適用)	63.24%	-	+4%
Winograd Schema Challenge	70.70%	-	+7%
CoQA (F1スコア)	55	-	-

⑦上記⑥以外で表に出来そうな部分

表に出来そうな部分は特に無し。

⑧具体的な数値の記載

数値	説明
93.3%	CBTにおける共通名詞の精度
89.1%	CBTにおける固有名詞の精度
52.66%	LAMBADAの精度
63.24%	LAMBADAにストップワードフィルターを適用した場合の精度
70.70%	Winograd Schema Challengeにおける精度
55	CoQAにおけるF1スコア
99.8 → 8.6	LAMBADAにおけるパープレキシティの改善

⑨具体的な変数名(数式用の記号)

具体的な変数名の記載無し。

⑩図/表があれば、各図/表は何を主張するためのものか

Figure 2: CBTにおけるモデルサイズの増加による性能向上を視覚化したもので、人間のパフォーマンスに近づく様子を示す。
Figure 3: Winograd Schema Challengeにおけるモデルの部分得点と完全得点の視覚化を示し、GPT-2の性能が向上していることを示す。

長めの文章に対して、①～⑩の多くの質問をしたため、1つ1つの回答が非常に端的なものになってしまっている。

言語データセットの性質や言語の評価指標について知っていないと、内容がよく分からない。論文の本文でも同様に書かれてある。

3.6. Summarization

3.7. Translation

3.8. Question Answering

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3.6. Summarization We test GPT-2’s ability to perform summarization on the CNN and Daily Mail dataset (Nallapati et al., 2016). To in- duce summarization behavior we add the text TL;DR: after the article and generate 100 tokens with Top-k random sam- pling (Fan et al., 2018) with k = 2 which reduces repetition and encourages more abstractive summaries than greedy de- coding. We use the first 3 generated sentences in these 100 tokens as the summary. While qualitatively the generations resemble summaries, as shown in Table 14, they often focus on recent content from the article or confuse specific details such as how many cars were involved in a crash or whether a logo was on a hat or shirt. On the commonly reported ROUGE 1,2,L metrics the generated summaries only begin to approach the performance of classic neural baselines and just barely outperforms selecting 3 random sentences from the article. GPT-2’s performance drops by 6.4 points on the aggregate metric when the task hint is removed which demonstrates the ability to invoke task specific behavior in a language model with natural language. 3.7. Translation We test whether GPT-2 has begun to learn how to translate from one language to another. In order to help it infer that this is the desired task, we condition the language model on a context of example pairs of the format english sentence = french sentence and then after a fi- nal prompt of english sentence = we sample from the model with greedy decoding and use the first generated sentence as the translation. On the WMT-14 English-French test set, GPT-2 gets 5 BLEU, which is slightly worse than a word-by-word substitution with a bilingual lexicon in- ferred in previous work on unsupervised word translation (Conneau et al., 2017b). On the WMT-14 French-English test set, GPT-2 is able to leverage its very strong English language model to perform significantly better, achieving 11.5 BLEU. This outperforms several unsupervised machine translation baselines from (Artetxe et al., 2017) and (Lample et al., 2017) but is still much worse than the 33.5 BLEU of the current best unsupervised machine translation approach (Artetxe et al., 2019). Performance on this task was sur- prising to us, since we deliberately removed non-English webpages from WebText as a filtering step. In order to con- firm this, we ran a byte-level language detector2 on WebText which detected only 10MB of data in the French language which is approximately 500x smaller than the monolingual French corpus common in prior unsupervised machine trans- lation research. 3.8. Question Answering A potential way to test what information is contained within a language model is to evaluate how often it generates the correct answer to factoid-style questions. Previous showcas- ing of this behavior in neural systems where all information is stored in parameters such as A Neural Conversational Model (Vinyals & Le, 2015) reported qualitative results due to the lack of high-quality evaluation datasets. The recently introduced Natural Questions dataset (Kwiatkowski et al., 2019) is a promising resource to test this more quantita- tively. Similar to translation, the context of the language model is seeded with example question answer pairs which helps the model infer the short answer style of the dataset. GPT-2 answers 4.1% of questions correctly when evalu- ated by the exact match metric commonly used on reading comprehension datasets like SQUAD.3 As a comparison point, the smallest model does not exceed the 1.0% accu- racy of an incredibly simple baseline which returns the most common answer for each question type (who, what, where, etc...). GPT-2 answers 5.3 times more questions correctly, suggesting that model capacity has been a major factor in the poor performance of neural systems on this kind of task as of yet. The probability GPT-2 assigns to its generated answers is well calibrated and GPT-2 has an accuracy of 63.1% on the 1% of questions it is most confident in. The 30 most confident answers generated by GPT-2 on develop- ment set questions are shown in Table 5. The performance of GPT-2 is still much, much, worse than the 30 to 50% range of open domain question answering systems which hybridize information retrieval with extractive document question answering (Alberti et al., 2019).

▼生成AIの回答(ChatGPT-4o mini)
①何についての記載か? + 要旨は何?

何についての記載か: GPT-2の要約、翻訳、質問応答に関する性能評価について。
要旨:
- 要約: CNNおよびDaily Mailデータセットにおいて、GPT-2は抽象的な要約を生成する能力を示すが、特定の詳細の混乱が見られる。ROUGEメトリクスでは古典的なニューラルベースラインに近づくものであった。
- 翻訳: 英語からフランス語、フランス語から英語への翻訳タスクにおいて、GPT-2はBLEUスコアを達成したが、従来の無監視機械翻訳手法には及ばなかった。
- 質問応答: GPT-2は自然言語の質問に対して4.1%の正答率を示し、モデルのキャパシティが重要な要因であることを示唆しているが、オープンドメイン質問応答システムに比べてパフォーマンスは劣る。

②改良点・工夫点・テクニック等の記載

要約: TL;DR:というテキストを加えることで要約の生成を促進し、Top-kランダムサンプリングを使用して重複を減少させ、より抽象的な要約を生成した。
翻訳: 翻訳タスクにおいて、例文ペアを用いて文脈を設定し、最後のプロンプトを使用して生成された文を翻訳として用いた。
質問応答: 自然言語の質問応答ペアで文脈を与えることにより、短い回答スタイルを推論させる手法を採用した。

③性能が向上した記載

要約: GPT-2の生成する要約は、選択された3つの文を用いる場合、ランダムに選んだ文を超えるわずかな改善を示したが、定量的には古典的なニューラルベースラインには及ばなかった。
翻訳: WMT-14英仏テストセットでGPT-2は5 BLEUを達成し、WMT-14仏英テストセットで11.5 BLEUを達成した。この結果は複数の無監視機械翻訳のベースラインを上回ったが、最先端の33.5 BLEUには劣った。
質問応答: GPT-2は4.1%の正答率を示し、最小モデルの1.0%に対し5.3倍の正答を達成した。

④メカニズムの解明・なぜそうなるのか等の記載

要約: TL;DR:の追加は、要約の生成を誘発する役割を果たしており、これによりGPT-2はより要約に特化した生成を行った。
翻訳: GPT-2は英語の強力な言語モデルを利用することで、フランス語から英語への翻訳性能が向上した。
質問応答: 質問応答でのモデルの正答率は、モデルのキャパシティに大きく依存している。

⑤具体的な処理方法の記載

要約: CNNおよびDaily Mailデータセットにおいて、記事の後にTL;DR:を追加し、Top-kランダムサンプリング（k=2）を使用して100トークンを生成し、最初の3文を要約として選択した。
翻訳: 例文ペア（英語文=フランス文）で文脈を設定し、最終プロンプトとして英語文=を用いてモデルからサンプリングし、生成された最初の文を翻訳として使用した。
質問応答: 自然言語の質問応答ペアで文脈を与え、短い回答形式での推論を促すようにした。

⑥比較の記載

タスク	GPT-2の結果	従来の結果	精度の向上幅
要約	-	古典的なニューラルベースライン	わずかに改善
翻訳 (英仏)	5 BLEU	-	-
翻訳 (仏英)	11.5 BLEU	-	-
質問応答	4.1%	1.0%	+3.1%

⑦上記⑥以外で表に出来そうな部分

表に出来そうな部分は特に無し。

⑧具体的な数値の記載

数値	説明
5 BLEU	WMT-14英仏テストセットにおけるGPT-2のスコア
11.5 BLEU	WMT-14仏英テストセットにおけるGPT-2のスコア
4.1%	質問応答におけるGPT-2の正答率
1.0%	最小モデルの正答率

⑨具体的な変数名(数式用の記号)

具体的な変数名の記載無し。

⑩図/表があれば、各図/表は何を主張するためのものか

Table 14: 要約の品質を示すもので、生成された要約が特定の詳細を混乱させる傾向があることを示している。
Table 5: GPT-2の質問応答における最も自信のある30の回答を示し、モデルの応答の信頼性を評価するためのもの。

4. Generalization vs Memorization

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4. Generalization vs Memorization Recent work in computer vision has shown that common im- age datasets contain a non-trivial amount of near-duplicate images. For instance CIFAR-10 has 3.3% overlap between train and test images (Barz & Denzler, 2019). This results in an over-reporting of the generalization performance of ma- chine learning systems. As the size of datasets increases this issue becomes increasingly likely which suggests a similar phenomena could be happening with WebText. Therefore it is important to analyze how much test data also shows up in the training data. To study this we created Bloom filters containing 8-grams of WebText training set tokens. To improve recall, strings were normalized to contain only lower-cased alphanumeric words with a single space as a delimiter. The Bloom filters were constructed such that the false positive rate is upper bounded by 1 108 . We further verified the low false positive rate by generating 1M strings, of which zero were found by the filter. These Bloom filters let us calculate, given a dataset, the percentage of 8-grams from that dataset that are also found in the WebText training set. Table 6 shows this overlap anal- ysis for the test sets of common LM benchmarks. Common LM datasets’ test sets have between 1-6% overlap with Web- Text train, with an average of overlap of 3.2%. Somewhat surprisingly, many datasets have larger overlaps with their own training splits, with an average of 5.9% overlap. Our approach optimizes for recall, and while manual inspec- tion of the overlaps shows many common phrases, there are many longer matches that are due to duplicated data. This is not unique to WebText. For instance, we discovered that the test set of WikiText-103 has an article which is also in the training dataset. Since there are only 60 articles in the test set there is at least an overlap of 1.6%.4 Potentially more worryingly, 1BW has an overlap of nearly 13.2% with its own training set according to our procedure. For the Winograd Schema Challenge, we found only 10 schemata which had any 8-gram overlaps with the WebText training set. Of these, 2 were spurious matches. Of the remaining 8, only 1 schema appeared in any contexts that gave away the answer. For CoQA, about 15% of documents in the news domain are already in WebText and the model performs about 3 F1 better on these. CoQA’s development set metric reports the average performance over 5 different domains and we measure a gain of about 0.5-1.0 F1 due to overlap across the various domains. However, no actual training questions or answers are in WebText since CoQA was released after the cutoff date for links in WebText. On LAMBADA, the average overlap is 1.2%. GPT-2 per- forms about 2 perplexity better on examples with greater than 15% overlap. Recalculating metrics when excluding all examples with any overlap shifts results from 8.6 to 8.7 perplexity and reduces accuracy from 63.2% to 62.9%. This very small change in overall results is likely due to only 1 in 200 examples having significant overlap. Overall, our analysis suggests that data overlap between WebText training data and specific evaluation datasets pro- vides a small but consistent benefit to reported results. How- ever, for most datasets we do not notice significantly larger overlaps than those already existing between standard train- ing and test sets, as Table 6 highlights. Understanding and quantifying how highly similar text im- pacts performance is an important research question. Better de-duplication techniques such as scalable fuzzy matching could also help better answer these questions. For now, we recommend the use of n-gram overlap based de-duplication as an important verification step and sanity check during the creation of training and test splits for new NLP datasets. Another potential way of determining whether the perfor- mance of WebText LMs is attributable to memorization is inspecting their performance on their own held-out set. As shown in Figure 4, performance on both the training and test sets of WebText are similar and improve together as model size is increased. This suggests even GPT-2 is still underfitting on WebText in many ways. GPT-2 is also able to write news articles about the discovery of talking unicorns. An example is provided in Table 13.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

記載内容: 一般化と記憶化に関する議論。WebTextのトレーニングデータとテストデータの重複や、他の一般的な言語モデル（LM）ベンチマークデータセットとの重複を分析し、これがパフォーマンスに与える影響を検討している。
要旨:
- 画像データセットにおけるトレーニングとテストデータの重複問題が言語データにも存在する可能性があるため、WebTextと他のLMベンチマークデータセット間の重複を分析。
- Bloomフィルターを使用して8-gramの重複を分析し、約1-6%の重複が見られるが、これは他の標準的なデータセットのトレーニングとテストセット間の重複と大差ない。
- 例として、CoQAのニュースドメインでは約15%のドキュメントがWebTextに含まれており、GPT-2はこれらに対してF1スコアが約3ポイント向上。
- 重複が少ない場合でも、GPT-2はモデルサイズの増加とともにトレーニングセットとテストセット両方で性能が向上する。

②改良点・工夫点・テクニック等の記載

Bloomフィルターを使用して8-gramの重複を効率的に検出し、メモリの効率性と精度のバランスを取っている。これにより、低い偽陽性率（1/10^8）を実現している。
重複検出のために、文字列を正規化してアルファベット小文字と数字のみを使用し、スペースで区切る処理を行った。

③性能が向上した記載

CoQAのニュースドメインでは、ドキュメントの15%がWebTextに含まれており、これによりF1スコアが約3ポイント向上している。
LAMBADAデータセットでは、重複が15%以上ある場合、GPT-2は2パープレキシティ分の性能向上を見せた。

④メカニズムの解明・なぜそうなるのか等の記載

トレーニングデータとテストデータの重複がパフォーマンス向上に寄与する可能性が示唆されている。特に、CoQAやLAMBADAのような重複が認められるデータセットにおいて、パフォーマンスが向上することが確認されている。

⑤具体的な処理方法の記載

WebTextトレーニングデータとLMベンチマークのテストセット間の重複を分析するために、Bloomフィルターが使用された。8-gramの重複検出のために、文字列を小文字のアルファベットと数字に正規化し、単一スペースで区切った文字列を使用している。
Bloomフィルターは、1/10^8の偽陽性率を上限とするよう設計され、検証のために1Mの文字列を生成し、全てがフィルターによって検出されないことを確認している。

参考: Bloomフィルター

▼生成AIの回答(ChatGPT-4o)
⑥比較の記載があれば違いを表でまとめて下さい。

データセット	WebTextとの重複率	重複が性能に与える影響	図/表番号
CoQA	15%	F1スコアが約3ポイント向上	記載なし
LAMBADA	1.2%	2パープレキシティ分の性能向上	記載なし
WikiText-103	1.6%	記載なし	記載なし
1BW	13.2%	記載なし	記載なし
その他のLMベンチマーク	平均3.2%	記載なし	表6

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

データセット	トレーニングセット内の重複率	WebTextとの重複率	図/表番号
Common LMデータセット	平均5.9%	平均3.2%	表6
WikiText-103	1.6%	1.6%	記載なし
1BW	13.2%	13.2%	記載なし

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。

数値	説明
3.3%	CIFAR-10のトレーニングとテストの画像の重複率
1/10^8	Bloomフィルターの偽陽性率
1.6%	WikiText-103のトレーニングとテストの重複率
13.2%	1BWのトレーニングとテストの重複率
15%	CoQAのニュースドメインの重複率
3 F1	CoQAの重複によるF1スコアの向上
0.5-1.0 F1	CoQAの他のドメインにおける重複による向上
1.2%	LAMBADAの重複率
2	LAMBADAでのパープレキシティの向上
8.6 → 8.7	重複除外後のパープレキシティの変化
63.2% → 62.9%	重複除外後の正確性の変化

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

変数名	説明
F1	CoQAにおける性能指標
パープレキシティ	LAMBADAにおける言語モデルの評価指標
Bloomフィルター	重複検出のために使用されるデータ構造

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。

表6: WebTextトレーニングデータとLMベンチマークテストセット間の8-gram重複率を示しており、重複が一般的なトレーニングとテストセット間の重複率と同程度であることを主張している。
図4: WebTextのトレーニングセットとテストセットにおけるGPT-2の性能の変化を示し、モデルサイズの増加に伴って両セットで性能が向上することを示している。

「⑥比較の記載」「⑧具体的な数値の記載」等の数値の抜き出しは、全て正しい。良好。

5. Related Work

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 5. Related Work A significant portion of this work measured the performance of larger language models trained on larger datasets. This is similar to the work of Jozefowicz et al. (2016) which scaled RNN based language models on the 1 Billion Word Benchmark. Bajgar et al. (2016) also previously improved results on the Children’s Book Test by creating a much larger training dataset out of Project Gutenberg to supplement the standard training dataset. Hestness et al. (2017) conducted a thorough analysis of how the performance of various deep learning models changes as a function of both model capac- ity and dataset size. Our experiments, while much noisier across tasks, suggest similar trends hold for sub-tasks of an objective and continue into the 1B+ parameter regime. Interesting learned functionality in generative models has been documented before such as the cells in an RNN language model performing line-width tracking and quote/comment detection Karpathy et al. (2015). More in- spirational to our work was the observation of Liu et al. (2018) that a model trained to generate Wikipedia articles also learned to translate names between languages. Previous work has explored alternative approaches to filter- ing and constructing a large text corpus of web pages, such as the iWeb Corpus (Davies, 2018). There has been extensive work on pre-training methods for language tasks. In addition to those mentioned in the introduction, GloVe (Pennington et al., 2014) scaled word vector representation learning to all of Common Crawl. An influential early work on deep representation learning for text was Skip-thought Vectors (Kiros et al., 2015). McCann et al. (2017) explored the use of representations derived from machine translation models and Howard & Ruder (2018) improved the RNN based fine-tuning approaches of (Dai & Le, 2015). (Conneau et al., 2017a) studied the transfer performance of representations learned by natural language inference models and (Subramanian et al., 2018) explored large-scale multitask training. (Ramachandran et al., 2016) demonstrated that seq2seq mod- els benefit from being initialized with pre-trained language models as encoders and decoders. More recent work has shown that LM pre-training is helpful when fine-tuned for difficult generation tasks like chit-chat dialog and dialog based question answering systems as well (Wolf et al., 2019) (Dinan et al., 2018).

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か
この節では、以前の研究との関連性について説明している。特に、大規模な言語モデルのトレーニングやデータセットのスケーリングに関する過去の研究に焦点を当てている。

要旨

Jozefowiczら (2016) やBajgarら (2016) は、大規模なデータセットを使ってRNNベースのモデルやChildren's Book Testの結果を改善した。
Hestnessら (2017) は、モデルの容量とデータセットサイズが性能に与える影響を分析し、本研究の結果も類似の傾向を示している。
Karpathyら (2015) は、生成モデルが学習する興味深い機能について記録しており、Liuら (2018) も同様の現象を観察した。
Davies (2018) のiWeb Corpusなど、大規模なテキストコーパスの構築やフィルタリングに関する研究が行われている。
言語タスクにおける事前学習方法に関する多くの研究があり、GloVeやSkip-thought Vectorsなどが例として挙げられている。
Ramachandranら (2016) やWolfら (2019) は、事前学習した言語モデルを使ったシーケンス生成や対話システムでの改善を示した。

②改良点・工夫点・テクニック等の記載があれば説明下さい。
以前の研究と同様に、大規模なデータセットを使用してモデルの性能を向上させるアプローチが取られている。また、生成モデルが学習する機能（例: 翻訳能力や行幅追跡）に関する観察が、従来の研究と比較して新たなインスピレーションを与えている。さらに、事前学習モデルを使ったフィルタリングや多タスク学習により、異なる領域間の知識転移が可能になることが示されている。

③性能が向上した記載があれば説明下さい。
具体的な性能向上については記載がないが、過去の研究に基づくと、大規模なデータセットや事前学習された言語モデルを使用することで、さまざまなタスクの性能が向上することが確認されている。特に、Wolfら (2019) の研究では、対話システムや質問応答システムにおいて事前学習されたモデルが効果を発揮していることが述べられている。

④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。
生成モデルが特定の機能（例: 翻訳や行幅追跡）を学習するメカニズムについては、Liuら (2018) とKarpathyら (2015) の研究が示唆している。モデルが大量のデータに触れることで、特定のタスクに特化した機能を自動的に学習する可能性がある。これは、データの多様性とモデルのキャパシティが、隠れたパターンを捉える能力を向上させるためと考えられる。

⑤具体的な処理方法の記載があれば説明下さい。
処理方法としては、主に大規模なデータセットを使用してモデルを事前学習する手法が強調されている。また、Seq2Seqモデルにおいて、事前学習された言語モデルをエンコーダーやデコーダーとして使用するアプローチも紹介されている (Ramachandranら, 2016)。具体的なアルゴリズムやデータの処理手法については言及がない。

⑥比較の記載があれば違いを表でまとめて下さい。

研究者	アプローチ	改善されたタスク	図/表番号
Jozefowiczら (2016)	RNNベースの言語モデルを1 Billion Word Benchmarkにスケール	言語モデリング	記載なし
Bajgarら (2016)	GutenbergプロジェクトのデータセットでChildren’s Book Testを改善	質問応答	記載なし
Liuら (2018)	Wikipedia生成モデルで名前の翻訳能力を学習	翻訳	記載なし
Ramachandranら (2016)	Seq2Seqモデルに事前学習されたLMを使用	シーケンス生成	記載なし
Wolfら (2019)	対話システムで事前学習モデルを使用	対話型QA	記載なし

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

研究者	アプローチ	データセット	図/表番号
Davies (2018)	iWeb Corpusでのテキストフィルタリング	Webページ	記載なし
Penningtonら (2014)	GloVeでの語彙ベクトル学習	Common Crawl	記載なし
Kirosら (2015)	Skip-thought Vectorsでテキスト表現学習	記載なし	記載なし
Conneauら (2017a)	NLIモデルでの転移学習	記載なし	記載なし

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。

数値	説明
1 Billion	Jozefowiczらが使用した1 Billion Word Benchmark
2016	BajgarらがGutenbergプロジェクトで使用した年
2017	Hestnessらが様々なモデルの性能を分析した年
2018	LiuらがWikipedia生成モデルでの翻訳機能を観察した年

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

変数名	説明
LM	言語モデル (Language Model)
RNN	再帰型ニューラルネットワーク (Recurrent Neural Network)
Seq2Seq	シーケンス間の変換モデル (Sequence to Sequence Model)

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。

この節では図や表に関する言及がなく、該当する図/表は存在しない。

▼生成AIの回答(ChatGPT-4o)
⑪関連研究の参照番号を全て列挙して、表にまとめて下さい。

参照番号	概要説明
Jozefowicz et al. (2016)	RNNベースの言語モデルを1 Billion Word Benchmarkでスケールさせた研究
Bajgar et al. (2016)	Project Gutenbergのデータセットを使ってChildren's Book Testの結果を改善した研究
Hestness et al. (2017)	モデル容量とデータセットサイズが性能に与える影響を分析した研究
Karpathy et al. (2015)	RNNモデルが行幅追跡や引用/コメント検出といった興味深い機能を学習する現象を観察した研究
Liu et al. (2018)	Wikipedia生成モデルが名前を言語間で翻訳する能力を学習する現象を観察した研究
Davies (2018)	iWeb Corpusを使ったWebページのフィルタリングおよびテキストコーパスの構築に関する研究
Pennington et al. (2014)	GloVeを用いてCommon Crawl全体で語彙ベクトル表現を学習した研究
Kiros et al. (2015)	Skip-thought Vectorsでのテキスト表現学習に関する初期の研究
McCann et al. (2017)	機械翻訳モデルから導出された表現の使用を探究した研究
Howard & Ruder (2018)	RNNベースのファインチューニングアプローチの改善に関する研究
Conneau et al. (2017a)	自然言語推論 (NLI) モデルで学習された表現の転移性能を研究
Subramanian et al. (2018)	大規模マルチタスク学習に関する研究
Ramachandran et al. (2016)	Seq2Seqモデルのエンコーダーおよびデコーダーとして事前学習された言語モデルを使うアプローチを提案した研究
Wolf et al. (2019)	対話システムや質問応答システムにおいて、事前学習された言語モデルを使うことで性能が向上したことを示した研究
Dinan et al. (2018)	チャット型対話や質問応答システムにおける言語モデルの事前学習の有効性を示した研究

6. Discussion

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 6. Discussion Much research has been dedicated to learning (Hill et al., 2016), understanding (Levy & Goldberg, 2014), and criti- cally evaluating (Wieting & Kiela, 2019) the representations of both supervised and unsupervised pre-training methods. Our results suggest that unsupervised task learning is an additional promising area of research to explore. These findings potentially help explain the widespread success of pre-training techniques for down-stream NLP tasks as we show that, in the limit, one of these pre-training techniques begins to learn to perform tasks directly without the need for supervised adaption or modification. On reading comprehension the performance of GPT-2 is competitive with supervised baselines in a zero-shot setting. However, on other tasks such as summarization, while it is qualitatively performing the task, its performance is still only rudimentary according to quantitative metrics. While suggestive as a research result, in terms of practical applica- tions, the zero-shot performance of GPT-2 is still far from use-able. We have studied the zero-shot performance of WebText LMs on many canonical NLP tasks, but there are many addi- tional tasks that could be evaluated. There are undoubtedly many practical tasks where the performance of GPT-2 is still no better than random. Even on common tasks that we evaluated on, such as question answering and translation, language models only begin to outperform trivial baselines when they have sufficient capacity. While zero-shot performance establishes a baseline of the potential performance of GPT-2 on many tasks, it is not clear where the ceiling is with finetuning. On some tasks, GPT-2’s fully abstractive output is a significant departure from the extractive pointer network (Vinyals et al., 2015) based outputs which are currently state of the art on many question answering and reading comprehension datasets. Given the prior success of fine-tuning GPT, we plan to in- vestigate fine-tuning on benchmarks such as decaNLP and GLUE, especially since it is unclear whether the additional training data and capacity of GPT-2 is sufficient to over- come the inefficiencies of uni-directional representations demonstrated by BERT (Devlin et al., 2018).

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

本節は、事前学習とゼロショット学習に関する議論である。主にGPT-2の性能に焦点を当て、自然言語処理（NLP）タスクにおけるモデルの可能性と限界を論じている。

事前学習技術の成功は、タスク自体を学習することによる可能性がある。
GPT-2は、特定のタスクにおいてゼロショット設定での性能が良好であるが、他のタスクではまだ基本的な性能にとどまっている。
ゼロショットでの限界はあるが、ファインチューニングによる改善の可能性がある。
GPT-2は、いくつかのNLPタスクでトリビアルなベースラインを上回るが、他の多くのタスクでは性能が低いままである。
GPT-2の能力はBERTのような双方向モデルに対抗できるかどうかは不明であり、今後の調査が必要である。

② 改良点・工夫点・テクニック等の記載

ゼロショット学習によってGPT-2がいくつかのタスクにおいてベースラインに匹敵するパフォーマンスを発揮する点は、事前学習技術の大きな進展として注目される。しかし、タスクによっては完全な性能を発揮するためにファインチューニングが依然として必要である。

③ 性能が向上した記載

読解タスクにおいて、GPT-2はゼロショット設定で教師ありベースラインと競合する性能を発揮している。
他のタスク（例: 要約）では、定性的にタスクを遂行しているが、定量的なメトリクスではまだ基本的な性能にとどまっている。

④ メカニズムの解明・なぜそうなるのか等の記載

事前学習により、モデルがタスク自体を学習する可能性があることが示唆されている。このことが、事前学習技術がNLPタスクにおいて広く成功している理由の一つであると考えられる。
GPT-2のゼロショット性能がタスクによって大きく異なるのは、タスクの複雑さやモデルの容量に依存するためである。

⑤ 具体的な処理方法の記載

WebText LMsを使用してゼロショット学習を行い、いくつかの標準的なNLPタスクで性能を評価した。タスクには、質問応答、翻訳、要約などが含まれている。
GPT-2のゼロショット学習の性能は、十分な容量を持つ言語モデルがタスクを遂行できるかどうかに依存する。

⑥ 比較の記載

比較項目	GPT-2 ゼロショット性能	ファインチューニングの必要性	現状の限界
読解タスク	教師ありベースラインと競合	一部のタスクでは不要	いくつかのタスクでは限界あり
要約タスク	定性的に遂行しているが、定量的には不十分	要約タスクの性能向上にはファインチューニングが必要	基本的なレベルの性能にとどまる
質問応答・翻訳	ベースラインをわずかに上回る性能を発揮している	ベースラインを超えるにはモデルの容量が重要	十分な容量がなければトリビアルな性能に留まる

⑦ 上記⑥以外で表に出来そうな部分

課題	ゼロショット性能	ファインチューニングの必要性
GPT-2 の抽象的出力	抽象的なゼロショット出力が、抽出型出力と異なる	GPT-2の出力はまだ改善が必要
BERTとの比較	GPT-2は一方向モデルであり、BERTは双方向モデルを採用	GPT-2の限界を打破するために双方向アプローチが必要か不明

⑧ 具体的な数値の記載

数値	説明
1B+	GPT-2のパラメータ数
ベースライン	ゼロショットで競合する教師ありベースライン性能
GPT-2容量	モデルの容量が十分でないとタスク性能はトリビアルに留まる

⑨ 具体的な変数名(数式用の記号)

本文には具体的な数式や変数名の記載はない。

⑩ 図/表があれば、各図/表は何を主張するためのものか

該当無し

7. Conclusion

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 7. Conclusion When a large language model is trained on a sufficiently large and diverse dataset it is able to perform well across many domains and datasets. GPT-2 zero-shots to state of the art performance on 7 out of 8 tested language model- ing datasets. The diversity of tasks the model is able to perform in a zero-shot setting suggests that high-capacity models trained to maximize the likelihood of a sufficiently varied text corpus begin to learn how to perform a surprising amount of tasks without the need for explicit supervision.5

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

本節は、GPT-2の大規模言語モデルにおけるゼロショット学習の総括である。モデルが多様なタスクにおいて明示的な教師なしで良好な性能を発揮する点が強調されている。

大規模かつ多様なデータセットで訓練された言語モデルは、複数のドメインやデータセットで優れた性能を発揮する。
GPT-2は、テストされた8つの言語モデリングデータセットのうち7つでゼロショットの最先端性能を達成した。
高容量モデルが多様なタスクをゼロショットで遂行できるのは、十分に多様なテキストコーパスを学習することで、明示的な教師なしにタスクを遂行する方法を学び始めるからである。

② 改良点・工夫点・テクニック等の記載

本研究では、大規模で多様なデータセットを用いてモデルを訓練することが、ゼロショット学習において重要であることを示している。モデルが単一のタスクに限定されず、多様なタスクに適応する能力を持つ点は特筆すべき改良点である。

③ 性能が向上した記載

GPT-2は、テストされた8つの言語モデリングデータセットのうち7つでゼロショットの最先端性能を達成している【図表番号5】。

④ メカニズムの解明・なぜそうなるのか等の記載

多様なテキストコーパスを学習した高容量モデルが、タスクの遂行方法を暗黙的に学習することが、ゼロショット学習の成功の鍵であるとされている。これは、明示的な教師なしでタスクを学習できる理由を説明している。

⑤ 具体的な処理方法の記載

大規模な言語モデル（GPT-2）を用いて、多様なテキストコーパスを学習し、その後ゼロショット学習で様々なタスクを評価する。具体的には、言語モデリングデータセットでゼロショット性能をテストし、他のタスクでもその能力を評価している。

⑥ 比較の記載

項目	GPT-2 ゼロショット性能	他のモデルとの比較	図/表番号
言語モデリングデータセット	8つのデータセットのうち7つで最先端性能	他のゼロショット学習モデルに勝る	図表番号5

「③ 性能が向上した記載」の「【図表番号5】」の参照は間違い。論文の本文中には記載なし。

8. Appendix A: Samples

8.1. Model capacity

8.2. Text Memorization

8.3. Diversity

8.4. Robustness

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 8. Appendix A: Samples 8.1. Model capacity To complement the reported perplexity gains of bigger LMs on WebText show in Figure 4, Tables 7 through 11 show side-by-side completions of the smallest WebText LM and GPT-2 on random unseen WebText test set articles. 8.2. Text Memorization We observe some memorizing behavior in GPT-2 on longer strings that are repeated many times in the dataset such as famous quotes or speeches. For example, when conditioned on the first sentence and a half of the Gettysburg Address (which occurs approximately 40 times throughout WebText), an argmax decode from GPT-2 recovers the speech. Even when sampling without truncation, we find that the model copies the speech for awhile before drifting, albeit in a similar style. It typically drifts within 100-200 tokens, and displays widening diversity once it drifts. To quantify how often exact memorization shows up in samples, we generated samples from GPT-2 conditioned on WebText test set articles and compared the overlap rates of GPT-2’s generations to the overlap rates of the ground-truth completions. The results of this analysis are shown below and suggest that GPT-2 repeats text from the training set less often then the baseline rate of held-out articles. 8.3. Diversity Table 12 shows multiple completions of the same random WebText test set context, showing the diversity of completions with standard sampling settings. 8.4. Robustness Table 13 shows the previously mentioned talking unicorns news article. We find the model to be capable of handling out of dis- tribution contexts, but the quality of these samples is generally lower.

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

本節は、GPT-2の性能を示すために提供されたサンプルやその評価に関する補遺である。具体的には、モデル容量、テキストの記憶、生成の多様性、ロバスト性について記載されている。

モデル容量: 図4で示されたWebTextに関するパープレキシティの向上に補完する形で、小規模なWebText LMとGPT-2のサイドバイサイドの比較がテーブル7〜11で提供されている。
テキストの記憶: GPT-2は長い文字列や繰り返しデータセット内で出現する有名な引用やスピーチを記憶する傾向があるが、トークン数100～200程度でそのパターンから逸脱し、生成の多様性が拡大する。
多様性: テーブル12では、同じWebTextテストセットコンテキストに対する複数の生成結果を示し、標準的なサンプリング設定での多様性が確認されている。
ロバスト性: テーブル13では、分布外のコンテキストに対する生成の例として、ユニコーンに関するニュース記事が示され、質は低いが対応可能であることが確認された。

② 改良点・工夫点・テクニック等の記載

モデル容量の改善: WebText LMとGPT-2のパープレキシティ向上を比較し、モデル容量を増加させることで生成品質が向上する点を評価している。
テキストの記憶: GPT-2が特定の長いテキストを記憶する傾向を示すが、一定のトークン数を超えると異なるスタイルの生成が始まる。この特性により、多様性を保ちながら生成が進むよう工夫されている。
多様性の確認: 標準的なサンプリング設定において、同じコンテキストに対する異なる生成が提示されており、多様な出力が得られることを確認している。

③ 性能が向上した記載

モデル容量を増加させることで、WebTextにおけるパープレキシティが向上している点が図4に示されている【図4】。
GPT-2のテキスト生成において、記憶した内容からの逸脱後に多様性が拡大する挙動が確認されており、これは性能の向上に寄与している【テーブル7～12】。

④ メカニズムの解明・なぜそうなるのか等の記載

モデルが長い文字列や繰り返し出現するテキストを記憶し、その後記憶から逸脱して多様な生成を行うメカニズムについて言及されている。この挙動は、GPT-2が訓練データの特定の部分を暗記し、同時にその後の生成で新しいパターンを導入するためと説明されている。

⑤ 具体的な処理方法の記載

GPT-2はWebTextデータセットを使用して訓練され、訓練データから生成されたテキストのパープレキシティを評価する。さらに、複数のサンプリング設定を用いて生成の多様性や記憶挙動を評価し、生成されたテキストと元のデータセットとの重複率を比較する。

⑥ 比較の記載

項目	小規模 WebText LM	GPT-2	図/表番号
テキスト生成の比較	単純な生成結果	複雑で多様な生成結果	テーブル7～11
記憶率	高頻度で繰り返されるテキストを再現	記憶からの逸脱後、多様な生成が進行	テーブル12
分布外コンテキストでの応答	質の低い生成	質が低いが対応可能	テーブル13

⑦ 上記⑥以外で表に出来そうな部分

項目	説明
記憶の逸脱	トークン数100～200で記憶したテキストから逸脱する
多様性	記憶から逸脱後に生成されるテキストの多様性が増加
分布外対応	分布外のコンテキストでも、一定の応答が得られる

⑧ 具体的な数値の記載

数値	説明
100～200	GPT-2が記憶から逸脱するトークン数
40	Gettysburg Addressが出現する回数

⑨ 具体的な変数名(数式用の記号)

該当なし。

⑩ 図/表の説明

図4: WebTextにおけるモデル容量とパープレキシティの関係を示し、モデル容量が大きくなるほどパープレキシティが低下し、性能が向上する点を主張している。
テーブル7～11: 小規模WebText LMとGPT-2のサイドバイサイド比較を示し、GPT-2の生成性能がより高いことを主張している。
テーブル12: 同じコンテキストに対する複数の生成結果を示し、GPT-2の生成の多様性を確認するためのもの。
テーブル13: 分布外コンテキストでの応答例を示し、GPT-2が低品質ながらも対応可能である点を示している。

あとがき

所感

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。(良い意味で論文の要点をそのまま抜き出した形)

分かったこと

分かったこと・注目した点を列挙すると:

モデルの仕組みはGPTとほぼ同じ。
特定のタスクに特化した教師あり学習は行わず、より大きな言語コーパスを使って、より大きなモデルの言語モデルを事前学習させることにより、zero-shot、もしくはfew-shotのセッティングでも精度が出るような汎用的なモデルを目指す。

自然言語処理タスクに対するゼロショット学習の成功を示した研究結果が記載される。
新しいデータセット「WebText」(数百万のウェブページから収集、大規模で多様なデータセット)を使用してトレーニングされたGPT-2モデルが、教師なしでさまざまな自然言語処理タスクを学習できることを示している。

モデルの容量（パラメータ数）はゼロショットタスク転移の成功に不可欠であり、容量が増加することで性能も向上する。

最大のモデルであるGPT-2は1.5Bパラメータを持つTransformerモデルであり、ゼロショット設定でテストされた8つの言語モデリングデータセットのうち7つで最先端の結果を達成した。
CoQAデータセットにおいて、トレーニングデータ（127,000+の例）を使用せずに、F1スコア55を達成しており、3つのベースラインシステムの性能を上回った。

現在の機械学習システムは、単一タスクに特化した狭い専門家としての性能を発揮するが、一般化能力が不足している。
マルチタスク学習は、より一般的なシステムを構築するための有望なアプローチであるが、現時点では成果は限られている。
転移学習の手法が進化しており、特に言語モデルがタスクをゼロショットで実行できる可能性がある。
本論文では、ゼロショット学習による下流タスクの遂行能力を示し、競争力のある結果を達成することを目指している。

より一般的なシステムを構築するためのマルチタスク学習や転移学習の可能性について述べている。
言語モデルにおいて、タスク固有のアーキテクチャの設計が不要であり、自己注意層（self-attention blocks）の転移だけで十分であることを提案している。

マルチタスク学習(複数のデータセットやタスクを一括して学習)が一般化能力を向上させる可能性がある理由として、単一タスクではなく、広範なタスク間で学習することで、より頑健なシステムが構築されることが期待されるためである。

言語モデリング＝シンボル列の条件付き確率を推定する手法であり、文脈に基づいて次のシンボルを予測する形式である。

タスクを条件に応じて学習するために、言語そのものを使ってタスク、入力、出力をシンボル列として表現する柔軟な手法を導入している。このアプローチは、従来のアーキテクチャ的なタスク特化型のエンコーダー・デコーダーの使用を不要にしている。

インターネット上の膨大な情報を利用する手法を提案している。

言語モデルのトレーニングデータセットの作成方法:
以前の研究では、ニュース記事、Wikipedia、フィクション書籍など、単一ドメインのテキストを使用して言語モデルを訓練していた。
より多様で大規模なデータセットを収集するために、ウェブスクレイピング（主にCommon Crawl）を活用したアプローチが取られる。
Common Crawlのデータは大規模であるが、質の低い文書が多く含まれている。
WebTextという新しいデータセットを作成するために、Redditの「3 karma」以上のリンクを基にしたスクレイピング手法が使用された。
WebTextは45百万件のリンクを含み、約8百万件の文書で40GBのテキストを含む。
Wikipediaの文書は、他のデータセットとの重複を避けるために除外された。

バイトペアエンコーディング（BPE）は、頻繁に使用されるシンボルのシーケンスに対しては単語レベル、稀なシンボルには文字レベルの入力を使用する中間的なアプローチとして機能する。
既存のBPE実装では、Unicodeコードポイントに基づいているため、非常に大きな基本語彙サイズが必要であるが(32,000〜64,000)、バイトレベルのBPEを使用し、基本語彙サイズを256に抑える。
効率的な圧縮を実現した。

通常のBPEでは頻繁に出現する単語の多様なバリエーション（例: "dog", "dog!", "dog?"）がすべて個別にトークン化されるため、無駄な語彙スロットの使用が発生する。これを防ぐために、BPEが文字カテゴリをまたいでマージすることを防ぎ、空白に関してのみ例外を設けることで、語彙スロットの最適な割り当てが可能となった。

言語モデルに使用されるTransformerモデル:
本モデルは、OpenAI GPTモデルの詳細に基づいており、いくつかの修正を加えている。
修正点として、Layer Normalizationを各サブブロックの入力に移動し、追加のLayer Normalizationを最終の自己注意ブロック後に追加している。
残差経路の累積を考慮した修正済みの初期化を使用し、残差層の重みを1/√N (Nは残差層の数)でスケーリングしている。
語彙サイズは50,257に拡張され、コンテキストサイズは512から1024トークンに増加している。
バッチサイズも512に増加している。

残差経路の重みをスケーリングする理由は、モデルが深くなるにつれて、残差経路に累積する情報が大きくなりすぎる問題を防ぐためである。
Layer Normalizationを各サブブロックの入力に移動したことで、各層がより適切に正規化され、勾配消失の問題が軽減されている。

モデルの学習率は、WebTextデータセットの5%の保持サンプルで最適なperplexityを達成するように手動で調整された。

言語モデリングのタスクで、zero-shotで異なるドメインへの転移性能を評価した。

WebText LMはバイトレベルで動作し、ロスのある前処理やトークナイズを必要としない。この特性により、任意の言語モデルベンチマークで評価が可能となっている。

WebText LMは、8つのデータセットのうち7つで、zero-shot設定において最先端の性能を達成した。
特にPenn TreebankやWikiText-2といった小規模データセットでは大きな改善が見られた【表3】。
LAMBADAやChildren’s Book Testのような長期依存関係を測定するデータセットでも顕著な改善が見られた【表3】。

要約: CNNおよびDaily Mailデータセットにおいて、GPT-2は抽象的な要約を生成する能力を示すが、特定の詳細の混乱が見られる。
要約: TL;DR:というテキストを加えることで要約の生成を促進し、Top-kランダムサンプリング（k=2）を使用して100トークンを生成し、最初の3文を要約として選択した。これにより重複を減少させ、より抽象的な要約を生成した。
↓
要約: GPT-2の生成する要約は、選択された3つの文を用いる場合、ランダムに選んだ文を超えるわずかな改善を示したが、定量的には古典的なニューラルベースラインには及ばなかった。

学習データWebTextとLMベンチマークデータセット間の重複を分析:
Bloomフィルターを使用して8-gramの重複を分析し(文字列を小文字のアルファベットと数字に正規化し、単一スペースで区切った文字列を使用)、約1-6%の重複が見られるが、これは他の標準的なデータセットのトレーニングとテストセット間の重複と大差ない。
LAMBADAデータセットでは、重複が15%以上ある場合、GPT-2は2パープレキシティ分の性能向上を見せた。

モデルが大量のデータに触れることで、特定のタスクに特化した機能を自動的に学習する可能性がある。これは、データの多様性とモデルのキャパシティが、隠れたパターンを捉える能力を向上させるためと考えられる。(Liuら (2018) とKarpathyら (2015) の研究が示唆している。)

GPT-2モデルの可能性と限界を論じている:
GPT-2は、特定のタスクにおいてゼロショット設定での性能が良好であるが、他のタスクではまだ基本的な性能にとどまっている。
ゼロショットでの限界はあるが、ファインチューニングによる改善の可能性がある。
GPT-2は、いくつかのNLPタスクでトリビアルなベースラインを上回るが、他の多くのタスクでは性能が低いままである。
GPT-2の能力はBERTのような双方向モデルに対抗できるかどうかは不明であり、今後の調査が必要である。

多様なテキストコーパスを学習した高容量モデルが、タスクの遂行方法を暗黙的に学習することが、ゼロショット学習の成功の鍵であるとされている。これは、明示的な教師なしでタスクを学習できる可能性を示している。

テキストの記憶: GPT-2は長い文字列や繰り返しデータセット内で出現する有名な引用やスピーチを記憶する傾向があるが、トークン数100～200程度でそのパターンから逸脱し、生成の多様性が拡大する。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up