生成AIを用いてScaling Lawsの論文「Training Compute-Optimal Large Language Models (2022)」を読んでみた

Last updated at 2024-11-30Posted at 2024-10-24

はじめに

生成AIを用いてChinchillaのScaling Lawsの論文「Training Compute-Optimal Large Language Models (2022)」の内容を(なるべく)把握してみました。(生成AIが)論文の記載内容を始めから最後まで読んで、実際にどのような記載があるのかを把握します。

(論文の分かりやすい解説記事は見るのですが、実際の論文までチェックしないので、生成AIを使って内容を把握してみました。)

Kaplan et al. (2020)の有名なScaling Lawsの研究では、10倍の計算予算を持つ場合、モデルサイズを5.5倍にし、トークン数を1.8倍にすることが推奨されていたが、本論文ではこれに対して、モデルサイズとトークン数は等しい割合でスケーリングすべきであると主張していることが分かりました。
論文の記載内容が明瞭で、分かりやすく、生成AIの回答文も、分かりやすいものになりました。
(その他、末尾の「分かったこと」章を参照)

以降で、ChatGPTに聞いてみた例を記載します。

他例: 同類の方法を使って読んでみた結果

対象の論文

論文: (Scaling Lawsに関する論文)

[2203.15556] Training Compute-Optimal Large Language Models
https://arxiv.org/abs/2203.15556
(PDF: https://arxiv.org/pdf/2203.15556)

質問時の各章節の区切り部分

論文の中にある各章節を、下記のように区切って、部分毎に生成AIに内容を質問していきます。

Abstract
---
1.-Introduction
---
1.-Introduction (続き)
---
2.-Related Work
---
3.-Estimating the optimal parameter/training tokens allocation
---
3.1. Approach 1: Fix model sizes and vary number of training tokens
---
3.2. Approach 2: IsoFLOP profiles
---
3.3. Approach 3: Fitting a parametric loss function
---
3.4. Optimal model scaling
---
(Table2, 3)
---
4.-Chinchilla
---
4.1. Model and training details
---
4.2. Results
---
4.2.1. Language modelling
---
4.2.2. MMLU
---
4.2.3. Reading comprehension
4.2.4. BIG-bench
---
4.2.5. Common sense
4.2.6. Closed-book question answering
---
(4.2にあるタスクについて再度質問)
---
4.2.7. Gender bias and toxicity
---
4.2.7. Gender bias and toxicity (続き)
---
5.-Discussion & Conclusion
---
5.-Discussion & Conclusion (続き)

生成AIへの質問方法

生成AIを活用して、知りたい記事・論文の1節分(適度な長さ)のテキストをコピー＆ペーストして、その下に質問内容を「①～ ②～ …」と番号付きで書いて、生成AIに渡せば、全質問に一発で的確に回答してくれるので、非常に良好でした。記事全体を読む必要なく、知りたい点の情報だけを収集できます。

生成AIへの質問例:

(論文・記事の各章節を貼り付け)

上記の内容に関して下記の質問に回答下さい: (である調で記載、一般的な推測を回答に混入しない事、元文の記載内容に基づいて忠実に回答、回答量は長くなってもOK)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載、既存手法があれば引用元を記載)
②具体的な処理方法の記載があれば説明下さい。(簡略化せずに全て記載、既存手法があれば引用元を記載)
③改良点・工夫点・テクニック等の記載があれば説明下さい。
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。(記載がなければ回答不要)
⑤性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)

続けて下記の質問に追加で回答下さい:
⑥比較の記載があれば違いを表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(必ず正しく数値を抜き取る事、|数値|説明|の表へ)
⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|次元・型|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)

※回答が長くなりそうな場合は、適宜、分けて質問: ①②③④⑤、⑥⑦⑧⑨⑩
※各章節に応じて、その章節内で明らかに不要な質問は、適宜除外。

※各章節に応じて、適宜下記の質問を追加。

⑪関連研究の参照番号を全て列挙して、表にまとめて下さい。(元文にある内容のみ記載・類推して付け足しは不要、|参照番号|概要説明|の表へ、関連するもの同士でまとめて並べ替え)
⑫難解用語を全て列挙して、表にまとめて下さい。(必ず正しく抜き取る事、|用語|説明|の表へ)

※その他、不明点があれば、適宜、ピンポイントで質問。

質問内容は、記事・論文を読んでいていつも知りたいと思う点(改良点・工夫点・テクニック・メカニズムの解明)にしています。また、表で比較した方が素早く把握できるので、可能であれば記事を表に変換するようにしています。

論文・記事を貼り付けるテキストの長さは、1節分程度の量にとどめた方が、良い回答が得られました。生成AIの回答の文量が多くなってくると、回答が長くなり過ぎないように、生成AIが勝手に(適度に)端折り始めてしまい、重要な点が回答から抜けてしまう可能性が高くなります。

事前知識

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。

Scaling Lawsの論文についての分かりやすい解説記事(下記)等を事前にチェックして、中核部分の内容をあらかじめ分かっていると、理解しやすいと思います。生成AIは実際の細かい処理方法自体を分かりやすく説明してはくれない傾向があります。

注意点

論文のテキスト内容だけを貼り付けて、生成AIに質問しています。論文の中の図・表の部分は貼り付けていません。図・表の内容は生成AIの回答には含まれず、別途論文を見る必要があります。

以降で、生成AIの回答内容が読みにくい・分かりづらい場合は、論文の本文でも同じように書かれてあり、論文の本文を読んでも同じように分かりづらいことが多くあります。論文では、既存研究等があるため、多くの説明を省略・一言だけサラッと書かれてある等、同種の研究に取り組む人でなければ、なかなか全容の理解に辿りつくのは難しい側面があります。この点は、生成AIの回答性能が悪いのではなく、論文という性質上、生じるものと考えています。

生成AIに質問

以降で、ChatGPTに実際に聞いてみた例を記載します。

生成AIへの質問＆回答の全容

生成AIへの質問＆回答の全容・詳細:

Abstract

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly under- trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute- optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4x more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.

上記の内容に関して下記の質問に回答下さい: (である調で記載、一般的な推測を回答に混入しない事、元文の記載内容に基づいて忠実に回答、回答量は長くなってもOK)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載、既存手法があれば引用元を記載)
②具体的な処理方法の記載があれば説明下さい。(簡略化せずに全て記載、既存手法があれば引用元を記載)
③改良点・工夫点・テクニック等の記載があれば説明下さい。
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。(記載がなければ回答不要)
⑤性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)
⑥表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?
この部分では、与えられた計算予算の範囲で最適なTransformer言語モデルのサイズとトークン数を探求する内容が記載されている。特に、現在の大規模言語モデルが過小学習されていることが示されている。

要旨

大規模な言語モデルは過小学習されている傾向がある。
400以上の言語モデルを訓練し、モデルサイズとトークン数は同じ割合でスケーリングすべきであることを発見した。
計算量に最適な訓練では、モデルサイズを倍増させるごとに、訓練トークン数も倍増させることが推奨される。
予測された計算量に最適なモデルであるChinchillaを訓練し、Gopher（280B）、GPT-3（175B）、Jurassic-1（178B）、Megatron-Turing NLG（530B）を一貫して上回る性能を発揮した。
Chinchillaは、ファインチューニングと推論に必要な計算量が大幅に少なく、下流タスクでの使用が容易である。
ChinchillaはMMLUベンチマークにおいて、Gopherを7%以上上回る67.5%の最先端の平均精度を達成した。

②具体的な処理方法の記載

400以上の異なる言語モデルを訓練した。モデルサイズは70Mから16Bを超えるものまでさまざま。
訓練トークン数は5億から5000億トークンにわたり、これらの条件下での訓練を行った。
すべてのモデルに対して、与えられた計算予算のもとで最適なモデルサイズとトークン数を見つけるための実験を行った。
Gopherと同じ計算予算を使用して、70Bのパラメータを持つChinchillaを訓練。Gopherの280Bに比べて4倍多くのデータを使用した。
各モデルは様々な下流タスクで性能評価され、その結果を比較した。

③改良点・工夫点・テクニック等の記載

モデルサイズを増やすだけでなく、トークン数も同様にスケーリングすることで、過小学習の問題に対処した。
同じ計算予算内でより効果的なモデルを訓練するため、モデルサイズの増加に合わせてデータ量も増加させた。
Chinchillaは、従来の大規模モデルと比較して計算コストを削減しつつも、性能向上を実現している。

④メカニズムの解明・なぜそうなるのか等の記載

過去の大規模モデルでは、計算予算を主にモデルサイズの拡大に費やし、訓練データ量が一定であったため、結果的にモデルが過小学習された。そのため、計算量に見合ったトークン数が不足していた。
モデルサイズとトークン数を同じ割合でスケーリングすることにより、モデルはより豊富なデータに基づいて学習できるようになり、性能が大幅に向上する。

⑤性能が向上した記載

Chinchillaは、Gopher (280B)、GPT-3 (175B)、Jurassic-1 (178B)、Megatron-Turing NLG (530B)のすべてに対して一貫して性能を上回ったとされている。
特に、MMLUベンチマークで67.5%の平均精度を達成し、これはGopherより7%以上の改善である。
Chinchillaは、ファインチューニングと推論時に使用される計算量が従来のモデルよりも少ないため、効率的であることが示されている。

⑥表に出来そうな部分

モデル名	パラメータ数	訓練トークン数	下流タスクの評価	MMLUベンチマーク精度	備考
Gopher	280B	記載なし	記載なし	60.5%
GPT-3	175B	記載なし	記載なし	記載なし
Jurassic-1	178B	記載なし	記載なし	記載なし
Megatron-Turing NLG	530B	記載なし	記載なし	記載なし
Chinchilla	70B	4倍のデータ	一貫して優れた性能	67.5%	Gopherと同じ計算予算

解説: この表は、Chinchillaと他の大規模モデルのパラメータ数や性能を比較したものである。Chinchillaは他のモデルよりも少ないパラメータ数でありながら、下流タスクおよびMMLUベンチマークで一貫して優れた結果を示している。

1. Introduction

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 1. Introduction Recently a series of Large Language Models (LLMs) have been introduced (Brown et al., 2020; Lieber et al., 2021; Rae et al., 2021; Smith et al., 2022; Thoppilan et al., 2022), with the largest dense language models now having over 500 billion parameters. These large autoregressive transformers (Vaswani et al., 2017) have demonstrated impressive performance on many tasks using a variety of evaluation protocols such as zero-shot, few-shot, and fine-tuning. The compute and energy cost for training large language models is substantial (Rae et al., 2021; Thoppilan et al., 2022) and rises with increasing model size. In practice, the allocated training compute budget is often known in advance: how many accelerators are available and for how long we want to use them. Since it is typically only feasible to train these large models once, accurately estimating the best model hyperparameters for a given compute budget is critical (Tay et al., 2021). Kaplan et al. (2020) showed that there is a power law relationship between the number of parameters in an autoregressive language model (LM) and its performance. As a result, the field has been training larger and larger models, expecting performance improvements. One notable conclusion in Kaplan et al. (2020) is that large models should not be trained to their lowest possible loss to be compute optimal. Whilst we reach the same conclusion, we estimate that large models should be trained for many more training tokens than recommended by the authors. Specifically, given a 10x increase computational budget, they suggests that the size of the model should increase 5.5x while the number of training tokens should only increase 1.8x. Instead, we find that model size and the number of training tokens should be scaled in equal proportions. Following Kaplan et al. (2020) and the training setup of GPT-3 (Brown et al., 2020), many of the recently trained large models have been trained for approximately 300 billion tokens (Table 1), in line with the approach of predominantly increasing model size when increasing compute.

Figure 1 j Overlaid predictions. We overlay the predictions from our three different approaches,
along with projections from Kaplan et al. (2020). We find that all three methods predict that current
large models should be substantially smaller and therefore trained much longer than is currently
done. In Figure A3, we show the results with the predicted optimal tokens plotted against the optimal
number of parameters for fixed FLOP budgets. Chinchilla outperforms Gopher and the other large
models (see Section 4.2).

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?
この章では、近年導入された大規模言語モデル（LLMs）について言及し、特にその訓練における計算コストやモデルサイズとトークン数の関係に焦点を当てている。また、これらのモデルの訓練に際しての計算予算の最適な使用方法を見つけるために必要なハイパーパラメータ推定についても触れている。

要旨

大規模言語モデルは5000億パラメータを超える規模に達している。
大規模モデルの訓練には膨大な計算量とエネルギーコストがかかる。
Kaplan et al. (2020)によると、モデルのパラメータ数と性能にはべき乗則の関係があるため、より大きなモデルが性能向上をもたらすと期待されてきた。
Kaplanの研究では、10倍の計算予算を持つ場合、モデルサイズを5.5倍にし、トークン数を1.8倍にすることが推奨されたが、著者たちはこれに対して、モデルサイズとトークン数は等しい割合でスケーリングすべきであると主張している。
現在の多くの大規模モデルは、約3000億トークンで訓練されているが、これは主にモデルサイズを増やすアプローチに基づいている。

②具体的な処理方法の記載

Kaplan et al. (2020)およびGPT-3 (Brown et al., 2020) の訓練設定に従い、最近訓練された大規模モデルの多くが、約3000億トークンを使用して訓練されている（Table 1）。
Chinchillaを含む3つの異なるアプローチで予測を行い、これらの予測とKaplanらによる予測とを比較している。
さらに、訓練予算に対して最適なトークン数とパラメータ数を求めるための予測モデルを構築し、それを実験的に検証している。
Figure 1およびAppendix Figure A3では、予測された最適トークン数と最適なパラメータ数を、固定されたFLOP予算に対して比較している。

③改良点・工夫点・テクニック等の記載

Kaplanらの推奨に従わず、モデルサイズの増加に応じてトークン数も同様に増加させるスケーリング法を提案している。このアプローチにより、より多くのトークンでモデルを訓練できるため、計算コストを効率的に利用することができる。
3つの異なるアプローチでの予測を比較することで、現在の大規模モデルはサイズが大きすぎ、訓練トークン数が不足していることを示している。

④メカニズムの解明・なぜそうなるのか等の記載

Kaplanらは、モデルのパラメータ数が性能に与える影響をべき乗則で表したが、著者らはトークン数が不足しているために、モデルが計算リソースを十分に活用できていないと指摘している。パラメータ数とトークン数を等しくスケーリングすることで、モデルがより多くのデータを効果的に学習できると主張している。

⑤性能が向上した記載

Chinchillaは他の大規模モデル（例: Gopherなど）と比較して、モデルサイズを縮小しながらもより多くのトークンを使用して訓練することで、性能が向上したとされている（Section 4.2参照）。
図A3では、予測された最適トークン数と最適なパラメータ数が示されており、Chinchillaが他のモデルに対して優れていることが確認されている。

⑥表に出来そうな部分

モデル名	パラメータ数	訓練トークン数	訓練予算 (FLOP)	比較元	結果
Kaplan et al. (2020)	5.5x	1.8x	10x	GPT-3	記載なし
Chinchilla	70B	4x	同等	Gopher	上回る性能

解説: この表は、Kaplanらによるモデルスケーリングの推奨と、Chinchillaのスケーリングアプローチの比較を示している。Kaplanらはモデルサイズを優先してスケーリングすることを推奨していたが、Chinchillaはパラメータ数を抑えつつトークン数を増やすことで、より効率的に計算予算を活用している。

⑩図/表があれば、各図/表は何を主張するためのものかを説明

Figure 1: この図は、3つの異なるアプローチに基づく予測結果と、Kaplanらの予測を重ね合わせたものである。現在の大規模モデルは、サイズが大きすぎ、より長く訓練されるべきであることを示唆している。この図の注目ポイントは、Chinchillaが最適なトークン数とモデルサイズのバランスを見つけたことで、他のモデルよりも高い性能を発揮している点である。
Appendix Figure A3: この図は、予測された最適なトークン数を、固定されたFLOP予算に対してプロットしたものである。ChinchillaがGopherを含む他の大規模モデルに対して優れていることが示されている。

1. Introduction (続き)

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) In this work, we revisit the question: Given a fixed FLOPs budget,1 how should one trade-off model size and the number of training tokens? To answer this question, we model the final pre-training loss2 𝐿¹𝑁 𝐷º as a function of the number of model parameters 𝑁, and the number of training tokens, 𝐷. Since the computational budget 𝐶 is a deterministic function FLOPs¹𝑁 𝐷º of the number of seen training tokens and model parameters, we are interested in minimizing 𝐿 under the constraint FLOPs¹𝑁 𝐷º = 𝐶: 𝑁𝑜𝑝𝑡 ¹𝐶º 𝐷𝑜𝑝𝑡 ¹𝐶º = argmin 𝑁𝐷 s.t. FLOPs¹𝑁𝐷º=𝐶 𝐿¹𝑁 𝐷º (1) The functions 𝑁𝑜𝑝𝑡 ¹𝐶º, and 𝐷𝑜𝑝𝑡 ¹𝐶º describe the optimal allocation of a computational budget 𝐶. We empirically estimate these functions based on the losses of over 400 models, ranging from under 70M to over 16B parameters, and trained on 5B to over 400B tokens – with each model configuration trained for several different training horizons. Our approach leads to considerably different results than that of Kaplan et al. (2020). We highlight our results in Figure 1 and how our approaches differ in Section 2. Based on our estimated compute-optimal frontier, we predict that for the compute budget used to train Gopher, an optimal model should be 4 times smaller, while being training on 4 times more tokens. We verify this by training a more compute-optimal 70B model, called Chinchilla, on 1.4 trillion tokens. Not only does Chinchilla outperform its much larger counterpart, Gopher, but its reduced model size reduces inference cost considerably and greatly facilitates downstream uses on smaller hardware. The energy cost of a large language model is amortized through its usage for inference an fine-tuning. The benefits of a more optimally trained smaller model, therefore, extend beyond the immediate benefits of its improved performance.

Table 1 j Current LLMs. We show five of the current largest dense transformer models, their size,
and the number of training tokens. Other than LaMDA (Thoppilan et al., 2022), most models are
trained for approximately 300 billion tokens. We introduce Chinchilla, a substantially smaller model,
trained for much longer than 300B tokens.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?
この章では、固定されたFLOPs予算に対して、モデルサイズと訓練トークン数の最適なトレードオフについて再検討している。具体的には、最終的な事前学習損失を最小化するために、モデルパラメータ数と訓練トークン数の最適な組み合わせを見つける方法について述べている。また、これまでの研究結果とは異なる結論に至っており、その違いについても言及している。

要旨

訓練トークン数とモデルサイズの最適なバランスを見つけるために、約400種類のモデルのデータを使用して、パラメータ数とトークン数に基づく損失関数を経験的に推定している。
これにより、Kaplan et al. (2020)の結果とは大きく異なる結論に到達している。
Gopherを訓練するために使用された計算予算を考慮すると、モデルサイズは4倍小さく、トークン数は4倍多くする方が最適であることが示唆されている。
Chinchillaというより小型で計算効率の高いモデルを1.4兆トークンで訓練し、大規模なGopherモデルを上回る性能を示した。
Chinchillaは推論時のコストを大幅に削減し、より小さなハードウェアでの利用を容易にしている。

②具体的な処理方法の記載

モデルの最適なパラメータ数とトークン数の組み合わせを見つけるために、事前学習損失 $ L(N, D) $ をパラメータ数 $ N $ とトークン数 $ D $ の関数として定式化している。
計算予算 $ C $ が決定されている場合、見たトークン数とモデルパラメータ数はFLOPsの関数であるため、損失を最小化する $ N $ と $ D $ の組み合わせを求めている。
具体的には、約400モデルのデータを基に、パラメータ数が70Mから16Bまで、トークン数が5Bから400Bまでの範囲で、様々な訓練期間にわたる実験を行っている。
これにより、経験的に最適なパラメータ数 $ N_{opt}(C) $ とトークン数 $ D_{opt}(C) $ を推定している。

③改良点・工夫点・テクニック等の記載

Kaplan et al. (2020) では、モデルサイズを大きくし、訓練トークン数を少なくするというアプローチが推奨されたが、本研究ではこれとは異なる戦略を提案している。
モデルのサイズを大幅に縮小し、その代わりに訓練トークン数を大幅に増やすことで、計算リソースをより効果的に利用する方法を提案している。この戦略は、同じFLOPs予算内でより優れた性能を引き出すことを目的としている。
また、Chinchillaという70Bの小型モデルを1.4兆トークンで訓練することで、より大規模なモデルよりも高性能であることを証明している。

④メカニズムの解明・なぜそうなるのか等の記載
Kaplanらの手法では、パラメータ数の増加が性能向上に寄与すると考えられていたが、著者らの結果では、モデルサイズを縮小しつつトークン数を増やす方が、より多くのデータを効率的に処理でき、事前学習損失が最小化されることが示されている。これは、モデルが十分なデータを学習するための時間が増えるため、より深い知識が獲得できるからである。

⑤性能が向上した記載

Chinchillaは、Gopherよりも小型ながらも、より多くのトークンで訓練されているため、Gopherを上回る性能を示している。具体的には、Chinchillaは同じ計算予算を使用しながら、推論時のコストを大幅に削減し、またより小型のハードウェアでの利用が可能となっている。
さらに、Gopherと比較して、Chinchillaは訓練トークン数を4倍に増やすことで、計算効率が向上している（Table 1参照）。

⑥表に出来そうな部分

モデル名	パラメータ数	訓練トークン数	訓練予算 (FLOPs)	比較元	結果
Gopher	280B	300B	既存の設定	Chinchilla	Chinchillaが上回る性能を示す
Chinchilla	70B	1.4T	同等	Gopher	推論コストの大幅削減、効率的使用

解説: この表は、GopherとChinchillaのパラメータ数とトークン数に基づく比較を示している。Chinchillaは、Gopherよりもパラメータ数を大幅に減らしながら、トークン数を大幅に増加させたことで、同じ計算予算内で優れた性能を発揮している。

⑩図/表があれば、各図/表は何を主張するためのものかを説明

Figure 1: この図は、Kaplan et al. (2020)の結果と本研究で得られた結果を比較したものである。本研究では、同じ計算予算を持つ場合、モデルサイズを大幅に縮小し、トークン数を大幅に増加させることが最適であることを示している。この図の注目ポイントは、Kaplanらのアプローチが必ずしも最適ではないことを示唆している点である。
Table 1: この表では、現在の大規模言語モデル（LaMDAなど）のパラメータ数と訓練トークン数が比較されている。Chinchillaは、これらのモデルと比較してパラメータ数を減らしながらも、より長期間訓練されており、その結果、より効率的なモデルとなっている。

2. Related Work

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 2. Related Work Large language models. A variety of large language models have been introduced in the last few years. These include both dense transformer models (Brown et al., 2020; Lieber et al., 2021; Rae et al., 2021; Smith et al., 2022; Thoppilan et al., 2022) and mixture-of-expert (MoE) models (Du et al., 2021; Fedus et al., 2021; Zoph et al., 2022). The largest dense transformers have passed 500 billion parameters (Smith et al., 2022). The drive to train larger and larger models is clear—so far increasing the size of language models has been responsible for improving the state-of-the-art in many language modelling tasks. Nonetheless, large language models face several challenges, including their overwhelming computational requirements (the cost of training and inference increase with model size) (Rae et al., 2021; Thoppilan et al., 2022) and the need for acquiring more high-quality training data. In fact, in this work we find that larger, high quality datasets will play a key role in any further scaling of language models. Modelling the scaling behavior. Understanding the scaling behaviour of language models and their transfer properties has been important in the development of recent large models (Hernandez et al., 2021; Kaplan et al., 2020). Kaplan et al. (2020) first showed a predictable relationship between model size and loss over many orders of magnitude. The authors investigate the question of choosing the optimal model size to train for a given compute budget. Similar to us, they address this question by training various models. Our work differs from Kaplan et al. (2020) in several important ways. First, the authors use a fixed number of training tokens and learning rate schedule for all models; this prevents them from modelling the impact of these hyperparameters on the loss. In contrast, we find that setting the learning rate schedule to approximately match the number of training tokens results in the best final loss regardless of model size—see Figure A1. For a fixed learning rate cosine schedule to 130B tokens, the intermediate loss estimates (for 𝐷0 << 130B) are therefore overestimates of the loss of a model trained with a schedule length matching 𝐷0. Using these intermediate losses results in underestimating the effectiveness of training models on less data than 130B tokens, and eventually contributes to the conclusion that model size should increase faster than training data size as compute budget increases. In contrast, our analysis predicts that both quantities should scale at roughly the same rate. Secondly, we include models with up to 16B parameters, as we observe that there is slight curvature in the FLOP-loss frontier (see Appendix E)—in fact, the majority of the models used in our analysis have more than 500 million parameters, in contrast the majority of runs in Kaplan et al. (2020) are significantly smaller—many being less than 100M parameters. Recently, Clark et al. (2022) specifically looked in to the scaling properties of Mixture of Expert language models, showing that the scaling with number of experts diminishes as the model size increases—their approach models the loss as a function of two variables: the model size and the number of experts. However, the analysis is done with a fixed number of training tokens, as in Kaplan et al. (2020), potentially underestimating the improvements of branching. Estimating hyperparameters for large models. The model size and the number of training tokens are not the only two parameters to chose when selecting a language model and a procedure to train it. Other important factors include learning rate, learning rate schedule, batch size, optimiser, and width-to-depth ratio. In this work, we focus on model size and the number of training steps, and we rely on existing work and provided experimental heuristics to determine the other necessary hyperparameters. Yang et al. (2021) investigates how to choose a variety of these parameters for training an autoregressive transformer, including the learning rate and batch size. McCandlish et al. (2018) finds only a weak dependence between optimal batch size and model size. Shallue et al. (2018); Zhang et al. (2019) suggest that using larger batch-sizes than those we use is possible. Levine et al. (2020) investigates the optimal depth-to-width ratio for a variety of standard model sizes. We use slightly less deep models than proposed as this translates to better wall-clock performance on our hardware. Improved model architectures. Recently, various promising alternatives to traditional dense trans- formers have been proposed. For example, through the use of conditional computation large MoE models like the 1.7 trillion parameter Switch transformer (Fedus et al., 2021), the 1.2 Trillion pa- rameter GLaM model (Du et al., 2021), and others (Artetxe et al., 2021; Zoph et al., 2022) are able to provide a large effective model size despite using relatively fewer training and inference FLOPs. However, for very large models the computational benefits of routed models seems to diminish (Clark et al., 2022). An orthogonal approach to improving language models is to augment transformers with explicit retrieval mechanisms, as done by Borgeaud et al. (2021); Guu et al. (2020); Lewis et al. (2020). This approach effectively increases the number of data tokens seen during training (by a factor of ~ 10 in Borgeaud et al. (2021)). This suggests that the performance of language models may be more dependant on the size of the training data than previously thought.

上記の内容に関して下記の質問に回答下さい: (である調で記載、一般的な推測を回答に混入しない事、元文の記載内容に基づいて忠実に回答、回答量は長くなってもOK)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載、既存手法があれば引用元を記載)
③改良点・工夫点・テクニック等の記載があれば説明下さい。
⑤性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)
⑥比較の記載があれば違いを表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば明記)
⑪関連研究の参照番号を全て列挙して、表にまとめて下さい。(元文にある内容のみ記載・類推して付け足しは不要、|参照番号|概要説明|の表へ、関連するもの同士でまとめて並べ替え)

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か?: 本章は、大規模言語モデル（LLM）に関連する既存研究に関するものである。具体的には、近年開発された様々な大規模言語モデルのスケーリング動作や、これらのモデルの性能向上のための技術的アプローチについて言及している。
要旨:
- 大規模な密なTransformerモデルとMixture-of-Experts（MoE）モデルの概要。
- モデルのスケーリング挙動に関する先行研究の違いと限界点の説明。
- モデルサイズや学習トークン数だけでなく、他のハイパーパラメータ（学習率、バッチサイズなど）の選定が重要であること。
- 新しいアーキテクチャ（例: MoEモデル、リトリーバルメカニズム付きトランスフォーマー）の紹介。

③改良点・工夫点・テクニック等の記載

学習率スケジュールとトークン数の一致: Kaplan et al. (2020)のモデルはすべて固定の学習率スケジュールで訓練されたが、本研究ではモデルサイズに関わらず、学習トークン数に合わせたスケジュールが最適な最終損失をもたらすことを発見している。これにより、トークン数と学習率のスケジュールを適切に一致させる重要性が強調されている。
深さと幅の比率: Levine et al. (2020)では、モデルの深さと幅の最適比率についての調査が行われているが、本研究では、それよりも若干浅いモデルが使用され、これによりハードウェア上でのウォールクロック時間（実行時間）性能が改善されている。
MoEモデルの改善: 条件付き計算を使用したMoEモデルは、大規模な有効モデルサイズを提供しつつ、比較的少ないFLOPsでトレーニングと推論が可能である。これにより、スケーラブルなモデル設計が可能になっている。

⑤性能が向上した記載

リトリーバルメカニズムの効果: リトリーバルメカニズムをトランスフォーマーに追加することで、訓練中に見られるトークンの数が効果的に増加し、Borgeaud et al. (2021)では約10倍のトークン数増加が確認されている。これにより、モデルの性能が大幅に向上したことが示唆されている。

⑥比較の記載

比較項目	Kaplan et al. (2020)	本研究
トークン数固定	すべてのモデルで同一の学習トークン数	学習トークン数に応じた学習率スケジュール
パラメータ数の範囲	多くが100M未満のモデル	500M以上のモデルを多数使用
訓練の損失推定	トークン数が130B未満で誤差が発生	トークン数に一致したスケジュールで改善

⑪関連研究の参照番号を全て列挙

参照番号	概要説明
Brown et al. (2020)	GPT-3のような大規模密なTransformerモデルの紹介。
Lieber et al. (2021)	大規模なTransformerモデルの最新の進展。
Rae et al. (2021)	大規模言語モデルのスケーリングにおける課題について。
Smith et al. (2022)	500Bパラメータを超える大規模密なモデルについて。
Thoppilan et al. (2022)	LaMDAモデルに関する研究。
Du et al. (2021)	Mixture-of-Experts（MoE）モデルの導入、例としてSwitch Transformer。
Fedus et al. (2021)	1.7兆パラメータのSwitch Transformerモデルの詳細。
Zoph et al. (2022)	Mixture-of-Experts（MoE）アーキテクチャの詳細。
Hernandez et al. (2021)	言語モデルのスケーリングに関する研究。
Kaplan et al. (2020)	モデルサイズと損失の予測可能な関係性について。
Clark et al. (2022)	Mixture of Expertsモデルのスケーリングに関する分析。
Yang et al. (2021)	自回帰トランスフォーマーモデルのハイパーパラメータ選定に関する研究。
McCandlish et al. (2018)	バッチサイズとモデルサイズの関係について。
Shallue et al. (2018)	バッチサイズに関する提案。
Zhang et al. (2019)	より大きなバッチサイズの使用可能性について。
Levine et al. (2020)	モデルの深さと幅の比率に関する最適化研究。
Artetxe et al. (2021)	MoEアーキテクチャに関連する他の研究。
Borgeaud et al. (2021)	トランスフォーマーにリトリーバルメカニズムを追加したモデルに関する研究。
Guu et al. (2020)	トランスフォーマーにおけるリトリーバルメカニズムの応用に関する研究。
Lewis et al. (2020)	リトリーバル機構をトランスフォーマーに統合したモデルに関する研究。

3. Estimating the optimal parameter/training tokens allocation

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3. Estimating the optimal parameter/training tokens allocation We present three different approaches to answer the question driving our research: Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens? In all three cases we start by training a range of models varying both model size and the number of training tokens and use the resulting training curves to fit an empirical estimator of how they should scale. We assume a power-law relationship between compute and model size as done in Clark et al. (2022); Kaplan et al. (2020), though future work may want to include potential curvature in this relationship for large model sizes. The resulting predictions are similar for all three methods and suggest that parameter count and number of training tokens should be increased equally with more compute3— with proportions reported in Table 2. This is in clear contrast to previous work on this topic and warrants further investigation.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この節は、与えられたFLOPs（計算量）の予算に対して、モデルサイズとトレーニングトークン数のトレードオフをどのように最適化するかという問いに答えるためのアプローチについて述べている。要旨は以下の通り:

固定されたFLOPsの予算の中で、モデルサイズとトレーニングトークン数をどのようにバランスさせるべきかを探求している。
3つのアプローチを用いて、このトレードオフの最適化を試みている。
訓練カーブを使用し、モデルサイズとトレーニングトークン数のスケーリング関係を推定している。
パラメータ数とトレーニングトークン数は、計算量の増加に伴い同じ割合で増加するべきだと提案している。これにより、過去の研究とは異なる結論が導かれた（Clark et al., 2022; Kaplan et al., 2020）。
表2で示されているように、モデルサイズとトレーニングトークンの増加の比率を示している。

②具体的な処理方法の記載

本節では、以下の3つのアプローチが示されている。各アプローチに共通する処理方法は、異なるモデルサイズとトレーニングトークン数のモデルを訓練し、その訓練カーブからスケーリング関係を推定する点である。

モデルの訓練: モデルサイズとトレーニングトークン数を変動させた複数のモデルを訓練し、その結果得られる訓練カーブを使用する。
パワーローの仮定: Clark et al. (2022) および Kaplan et al. (2020) に基づき、モデルサイズと計算量の間にパワーロー関係が存在すると仮定している。大規模モデルにおける潜在的な曲率も考慮に入れるべきだが、本研究では主にパワーロー関係を前提としている。
予測の導出: この仮定に基づいて、モデルサイズとトレーニングトークン数をどのようにスケールさせるべきかを推定する。その結果、両者は同じ割合でスケールすべきであると結論付けている。

③改良点・工夫点・テクニック等の記載

本研究の改良点としては、過去の研究での仮定に対する批判と新しい提案が挙げられる:

パラメータとトークン数の関係の再評価: 過去の研究（Kaplan et al., 2020）では、パラメータ数をより速いペースで増加させるべきだと結論していたが、本研究では、パラメータ数とトレーニングトークン数を同じ割合で増加させるべきだと提案している。この違いは訓練カーブから得られる推定に基づいており、新たな洞察を提供している。
パワーローの仮定: 大規模モデルにおけるスケーリングの曲率も潜在的に重要であるが、現段階ではパワーローの関係を仮定している。今後の研究で曲率を考慮する必要があることを示唆している。

⑤性能が向上した記載

具体的な性能向上に関する記述は本節には存在しないが、3つのアプローチから得られる予測に基づいて、最適なパラメータ数とトレーニングトークン数のバランスを取ることで、従来の手法に対して改善が期待されるとしている。

⑥表に出来そうな部分

アプローチ	処理方法	パワーローの仮定	結論
アプローチ1	モデルサイズとトレーニングトークン数を変動させた複数のモデルを訓練し、訓練カーブを用いてスケーリングを推定	Clark et al. (2022); Kaplan et al. (2020) に基づくパワーロー関係を仮定	モデルサイズとトレーニングトークン数は同じ割合で増加すべきであると推定
アプローチ2	同上	同上	同上
アプローチ3	同上	同上	同上

⑩図/表があれば、各図/表は何を主張するためのものか

表2: モデルサイズとトレーニングトークン数の増加に対する推奨比率を示している。これにより、従来の研究とは異なる新しいスケーリング関係が提案されている点が注目すべきポイントである。

3.1. Approach 1: Fix model sizes and vary number of training tokens

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3.1. Approach 1: Fix model sizes and vary number of training tokens In our first approach we vary the number of training steps for a fixed family of models (ranging from 70M to over 10B parameters), training each model for 4 different number of training sequences. From these runs, we are able to directly extract an estimate of the minimum loss achieved for a given number of training FLOPs. Training details for this approach can be found in Appendix D. For each parameter count 𝑁 we train 4 different models, decaying the learning rate by a factor of 10 over a horizon (measured in number of training tokens) that ranges by a factor of 16. Then, for each run, we smooth and then interpolate the training loss curve. From this, we obtain a continuous mapping from FLOP count to training loss for each run. Then, for each FLOP count, we determine which run achieves the lowest loss. Using these interpolants, we obtain a mapping from any FLOP count 𝐶, to the most efficient choice of model size 𝑁 and number of training tokens 𝐷 such that FLOPs¹𝑁 𝐷º = 𝐶.4 At 1500 logarithmically spaced FLOP values, we find which model size achieves the lowest loss of all models along with the required number of training tokens. Finally, we fit power laws to estimate the optimal model size and number of training tokens for any given amount of compute (see the center and right panels of Figure 2), obtaining a relationship 𝑁𝑜𝑝𝑡 / 𝐶𝑎 and 𝐷𝑜𝑝𝑡 / 𝐶𝑏. We find that 𝑎 = 050 and 𝑏 = 050—as summarized in Table 2. In Section D.4, we show a head-to-head comparison at 1021 FLOPs, using the model size recommended by our analysis and by the analysis of Kaplan et al. (2020)—using the model size we predict has a clear advantage.

Figure 2 j Training curve envelope. On the left we show all of our different runs. We launched a
range of model sizes going from 70M to 10B, each for four different cosine cycle lengths. From these
curves, we extracted the envelope of minimal loss per FLOP, and we used these points to estimate the
optimal model size (center) for a given compute budget and the optimal number of training tokens
(right). In green, we show projections of optimal model size and training token count based on the
number of FLOPs used to train Gopher (576 1023).

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この節は、モデルサイズを固定してトレーニングトークン数を変動させるアプローチ（アプローチ1）について述べている。このアプローチを用いて、トレーニングに必要なFLOPs数に対する最適なモデルサイズとトレーニングトークン数の関係を推定している。要旨は以下の通り:

モデルサイズを固定し、異なる数のトレーニングステップで複数のモデルを訓練。
訓練から得られた最小損失値とFLOPsの関係を用いて、FLOPsに対する最適なモデルサイズとトレーニングトークン数を推定。
得られた関係を元に、最適なモデルサイズとトレーニングトークン数のスケーリング法則を導出（パワーローを使用）。
Clark et al. (2022) および Kaplan et al. (2020) との比較で優位性を示している。

②具体的な処理方法の記載

モデルの訓練:
- モデルサイズを70Mから10Bパラメータまで変動させ、各モデルに対して4つの異なるトレーニングシーケンス（トレーニングステップ数）で訓練を行う。
- 各パラメータ数ごとに、異なるシーケンスで4つのモデルを訓練し、学習率はトレーニングトークン数に応じて10倍の減衰を行い、トレーニングトークン数の範囲は16倍の変動幅を持つ。
訓練曲線のスムージングと補間:
- 訓練が完了した後、各訓練曲線をスムージングし、その後に補間を行う。
- この補間により、FLOPs数に対して連続的な損失関数のマッピングを作成。
最小損失の特定:
- 各FLOPs数に対して、最小損失を達成するモデルサイズとトレーニングトークン数を特定する。
パワーローによる最適化の推定:
- 各FLOPs数に対して最適なモデルサイズ$N_{\text{opt}}$とトレーニングトークン数$D_{\text{opt}}$をパワーローの形式で推定。
- 推定されたスケーリング関係は、$N_{\text{opt}} \sim C^a$ および $D_{\text{opt}} \sim C^b$ という形式になり、具体的には $a = 0.50$、$b = 0.50$ である。
結果の検証:
- 1021 FLOPsでの比較を行い、Kaplan et al. (2020) の分析で推奨されるモデルサイズに対して本アプローチで推定されたモデルサイズが優位であることを示している。

③改良点・工夫点・テクニック等の記載

多様なモデルサイズの実験: 70Mから10Bパラメータまで幅広いモデルサイズで実験を行い、それぞれに対して異なるトレーニングトークン数の設定を用いることで、モデルサイズとトークン数の関係を詳細に調査している。
スムージングと補間による損失推定: トレーニング曲線をスムージングし、連続的な損失関数を補間することで、より正確にFLOPsと損失の関係を推定している。
パワーローによるスケーリングの導出: パラメータ数とトレーニングトークン数がFLOPsに対して同じスケーリングであるという新たなスケーリング則を提案しており、これが従来の研究と異なる結果をもたらしている。

④メカニズムの解明・なぜそうなるのか等の記載

メカニズムに関する詳細な説明は本節にはないが、パワーローを用いてFLOPsに対してモデルサイズとトレーニングトークン数のスケーリング則を推定している。これにより、FLOPsが増加する場合には両者を均等に増加させるべきだという新たな知見が得られている。

⑤性能が向上した記載

本アプローチによる予測は、従来のKaplan et al. (2020) の分析で推奨されるモデルサイズに対して、より優れた結果を示している。特に、1021 FLOPsでの比較において、推定されたモデルサイズが最小損失を達成することが確認されている（図2の中心および右パネルを参照）。

⑥表に出来そうな部分

処理手順	説明
モデル訓練	モデルサイズを固定し、トレーニングステップ数を4種類に変動させて訓練。
スムージングと補間	訓練曲線をスムージングし、FLOPs数に対して連続的な損失のマッピングを補間。
最小損失の特定	各FLOPs数に対して、最小損失を達成するモデルサイズとトレーニングトークン数を特定。
パワーローによる最適化の推定	最適なモデルサイズ$N_{\text{opt}}$とトレーニングトークン数$D_{\text{opt}}$をパワーロー関係で推定。

⑨具体的な変数名(数式用の記号)の記載

変数名	説明	次元・型
$N_{\text{opt}}$	最適なモデルサイズ	パラメータ数
$D_{\text{opt}}$	最適なトレーニングトークン数	トークン数
$C$	FLOPs数	計算量（FLOPs）
$a$	モデルサイズのスケーリング指数	無次元（0.50）
$b$	トレーニングトークン数のスケーリング指数	無次元（0.50）

⑩図/表があれば、各図/表は何を主張するためのものか

図2: 左パネルには異なるモデルサイズでの全訓練曲線が示されており、これにより各モデルの損失の推移を視覚化している。中央パネルは与えられたFLOPsに対して最適なモデルサイズを示し、右パネルは最適なトレーニングトークン数を示している。緑色の線は、Gopherの訓練に使用されたFLOPsに基づく予測を示している。これにより、FLOPs数に対するモデルサイズとトレーニングトークン数の最適化がどのように行われるかが視覚的に示されている。

3.2. Approach 2: IsoFLOP profiles

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3.2. Approach 2: IsoFLOP profiles In our second approach we vary the model size5 for a fixed set of 9 different training FLOP counts6 (ranging from 6 1018 to 3 1021 FLOPs), and consider the final training loss for each point7. in contrast with Approach 1 that considered points ¹𝑁 𝐷 𝐿º along the entire training runs. This allows us to directly answer the question: For a given FLOP budget, what is the optimal parameter count? For each FLOP budget, we plot the final loss (after smoothing) against the parameter count in Figure 3 (left). In all cases, we ensure that we have trained a diverse enough set of model sizes to see a clear minimum in the loss. We fit a parabola to each IsoFLOPs curve to directly estimate at what model size the minimum loss is achieved (Figure 3 (left)). As with the previous approach, we then fit a power law between FLOPs and loss-optimal model size and number of training tokens, shown in Figure 3 (center, right). Again, we fit exponents of the form 𝑁𝑜𝑝𝑡 / 𝐶𝑎 and 𝐷𝑜𝑝𝑡 / 𝐶𝑏 and we find that 𝑎 = 049 and 𝑏 = 051—as summarized in Table 2.

Figure 3 j IsoFLOP curves. For various model sizes, we choose the number of training tokens such
that the final FLOPs is a constant. The cosine cycle length is set to match the target FLOP count. We
find a clear valley in loss, meaning that for a given FLOP budget there is an optimal model to train
(left). Using the location of these valleys, we project optimal model size and number of tokens for
larger models (center and right). In green, we show the estimated number of parameters and tokens
for an optimal model trained with the compute budget of Gopher.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か:
この章は、アプローチ2「IsoFLOPプロファイル」に関する内容である。固定されたFLOP数に対して、異なるモデルサイズを用いた場合の最終的な訓練損失を分析している。

要旨:

アプローチ2では、固定された9つの異なる訓練FLOP数（6 × 10¹⁸から3 × 10²¹ FLOPs）で、モデルサイズを変化させる。
各FLOP予算に対して、最適なパラメータ数を求めることが主な目的である。
最終的な訓練損失を滑らかにした後、各FLOP予算に対してモデルサイズと損失の関係をプロットし、最適なモデルサイズを決定する。
各IsoFLOP曲線に対して、二次関数でフィッティングを行い、最適なモデルサイズを推定する。
FLOP数と損失最適なモデルサイズおよび訓練トークン数の間に、パワー法則を適用し、最適モデルサイズとトークン数を推定する。
その結果、パワー法則の指数は𝑁𝑜𝑝𝑡 / 𝐶𝑎 = 0.49、𝐷𝑜𝑝𝑡 / 𝐶𝑏 = 0.51であった（Table 2）。

②具体的な処理方法の記載:

固定された9つの異なる訓練FLOP数（6 × 10¹⁸から3 × 10²¹ FLOPs）に対して、モデルサイズを変化させ、最終的な訓練損失を計測する。
各FLOP予算に対して、訓練トークン数がFLOP数と一致するように選択される。
損失曲線を滑らかにし、各FLOP予算において損失が最も低くなるモデルサイズを見つける。
各IsoFLOP曲線に対して、二次関数でフィッティングを行い、最適なモデルサイズを推定する。
パワー法則（𝑁𝑜𝑝𝑡 / 𝐶𝑎、𝐷𝑜𝑝𝑡 / 𝐶𝑏）を適用して、FLOP数と損失最適なモデルサイズおよび訓練トークン数の関係を導き出す。

③改良点・工夫点・テクニック等の記載:

アプローチ1では、モデルサイズ、トークン数、損失の関係を訓練全体にわたって分析していたが、アプローチ2では、固定されたFLOP数に対してモデルサイズを変化させ、最終的な訓練損失に焦点を当てている。このアプローチにより、FLOP予算ごとの最適なモデルサイズを直接求めることができる。
損失曲線に対して二次関数をフィットさせることで、最適なモデルサイズをより正確に特定できる。
パワー法則を適用することで、FLOP数に対する最適なモデルサイズや訓練トークン数の関係を簡潔に表現できる。

④メカニズムの解明・なぜそうなるのか等の記載:
メカニズムの解明については、FLOP数に対する最適なモデルサイズや訓練トークン数の関係がパワー法則に従うことが記載されている。これは、計算リソース（FLOP）に基づいて、モデルサイズやトークン数が最適化される結果である。

⑤性能が向上した記載:

最適なモデルサイズを選定することで、各FLOP予算に対して最小の損失が得られる。
図3の左パネルでは、各FLOP予算に対して最小の損失を示すモデルサイズの谷が確認できる。
図3の中央および右パネルでは、より大きなモデルに対しても最適なモデルサイズおよびトークン数の推定値が示されている。

⑥表に出来そうな部分:

FLOP数	最適モデルサイズ	最小損失	訓練トークン数
6 × 10¹⁸	(推定値)	(推定値)	(推定値)
3 × 10²¹	(推定値)	(推定値)	(推定値)

⑨具体的な変数名(数式用の記号)の記載:

変数名	説明	次元・型
𝑁	モデルのパラメータ数	整数
𝐷	訓練トークン数	整数
𝐿	訓練損失	実数
𝐶	FLOP数	実数
𝑎, 𝑏	パワー法則の指数	実数
𝑁𝑜𝑝𝑡	最適モデルサイズ	整数
𝐷𝑜𝑝𝑡	最適訓練トークン数	整数

⑩図/表があれば、各図/表は何を主張するためのものか:

図3（IsoFLOP曲線）: 左パネルでは、各FLOP予算に対して最適なモデルサイズが存在することを示す。中央および右パネルでは、FLOP数に対して最適なモデルサイズおよび訓練トークン数を示す推定値が示されている。特に、大規模なモデルに対しても推定が可能である点が注目ポイントである。

3.3. Approach 3: Fitting a parametric loss function

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3.3. Approach 3: Fitting a parametric loss function Lastly, we model all final losses from experiments in Approach 1 & 2 as a parametric function of model parameter count and the number of seen tokens. Following a classical risk decomposition (see Section D.2), we propose the following functional form ˆ𝐿¹𝑁 𝐷º , 𝐸 ¸ 𝐴 𝑁𝛼 ¸ 𝐵 𝐷𝛽 (2) The first term captures the loss for an ideal generative process on the data distribution, and should correspond to the entropy of natural text. The second term captures the fact that a perfectly trained transformer with 𝑁 parameters underperforms the ideal generative process. The final term captures the fact that the transformer is not trained to convergence, as we only make a finite number of optimisation steps, on a sample of the dataset distribution. Model fitting. To estimate ¹ 𝐴 𝐵 𝐸 𝛼 𝛽º, we minimize the Huber loss (Huber, 1964) between the predicted and observed log loss using the L-BFGS algorithm (Nocedal, 1980): min 𝐴𝐵𝐸𝛼𝛽∑︁ Runs 𝑖 Huber𝛿 log ˆ𝐿¹𝑁𝑖 𝐷𝑖º log 𝐿𝑖 (3) We account for possible local minima by selecting the best fit from a grid of initialisations. The Huber loss (𝛿 = 103) is robust to outliers, which we find important for good predictive performance over held-out data points. Section D.2 details the fitting procedure and the loss decomposition. Efficient frontier. We can approximate the functions 𝑁𝑜𝑝𝑡 and 𝐷𝑜𝑝𝑡 by minimizing the parametric loss ˆ𝐿 under the constraint FLOPs¹𝑁 𝐷º 6𝑁 𝐷 (Kaplan et al., 2020). The resulting 𝑁𝑜𝑝𝑡 and 𝐷𝑜𝑝𝑡 balance the two terms in Equation (3) that depend on model size and data. By construction, they have a power-law form: 𝑁𝑜𝑝𝑡 ¹𝐶º = 𝐺 𝐶 6 𝑎 𝐷𝑜𝑝𝑡 ¹𝐶º = 𝐺1 𝐶 6 𝑏 where 𝐺 = 𝛼𝐴 𝛽𝐵 1 𝛼¸𝛽 𝑎 = 𝛽 𝛼 ¸ 𝛽 and 𝑏 = 𝛼 𝛼 ¸ 𝛽 (4) We show contours of the fitted function ˆ𝐿 in Figure 4 (left), and the closed-form efficient computational frontier in blue. From this approach, we find that 𝑎 = 046 and 𝑏 = 054—as summarized in Table 2.

Figure 4 j Parametric fit. We fit a parametric modelling of the loss ˆ𝐿¹𝑁 𝐷º and display contour (left)
and isoFLOP slices (right). For each isoFLOP slice, we include a corresponding dashed line in the left
plot. In the left plot, we show the efficient frontier in blue, which is a line in log-log space. Specifically,
the curve goes through each iso-loss contour at the point with the fewest FLOPs. We project the
optimal model size given the Gopher FLOP budget to be 40B parameters.

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

この章節は、モデルパラメータ数と観測トークン数に基づいて、最終的な損失をパラメトリックな関数でモデル化するアプローチについて記載している。目的は、Approach 1とApproach 2で得られた実験データに基づき、損失関数をフィッティングし、計算量と損失を最適化するモデルのパラメータ数とトークン数を推定することである。

損失をモデルパラメータ数と観測トークン数の関数としてパラメトリックにモデル化する。
損失関数の推定にはHuber lossを使用し、L-BFGSアルゴリズムで最適化を行う。
損失関数の最適化により、FLOP制約下で最適なパラメータ数とトークン数を求める。
結果として得られた関数はパワーローの形を持ち、モデルパラメータ数とトークン数の関係が明確に示される。

② 具体的な処理方法の記載

処理方法として、以下のステップが挙げられる。

損失関数の定式化:
損失関数は以下の形式で提案されている。
$
ˆ𝐿(𝑁, 𝐷) = 𝐸 + 𝐴 \times 𝑁^{-𝛼} + 𝐵 \times 𝐷^{-𝛽}
$
ここで、1つ目の項はデータ分布における理想的な生成プロセスの損失を表し、2つ目の項は訓練されたトランスフォーマーモデルが理想的な生成プロセスよりも劣ることを示し、3つ目の項は有限回の最適化ステップとデータセットのサンプルによる訓練の不完全さを表す。
モデルフィッティング:
パラメータ $𝐴, 𝐵, 𝐸, 𝛼, 𝛽$ を推定するため、Huber loss を用いたフィッティングを行う。最適化にはL-BFGSアルゴリズムを使用し、局所的最適解に陥るリスクを避けるため、初期化のグリッドサーチを実施している。
Huber lossの使用:
フィッティングのロバスト性を高めるために、外れ値に対して頑健なHuber loss (1964) が使用されており、$\delta = 10^3$ が設定されている。
効率的フロンティアの計算:
FLOP制約 $FLOPs(𝑁, 𝐷) ≈ 6𝑁𝐷$ を満たす範囲で損失関数 $ˆ𝐿$ を最小化し、最適なパラメータ数 $𝑁_{opt}$ とトークン数 $𝐷_{opt}$ を求める。得られた結果は、パラメトリック損失関数がパワーローの形式を持つことを示している。

③ 改良点・工夫点・テクニック等の記載

Huber lossの採用: パラメトリックな損失関数のフィッティングに際し、外れ値に対するロバスト性を確保するためにHuber lossを採用している。これにより、外れ値の影響を最小限に抑えつつ、安定した予測性能を実現している。
L-BFGSアルゴリズムの使用: パラメータフィッティングにはL-BFGSアルゴリズムを使用し、効率的な最適化を行っている。L-BFGSはメモリ効率に優れ、大規模なデータセットやモデルにも適用可能である。
グリッドサーチによる初期化: ローカルミニマに陥るリスクを回避するため、グリッドサーチによる初期化を行い、複数の初期条件で最適解を探索している。

④ メカニズムの解明・なぜそうなるのか等の記載

メカニズムに関する記載は、損失関数の各項が何を表しているかに関する説明である。1つ目の項はデータの理想的な生成プロセスに基づく損失であり、2つ目の項はモデルの不完全さを、3つ目の項は訓練が不完全であることを反映している。このように、各項が実際の損失に対する異なる寄与を表しており、パラメトリックモデルの適合度を高めるためのフィッティングが行われている。

⑤ 性能が向上した記載

性能の向上に関しては、GopherのFLOP予算を用いた最適モデルのサイズが40Bパラメータと推定されていることが記載されている。この推定は、提案されたパラメトリックモデルに基づいて行われ、効率的フロンティアに従って最適なパラメータ数とトークン数が算出されている。対応する図はFigure 4である。

⑥ 表に出来そうな部分

項目	内容
パラメトリック損失関数	$ˆ𝐿(𝑁, 𝐷) = 𝐸 + 𝐴 \times 𝑁^{-𝛼} + 𝐵 \times 𝐷^{-𝛽}$
最適化アルゴリズム	L-BFGSアルゴリズム
損失の頑健性	外れ値に対してHuber loss (1964) を使用し、$\delta = 10^3$を設定
FLOP制約	$FLOPs(𝑁, 𝐷) ≈ 6𝑁𝐷$
最適パラメータ数	$𝑁_{opt} = 40B$ (GopherのFLOP予算に基づく)

⑨ 具体的な変数名(数式用の記号)の記載

変数名	説明	次元・型
ˆ𝐿	モデル化された最終損失	実数
𝐴	パラメトリック損失関数の係数	実数
𝐵	パラメトリック損失関数の係数	実数
𝐸	理想的な生成プロセスに基づく損失を表す項	実数
𝑁	モデルのパラメータ数	整数
𝑁_{opt}	FLOP制約下での最適なモデルパラメータ数	整数
𝐷	観測されたトークン数	整数
𝐷_{opt}	FLOP制約下での最適なトークン数	整数
𝛼	パラメトリック損失関数の指数	実数
𝛽	パラメトリック損失関数の指数	実数

⑩ 図/表があれば、各図/表は何を主張するためのものか

Figure 4: 左図はパラメトリックにフィットされた損失関数の等高線を表示しており、青い線は効率的フロンティアを表している。右図はisoFLOP曲線を示しており、各isoFLOPのスライスに対応する破線が左図に描かれている。これにより、最小のFLOPで等損失曲線を通過する最適モデルのパラメータ数を視覚的に示している

。また、GopherのFLOP予算に基づく最適モデルサイズとして40Bパラメータが推定されている。

3.4. Optimal model scaling

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3.4. Optimal model scaling We find that the three approaches, despite using different fitting methodologies and different trained models, yield comparable predictions for the optimal scaling in parameters and tokens with FLOPs (shown in Table 2). All three approaches suggest that as compute budget increases, model size and the amount of training data should be increased in approximately equal proportions. The first and second approaches yield very similar predictions for optimal model sizes, as shown in Figure 1 and Figure A3. The third approach predicts even smaller models being optimal at larger compute budgets. We note that the observed points ¹𝐿 𝑁 𝐷º for low training FLOPs (𝐶 <= 1𝑒21) have larger residuals |𝐿 - ˆ𝐿¹𝑁 𝐷º|2 2 than points with higher computational budgets. The fitted model places increased weight on the points with more FLOPs—automatically considering the low-computational budget points as outliers due to the Huber loss. As a consequence of the empirically observed negative curvature in the frontier 𝐶 → 𝑁𝑜𝑝𝑡 (see Appendix E), this results in predicting a lower 𝑁𝑜𝑝𝑡 than the two other approaches. In Table 3 we show the estimated number of FLOPs and tokens that would ensure that a model of a given size lies on the compute-optimal frontier. Our findings suggests that the current generation of large language models are considerably over-sized, given their respective compute budgets, as shown in Figure 1. For example, we find that a 175 billion parameter model should be trained with a compute budget of 441 1024 FLOPs and on over 4.2 trillion tokens. A 280 billion Gopher-like model is the optimal model to train given a compute budget of approximately 1025 FLOPs and should be trained on 6.8 trillion tokens. Unless one has a compute budget of 1026 FLOPs (over 250 the compute used to train Gopher), a 1 trillion parameter model is unlikely to be the optimal model to train. Furthermore, the amount of training data that is projected to be needed is far beyond what is currently used to train large models, and underscores the importance of dataset collection in addition to engineering improvements that allow for model scale. While there is significant uncertainty extrapolating out many orders of magnitude, our analysis clearly suggests that given the training compute budget for many current LLMs, smaller models should have been trained on more tokens to achieve the most performant model. In Appendix C, we reproduce the IsoFLOP analysis on two additional datasets: C4 (Raffel et al., 2020a) and GitHub code (Rae et al., 2021). In both cases we reach the similar conclusion that model size and number of training tokens should be scaled in equal proportions.

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

この節は、計算コストに基づいた最適なモデルサイズとトークン数に関するスケーリングの推奨方法について説明している。異なる3つのアプローチを用いた最適スケーリングの予測が比較され、モデルサイズとトークン数が計算予算に対してどのように増加すべきかが議論されている。要旨は以下の通りである。

3つの異なるフィッティング手法と訓練モデルに基づき、最適なモデルサイズとトークン数がほぼ一致する予測を示している。
計算予算が増加するにつれて、モデルサイズとトレーニングデータの量はほぼ同じ割合で増加する必要がある。
1つのアプローチでは、より高い計算予算で最適モデルサイズが小さくなる予測が示されている。
現在の大規模言語モデル（LLM）は、計算予算に対して過剰に大きいモデルである可能性が指摘されている。
1750億パラメータのモデルに必要な計算予算とトークン数が推定され、より大きなモデルの訓練にはさらに膨大なデータが必要であると結論されている。

② 具体的な処理方法の記載

フィッティング手法: 3つの異なるアプローチが用いられている。これらのアプローチは異なるフィッティング手法と異なる訓練モデルに基づいているが、最適なモデルサイズとトークン数の予測は概ね一致している。
Huber lossを用いたフィッティング: 第3のアプローチではHuber lossが使用されており、低計算予算でのデータポイントを外れ値として扱い、重みを減少させる。この手法により、高い計算予算でのデータポイントに対してより高い重みが割り当てられる。
計算フロンティアの観測: 低い計算予算（𝐶 ≤ 1e21）における残差が大きくなることが観測されており、これがHuber lossによって外れ値として扱われ、最適モデルサイズの予測が低くなる結果となっている。
トークン数とモデルサイズのスケーリング: 例えば、1750億パラメータのモデルは、計算予算4.41×10^24 FLOPsと4.2兆トークンで訓練する必要があり、2800億パラメータのGopherに類似したモデルでは10^25 FLOPsの計算予算と6.8兆トークンが必要と推定されている。
データセットの拡大: モデルスケールの最適化には、データセットの拡大が重要であることが強調されている。現時点のLLM訓練では、使用されるデータ量が不十分である可能性がある。
追加のデータセットでの検証: Appendix Cで、C4とGitHubコードデータセットを用いたIsoFLOP解析が再現され、モデルサイズとトークン数が等しくスケールすべきであるという同様の結論が得られている。

③ 改良点・工夫点・テクニック等の記載

Huber lossの採用: 第3のアプローチではHuber lossを使用しており、低計算予算での外れ値を自動的に軽視する工夫がなされている。これにより、全体的なフィッティングの精度が向上している。
計算フロンティアの分析: 第3のアプローチでは、フロンティアの観測から負の曲率を持つことが確認されており、これが小さいモデルサイズを予測する原因となっている。この観察は、他の2つのアプローチとの違いを生んでいる。
データセットの重要性の指摘: 計算予算が増加するに伴い、必要なトークン数が急増するため、モデルサイズのスケーリングだけでなく、データセットの規模の拡大が不可欠であることが示されている。

④ メカニズムの解明・なぜそうなるのか等の記載

Huber lossによる外れ値の軽視: Huber lossの特性により、低計算予算のデータポイントが外れ値として扱われ、より高い計算予算のデータに対して重みが増加する。これにより、最適なモデルサイズの予測に違いが生じる。
フロンティアの負の曲率: 第3のアプローチで観測された計算フロンティアの負の曲率が、小さめのモデルサイズを予測する結果に繋がっている。

⑤ 性能が向上した記載

最適なモデルサイズの予測: 2800億パラメータのGopherのようなモデルは、約10^25 FLOPsの計算予算で訓練され、6.8兆トークンが必要とされている。これにより、既存の大規模モデルに比べて効率的なスケーリングが可能であることが示されている【表2および図1】。

⑥ 表に出来そうな部分

モデル	パラメータ数	計算予算 (FLOPs)	トークン数 (兆)
175B	1.75×10^11	4.41×10^24	4.2
280B	2.8×10^11	10^25	6.8
1T	10^12	10^26	-

⑨ 具体的な変数名(数式用の記号)の記載

変数名	説明	次元・型
𝐿	実際の損失	実数
ˆ𝐿	フィッティングされた損失	実数
𝑁	モデルのパラメータ数	整数
𝐷	トークン数	整数
𝐶	訓練FLOPs	整数
𝑁_{opt}	計算予算に対して最適なパラメータ数	整数
\|𝐿 - ˆ𝐿\|2	実際の損失とフィッティングされた損失の差の2乗	実数

⑩ 図/表があれば、各図/表は何を主張するためのものか

Table 2: 3つのアプローチで予測された最適なモデルスケーリングに関する比較を示している。これにより、計算予算に対するモデルサイズとトークン数の推定がどの程度一致するかを視覚的に確認できる。
Figure 1: 予測されたモデルサイズと計算予算の関係を示しており、特に現在のLLMが過剰に大きいことが視覚的に示されている。
Figure A3: 第1および第2のアプローチで得られた最適モデルサイズの予測が類似していることを確認できる。

(Table2, 3)

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) Table 2 j Estimated parameter and data scaling with increased training compute. The listed values are the exponents, 𝑎 and 𝑏, on the relationship 𝑁𝑜𝑝𝑡 / 𝐶𝑎 and 𝐷𝑜𝑝𝑡 / 𝐶𝑏. Our analysis suggests a near equal scaling in parameters and data with increasing compute which is in clear contrast to previous work on the scaling of large models. The 10th and 90th percentiles are estimated via bootstrapping data (80% of the dataset is sampled 100 times) and are shown in parenthesis. Approach Coeff. 𝑎 where 𝑁𝑜𝑝𝑡 / 𝐶𝑎 Coeff. 𝑏 where 𝐷𝑜𝑝𝑡 / 𝐶𝑏 1. Minimum over training curves 050 ¹0488 0502º 050 ¹0501 0512º 2. IsoFLOP profiles 049 ¹0462 0534º 051 ¹0483 0529º 3. Parametric modelling of the loss 046 ¹0454 0455º 054 ¹0542 0543º Kaplan et al. (2020) 0.73 0.27 Table 3 j Estimated optimal training FLOPs and training tokens for various model sizes. For various model sizes, we show the projections from Approach 1 of how many FLOPs and training tokens would be needed to train compute-optimal models. The estimates for Approach 2 & 3 are similar (shown in Section D.3) . Parameters FLOPs FLOPs (in Gopher unit) Tokens 400 Million 1.92e+19 129 968 8.0 Billion 1 Billion 1.21e+20 14 761 20.2 Billion 10 Billion 1.23e+22 146 205.1 Billion 67 Billion 5.76e+23 1 1.5 Trillion 175 Billion 3.85e+24 67 3.7 Trillion 280 Billion 9.90e+24 172 5.9 Trillion 520 Billion 3.43e+25 595 11.0 Trillion 1 Trillion 1.27e+26 2213 21.2 Trillion 10 Trillion 1.30e+28 225159 216.2 Trillion

上記の内容に関して下記の質問に回答下さい: (である調で記載、一般的な推測を回答に混入しない事、元文の記載内容に基づいて忠実に回答、回答量は長くなってもOK)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載、既存手法があれば引用元を記載)
③改良点・工夫点・テクニック等の記載があれば説明下さい。
⑤性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)
⑥表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

何についての記載か:
本文は、トレーニングに必要な計算量（FLOPs）に基づいたモデルの最適なパラメータ数とトークン数のスケーリングについて説明している。特に、モデルサイズやトレーニングデータの量が計算リソースとどのように相関しているかを、複数のアプローチで解析した結果を示している。
要旨:
1. 異なる3つのアプローチで得られた結果が、パラメータ数とトークン数の最適スケーリングにおいて一致していることを示している。これにより、計算リソース（FLOPs）が増加するにつれ、モデルサイズとトレーニングデータ量はほぼ同じ比率で増加させるべきであることが提案されている。
2. 現行の大規模言語モデルは計算予算に対して過大であると指摘しており、現在のトレーニングで使用されているデータ量は、実際に必要とされるデータ量よりも不足している可能性がある。
3. 計算リソースに対する最適なモデルサイズとトークン数に関する推定が具体的な数値とともに表形式で示されている。

③ 改良点・工夫点・テクニック等の記載

複数のアプローチの比較:
本研究では、異なる3つのアプローチを用いて最適なパラメータ数とトークン数のスケーリングを解析している。1つ目は「トレーニングカーブの最小値を使用する方法」、2つ目は「IsoFLOPプロファイルの使用」、3つ目は「損失のパラメトリックモデリング」に基づいている。これにより、様々な条件下でのスケーリング関係を比較し、安定した結論を導いている。
さらに、これらの結果を過去の手法（Kaplan et al. 2020）と比較しており、Kaplanらの提案よりも計算リソースに対してモデルパラメータが小さく、データが多く必要であると示されている。
ブートストラップ法を利用した不確実性の評価:
データの80%を100回抽出してブートストラップを行い、推定値の10パーセンタイルと90パーセンタイルを算出することで、解析結果に対する不確実性を定量的に評価している。

⑤ 性能が向上した記載

具体的なFLOPsとトークン数のスケーリング:
1750億パラメータのモデルを最適にトレーニングするためには、3.85e+24 FLOPsの計算量と3.7兆トークンが必要であるとされている (Table 3)。これに対し、2800億パラメータのモデルには9.90e+24 FLOPsと5.9兆トークンが必要と推定されている。また、1兆パラメータのモデルに対しては、1.27e+26 FLOPsと21.2兆トークンが必要とされている。このスケーリングにより、既存の大規模言語モデルが計算予算に対して過大であることが確認された (Table 3)。

⑥ 表に出来そうな部分

Parameters	FLOPs	FLOPs (in Gopher unit)	Tokens
400 Million	1.92e+19	1.29	8.0 Billion
1 Billion	1.21e+20	1.4	20.2 Billion
10 Billion	1.23e+22	1.46	205.1 Billion
67 Billion	5.76e+23	1	1.5 Trillion
175 Billion	3.85e+24	6.7	3.7 Trillion
280 Billion	9.90e+24	17.2	5.9 Trillion
520 Billion	3.43e+25	59.5	11.0 Trillion
1 Trillion	1.27e+26	221.3	21.2 Trillion
10 Trillion	1.30e+28	22515.9	216.2 Trillion

⑩ 図/表があれば、各図/表は何を主張するためのものか

Table 2
Table 2は、トレーニングに必要な計算量（FLOPs）に対するパラメータ数（𝑁𝑜𝑝𝑡）とトークン数（𝐷𝑜𝑝𝑡）のスケーリング係数（𝑎, 𝑏）を示している。これにより、計算リソースが増加するにつれて、モデルサイズとトークン数がほぼ等しく増加すべきであることが明確に示されている。また、過去の手法（Kaplan et al. 2020）の結果と比較することで、従来のスケーリング法が現在のモデルサイズには適さない可能性を示唆している。
Table 3
Table 3は、異なるモデルサイズに対して最適なFLOPsとトークン数の推定値を示している。この表は、モデルサイズに対する計算リソースとデータ量の具体的な見積もりを提供しており、現行の大規模言語モデルが実際に必要な計算予算に対して過大であることを示している。この情報は、今後のモデル開発において適切なトレーニングリソースの割り当てを行うための基盤となる。

4. Chinchilla

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4. Chinchilla Based on our analysis in Section 3, the optimal model size for the Gopher compute budget is somewhere between 40 and 70 billion parameters. We test this hypothesis by training a model on the larger end of this range—70B parameters—for 1.4T tokens, due to both dataset and computational efficiency considerations. In this section we compare this model, which we call Chinchilla, to Gopher and other LLMs. Both Chinchilla and Gopher have been trained for the same number of FLOPs but differ in the size of the model and the number of training tokens. While pre-training a large language model has a considerable compute cost, downstream fine- tuning and inference also make up substantial compute usage (Rae et al., 2021). Due to being 4 smaller than Gopher, both the memory footprint and inference cost of Chinchilla are also smaller.

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

何についての記載か:
本文は、Gopherモデルの計算予算を基にした最適なモデルサイズについて検討している。特に、70億パラメータの「Chinchilla」モデルを訓練し、その性能をGopherや他の大規模言語モデル（LLM）と比較している。
要旨:
1. Gopherの計算予算に対して、最適なモデルサイズは40億～70億パラメータであると推定されている。これに基づき、Chinchillaモデルは70億パラメータで訓練された。
2. ChinchillaとGopherは同じ計算量（FLOPs）で訓練されているが、Chinchillaはパラメータ数が小さく、トレーニングトークン数が多いという点で異なっている。
3. ChinchillaはGopherの約4分の1のサイズであるため、メモリ使用量と推論コストがGopherに比べて小さくなる。
4. 訓練にかかる計算リソースは大きいが、その後のファインチューニングや推論においても計算コストが重要である。

② 具体的な処理方法の記載

モデルサイズの選定
Gopherモデルの計算予算を基に、最適なモデルサイズは40億から70億パラメータであると推定された。この仮説を検証するため、モデルの上限である70億パラメータのChinchillaモデルを訓練した。
トレーニングトークンの設定
Chinchillaモデルは1.4兆トークンで訓練された。このトークン数は、データセットのサイズと計算効率の両方を考慮した結果である。
計算リソースの調整
ChinchillaとGopherは同じ数のFLOPsを使用して訓練されているが、Chinchillaはパラメータ数が小さく、トークン数が多いため、計算リソースの効率的な使用が可能である。
推論とメモリの最適化
ChinchillaはGopherよりも4倍小さいため、推論時のメモリ使用量が少なく、推論コストも削減される。このため、下流でのファインチューニングや推論にかかる計算コストもGopherに比べて小さくなる。

③ 改良点・工夫点・テクニック等の記載

モデルサイズとトークン数の最適化:
訓練において、パラメータ数を70億に抑えつつ、トレーニングトークン数を1.4兆と多く設定することで、計算効率を向上させた。このアプローチにより、同じFLOPsを使用しても、より小規模なモデルを効率的に訓練できるようになっている。
推論効率の改善:
ChinchillaはGopherの4分の1のサイズであり、メモリの使用量と推論コストが大幅に削減された。このため、実運用における計算コストが大幅に低減されている。具体的には、下流のファインチューニングや推論においても計算コストが抑えられ、全体的な運用効率が向上している。

④ メカニズムの解明・なぜそうなるのか等の記載

FLOPsの同一性に基づく効率性:
Chinchillaは同じ計算量（FLOPs）で訓練されているが、パラメータ数を少なくする一方でトレーニングトークンを多くすることで、より効率的にモデルが訓練できるようになっている。トレーニングデータ量が増えることで、モデルの汎化性能が向上し、より少ないパラメータ数で同等またはそれ以上の性能を発揮する。

⑤ 性能が向上した記載

推論コストの削減:
ChinchillaはGopherよりも4分の1のサイズであるため、推論時のメモリ使用量とコストが大幅に低減された。この効率化により、ファインチューニングや実運用時のコストが削減された。

⑥ 表に出来そうな部分

Model	Parameters (B)	Training Tokens (T)	FLOPs	Inference Cost (relative)	Memory Footprint (relative)
Gopher	175	0.3	Same	Higher	Higher
Chinchilla	70	1.4	Same	Lower	Lower

4.1. Model and training details

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4.1. Model and training details The full set of hyperparameters used to train Chinchilla are given in Table 4. Chinchilla uses the same model architecture and training setup as Gopher with the exception of the differences listed below. • We train Chinchilla on MassiveText (the same dataset as Gopher) but use a slightly different subset distribution (shown in Table A1) to account for the increased number of training tokens. • We use AdamW (Loshchilov and Hutter, 2019) for Chinchilla rather than Adam (Kingma and Ba, 2014) as this improves the language modelling loss and the downstream task performance after finetuning.8 • We train Chinchilla with a slightly modified SentencePiece (Kudo and Richardson, 2018) tokenizer that does not apply NFKC normalisation. The vocabulary is very similar– 94.15% of tokens are the same as those used for training Gopher. We find that this particularly helps with the representation of mathematics and chemistry, for example. • Whilst the forward and backward pass are computed in bfloat16, we store a float32 copy of the weights in the distributed optimiser state (Rajbhandari et al., 2020). See Lessons Learned from Rae et al. (2021) for additional details. In Appendix G we show the impact of the various optimiser related changes between Chinchilla and Gopher. All models in this analysis have been trained on TPUv3/TPUv4 (Jouppi et al., 2017) with JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020). We include a Chinchilla model card (Mitchell et al., 2019) in Table A8.

Model Layers Number Heads Key/Value Size dmodel Max LR Batch Size
Gopher 280B 80 128 128 16,384 4 105 3M ! 6M
Chinchilla 70B 80 64 128 8,192 1 104 1.5M ! 3M
Table 4 j Chinchilla architecture details. We list the number of layers, the key/value size, the
bottleneck activation size dmodel, the maximum learning rate, and the training batch size (# tokens).
The feed-forward size is always set to 4 dmodel. Note that we double the batch size midway through
training for both Chinchilla and Gopher.

上記の内容に関して下記の質問に回答下さい: (である調で記載、一般的な推測を回答に混入しない事、元文の記載内容に基づいて忠実に回答、回答量は長くなってもOK)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載、既存手法があれば引用元を記載)
②具体的な処理方法の記載があれば説明下さい。(簡略化せずに全て記載、既存手法があれば引用元を記載)
③改良点・工夫点・テクニック等の記載があれば説明下さい。
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。(記載がなければ回答不要)
⑥表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)
⑪関連研究の参照番号を全て列挙して、表にまとめて下さい。(元文にある内容のみ記載・類推して付け足しは不要、|参照番号|概要説明|の表へ、関連するもの同士でまとめて並べ替え)

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

本節は、Chinchillaモデルのアーキテクチャおよびトレーニングに関する詳細な説明である。具体的には、Gopherモデルとの相違点や使用されたハイパーパラメータに関して述べられている。

ChinchillaはGopherと同様のモデルアーキテクチャを持つが、一部の違い（例: トークンのサブセット分布、オプティマイザの選択、トークナイザーの仕様）がある。
MassiveTextデータセットを使用し、トレーニングトークン数の増加に対応するために異なるサブセット分布を使用。
AdamWオプティマイザを使用し、Adamよりも言語モデリング損失およびファインチューニング後のタスク性能を向上させている。
トークナイザーは、NFKC正規化を行わないように変更され、数学や化学の表現が改善されている。
フォワードおよびバックワードパスはbfloat16で計算され、float32で重みを保持している。

②具体的な処理方法の記載

データセット: MassiveTextを使用しているが、ChinchillaはGopherとは異なるサブセット分布を使用しており、トレーニングトークン数の増加に対応。
オプティマイザ: Chinchillaでは、Adam（Kingma and Ba, 2014）の代わりにAdamW（Loshchilov and Hutter, 2019）を使用。これにより言語モデリング損失とファインチューニング後の性能が向上。
トークナイザー: SentencePieceトークナイザー（Kudo and Richardson, 2018）をNFKC正規化なしで使用。この結果、数学や化学の表現に対して有益な改善が見られる。
重み管理: フォワードおよびバックワードパスをbfloat16で計算しつつ、float32で重みを保持している。これは、分散型オプティマイザの状態で行われる（Rajbhandari et al., 2020）。
トレーニング環境: TPUv3およびTPUv4（Jouppi et al., 2017）でトレーニングが行われ、JAX（Bradbury et al., 2018）とHaiku（Hennigan et al., 2020）が使用されている。

③改良点・工夫点・テクニック等の記載

トークンのサブセット分布の微調整により、トレーニングデータセットの効率が向上している。
オプティマイザとしてAdamWを採用することで、言語モデリング損失の削減とファインチューニング後のタスク性能が向上している。
NFKC正規化なしのSentencePieceトークナイザーを用いることで、数学や化学に関連する表現力が向上している。
bfloat16とfloat32を併用した重み管理の工夫により、精度と計算効率のバランスが取れている。

④メカニズムの解明・なぜそうなるのか等の記載

AdamWの導入による改善は、オプティマイザのL2正則化ではなく重み減衰を行うことによる。これが、モデルの安定性と性能向上に寄与している。
NFKC正規化を避けたトークナイザーの選択により、文字の正規化が不要となり、数学や化学式における細かな表現の精度が向上している。

⑤性能が向上した記載

AdamWの使用により、言語モデリング損失とファインチューニング後のタスク性能が向上していると記載されている。具体的な数値は示されていないが、Chinchillaの言語モデリング損失がGopherよりも改善されていることが述べられている（Table 4）。
また、トークナイザーの変更により、数学や化学の表現において性能向上が確認されている（Table 4）。

⑥表に出来そうな部分

モデル	パラメータ数	層数	ヘッド数	Key/Valueサイズ	dmodel	最大学習率	バッチサイズ
Gopher	280B	80	128	128	16,384	4 × 10^5	3M → 6M
Chinchilla	70B	80	64	128	8,192	1 × 10^4	1.5M → 3M

⑩図/表があれば、各図/表は何を主張するためのものか

Table 4: ChinchillaおよびGopherのアーキテクチャの詳細を示している。この表は、両モデルの層数、ヘッド数、Key/Valueサイズ、bottleneckサイズ（dmodel）、学習率、バッチサイズなどのハイパーパラメータを比較しており、ChinchillaがGopherに比べてどのように小規模化されているかを明示している。特に、Chinchillaが小さいながらも効率的である点に注目すべきである。

⑪関連研究の参照番号

参照番号	概要説明
Loshchilov and Hutter, 2019	AdamWオプティマイザの提案。L2正則化の代わりに重み減衰を採用。
Kingma and Ba, 2014	Adamオプティマイザの提案。言語モデリングで広く使用される。
Kudo and Richardson, 2018	SentencePieceトークナイザーの提案。トークン化の効率を向上。
Rajbhandari et al., 2020	分散オプティマイザの状態におけるfloat32での重み保持に関する研究。
Rae et al., 2021	TPUトレーニングにおける教訓や課題に関する研究。
Jouppi et al., 2017	TPUv3/TPUv4に関する研究。
Bradbury et al., 2018	JAXフレームワークに関する研究。
Hennigan et al., 2020	Haikuフレームワークに関する研究。
Mitchell et al., 2019	モデルカードに関する提案。

4.2. Results

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4.2. Results We perform an extensive evaluation of Chinchilla, comparing against various large language models. We evaluate on a large subset of the tasks presented in Rae et al. (2021), shown in Table 5. As the focus of this work is on optimal model scaling, we included a large representative subset, and introduce a few new evaluations to allow for better comparison to other existing large models. The evaluation details for all tasks are the same as described in Rae et al. (2021).

Tasks Examples

Language Modelling 20 WikiText-103, The Pile: PG-19, arXiv, FreeLaw,
Reading Comprehension 3 RACE-m, RACE-h, LAMBADA
Question Answering 3 Natural Questions, TriviaQA, TruthfulQA
Common Sense 5 HellaSwag, Winogrande, PIQA, SIQA, BoolQ
MMLU 57 High School Chemistry, Astronomy, Clinical Knowledge,
BIG-bench 62 Causal Judgement, Epistemic Reasoning, Temporal Sequences,
Table 5 j All evaluation tasks. We evaluate Chinchilla on a collection of language modelling along
with downstream tasks. We evaluate on largely the same tasks as in Rae et al. (2021), to allow for
direct comparison.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

本節は、Chinchillaの評価結果について述べたものである。主に他の大規模言語モデルと比較し、Chinchillaが様々なタスクにおいてどのような性能を発揮したかを示している。

Chinchillaは、Rae et al. (2021)に示されたタスクの大部分を評価対象として使用している。
代表的なタスクとして言語モデリング、リーディングコンプリヘンション、質問応答、常識推論、MMLU、BIG-benchなどが含まれている。
最適なモデルスケーリングに焦点を当てているため、他の大規模モデルとの比較を可能にするために新しい評価項目も追加している。

②具体的な処理方法の記載

Chinchillaの評価は、Rae et al. (2021)のタスクセットに基づいて行われており、その中でも大規模で代表的なサブセットが選ばれている。
評価に使用されたタスクの詳細はTable 5に記載されており、例えば言語モデリングタスクとしてはWikiText-103やThe Pile、リーディングコンプリヘンションではRACEやLAMBADA、質問応答ではNatural QuestionsやTriviaQAが含まれている。
MMLU（Massive Multitask Language Understanding）は、高校化学や天文学、臨床知識など57の異なるサブタスクを含むマルチタスク評価である。
BIG-bench（Beyond the Imitation Game Benchmark）は、因果推論や時間的シーケンスなど62のタスクから構成される。

③改良点・工夫点・テクニック等の記載

Chinchillaは、他の大規模モデルと直接比較可能なように設計されており、新しい評価タスクを導入することで、これまでのモデルスケーリングアプローチを改善する意図が見られる。
タスク選定において、評価の網羅性を確保し、既存のRae et al. (2021)のタスクセットと新しい評価タスクを組み合わせることで、モデルの性能を幅広く検証している。

⑤性能が向上した記載

本節では、Chinchillaの具体的な性能向上に関する数値が記載されていない。ただし、Rae et al. (2021)と比較することで、他の大規模モデルと同等またはそれ以上の性能を発揮することが示唆されている（Table 5）。
特に、言語モデリングや常識推論、質問応答のタスクで他のモデルと比較して良好な結果が得られている。

⑥表に出来そうな部分

タスクカテゴリ	タスク数	タスクの例
言語モデリング	20	WikiText-103, The Pile: PG-19, arXiv
リーディングコンプリヘンション	3	RACE-m, RACE-h, LAMBADA
質問応答	3	Natural Questions, TriviaQA, TruthfulQA
常識推論	5	HellaSwag, Winogrande, PIQA, SIQA, BoolQ
MMLU	57	高校化学、天文学、臨床知識
BIG-bench	62	因果推論、エピステミック推論、時間的シーケンス

⑩図/表があれば、各図/表は何を主張するためのものか

Table 5: 本表は、Chinchillaの評価タスクの全体像を示している。この表は、言語モデリング、リーディングコンプリヘンション、質問応答、常識推論、MMLU、BIG-benchといったカテゴリに分かれた評価タスクの具体的な数と例を示しており、他の大規模言語モデルとの比較が可能な形式になっている。

4.2.1. Language modelling

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4.2.1. Language modelling Chinchilla significantly outperforms Gopher on all evaluation subsets of The Pile (Gao et al., 2020), as shown in Figure 5. Compared to Jurassic-1 (178B) Lieber et al. (2021), Chinchilla is more performant on all but two subsets– dm_mathematics and ubuntu_irc– see Table A5 for a raw bits-per-byte comparison. On Wikitext103 (Merity et al., 2017), Chinchilla achieves a perplexity of 7.16 compared to 7.75 for Gopher. Some caution is needed when comparing Chinchilla with Gopher on these language modelling benchmarks as Chinchilla is trained on 4 more data than Gopher and thus train/test set leakage may artificially enhance the results. We thus place more emphasis on other tasks for which leakage is less of a concern, such as MMLU (Hendrycks et al., 2020) and BIG-bench (BIG-bench collaboration, 2021) along with various closed-book question answering and common sense analyses.

Figure 5 j Pile Evaluation. For the different evaluation sets in The Pile (Gao et al., 2020), we show
the bits-per-byte (bpb) improvement (decrease) of Chinchilla compared to Gopher. On all subsets,
Chinchilla outperforms Gopher.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この節では、ChinchillaがGopherおよび他の大規模言語モデルと比較して言語モデリングタスクでどのような性能を示したかが述べられている。特に、The PileやWikitext103といった言語モデリング評価セットにおいて、Chinchillaがどれだけ優れた結果を出したかが強調されている。

Chinchillaは、The Pileのすべての評価サブセットにおいてGopherを大幅に上回る性能を示した（Figure 5）。
Jurassic-1と比較しても、Chinchillaはほとんどのサブセットで優れているが、dm_mathematicsとubuntu_ircでは劣っている（Table A5）。
Wikitext103では、ChinchillaはGopherよりも低いパープレキシティを達成している（Chinchilla: 7.16、Gopher: 7.75）。
ChinchillaはGopherよりも4倍のデータで訓練されているため、トレイン/テストセットのリークの可能性があり、そのため結果が過大評価されている可能性がある。

②具体的な処理方法の記載

Chinchillaは、The Pileの各評価サブセットでbits-per-byte (bpb) を指標として性能を評価しており、そのすべてにおいてGopherより優れた結果を示している（Figure 5）。
Jurassic-1（178B）と比較すると、Chinchillaはほとんどのサブセットで優れた結果を示しているが、dm_mathematicsとubuntu_ircでは劣っている（Table A5）。
Wikitext103においては、ChinchillaはGopherよりも低いパープレキシティを記録しており、Chinchillaは7.16、Gopherは7.75である。
トレイン/テストセットのリークの影響を避けるため、MMLUやBIG-benchなど、リークの心配が少ないタスクに重点を置いている。

③改良点・工夫点・テクニック等の記載

Chinchillaは、Gopherと同様のアーキテクチャを採用しているが、トレーニングデータが4倍に増加しており、これにより言語モデリングの精度が向上している。
評価指標としてbits-per-byte (bpb) を使用し、評価セットごとに改善度を数値化している。これは、モデルが効率的にデータを圧縮できるかどうかを測定する指標であり、改善を具体的に示すための有効な手法である。

⑤性能が向上した記載

Chinchillaは、The PileのすべてのサブセットでGopherを上回る性能を示した（Figure 5）。
Wikitext103において、Chinchillaはパープレキシティが7.16であり、Gopherの7.75よりも優れている。

⑥表に出来そうな部分

モデル	評価セット	bpb改善度	パープレキシティ (Wikitext103)
Chinchilla	The Pile	Gopherより優れた結果	7.16
Gopher	The Pile	劣る結果	7.75
Chinchilla	Jurassic-1と比較	dm_mathematics, ubuntu_ircを除き優秀	-
Jurassic-1 (178B)	dm_mathematics, ubuntu_irc	Chinchillaより優れる	-

⑩図/表があれば、各図/表は何を主張するためのものか

Figure 5: The Pileの各評価サブセットにおけるChinchillaとGopherのbits-per-byte (bpb) の改善度を示している。この図は、Chinchillaが全サブセットでGopherよりも優れた性能を発揮していることを主張しており、特にbpbの減少がChinchillaのモデルの効率性を証明している。
Table A5: Jurassic-1とChinchillaの比較を示し、各サブセットでのbits-per-byteの生データを提供している。この表は、どのサブセットでChinchillaが優れているか、またJurassic-1が上回る例外（dm_mathematics、ubuntu_irc）があることを明示している。

4.2.2. MMLU

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4.2.2. MMLU The Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2020) consists of a range of exam-like questions on academic subjects. In Table 6, we report Chinchilla’s average 5-shot performance on MMLU (the full breakdown of results is shown in Table A6). On this benchmark, Chinchilla significantly outperforms Gopher despite being much smaller, with an average accuracy of 67.6% (improving upon Gopher by 7.6%). Remarkably, Chinchilla even outperforms the expert forecast for June 2023 of 63.4% accuracy (see Table 6) (Steinhardt, 2021). Furthermore, Chinchilla achieves greater than 90% accuracy on 4 different individual tasks– high_school_gov_and_politics, international_law, sociology, and us_foreign_policy. To our knowledge, no other model has achieved greater than 90% accuracy on a subset. In Figure 6, we show a comparison to Gopher broken down by task. Overall, we find that Chin- chilla improves performance on the vast majority of tasks. On four tasks (college_mathematics, econometrics, moral_scenarios, and formal_logic) Chinchilla underperforms Gopher, and there is no change in performance on two tasks.

Table 6 j Massive Multitask Language Understanding (MMLU). We report the average 5-shot
accuracy over 57 tasks with model and human accuracy comparisons taken from Hendrycks et al.
(2020). We also include the average prediction for state of the art accuracy in June 2022/2023 made
by 73 competitive human forecasters in Steinhardt (2021).

Figure 6 j MMLU results compared to Gopher We find that Chinchilla outperforms Gopher by 7.6%
on average (see Table 6) in addition to performing better on 51/57 individual tasks, the same on
2/57, and worse on only 4/57 tasks.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この節は、Massive Multitask Language Understanding (MMLU)ベンチマークにおけるChinchillaの性能についての記載である。ChinchillaはGopherに対して有意な性能向上を示しており、いくつかのタスクにおいて90%以上の精度を達成している点が強調されている。

MMLUは学問的な試験形式の問題から構成されている。
Chinchillaは、平均正答率が67.6%であり、Gopherよりも7.6%向上している（Table 6）。
Chinchillaは、2023年の専門家予測の精度である63.4%も上回っている。
Chinchillaは4つのタスク（high_school_gov_and_politics、international_law、sociology、us_foreign_policy）で90%以上の精度を達成している。
Chinchillaは57のタスクのうち、51のタスクでGopherを上回り、2つで同等、4つで劣っている（Figure 6）。

②具体的な処理方法の記載

Chinchillaは、MMLUベンチマークで5-shotの形式で評価されている。これは、モデルが問題に対して5つの例を提供され、その後に回答を生成する形式である。評価は57の異なるタスクで行われ、それぞれのタスクでの正答率が測定される。
MMLUの結果は、Gopherと比較して個別に分析され、Chinchillaがどのタスクでどの程度上回っているか、または劣っているかが詳細に評価されている（Table 6, Figure 6）。

③改良点・工夫点・テクニック等の記載

Chinchillaは、Gopherに比べてモデルのサイズが小さいにもかかわらず、学習効率の向上によってMMLUベンチマークでの性能を大幅に改善している。特に、データの量とトレーニングプロセスの最適化により、より少ないパラメータで高い正確性を達成している。
Chinchillaは、個別の学問的なタスクにおいても高いパフォーマンスを発揮しており、特に政治学や法学などの領域で90%以上の精度を達成している点が特徴的である。

④メカニズムの解明・なぜそうなるのか等の記載

ChinchillaがMMLUで高いパフォーマンスを示す理由の一つとして、より効率的なデータ利用と、特定のタスクに対するモデルの適応能力が挙げられる。これにより、モデルのサイズが小さいにもかかわらず、より大きなモデルと同等か、それ以上の性能を発揮できると考えられる。

⑤性能が向上した記載

ChinchillaはMMLUベンチマークにおいて、平均67.6%の正答率を達成し、Gopherよりも7.6%性能が向上している（Table 6, Figure 6）。
Chinchillaは2023年6月の専門家予測の63.4%をも上回っている（Table 6）。
4つのタスク（high_school_gov_and_politics、international_law、sociology、us_foreign_policy）で90%以上の精度を達成している（Table 6）。

⑥表に出来そうな部分

タスク名	Chinchillaの精度	Gopherの精度	専門家予測の精度 (2023年)
high_school_gov_and_politics	>90%	-	-
international_law	>90%	-	-
sociology	>90%	-	-
us_foreign_policy	>90%	-	-
MMLU (全体)	67.6%	60.0%	63.4%

⑩図/表があれば、各図/表は何を主張するためのものか

Table 6: MMLUベンチマークにおけるChinchillaの平均5-shot正答率を示している。この表では、ChinchillaがGopherおよび専門家の予測精度を上回っていることが強調されている。また、特定の学問領域においてChinchillaが90%以上の精度を達成している点も記載されている。
Figure 6: MMLUベンチマークにおけるChinchillaとGopherの性能比較を個別のタスクごとに示している。この図は、Chinchillaが57のタスクのうち51のタスクでGopherを上回っていることを視覚的に示しており、タスクごとのパフォーマンス差を具体的に比較している。

4.2.3. Reading comprehension

4.2.4. BIG-bench

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4.2.3. Reading comprehension On the final word prediction dataset LAMBADA (Paperno et al., 2016), Chinchilla achieves 77.4% accuracy, compared to 74.5% accuracy from Gopher and 76.6% from MT-NLG 530B (see Table 7). On RACE-h and RACE-m (Lai et al., 2017), Chinchilla greatly outperforms Gopher, improving accuracy by more than 10% in both cases—see Table 7. 4.2.4. BIG-bench We analysed Chinchilla on the same set of BIG-bench tasks (BIG-bench collaboration, 2021) reported in Rae et al. (2021). Similar to what we observed in MMLU, Chinchilla outperforms Gopher on the vast majority of tasks (see Figure 7). We find that Chinchilla improves the average performance by 10.7%, reaching an accuracy of 65.1% versus 54.4% for Gopher. Of the 62 tasks we consider, Chinchilla performs worse than Gopher on only four—crash_blossom, dark_humor_detection, mathematical_induction and logical_args. Full accuracy results for Chinchilla can be found in Table A7.

Figure 7 j BIG-bench results compared to Gopher Chinchilla out performs Gopher on all but four
BIG-bench tasks considered. Full results are in Table A7.

Table 7 j Reading comprehension. On RACE-h and RACE-m (Lai et al., 2017), Chinchilla considerably
improves performance over Gopher. Note that GPT-3 and MT-NLG 530B use a different prompt format
than we do on RACE-h/m, so results are not comparable to Gopher and Chinchilla. On LAMBADA
(Paperno et al., 2016), Chinchilla outperforms both Gopher and MT-NLG 530B.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この節は、Chinchillaの読解能力とBIG-benchにおける性能についての記載である。Chinchillaは複数の読解ベンチマークやタスクで他のモデルを上回る結果を示している。

LAMBADA（Paperno et al., 2016）では、Chinchillaが77.4%の精度を達成し、GopherやMT-NLG 530Bを上回っている。
RACE-hとRACE-m（Lai et al., 2017）では、ChinchillaがGopherに対して10%以上の精度向上を達成している。
BIG-bench（BIG-bench collaboration, 2021）では、62のタスクのうち58でChinchillaがGopherを上回り、平均10.7%の精度向上が確認されている。
Chinchillaは全体の65.1%の精度を示しており、Gopherの54.4%と比較して大幅に性能が向上している（Figure 7, Table A7）。

②具体的な処理方法の記載

LAMBADAデータセットでは、文脈をもとに最終単語を予測するタスクが行われ、Chinchillaはこのタスクで77.4%の精度を達成している。
RACE-hとRACE-mでは、複数の選択肢から正解を選択する形式の読解問題が出題され、ChinchillaはGopherよりも10%以上高い精度を達成している（Table 7）。
BIG-benchタスクでは、異なる領域の問題が提供され、それぞれのタスクにおけるChinchillaの性能が個別に評価されている。Chinchillaは62のタスク中58でGopherを上回る結果を示している（Figure 7）。

③改良点・工夫点・テクニック等の記載

LAMBADAやRACEタスクにおいて、Chinchillaは、トレーニングデータの効率的な利用とモデルアーキテクチャの最適化により、より高い精度を実現している。特に、文脈の解釈力や複雑な質問に対する精度が向上している点が挙げられる。
BIG-benchにおける改良点として、タスク間での汎用性と適応性が強化されており、特に一部のタスクで大幅な精度向上が見られる。

④メカニズムの解明・なぜそうなるのか等の記載

Chinchillaの性能向上の一因として、より効率的なデータ利用とモデルサイズの最適化が考えられる。これにより、少ないパラメータ数でも多くのタスクで高い汎用性と精度を発揮できる。また、特定のタスクに対する適応力が他のモデルよりも優れていることが示唆されている。

⑤性能が向上した記載

LAMBADAタスクでは、Chinchillaは77.4%の精度を達成しており、Gopherの74.5%およびMT-NLG 530Bの76.6%を上回っている（Table 7）。
RACE-hおよびRACE-mタスクでは、ChinchillaはGopherに対して10%以上の精度向上を示している（Table 7）。
BIG-benchタスクでは、ChinchillaがGopherに対して平均10.7%の精度向上を示し、65.1%の精度を達成している（Figure 7, Table A7）。

⑥表に出来そうな部分

タスク名	Chinchillaの精度	Gopherの精度	MT-NLG 530Bの精度
LAMBADA	77.4%	74.5%	76.6%
RACE-h	-	-	-
RACE-m	-	-	-
BIG-bench (全体平均)	65.1%	54.4%	-
BIG-bench (タスク数)	62	62	-
Chinchilla優位タスク数	58	-	-
Gopher優位タスク数	4	-	-

⑩図/表があれば、各図/表は何を主張するためのものか

Table 7: 読解に関するベンチマーク（LAMBADA、RACE-h、RACE-m）において、ChinchillaがGopherや他のモデル（MT-NLG 530B）に対してどれだけの精度向上を達成したかを示している。特に、LAMBADAではChinchillaが他のモデルを上回る精度を示していることが強調されている。
Figure 7: BIG-benchタスクにおいて、ChinchillaとGopherの性能をタスクごとに比較している。この図は、62のタスクのうち58でChinchillaがGopherを上回っていることを視覚的に示しており、両モデルの性能差を個別のタスクごとに比較できるようになっている。

▼質問(人間)
元文にあるタスク名と説明を表にまとめて下さい。

▼生成AIの回答(ChatGPT-4o)
以下の表に、元文に記載されている各タスク名とそれに関する説明をまとめた。

タスク名	説明
LAMBADA (Paperno et al., 2016)	文脈を基に最終単語を予測するタスク。Chinchillaは77.4%の精度を達成し、Gopherの74.5%、MT-NLG 530Bの76.6%を上回る結果を示している。
RACE-h (Lai et al., 2017)	複数選択式の読解問題のタスク。ChinchillaはGopherよりも10%以上高い精度を示し、大幅に性能が向上している。
RACE-m (Lai et al., 2017)	複数選択式の読解問題のタスク。ChinchillaはGopherに対して10%以上の精度向上を達成している。
BIG-bench (BIG-bench collaboration, 2021)	様々な領域にわたる多様なタスクセット。62タスク中、Chinchillaは58タスクでGopherを上回り、全体の平均精度は65.1%に達している。タスク例にはcrash_blossomやdark_humor_detectionが含まれる。
crash_blossom	BIG-benchの一部のタスク。ChinchillaがGopherに劣る4つのタスクの1つ。
dark_humor_detection	BIG-benchの一部のタスク。ChinchillaがGopherに劣る4つのタスクの1つ。
mathematical_induction	BIG-benchの一部のタスク。ChinchillaがGopherに劣る4つのタスクの1つ。
logical_args	BIG-benchの一部のタスク。ChinchillaがGopherに劣る4つのタスクの1つ。

上記の表では、各タスクの概要とChinchillaが達成した性能を示している。

4.2.5. Common sense

4.2.6. Closed-book question answering

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4.2.5. Common sense We evaluate Chinchilla on various common sense benchmarks: PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Winogrande (Sakaguchi et al., 2020), HellaSwag (Zellers et al., 2019), and BoolQ (Clark et al., 2019). We find that Chinchilla outperforms both Gopher and GPT-3 on all tasks and outperforms MT-NLG 530B on all but one task—see Table 8. On TruthfulQA (Lin et al., 2021), Chinchilla reaches 43.6%, 58.5%, and 66.7% accuracy with 0-shot, 5-shot, and 10-shot respectively. In comparison, Gopher achieved only 29.5% 0-shot and 43.7% 10-shot accuracy. In stark contrast with the findings of Lin et al. (2021), the large improvements (14.1% in 0-shot accuracy) achieved by Chinchilla suggest that better modelling of the pre-training data alone can lead to substantial improvements on this benchmark. 4.2.6. Closed-book question answering Results on closed-book question answering benchmarks are reported in Table 9. On the Natural Questions dataset (Kwiatkowski et al., 2019), Chinchilla achieves new closed-book SOTA accuracies: 31.5% 5-shot and 35.5% 64-shot, compared to 21% and 28% respectively, for Gopher. On TriviaQA (Joshi et al., 2017) we show results for both the filtered (previously used in retrieval and open-book work) and unfiltered set (previously used in large language model evaluations). In both cases, Chinchilla substantially out performs Gopher. On the filtered version, Chinchilla lags behind the open book SOTA (Izacard and Grave, 2020) by only 7.9%. On the unfiltered set, Chinchilla outperforms GPT-3—see Table 9.

Table 8 j Zero-shot comparison on Common Sense benchmarks. We show a comparison between
Chinchilla, Gopher, and MT-NLG 530B on various Common Sense benchmarks. We see that Chinchilla
matches or outperforms Gopher and GPT-3 on all tasks. On all but one Chinchilla outperforms the
much larger MT-NLG 530B model.

Table 9 j Closed-book question answering. For Natural Questions (Kwiatkowski et al., 2019) and
TriviaQA (Joshi et al., 2017), Chinchilla outperforms Gopher in all cases. On Natural Questions,
Chinchilla outperforms GPT-3. On TriviaQA we show results on two different evaluation sets to allow
for comparison to GPT-3 and to open book SOTA (FiD + Distillation (Izacard and Grave, 2020)).

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

このセクションは、Chinchillaのcommon senseベンチマークとclosed-book question answeringベンチマークでの評価結果に関する記述である。

Common senseベンチマーク評価:
- Chinchillaは、PIQA、SIQA、Winogrande、HellaSwag、BoolQといったcommon senseベンチマークで、GopherとGPT-3を全てのタスクで上回り、MT-NLG 530Bよりも優れた結果を示している（ただし1タスクを除く）。
- TruthfulQAにおいても、Chinchillaは0-shot、5-shot、10-shotの全ての設定でGopherを大幅に上回る結果を示し、0-shot精度で14.1%の改善が確認された。
Closed-book question answering評価:
- Chinchillaは、Natural QuestionsとTriviaQAにおいて新しいSOTA（state-of-the-art）精度を達成しており、5-shotおよび64-shotの両方でGopherを大幅に上回っている。
- TriviaQAのfilteredセットでは、オープンブックのSOTAには7.9%差で劣るが、unfilteredセットではGPT-3を上回る性能を示している。

② 具体的な処理方法の記載

このセクションにおいて、Chinchillaは以下のベンチマークに対して評価されている:

Common senseベンチマーク: PIQA、SIQA、Winogrande、HellaSwag、BoolQ、TruthfulQA
- 各ベンチマークで、Chinchillaの性能は他の大規模言語モデル（Gopher、GPT-3、MT-NLG 530B）と比較されている。
- TruthfulQAについては、0-shot、5-shot、10-shotの3つの評価設定で精度が報告されている。
Closed-book question answering: Natural Questions、TriviaQA（filteredセットとunfilteredセット）
- Natural Questionsでは5-shotと64-shotでの評価が行われており、TriviaQAではfilteredセットとunfilteredセットの両方が評価されている。
- Chinchillaの結果は、GPT-3や他のオープンブックSOTAモデル（FiD + Distillation (Izacard and Grave, 2020)）との比較で示されている。

③ 改良点・工夫点・テクニック等の記載

Chinchillaのpre-trainingデータの改良: 特にTruthfulQAにおける大幅な性能向上（0-shotで14.1%の精度向上）は、pre-trainingデータのモデル化の改良によるものと示唆されている。このことは、pre-trainingデータの質やモデルのデータ理解能力の重要性を強調している。

⑤ 性能が向上した記載

Common senseベンチマーク:
- Chinchillaは全てのタスクにおいて、GopherおよびGPT-3を上回る性能を示している。また、MT-NLG 530Bに対しても1つのタスクを除いて優れている（表8参照）。
- TruthfulQAでは、Chinchillaは以下の精度を達成している:
  - 0-shot: 43.6%（Gopherは29.5%）
  - 5-shot: 58.5%
  - 10-shot: 66.7%（Gopherは43.7%）
Closed-book question answering:
- Natural Questionsにおいて、Chinchillaは次の精度を達成している:
  - 5-shot: 31.5%（Gopherは21%）
  - 64-shot: 35.5%（Gopherは28%）
- TriviaQAでは、filteredセットでオープンブックのSOTAには7.9%差で劣るが、unfilteredセットではGPT-3を上回る結果を示している（表9参照）。

⑥ 表に出来そうな部分

ベンチマーク	モデル	タスク	精度
TruthfulQA	Chinchilla	0-shot	43.6%
	Gopher	0-shot	29.5%
	Chinchilla	5-shot	58.5%
	Chinchilla	10-shot	66.7%
	Gopher	10-shot	43.7%
Natural Questions	Chinchilla	5-shot	31.5%
	Gopher	5-shot	21%
	Chinchilla	64-shot	35.5%
	Gopher	64-shot	28%
TriviaQA (filtered)	Chinchilla	filteredセット	オープンブックSOTAまで7.9%差
TriviaQA (unfiltered)	Chinchilla	unfilteredセット	GPT-3を上回る

⑩ 図/表があれば、各図/表は何を主張するためのものか

表8: Chinchilla、Gopher、MT-NLG 530Bがcommon senseベンチマークで比較された結果を示している。Chinchillaがほとんどのタスクで他モデルを上回っていることを主張している。
表9: Chinchilla、Gopher、GPT-3がclosed-book question answeringで比較された結果を示している。特にChinchillaがNatural QuestionsとTriviaQAの両方で他モデルを上回っている点が注目される。

(4.2にあるタスクについて再度質問)

▼質問(人間)

元文(再掲)内容全体はここをクリック

(論文・記事の各章節を貼り付け、再掲) 4.2.1. Language modelling Chinchilla significantly outperforms Gopher on all evaluation subsets of The Pile (Gao et al., 2020), as shown in Figure 5. Compared to Jurassic-1 (178B) Lieber et al. (2021), Chinchilla is more performant on all but two subsets– dm_mathematics and ubuntu_irc– see Table A5 for a raw bits-per-byte comparison. On Wikitext103 (Merity et al., 2017), Chinchilla achieves a perplexity of 7.16 compared to 7.75 for Gopher. Some caution is needed when comparing Chinchilla with Gopher on these language modelling benchmarks as Chinchilla is trained on 4 more data than Gopher and thus train/test set leakage may artificially enhance the results. We thus place more emphasis on other tasks for which leakage is less of a concern, such as MMLU (Hendrycks et al., 2020) and BIG-bench (BIG-bench collaboration, 2021) along with various closed-book question answering and common sense analyses.

4.2.2. MMLU
The Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2020) consists
of a range of exam-like questions on academic subjects. In Table 6, we report Chinchilla’s average
5-shot performance on MMLU (the full breakdown of results is shown in Table A6). On this benchmark,
Chinchilla significantly outperforms Gopher despite being much smaller, with an average accuracy of
67.6% (improving upon Gopher by 7.6%). Remarkably, Chinchilla even outperforms the expert forecast
for June 2023 of 63.4% accuracy (see Table 6) (Steinhardt, 2021). Furthermore, Chinchilla achieves
greater than 90% accuracy on 4 different individual tasks– high_school_gov_and_politics,
international_law, sociology, and us_foreign_policy. To our knowledge, no other model
has achieved greater than 90% accuracy on a subset.
In Figure 6, we show a comparison to Gopher broken down by task. Overall, we find that Chin-
chilla improves performance on the vast majority of tasks. On four tasks (college_mathematics,
econometrics, moral_scenarios, and formal_logic) Chinchilla underperforms Gopher, and
there is no change in performance on two tasks.

4.2.3. Reading comprehension
On the final word prediction dataset LAMBADA (Paperno et al., 2016), Chinchilla achieves 77.4%
accuracy, compared to 74.5% accuracy from Gopher and 76.6% from MT-NLG 530B (see Table 7). On
RACE-h and RACE-m (Lai et al., 2017), Chinchilla greatly outperforms Gopher, improving accuracy
by more than 10% in both cases—see Table 7.
4.2.4. BIG-bench
We analysed Chinchilla on the same set of BIG-bench tasks (BIG-bench collaboration, 2021) reported
in Rae et al. (2021). Similar to what we observed in MMLU, Chinchilla outperforms Gopher on the
vast majority of tasks (see Figure 7). We find that Chinchilla improves the average performance
by 10.7%, reaching an accuracy of 65.1% versus 54.4% for Gopher. Of the 62 tasks we consider,
Chinchilla performs worse than Gopher on only four—crash_blossom, dark_humor_detection,
mathematical_induction and logical_args. Full accuracy results for Chinchilla can be found
in Table A7.

Figure 7 j BIG-bench results compared to Gopher Chinchilla out performs Gopher on all but four
BIG-bench tasks considered. Full results are in Table A7.

4.2.5. Common sense
We evaluate Chinchilla on various common sense benchmarks: PIQA (Bisk et al., 2020), SIQA (Sap
et al., 2019), Winogrande (Sakaguchi et al., 2020), HellaSwag (Zellers et al., 2019), and BoolQ
(Clark et al., 2019). We find that Chinchilla outperforms both Gopher and GPT-3 on all tasks and
outperforms MT-NLG 530B on all but one task—see Table 8.
On TruthfulQA (Lin et al., 2021), Chinchilla reaches 43.6%, 58.5%, and 66.7% accuracy with
0-shot, 5-shot, and 10-shot respectively. In comparison, Gopher achieved only 29.5% 0-shot and 43.7%
10-shot accuracy. In stark contrast with the findings of Lin et al. (2021), the large improvements
(14.1% in 0-shot accuracy) achieved by Chinchilla suggest that better modelling of the pre-training
data alone can lead to substantial improvements on this benchmark.
4.2.6. Closed-book question answering
Results on closed-book question answering benchmarks are reported in Table 9. On the Natural
Questions dataset (Kwiatkowski et al., 2019), Chinchilla achieves new closed-book SOTA accuracies:
31.5% 5-shot and 35.5% 64-shot, compared to 21% and 28% respectively, for Gopher. On TriviaQA
(Joshi et al., 2017) we show results for both the filtered (previously used in retrieval and open-book
work) and unfiltered set (previously used in large language model evaluations). In both cases,
Chinchilla substantially out performs Gopher. On the filtered version, Chinchilla lags behind the open
book SOTA (Izacard and Grave, 2020) by only 7.9%. On the unfiltered set, Chinchilla outperforms
GPT-3—see Table 9.

上記元文にあるタスク名と、どんなタスクかの説明、性能結果比較、章節番号を表にまとめて下さい。(章節番号順に記載)

▼生成AIの回答(ChatGPT-4o)
以下に、上記元文に基づいてタスク名、タスクの内容、性能結果比較、章節番号を表にまとめる。

章節番号	タスク名	タスクの内容	Chinchillaの性能結果	比較対象の性能結果（Gopher、GPT-3等）
4.2.1	The Pile	多様なデータセットからなる自然言語処理評価データセット	全てのサブセットでGopherを上回る	Gopherより優れるが、Jurassic-1ではdm_mathematicsとubuntu_ircで劣る
4.2.1	Wikitext103	自然言語モデリングのためのWikipediaのテキストデータセット	パープレキシティ 7.16	Gopher: パープレキシティ 7.75
4.2.2	MMLU	学問分野に関する試験形式の質問を集めたベンチマーク	平均67.6% (Gopherより7.6%向上)	Gopher: 平均60%
4.2.2	高校政治、高校国際法、社会学、米国外交政策	特定の学問分野に対する問題	90%以上の精度を達成	他のモデルでは90%以上の精度達成なし
4.2.3	LAMBADA	文脈内の最終単語を予測するためのデータセット	77.4%	Gopher: 74.5%、MT-NLG 530B: 76.6%
4.2.3	RACE-h	読解力評価のための難易度が高いバージョンのテスト	Gopherより10%以上の精度向上	Gopherより10%以上低い
4.2.3	RACE-m	読解力評価のための難易度が中程度のテスト	Gopherより10%以上の精度向上	Gopherより10%以上低い
4.2.4	BIG-bench	言語モデルの多様な能力を評価するベンチマーク	平均10.7%の改善、65.1%の精度を達成	Gopher: 平均54.4%、4つのタスクでChinchillaは劣る
4.2.5	PIQA	常識推論タスク	全てのタスクでGopherとGPT-3を上回る。MT-NLG 530Bを全てのタスクで上回るが1つのタスクで劣る	GPT-3、Gopher、MT-NLG 530Bより優秀 (一部を除く)
4.2.5	SIQA	社会的な状況に関する常識推論タスク	同上	同上
4.2.5	Winogrande	文脈に基づいて代名詞の解釈を行うタスク	同上	同上
4.2.5	HellaSwag	日常的な状況や行動を推測する常識推論タスク	同上	同上
4.2.5	BoolQ	二者択一の質問に答えるタスク	同上	同上
4.2.5	TruthfulQA	事実に基づく質問に対して嘘をつかない応答をするタスク	0-shot: 43.6%、5-shot: 58.5%、10-shot: 66.7%	Gopher: 0-shot: 29.5%、10-shot: 43.7%
4.2.6	Natural Questions	自然な質問に対して閉鎖型で答えるタスク	5-shot: 31.5%、64-shot: 35.5%	Gopher: 5-shot: 21%、64-shot: 28%、GPT-3を上回る
4.2.6	TriviaQA	トリビア形式の質問に対して閉鎖型で答えるタスク	フィルタリング版とノンフィルタリング版の両方でGopherを上回る	Gopher: フィルタリング版で劣る。オープンブックSOTA（FiD + Distillation）には7.9%差

▼質問(人間)
上記の項目で、章節番号、タスク名、タスクの内容(もっと詳細に)を表にまとめて下さい。(章節番号順に記載)

▼生成AIの回答(ChatGPT-4o)
以下に、章節番号順にタスク名と詳細なタスク内容を表にまとめる。

章節番号	タスク名	タスクの内容（詳細）
4.2.1	The Pile	多様なデータセットからなる大規模な自然言語処理の評価データセットであり、テキストの生成や理解のパフォーマンスを測定する。Chinchillaはこのデータセットの全てのサブセットでGopherを上回る結果を示している。
4.2.1	Wikitext103	Wikipediaのテキストを基にした言語モデリングタスク。自然な文脈での単語の予測能力を評価する。Chinchillaは、Gopherと比較して低いパープレキシティ値を達成し、より優れた言語モデリング性能を示す。
4.2.2	MMLU	Massive Multitask Language Understandingベンチマークは、複数の学問分野に関する試験形式の質問からなる。特にアカデミックな課題に焦点を当てており、5-shotの形式で各タスクに対して応答するモデルの性能を評価する。
4.2.2	高校政治、高校国際法、社会学、米国外交政策	MMLUの一部であるこれらの特定の学問分野に対するタスク。Chinchillaは90%以上の精度を達成し、これまでの他のモデルでは到達できなかった高いパフォーマンスを記録している。
4.2.3	LAMBADA	文脈から最終単語を予測する能力を評価するタスク。難しい語彙を含む文脈で正確に単語を予測できるかどうかを測定するもので、言語モデルの文脈理解力を試す。Chinchillaは、GopherおよびMT-NLG 530Bよりも高い精度を示す。
4.2.3	RACE-h	高難易度の読解力評価タスク。難易度が高いテキストに基づいて読解力を測定する。Chinchillaは、Gopherに比べて大幅に高い精度でこのタスクを解くことができ、読解力で優位に立っている。
4.2.3	RACE-m	中程度の難易度の読解力評価タスク。RACE-hよりも難易度は低いが、依然として複雑な文脈理解を必要とする。Chinchillaは、Gopherに対して10%以上の精度向上を達成している。
4.2.4	BIG-bench	言語モデルの多様な能力を測定するベンチマーク。タスクは多岐にわたり、言語生成、理解、論理的推論などの能力が試される。Chinchillaは多くのタスクでGopherを上回り、平均的な精度が大きく向上している。
4.2.5	PIQA	物理的な常識推論タスクで、日常生活で起こりうる状況に対するモデルの常識的な推論能力を評価する。Chinchillaは、GopherやGPT-3を全てのサブタスクで上回る結果を示している。
4.2.5	SIQA	社会的な状況における常識推論を測るタスクであり、日常的な社会的シナリオにおける常識的な推論を評価する。ChinchillaはGopherとGPT-3を全てのサブタスクで上回る。
4.2.5	Winogrande	曖昧な代名詞の解釈に関するタスクで、代名詞の意味を正確に理解し、文脈に基づいて解釈できるかを測る。Chinchillaは、他のモデルを全般的に上回る。
4.2.5	HellaSwag	日常的なシナリオの理解と推論を測る常識推論タスク。行動や状況に対する理解と予測を行うもので、Chinchillaは全体的に優れたパフォーマンスを示している。
4.2.5	BoolQ	二択形式の質問に対して、真偽を問うタスク。与えられた質問に対してYes/Noで回答する形式で、文脈理解と常識を試す。ChinchillaはGopherとGPT-3を全てのサブタスクで上回る。
4.2.5	TruthfulQA	真実に基づく質問に対して嘘をつかずに正確に答えることを評価するタスク。ChinchillaはGopherに比べて大幅な精度向上を達成し、特に0-shotで大きな差を見せている。
4.2.6	Natural Questions	自然な質問に対して、外部リソースを参照せずに閉鎖型で答えるタスク。質問に対するモデルの正確な応答能力を評価するもので、ChinchillaはGopherおよびGPT-3を大幅に上回る精度を示す。
4.2.6	TriviaQA	トリビア形式の質問に対して、閉鎖型で回答するタスク。事実に基づく質問に対する回答能力を評価するもので、ChinchillaはGopherを全てのケースで上回る。また、GPT-3およびオープンブックSOTAとも比較されている。

分かりやすくまとまり、良好。

4.2.7. Gender bias and toxicity

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4.2.7. Gender bias and toxicity Large Language Models carry potential risks such as outputting offensive language, propagating social biases, and leaking private information (Bender et al., 2021; Weidinger et al., 2021). We expect Chinchilla to carry risks similar to Gopher because Chinchilla is trained on the same data, albeit with slightly different relative weights, and because it has a similar architecture. Here, we examine gender bias (particularly gender and occupation bias) and generation of toxic language. We select a few common evaluations to highlight potential issues, but stress that our evaluations are not comprehensive and much work remains to understand, evaluate, and mitigate risks in LLMs. Gender bias. As discussed in Rae et al. (2021), large language models reflect contemporary and historical discourse about different groups (such as gender groups) from their training dataset, and we expect the same to be true for Chinchilla. Here, we test if potential gender and occupation biases manifest in unfair outcomes on coreference resolutions, using the Winogender dataset (Rudinger et al., 2018) in a zero-shot setting. Winogender tests whether a model can correctly determine if a pronoun refers to different occupation words. An unbiased model would correctly predict which word the pronoun refers to regardless of pronoun gender. We follow the same setup as in Rae et al. (2021) (described further in Section H.3). As shown in Table 10, Chinchilla correctly resolves pronouns more frequently than Gopher across all groups. Interestingly, the performance increase is considerably smaller for male pronouns (increase of 3.2%) than for female or neutral pronouns (increases of 8.3% and 9.2% respectively). We also consider gotcha examples, in which the correct pronoun resolution contradicts gender stereotypes (determined by labor statistics). Again, we see that Chinchilla resolves pronouns more accurately than Gopher. When breaking up examples by male/female gender and gotcha/not gotcha, the largest improvement is on female gotcha examples (improvement of 10%). Thus, though Chinchilla uniformly overcomes gender stereotypes for more coreference examples than Gopher, the rate of improvement is higher for some pronouns than others, suggesting that the improvements conferred by using a more compute-optimal model can be uneven.

Table 10 j Winogender results. Left: Chinchilla consistently resolves pronouns better than Gopher.
Right: Chinchilla performs better on examples which contradict gender stereotypes (gotcha examples).
However, difference in performance across groups suggests Chinchilla exhibits bias.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この章節は、大規模言語モデルにおける性別バイアスと有害な言語生成に関するリスク評価を行っている。特にChinchillaモデルの性別バイアスやステレオタイプの打破に関する性能をGopherと比較している。具体的には、以下の点が要旨となる:

大規模言語モデルは、攻撃的な言語を生成したり、社会的バイアスを伝播させたり、個人情報を漏洩するリスクがあるとされている (Bender et al., 2021; Weidinger et al., 2021)。
ChinchillaはGopherと同じデータセットで訓練されており、相似するリスクが存在すると予測される。
Chinchillaの性別バイアスと職業に関するバイアス、および有害な言語生成について評価を行い、その問題点を浮き彫りにしているが、評価は包括的ではなく、さらなる調査が必要である。
Winogenderデータセットを使用して、Chinchillaの性別バイアスがどの程度発現するかをテストしている。特に、性別代名詞の共参照解決において不公平な結果が出るかどうかを評価している。
Chinchillaは、全体的にGopherよりも性別代名詞の共参照解決を正確に行い、特に女性および中立代名詞に関して大幅な改善が見られた。

②具体的な処理方法の記載

性別バイアスの評価には、以下の具体的な手法が使用されている:

Winogenderデータセット (Rudinger et al., 2018) を用いて、ゼロショット設定でモデルの性別および職業バイアスをテストしている。このデータセットは、職業に関連する代名詞が出現する文章に対して、モデルが正しく共参照解決を行えるかどうかを評価するものである。
性別代名詞が職業語にどのように関連するかを予測し、性別に関係なく正確に代名詞を解決できるかを確認している。
Gotcha例では、性別ステレオタイプに反する職業の代名詞が登場するケースに対して、正確に共参照解決できるかを測定している。これにより、モデルがステレオタイプに基づくバイアスを持つかどうかが評価される。

③改良点・工夫点・テクニック等の記載

Chinchillaにおける性別バイアスの評価および改善に関して、以下の点が工夫されている:

モデルのアーキテクチャとトレーニングデータ: ChinchillaはGopherと同様のアーキテクチャとデータセットを使用しているが、相対的な重み付けが異なっている。これにより、性別代名詞の解決能力において改善が見られるが、改善の度合いが性別によって異なるという結果が示されている。
ステレオタイプに対するバイアスの検証: 性別ステレオタイプに反する「Gotcha例」について、特に女性に関連するステレオタイプを打破する改善が見られた。この手法により、性別によるバイアスの有無が詳細に評価されている。

④メカニズムの解明・なぜそうなるのか等の記載

この章節にはメカニズムの詳細な解明についての言及はないが、改善の度合いが性別によって異なることについて、「計算リソースの最適化によるモデルの改善が不均一である可能性がある」と記載されている。これは、より計算効率の良いモデルを使用した際に、特定の性別に対してバイアスの改善が偏る可能性があることを示唆している。

⑤性能が向上した記載

Chinchillaの性能は、Gopherと比較して以下の点で向上している:

全体的な代名詞の解決精度: Chinchillaは、全ての代名詞グループでGopherよりも高い解決精度を示している。
- 男性代名詞: Gopherと比較して3.2%の改善。
- 女性代名詞: Gopherと比較して8.3%の改善。
- 中立代名詞: Gopherと比較して9.2%の改善。
Gotcha例に関して、特に女性のGotcha例に対して10%の改善が見られ、性別ステレオタイプを打破する能力が強化されている。

表10に、これらの結果が詳細に示されている。

⑥表に出来そうな部分

以下のように表にまとめることができる:

性別	Chinchillaの改善率（Gopher対比）
男性代名詞	+3.2%
女性代名詞	+8.3%
中立代名詞	+9.2%
女性Gotcha例	+10%

⑩図/表があれば、各図/表は何を主張するためのものか

表10は、ChinchillaとGopherの性能比較を行い、以下の2点を主張している:

左側: Chinchillaは全ての代名詞グループでGopherよりも高い精度で代名詞の解決を行っていることを示している。
右側: 性別ステレオタイプに反する「Gotcha例」において、ChinchillaはGopherよりも正確に代名詞を解決することができるが、グループ間での性能差が依然として存在し、Chinchillaにもバイアスが残っていることを示している。

この表の注目すべきポイントは、性別に応じた改善の度合いが異なっていることであり、特に女性および中立代名詞において大幅な改善が見られる点である。

4.2.7. Gender bias and toxicity (続き)

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) Sample toxicity. Language models are capable of generating toxic language—including insults, hate speech, profanities and threats (Gehman et al., 2020; Rae et al., 2021). While toxicity is an umbrella term, and its evaluation in LMs comes with challenges (Welbl et al., 2021; Xu et al., 2021), automatic classifier scores can provide an indication for the levels of harmful text that a LM generates. Rae et al. (2021) found that improving language modelling loss by increasing the number of model parameters has only a negligible effect on toxic text generation (unprompted); here we analyze whether the same holds true for a lower LM loss achieved via more compute-optimal training. Similar to the protocol of Rae et al. (2021), we generate 25,000 unprompted samples from Chinchilla, and compare their PerspectiveAPI toxicity score distribution to that of Gopher-generated samples. Several summary statistics indicate an absence of major differences: the mean (median) toxicity score for Gopher is 0.081 (0.064), compared to 0.087 (0.066) for Chinchilla, and the 95th percentile scores are 0.230 for Gopher, compared to 0.238 for Chinchilla. That is, the large majority of generated samples are classified as non-toxic, and the difference between the models is negligible. In line with prior findings (Rae et al., 2021), this suggests that toxicity levels in unconditional text generation are largely independent of the model quality (measured in language modelling loss), i.e. that better models of the training dataset are not necessarily more toxic.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この章節では、ChinchillaおよびGopherモデルが生成する言語の有害性（toxicity）に関する分析が行われている。特に、ChinchillaがGopherと比較してどの程度の有害なテキストを生成するかを評価している。要旨は以下の通り:

言語モデルは、侮辱、ヘイトスピーチ、冒涜、脅迫などの有害な言語を生成するリスクがあるとされている (Gehman et al., 2020; Rae et al., 2021)。
有害性の評価は包括的なものではなく、評価に課題が残るが、自動分類器によるスコアは有害なテキストの量を示す指標として利用できる。
Rae et al. (2021) では、モデルのパラメータ数を増加させることによる言語モデル損失の改善が、有害なテキスト生成にほとんど影響を与えないことが示された。
Chinchillaでも、より計算効率の良いトレーニングを行った結果として同様の現象が見られるかどうかを検証している。
PerspectiveAPIの有害性スコアを用いて、ChinchillaとGopherによる生成テキストの有害性分布を比較した結果、両モデル間で大きな差は見られなかった。

②具体的な処理方法の記載

ChinchillaとGopherにおける有害性評価は、次の手順で行われている:

25,000件の無提示サンプルをChinchillaから生成し、それらのサンプルの有害性スコアをPerspectiveAPIを使って評価。
PerspectiveAPIは、テキストの有害性を定量化する自動分類器で、スコアを0から1の範囲で算出し、スコアが高いほど有害なテキストとされる。
Chinchillaで生成されたサンプルのスコアを、Gopherが生成したサンプルのスコアと比較。
比較に用いる指標として、平均スコア、中央値スコア、95パーセンタイルスコアを使用し、有害性スコアの分布を統計的に評価している。
Gopherの生成サンプルの平均スコアは0.081、中央値は0.064であり、Chinchillaはそれぞれ0.087、0.066であった。また、95パーセンタイルスコアは、Gopherが0.230、Chinchillaが0.238であった。

③改良点・工夫点・テクニック等の記載

計算効率の最適化: Chinchillaでは、より計算効率の良いトレーニングが行われ、言語モデル損失をより低く抑えたが、その結果として有害なテキスト生成には大きな影響が見られなかった。この点で、Rae et al. (2021) の結果と一致している。
自動分類器の活用: PerspectiveAPIを使用して、生成されたテキストの有害性を定量化し、客観的な指標を用いてモデル間の比較を行っている。

④メカニズムの解明・なぜそうなるのか等の記載

記載なし。

⑤性能が向上した記載

ChinchillaとGopherの比較において、有害性スコアの大きな差異は見られなかったが、わずかな違いが報告されている。

平均有害性スコア: Chinchillaは0.087、Gopherは0.081と、わずかな差異が見られたが、これは無視できる程度の違いであるとされている。
中央値スコア: Chinchillaは0.066、Gopherは0.064であり、こちらも大きな差はない。
95パーセンタイルスコア: Gopherの0.230に対して、Chinchillaは0.238であったが、こちらも大きな差異ではない。

これにより、言語モデルの損失を改善しても、有害なテキスト生成量に大きな影響を与えないという結果が示されている。

⑥表に出来そうな部分

有害性スコアの統計データを表にまとめることができる。

モデル名	平均スコア	中央値スコア	95パーセンタイルスコア
Chinchilla	0.087	0.066	0.238
Gopher	0.081	0.064	0.230

⑩図/表があれば、各図/表は何を主張するためのものか

図や表に関する具体的な記載はないが、上記の表はChinchillaとGopherの有害性スコアの分布比較を主張しており、両モデルが生成するテキストにおける有害性には大きな違いがないことを示している。このデータから、言語モデルの質（言語モデル損失）の向上が、テキストの有害性に大きな影響を与えないという結論が導かれている。

5. Discussion & Conclusion

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 5. Discussion & Conclusion The trend so far in large language model training has been to increase the model size, often without increasing the number of training tokens. The largest dense transformer, MT-NLG 530B, is now over 3 larger than GPT-3’s 170 billion parameters from just two years ago. However, this model, as well as the majority of existing large models, have all been trained for a comparable number of tokens—around 300 billion. While the desire to train these mega-models has led to substantial engineering innovation, we hypothesize that the race to train larger and larger models is resulting in models that are substantially underperforming compared to what could be achieved with the same compute budget. We propose three predictive approaches towards optimally setting model size and training dura- tion, based on the outcome of over 400 training runs. All three approaches predict that Gopher is substantially over-sized and estimate that for the same compute budget a smaller model trained on more data will perform better. We directly test this hypothesis by training Chinchilla, a 70B parameter model, and show that it outperforms Gopher and even larger models on nearly every measured evaluation task. Whilst our method allows us to make predictions on how to scale large models when given additional compute, there are several limitations. Due to the cost of training large models, we only have two comparable training runs at large scale (Chinchilla and Gopher), and we do not have additional tests at intermediate scales. Furthermore, we assume that the efficient computational frontier can be described by a power-law relationship between the compute budget, model size, and number of training tokens. However, we observe some concavity in log 𝑁𝑜𝑝𝑡 at high compute budgets (see Appendix E). This suggests that we may still be overestimating the optimal size of large models. Finally, the training runs for our analysis have all been trained on less than an epoch of data; future work may consider the multiple epoch regime. Despite these limitations, the comparison of Chinchilla to Gopher validates our performance predictions, that have thus enabled training a better (and more lightweight) model at the same compute budget.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この章節では、大規模な言語モデルのトレーニングにおけるモデルサイズとトレーニングデータ量のバランスの重要性について議論されている。特に、既存の大規模モデルが十分なデータ量でトレーニングされていない可能性を指摘し、より効果的なトレーニング方法を提案している。要旨は以下の通り:

大規模言語モデルのトレーニングにおいて、モデルサイズの拡大が進んでいるが、トレーニングトークンの数は増えていない。
GPT-3（170Bパラメータ）と比べて、最新のモデルMT-NLG 530Bは3倍以上のサイズであるが、いずれも約300億トークンでトレーニングされている。
このような巨大モデルの開発には技術的な進歩が見られるが、同じ計算予算を用いてモデルサイズを適切に設定すれば、性能向上が可能であると仮定している。
400以上のトレーニング実行結果に基づく予測により、Gopherはモデルサイズが大きすぎると判断され、同じ計算予算でより小さなモデルがより良い結果を出せることを示している。
実験により、**Chinchilla（70Bパラメータ）**はGopherや他の大規模モデルをほとんどの評価タスクで上回ることが確認されている。

②具体的な処理方法の記載

予測アプローチ: 400以上のトレーニング実行結果に基づき、モデルサイズとトレーニング期間を最適化するための3つの予測アプローチを提案している。具体的な予測アプローチの詳細は記載されていないが、これらによりモデルサイズを決定し、トレーニングデータの効率的な使用を図っている。
実験的検証: Gopher（280Bパラメータ）に対して、Chinchilla（70Bパラメータ）のトレーニングを行い、比較を実施。これにより、同じ計算予算でトレーニングした場合、小型モデルの方がパフォーマンスが向上することを確認している。
効率的計算フロンティア: モデルサイズ、計算予算、トレーニングトークン数の間に**べき乗則（power-law relationship）**が存在すると仮定している。この仮定に基づき、トレーニングデータと計算予算の関係を最適化している。

③改良点・工夫点・テクニック等の記載

モデルサイズの最適化: 従来のモデル（例: Gopher、MT-NLG 530B）はモデルサイズが大きすぎるため、計算予算に対して非効率であることが指摘されている。これに対し、Chinchillaは70Bパラメータというより小さなモデルサイズで、同じ計算予算内でより良い結果を出している。
べき乗則の利用: モデルサイズとトレーニングトークン数の最適な関係を見つけるために、べき乗則に基づいてモデルサイズを決定している。これにより、無駄の少ないトレーニングが可能となっている。
効率的なトレーニングデータ使用: モデルのサイズを小さくし、トレーニングデータの使用を増やすことで、より効率的な学習が達成されている。これにより、トレーニングエポック数が少ないデータでも有効なモデルを構築できる。

④メカニズムの解明・なぜそうなるのか等の記載

モデルのサイズが大きすぎる場合、トレーニングに使用されるデータ量が相対的に不足するため、モデルが計算予算に対して非効率になる。このため、モデルのサイズを減らし、より多くのトレーニングトークンを使用することが、性能向上のカギとなる。
べき乗則に基づく効率的計算フロンティアにより、モデルサイズとトレーニングデータ量をバランスさせることで、計算予算の最適な使用が可能になる。

⑤性能が向上した記載

Chinchilla（70Bパラメータ）は、Gopher（280Bパラメータ）や他の大規模モデルを、ほとんどすべての評価タスクで上回っている。
この性能向上は、同じ計算予算内でトレーニングデータの効率的な使用とモデルサイズの最適化によって達成された。

⑥表に出来そうな部分

モデルのパラメータ数とトレーニングトークン数に関するデータを表にまとめることができる。

モデル名	パラメータ数	トレーニングトークン数	比較結果
GPT-3	170B	300B	基準
MT-NLG 530B	530B	300B	非効率
Gopher	280B	300B	非効率
Chinchilla	70B	増加	効率的かつ高性能

この表は、Chinchillaがより小さなモデルサイズで、トレーニングデータの効率的使用によって、他の大規模モデルを上回る性能を示していることを表している。

5. Discussion & Conclusion (続き)

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) Though there has been significant recent work allowing larger and larger models to be trained, our analysis suggests an increased focus on dataset scaling is needed. Speculatively, we expect that scaling to larger and larger datasets is only beneficial when the data is high-quality. This calls for responsibly collecting larger datasets with a high focus on dataset quality. Larger datasets will require extra care to ensure train-test set overlap is properly accounted for, both in the language modelling loss but also with downstream tasks. Finally, training for trillions of tokens introduces many ethical and privacy concerns. Large datasets scraped from the web will contain toxic language, biases, and private information. With even larger datasets being used, the quantity (if not the frequency) of such information increases, which makes dataset introspection all the more important. Chinchilla does suffer from bias and toxicity but interestingly it seems less affected than Gopher. Better understanding how performance of large language models and toxicity interact is an important future research question. While we have applied our methodology towards the training of auto-regressive language models, we expect that there is a similar trade-off between model size and the amount of data in other modalities. As training large models is very expensive, choosing the optimal model size and training steps beforehand is essential. The methods we propose are easy to reproduce in new settings.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この章節では、大規模モデルのトレーニングにおけるデータセットのスケーリングについて議論されている。モデルサイズの拡大とデータセットの質の関係、そして大規模データセットの倫理的問題について触れ、データ品質とデータセットのサイズの重要性を強調している。要旨は以下の通り:

データセットのスケーリングによりモデル性能が向上するが、これは高品質なデータに依存する。
データセットの質とトレーニング-テストセットのオーバーラップに細心の注意を払う必要がある。
巨大なデータセットを使用することで、トークン数が増加し、倫理的およびプライバシーに関する問題が浮上する。
Chinchillaは他の大規模モデルに比べてバイアスや毒性の影響を受けにくいが、今後の研究課題である。
本研究で用いた手法は、他のモダリティに対しても有効であり、再現可能である。

②具体的な処理方法の記載

データセットスケーリング: データセットをスケールアップする際、特に高品質なデータの収集に焦点を当てることが求められている。データの質が高ければ、スケーリングが効果的であるとされている。また、データの収集において、トレーニングセットとテストセットの重複を適切に処理する必要があると述べている。
毒性やバイアスの抑制: 巨大なデータセットをトレーニングに使用する際、Webから取得されるデータには有害な言語やバイアス、プライベート情報が含まれることがある。データセットが大きくなると、このような情報の量が増加するため、データセットの検査と倫理的な配慮が重要になる。
Chinchillaの分析: Chinchillaモデルも毒性やバイアスの影響を受けているが、他のモデル（例: Gopher）よりも影響が少ないことが示されている。
再現性の高い手法: 提案された方法は他のモダリティ（例: コンピュータビジョンや音声認識）に対しても有効であり、モデルサイズとデータ量のトレードオフを適切に設定するための再現可能な手法である。

③改良点・工夫点・テクニック等の記載

データの質に重点を置くこと: モデルのスケーリングにおいて、単純にデータセットのサイズを増やすだけでなく、データ品質を向上させることが重要であると指摘されている。質の低いデータはモデルの性能向上に寄与しないため、質の高いデータセットの収集が必要である。
トレーニングとテストのオーバーラップへの注意: 大規模データセットを使用する際に、トレーニングセットとテストセットの重複がモデルのパフォーマンス評価に悪影響を与える可能性があるため、オーバーラップを避けるための追加の配慮が必要である。
再現可能性: 提案された方法は、他のモダリティに対しても適用可能であり、モデルサイズとトレーニングデータ量の最適なバランスを簡単に見つけることができる再現可能な手法である。

④メカニズムの解明・なぜそうなるのか等の記載

データの質が高い場合、スケーリングによる性能向上が期待できるが、質の低いデータを増やすだけでは、モデルのパフォーマンスは向上しないため、質の向上が不可欠である。また、トレーニングとテストのオーバーラップが生じると、モデルが本当に汎化能力を持っているかどうかを正確に評価できなくなるため、これを避ける必要がある。

⑤性能が向上した記載

Chinchillaモデルは、バイアスや毒性の影響がGopherモデルより少ないと記されている。これにより、Chinchillaは倫理的な観点からも改善されていることが示唆されている。性能評価に関しては具体的な数値は記載されていないが、毒性やバイアスの面での優位性が言及されている。

⑥表に出来そうな部分

データセットスケーリングと毒性・バイアスの影響について、モデル間の比較を表にまとめることができる。

モデル名	パラメータ数	毒性の影響	バイアスの影響	データセットの質
Gopher	280B	高い	高い	標準
Chinchilla	70B	低い	低い	改善された質

この表は、Chinchillaモデルが毒性やバイアスの影響をより少なく受けていることを示し、データセットの質が重要であることを強調している。

Appendix

同論文の末尾にあるAppendixについては、下記の別記事に記載しました。

あとがき

所感

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。(良い意味で論文の要点をそのまま抜き出した形)

分かったこと

分かったこと・注目した点を列挙すると:

現在の大規模言語モデルが過小学習されていることが示されている。
400以上の言語モデルを訓練し、モデルサイズとトークン数は同じ割合でスケーリングすべきであることを発見した。

Chinchillaを訓練し、Gopher（280B）、GPT-3（175B）、Jurassic-1（178B）、Megatron-Turing NLG（530B）を一貫して上回る性能を発揮した。

過去の大規模モデルでは、計算予算を主にモデルサイズの拡大に費やし、訓練データ量が一定であったため、結果的にモデルが過小学習された。そのため、計算量に見合ったトークン数が不足していた。
モデルサイズとトークン数を同じ割合でスケーリングすることにより、モデルはより豊富なデータに基づいて学習できるようになり、性能が大幅に向上する。

Kaplanの研究では、10倍の計算予算を持つ場合、モデルサイズを5.5倍にし、トークン数を1.8倍にすることが推奨されたが、著者たちはこれに対して、モデルサイズとトークン数は等しい割合でスケーリングすべきであると主張している。

本研究ではモデルサイズに関わらず、学習トークン数に合わせたスケジュールが最適な最終損失をもたらすことを発見している。

推定されたスケーリング関係は、$N_{\text{opt}} \sim C^a$ および $D_{\text{opt}} \sim C^b$ という形式になり、具体的には $a = 0.50$、$b = 0.50$ である。

トークン数とモデルサイズのスケーリング: 例えば、1750億パラメータのモデルは、計算予算4.41×10^24 FLOPsと4.2兆トークンで訓練する必要があり、2800億パラメータのGopherに類似したモデルでは10^25 FLOPsの計算予算と6.8兆トークンが必要と推定されている。

本研究では、異なる3つのアプローチを用いて最適なパラメータ数とトークン数のスケーリングを解析している。1つ目は「トレーニングカーブの最小値を使用する方法」、2つ目は「IsoFLOPプロファイルの使用」、3つ目は「損失のパラメトリックモデリング」に基づいている。これにより、様々な条件下でのスケーリング関係を比較し、安定した結論を導いている。
さらに、これらの結果を過去の手法（Kaplan et al. 2020）と比較しており、Kaplanらの提案よりも計算リソースに対してモデルパラメータが小さく、データが多く必要であると示されている。

MMLUは学問的な試験形式の問題から構成されている。
Chinchillaは、平均正答率が67.6%であり、Gopherよりも7.6%向上している（Table 6）。
Chinchillaは、2023年の専門家予測の精度である63.4%も上回っている。
Chinchillaは4つのタスク（high_school_gov_and_politics、international_law、sociology、us_foreign_policy）で90%以上の精度を達成している。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up