More than 1 year has passed since last update.

生成AIを用いてInstructGPTの論文「Training language models to follow instructions with human feedback (2022)」を読んでみた

Last updated at 2024-09-25Posted at 2024-09-25

はじめに

生成AIを用いてInstructGPTの論文「Training language models to follow instructions with human feedback (2022)」の内容を(なるべく)把握してみました。(生成AIが)論文の記載内容を始めから最後まで読んで、実際にどのような記載があるのかを把握します。

(論文の分かりやすい解説記事は見るのですが、実際の論文までチェックしないので、生成AIを使って内容を把握してみました。)

最終モデル(GPT-3)は175Bの大サイズなのに、報酬によって最終モデルに教えるための報酬モデル(RM)は6Bの(100倍以上小さい)方が良いことが分かりました。知識等の全てを教える訳ではなく、回答が好ましいスタイルかだけを教えれば良いので、スタイルのチェックだけであれば小さいモデル(6B)で十分、ということなのかと推測。(その他、末尾の「分かったこと」章を参照)

以降で、ChatGPTに聞いてみた例を記載します。

他例: 同類の方法を使って読んでみた結果

対象の論文

論文: (InstructGPTに関する論文)

[2203.02155] Training language models to follow instructions with human feedback
https://arxiv.org/abs/2203.02155
(PDF: https://arxiv.org/pdf/2203.02155)

質問時の各章節の区切り部分

論文の中にある各章節を、下記のように区切って、部分毎に生成AIに内容を質問していきます。

Abstract
---
1 Introduction
---
1 Introduction (続き)
---
2 Related work
---
3 Methods and experimental details
3.1 High-level methodology
---
3.2 Dataset
---
3.3 Tasks
---
3.4 Human data collection
---
3.5 Models
---
3.5 Models (続き)
---
3.6 Evaluation
---
4 Results
4.1 Results on the API distribution
---
4.2 Results on public NLP datasets
---
4.3 Qualitative results
---
5 Discussion
5.1 Implications for alignment research
---
5.2 Who are we aligning to?
---
5.3 Limitations
---
5.4 Open questions
---
5.5 Broader impacts

生成AIへの質問方法

生成AIを活用して、知りたい記事・論文の1節分(適度な長さ)のテキストをコピー＆ペーストして、その下に質問内容を「①～ ②～ …」と番号付きで書いて、生成AIに渡せば、全質問に一発で的確に回答してくれるので、非常に良好でした。記事全体を読む必要なく、知りたい点の情報だけを収集できます。

生成AIへの質問例:

(論文・記事の各章節を貼り付け)

上記の内容に関して下記の質問に回答下さい: (である調で記載、質問に対して該当するものが無ければ無しと記載)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載、既存手法があれば引用元を記載)
②具体的な処理方法の記載があれば説明下さい。(簡略化せずに全て記載、既存手法があれば引用元を記載)
③改良点・工夫点・テクニック等の記載があれば説明下さい。
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。
⑤性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)

続けて下記の質問に追加で回答下さい:
⑥比較の記載があれば違いを表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(必ず正しく数値を抜き取る事、|数値|説明|の表へ)
⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)

※回答が長くなりそうな場合は、適宜、分けて質問: ①②③④⑤、⑥⑦⑧⑨⑩
※その他、不明点があれば、適宜、追加で質問。

質問内容は、記事・論文を読んでいていつも知りたいと思う点(改良点・工夫点・テクニック・メカニズムの解明)にしています。また、表で比較した方が素早く把握できるので、可能であれば記事を表に変換するようにしています。

論文・記事を貼り付けるテキストの長さは、1節分程度の量にとどめた方が、良い回答が得られました。生成AIの回答の文量が多くなってくると、回答が長くなり過ぎないように、生成AIが勝手に(適度に)端折り始めてしまい、重要な点が回答から抜けてしまう可能性が高くなります。

事前知識

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。

InstructGPTについての分かりやすい解説記事(下記)等を事前にチェックして、中核部分の内容をあらかじめ分かっていると、理解しやすいと思います。生成AIは実際の細かい処理方法自体を分かりやすく説明してはくれない傾向があります。

注意点

論文のテキスト内容だけを貼り付けて、生成AIに質問しています。論文の中の図・表の部分は貼り付けていません。図・表の内容は生成AIの回答には含まれず、別途論文を見る必要があります。

以降で、生成AIの回答内容が読みにくい・分かりづらい場合は、論文の本文でも同じように書かれてあり、論文の本文を読んでも同じように分かりづらいことが多くあります。論文では、既存研究等があるため、多くの説明を省略・一言だけサラッと書かれてある等、同種の研究に取り組む人でなければ、なかなか全容の理解に辿りつくのは難しい側面があります。この点は、生成AIの回答性能が悪いのではなく、論文という性質上、生じるものと考えています。

生成AIに質問

以降で、ChatGPTに実際に聞いてみた例を記載します。

生成AIへの質問＆回答の全容

生成AIへの質問＆回答の全容・詳細:

Abstract

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) Abstract Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?
この部分では、言語モデルをユーザーの意図に沿って調整するための新たな手法について述べられている。具体的には、GPT-3モデルに対して人間のフィードバックを用いたファインチューニングを行うことで、モデルの出力を改善し、ユーザーの期待により沿ったものにする試みが記載されている。

大規模な言語モデルは、必ずしもユーザーの意図に沿った出力を生成しない問題がある。
ユーザーの意図に合ったモデル出力を実現するために、人間のフィードバックを用いたファインチューニングを実施。
提案手法は、OpenAI APIを介して得られたプロンプトやラベラーが作成したプロンプトを使用してデータセットを収集し、それに基づいてGPT-3をファインチューニング。
人間のフィードバックを活用した強化学習を用いて、モデルのさらなる調整を行い、InstructGPTを作成。
1.3BパラメータのInstructGPTは、175BパラメータのGPT-3と比較して出力品質が高いと評価される。
InstructGPTは、真実性の向上と有害な出力の削減に成功しつつ、公開されているNLPデータセット上での性能の低下は最小限に抑えられている。

②具体的な処理方法の記載

ラベラーによって書かれたプロンプトと、OpenAI APIを通じて提出されたプロンプトを収集し、望ましいモデルの動作のデモンストレーションを集めたデータセットを構築した。
このデータセットを使用し、GPT-3を教師あり学習でファインチューニングした。
次に、モデルの出力に対するランキングデータセットを収集し、このランキングに基づいて強化学習を用いてモデルをさらに調整した。
強化学習のアルゴリズムとして、人間のフィードバックを用いたポリシー最適化を実施。

③改良点・工夫点・テクニック等の記載

人間のフィードバックに基づく教師あり学習と強化学習を組み合わせることで、ユーザーの意図により適合するモデルを作成した。
100倍少ないパラメータを持つモデル（1.3B InstructGPT）が、より大規模なモデル（175B GPT-3）よりも好まれる結果を示したことが、モデルの効率性と調整の有効性を示している。

④メカニズムの解明・なぜそうなるのか等の記載

人間のフィードバックを用いることで、モデルが単純に言語パターンを学習するだけでなく、ユーザーの意図や期待に応じた出力を生成できるように誘導されると考えられている。
パラメータの数が多いことが必ずしも出力品質に比例しないことが、InstructGPTとGPT-3の比較から示唆されている。

⑤性能が向上した記載

InstructGPTは、GPT-3と比較して真実性の向上が見られ、有害な出力の生成も減少している。具体的には、1.3BパラメータのInstructGPTが、175BパラメータのGPT-3よりも好まれる出力を生成する結果を示している (具体的な図/表番号は記載されていない)。

参考:

InstructGPT では GPT-3 をベースに、人によるフィードバックでモデルを強化 (ファインチューニング) することにより、よりユーザの意図に沿った回答や文章を生成しようというものです。

人によるフィードバックを強化学習を使ってファインチューニングしますが、この学習方法を『RLHF; Reinforcement Learning from Human Feedback』と呼び、OpenAI が文章の要約タスクで精力的に研究していた手法です。

1 Introduction

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 1 Introduction Large language models (LMs) can be “prompted” to perform a range of natural language process- ing (NLP) tasks, given some examples of the task as input. However, these models often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021; Weidinger et al., 2021; Tamkin et al., 2021; Gehman et al., 2020). This is because the language modeling objective used for many recent large LMs—predicting the next token on a webpage from the internet—is different from the objective “follow the user’s instructions helpfully and safely” (Radford et al., 2019; Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022). Thus, we say that the language modeling objective is misaligned. Averting these unintended behaviors is especially important for language models that are deployed and used in hundreds of applications. We make progress on aligning language models by training them to act in accordance with the user’s intention (Leike et al., 2018). This encompasses both explicit intentions such as following instructions and implicit intentions such as staying truthful, and not being biased, toxic, or otherwise harmful. Using the language of Askell et al. (2021), we want language models to be helpful (they should help the user solve their task), honest (they shouldn’t fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment). We elaborate on the evaluation of these criteria in Section 3.6. We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement learning from human feedback (RLHF; Christiano et al., 2017; Stiennon et al., 2020) to fine-tune GPT-3 to follow a broad class of written instructions (see Figure 2). This technique uses human preferences as a reward signal to fine-tune our models. We first hire a team of 40 contractors to label our data, based on their performance on a screening test (see Section 3.4 and Appendix B.1 for more details). We then collect a dataset of human-written demonstrations of the desired output behavior on (mostly English) prompts submitted to the OpenAI API3 and some labeler-written prompts, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between outputs from our models on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which model output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our supervised learning baseline to maximize this reward using the PPO algorithm (Schulman et al., 2017). We illustrate this process in Figure 2. This procedure aligns the behavior of GPT-3 to the stated preferences of a specific group of people (mostly our labelers and researchers), rather than any broader notion of “human values”; we discuss this further in Section 5.2. We call the resulting models InstructGPT.

上記の内容に関して下記の質問に回答下さい: (である調で記載、質問に対して該当するものが無ければ無しと記載)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載、既存手法があれば引用元を記載)
②具体的な処理方法の記載があれば説明下さい。(簡略化せずに全て記載、既存手法があれば引用元を記載)
③改良点・工夫点・テクニック等の記載があれば説明下さい。
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。
⑤性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)
⑥比較の記載があれば違いを表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(必ず正しく数値を抜き取る事、|数値|説明|の表へ)
⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この部分では、言語モデルの目標とユーザーの意図との間の不一致を解消し、言語モデルがユーザーの指示に従い、安全で正確な応答を提供するように調整する手法について述べられている。具体的には、GPT-3を強化学習に基づく人間のフィードバックを用いてファインチューニングすることで、モデルの出力をユーザーの期待に合ったものにするプロセスを説明している。

大規模言語モデルは、予測タスクに優れているが、ユーザーの指示に従うことや、真実性を保持し有害な出力を避けることには課題がある。
GPT-3をユーザーの意図に沿って調整するために、人間のフィードバックを使用した強化学習 (RLHF) を用いる。
ラベラーによるプロンプトとAPIプロンプトからデータセットを収集し、これを基に教師あり学習と強化学習を実施。
報酬モデル (RM) を用いて、ラベラーの好む出力を予測し、その報酬信号を使ってモデルをさらに調整。
提案手法は、ラベラーや研究者の指示に基づいた出力を生成するが、一般的な「人間の価値観」に沿ったものではない点を注意すべきである。

②具体的な処理方法の記載

40名の契約社員がデータラベリングのために雇われ、スクリーニングテストでのパフォーマンスに基づいて選抜された（詳細はSection 3.4およびAppendix B.1に記載）。
OpenAI APIを介して提出されたプロンプトやラベラーが作成したプロンプトから、望ましい出力動作を示すデータセットを収集。
収集されたデータセットを使用して教師あり学習のベースラインモデルをトレーニング。
APIプロンプトのより大規模なセットに対して、モデルの出力を比較するための人間による評価を実施。
評価データセットを基に報酬モデル (RM) をトレーニングし、どのモデル出力が好ましいかを予測。
RMを報酬関数として使用し、PPOアルゴリズムを用いて教師あり学習のベースラインモデルを強化学習でファインチューニング。

③改良点・工夫点・テクニック等の記載

人間のフィードバックを報酬として使用し、強化学習 (PPO) によるファインチューニングを行うことで、モデルがより人間の意図に沿った出力を生成するように調整された。
ベースラインの教師あり学習モデルに対して、RMを用いたファインチューニングを行うことで、出力品質を向上させる工夫を行った。

④メカニズムの解明・なぜそうなるのか等の記載

大規模言語モデルの目標は「次のトークンを予測する」ことであるため、ユーザーの指示や期待に直接対応するわけではない。このミスマッチが有害な出力や不正確な情報の生成につながる。
人間のフィードバックを報酬として用いることで、モデルが指示に従い、かつ真実で有害ではない出力を生成するようになる。

⑤性能が向上した記載

強化学習に基づくファインチューニングによって、GPT-3の出力がユーザーの期待により合致するようになり、真実性や有害な出力の削減が確認された。
出力の評価にはラベラーによる比較が使用されており、PPOによるファインチューニング後のモデルの出力がより高く評価された (図/表番号に対応するものは本文では言及されていない)。

⑥比較の記載

変更点	方法	効果	表番号
ベースラインモデル	教師あり学習 (SL)	基本的な指示に従うが、不十分	無
RMを用いたファインチューニング	強化学習 (PPO) + 報酬モデル (RM)	指示に従う能力が向上、真実性や有害な出力の削減	無

解説: 上記の表は、ベースラインモデルと強化学習後のモデルの違いを示している。

⑦上記⑥以外で表に出来そうな部分

データ収集手法	説明
OpenAI API プロンプト	ユーザーが提供したプロンプトのデータセット
ラベラーが作成したプロンプト	人間のラベラーによる手動生成プロンプトデータ

解説: この表は、データ収集手法の違いをまとめたものである。

⑧具体的な数値の記載

数値	説明
40	データラベリングのために雇用された契約社員の人数

⑨具体的な変数名の記載

変数名	説明
RM	報酬モデル
PPO	Proximal Policy Optimization

⑩図/表があれば、各図/表は何を主張するためのものか

Figure 2: 人間のフィードバックを報酬として使用した強化学習のプロセスを図示しており、GPT-3がどのようにユーザーの意図に沿ったモデルに進化するかを示している。

参考: RLHF

▼質問(人間)
続けて下記の質問に追加で回答下さい:
⑪関連研究の参照番号を全て列挙して、表にまとめて下さい。(元文にある内容のみ記載・類推して付け足しは不要、|参照番号|概要説明|の表へ、関連するもの同士でまとめて並べ替え)

▼生成AIの回答(ChatGPT-4o)
⑪関連研究の参照番号

参照番号	概要説明
Bender et al., 2021	言語モデルが生み出す予期しない有害な振る舞いやバイアスについて言及。
Bommasani et al., 2021	大規模な言語モデルの展開と、それに関連する課題やリスクについて検討。
Kenton et al., 2021	言語モデルに関連する倫理的・社会的な問題についての議論。
Weidinger et al., 2021	言語モデルの安全性と倫理的な懸念を評価するためのフレームワーク。
Tamkin et al., 2021	言語モデルの生成的能力とその結果生じるバイアスや誤情報についての研究。
Gehman et al., 2020	言語モデルによる生成テキストにおける有害なコンテンツのリスクに関する分析。
Radford et al., 2019	GPT-2に関連する言語モデルのトレーニングプロセスとその成果に関する報告。
Brown et al., 2020	GPT-3に関する技術的な詳細およびその性能に関する研究報告。
Fedus et al., 2021	言語モデルのトレーニング目標とユーザーの意図との乖離についての議論。
Rae et al., 2021	言語モデルの能力とそれに関連する制約や課題についての研究。
Thoppilan et al., 2022	言語モデルのスケーリングに関する詳細と、それによる性能向上の限界に関する考察。
Leike et al., 2018	人間の意図とモデルの出力を一致させるための強化学習に基づく手法 (RLHF) の提案とその評価。
Askell et al., 2021	言語モデルが「helpful, honest, and harmless」という基準を満たすべきであるとする研究。
Christiano et al., 2017	人間のフィードバックを使用した強化学習 (RLHF) の初期研究。
Stiennon et al., 2020	強化学習と人間のフィードバックを組み合わせてモデルをトレーニングする手法の評価。
Schulman et al., 2017	Proximal Policy Optimization (PPO) アルゴリズムの提案とその技術的詳細。

解説: 上記の表は、本文中に言及された関連研究の参照番号をまとめたものである。関連するもの同士をまとめて並べ替えた。

1 Introduction (続き)

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) We mainly evaluate our models by having our labelers rate the quality of model outputs on our test set, consisting of prompts from held-out customers (who are not represented in the training data). We also conduct automatic evaluations on a range of public NLP datasets. We train three model sizes (1.3B, 6B, and 175B parameters), and all of our models use the GPT-3 architecture. Our main findings are as follows: Labelers significantly prefer InstructGPT outputs over outputs from GPT-3. On our test set, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having over 100x fewer parameters. These models have the same architecture, and differ only by the fact that InstructGPT is fine-tuned on our human data. This result holds true even when we add a few-shot prompt to GPT-3 to make it better at following instructions. Outputs from our 175B InstructGPT are preferred to 175B GPT-3 outputs 85 ± 3% of the time, and preferred 71 ± 4% of the time to few-shot 175B GPT-3. InstructGPT models also generate more appropriate outputs according to our labelers, and more reliably follow explicit constraints in the instruction. InstructGPT models show improvements in truthfulness over GPT-3. On the TruthfulQA benchmark, InstructGPT generates truthful and informative answers about twice as often as GPT-3. Our results are equally strong on the subset of questions that were not adversarially selected against GPT-3. On “closed-domain” tasks from our API prompt distribution, where the output should not contain information that is not present in the input (e.g. summarization and closed-domain QA), InstructGPT models make up information not present in the input about half as often as GPT-3 (a 21% vs. 41% hallucination rate, respectively). InstructGPT shows small improvements in toxicity over GPT-3, but not bias. To measure toxicity, we use the RealToxicityPrompts dataset (Gehman et al., 2020) and conduct both automatic and human evaluations. InstructGPT models generate about 25% fewer toxic outputs than GPT-3 when prompted to be respectful. InstructGPT does not significantly improve over GPT-3 on the Winogender (Rudinger et al., 2018) and CrowSPairs (Nangia et al., 2020) datasets. We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure. During RLHF fine-tuning, we observe performance regressions compared to GPT-3 on certain public NLP datasets, notably SQuAD (Rajpurkar et al., 2018), DROP (Dua et al., 2019), HellaSwag (Zellers et al., 2019), and WMT 2015 French to English translation (Bojar et al., 2015). This is an example of an “alignment tax” since our alignment procedure comes at the cost of lower performance on certain tasks that we may care about. We can greatly reduce the performance regressions on these datasets by mixing PPO updates with updates that increase the log likelihood of the pretraining distribution (PPO-ptx), without compromising labeler preference scores. Our models generalize to the preferences of “held-out” labelers that did not produce any train- ing data. To test the generalization of our models, we conduct a preliminary experiment with held-out labelers, and find that they prefer InstructGPT outputs to outputs from GPT-3 at about the same rate as our training labelers. However, more work is needed to study how these models perform on broader groups of users, and how they perform on inputs where humans disagree about the desired behavior. Public NLP datasets are not reflective of how our language models are used. We compare GPT-3 fine-tuned on our human preference data (i.e. InstructGPT) to GPT-3 fine-tuned on two different compilations of public NLP tasks: the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) (in particular, the T0++ variant). These datasets consist of a variety of NLP tasks, combined with natural language instructions for each task. On our API prompt distribution, our FLAN and T0 models perform slightly worse than our SFT baseline, and labelers significantly prefer InstructGPT to these models (InstructGPT has a 73.4 ±2% winrate vs. our baseline, compared to 26.8 ±2% and 29.8 ±2% for our version of T0 and FLAN, respectively). InstructGPT models show promising generalization to instructions outside of the RLHF fine- tuning distribution. We qualitatively probe InstructGPT’s capabilities, and find that it is able to follow instructions for summarizing code, answer questions about code, and sometimes follows instructions in different languages, despite these instructions being very rare in the fine-tuning distribution. In contrast, GPT-3 can perform these tasks but requires more careful prompting, and does not usually follow instructions in these domains. This result is exciting because it suggests that our models are able to generalize the notion of “following instructions.” They retain some alignment even on tasks for which they get very little direct supervision signal. InstructGPT still makes simple mistakes. For example, InstructGPT can still fail to follow instructions, make up facts, give long hedging answers to simple questions, or fail to detect instructions with false premises. Overall, our results indicate that fine-tuning large language models using human preferences signifi- cantly improves their behavior on a wide range of tasks, though much work remains to be done to improve their safety and reliability. The rest of this paper is structured as follows: We first detail related work in Section 2, before diving into our method and experiment details in Section 3, including our high-level methodology (3.1), task and dataset details (3.3 and 3.2), human data collection (3.4), how we trained our models (3.5), and our evaluation procedure (3.6). We then present our results in Section 4, divided into three parts: results on the API prompt distribution (4.1), results on public NLP datasets (4.2), and qualitative results (4.3). Finally we give an extended discussion of our work in Section 5, including implications for alignment research (5.1), what we are aligning to (5.2), limitations (5.3), open questions (5.4), and broader impacts of this work (5.5).

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

記載内容: 本文は、InstructGPTモデルの評価とその結果に関する詳細を述べている。具体的には、モデルの評価方法、結果、改良点、および今後の課題に焦点を当てている。
要旨:
1. InstructGPTモデルは、GPT-3モデルに比べて人間のラベラーによって高く評価されている。
2. InstructGPTは、真実性においてGPT-3を上回る結果を示している。
3. 有毒性の削減においてInstructGPTは改善を見せるが、バイアスに関してはGPT-3と大差がない。
4. 公開NLPデータセットでの性能低下は、RLHF微調整中に起こるが、PPO-ptx手法で回避可能。
5. InstructGPTは、新しいラベラーの嗜好にも対応可能だが、さらに広範なユーザー層での評価が必要。
6. 公開NLPデータセットは、実際の使用状況を反映しておらず、FLANやT0のようなモデルはInstructGPTに劣る。
7. InstructGPTは、RLHF微調整の範囲外でも、指示に対して一般化した対応が可能。
8. InstructGPTは依然として簡単なミスを犯すことがあり、安全性と信頼性の向上が課題。

②具体的な処理方法の記載

モデル評価方法: モデルの出力品質は、訓練データに含まれていないテストセットのプロンプトを用いてラベラーにより評価された。また、公開NLPデータセット上での自動評価も実施された。
モデルサイズ: GPT-3のアーキテクチャを用いて、1.3B, 6B, 175Bパラメータの3つのモデルサイズが訓練された。
ラベラー評価: テストセットにおけるInstructGPTの出力は、GPT-3の出力に比べて高評価を得た。具体的には、1.3BパラメータのInstructGPTが、175BパラメータのGPT-3よりも優れていると評価された。
RLHF (Reinforcement Learning from Human Feedback) 微調整: 人間のラベラーのフィードバックを用いて強化学習を行い、PPO (Proximal Policy Optimization) アルゴリズムで報酬関数を最適化した。また、PPO-ptx手法を用いて、事前学習分布の対数尤度を増加させることで、公開NLPデータセットでの性能低下を最小化した。

③改良点・工夫点・テクニック等の記載

人間の嗜好に基づく微調整: InstructGPTは、GPT-3と同じアーキテクチャを持つが、人間のフィードバックを基にした強化学習 (RLHF) によって微調整されており、これによりモデルがユーザーの指示により適切に従うようになっている。
PPO-ptx手法の使用: RLHFによる微調整中に、公開NLPデータセットでの性能低下を防ぐため、PPO-ptxを使用することで、性能低下を最小化した。

④メカニズムの解明・なぜそうなるのか等の記載

モデル出力の品質向上: InstructGPTは、ラベラーのフィードバックを報酬信号として取り入れることで、ユーザーの意図により合致する出力を生成するように学習している。この結果、GPT-3よりも指示に対する従順さが向上している。
性能低下の回避: RLHF微調整時に、PPO-ptx手法を用いることで、事前学習データ分布に基づく対数尤度を増加させ、性能低下を防いでいる。

⑤性能が向上した記載

InstructGPT vs GPT-3: InstructGPTは、175BパラメータのGPT-3に対して85 ± 3%の確率でラベラーにより選好されており、few-shotのGPT-3よりも71 ± 4%の確率で優れている (図2)。
真実性の向上: TruthfulQAベンチマークにおいて、InstructGPTはGPT-3の約2倍の確率で真実で有益な回答を生成している。
有毒性の削減: RealToxicityPromptsデータセットでの評価において、InstructGPTはGPT-3に比べて約25%少ない有毒な出力を生成している。

【図2】

▼質問(人間)
続けて下記の質問に追加で回答下さい:
⑥比較の記載があれば違いを表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(必ず正しく数値を抜き取る事、|数値|説明|の表へ)
⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)

▼生成AIの回答(ChatGPT-4o)
⑥比較の記載

以下は、InstructGPTとGPT-3、ならびにその他のモデルとの性能比較を表にまとめたものである。

モデル	比較項目	パラメータ数	ラベラー選好率	few-shotのGPT-3に対する優位性	真実性	有毒性の削減	公開NLPデータセットでの性能低下
InstructGPT	ラベラーの好み	1.3B	175B GPT-3より優位	-	高い	改善	低下
InstructGPT	ラベラーの好み	175B	85 ± 3%	71 ± 4%	改善 (TruthfulQAで2倍)	25%減少	低下 (PPO-ptxで回避可能)
GPT-3	few-shot	175B	低い	-	低い	-	-
T0	公開NLPデータセットのパフォーマンス	不明	低い	26.8 ± 2%	-	-	-
FLAN	公開NLPデータセットのパフォーマンス	不明	低い	29.8 ± 2%	-	-	-

解説: この表は、InstructGPTとGPT-3および他の公開NLPモデル（T0やFLAN）との比較を示している。特に、ラベラーの選好率や真実性、有毒性の削減においてInstructGPTの方が優れているが、公開NLPデータセットでは性能が低下する問題もある。PPO-ptxを用いることで性能低下は回避可能である(図2、図5)。

⑦上記⑥以外で表に出来そうな部分

以下は、公開NLPデータセットにおける性能低下を防ぐための技術を表にまとめたものである。

技術	説明	利点
RLHF	人間のフィードバックを利用した強化学習	ユーザーの指示に従いやすくなる
PPO-ptx	事前学習分布の対数尤度を増加させる	公開NLPデータセットでの性能低下を防止
SFT (Supervised Fine-Tuning)	監督データを用いた微調整	安定したベースラインモデルの生成
Few-shot learning	少数の例から学習	GPT-3での強化手法だが、InstructGPTには劣る

解説: この表は、InstructGPTのパフォーマンス改善に関連する技術を示しており、特にPPO-ptxが公開NLPデータセットでの性能低下を防ぐために重要である(図4、図6)。

⑧具体的な数値の記載

数値	説明
1.3B	InstructGPTの小規模モデルのパラメータ数
6B	InstructGPTの中規模モデルのパラメータ数
175B	InstructGPTおよびGPT-3の大規模モデルのパラメータ数
85 ± 3%	175B InstructGPTが175B GPT-3よりもラベラーに好まれる割合
71 ± 4%	175B InstructGPTがfew-shot 175B GPT-3よりもラベラーに好まれる割合
約2倍	TruthfulQAベンチマークでInstructGPTがGPT-3に比べて真実性が高い頻度
25%	InstructGPTがGPT-3に比べて有毒性の少ない出力を生成する割合
21%	InstructGPTがクローズド・ドメインのタスクでハルシネーションを起こす頻度
41%	GPT-3がクローズド・ドメインのタスクでハルシネーションを起こす頻度
73.4 ± 2%	InstructGPTがFLANよりも優れているとラベラーに評価される割合
26.8 ± 2%	InstructGPTがT0よりも優れているとラベラーに評価される割合

解説: この表は、本文に記載されている具体的な数値を示している。これらの数値は、モデルサイズ、ラベラーの選好率、真実性、有毒性の削減に関するものであり、各モデルの特徴を数値で示している(図1、図2、図3)。

⑨具体的な変数名 (数式用の記号)

変数名	説明
B	パラメータ数の単位（ビリオン）
RLHF	人間のフィードバックを用いた強化学習
PPO	Proximal Policy Optimizationアルゴリズム
PPO-ptx	PPOと事前学習分布を混合した更新手法
SFT	監督微調整
TruthfulQA	真実性を評価するベンチマーク
Winogender	バイアスを評価するデータセット
CrowSPairs	バイアスを評価するデータセット

解説: これらの変数名は、モデルの訓練方法や評価に関連する主要な技術やベンチマークの名前を表している(図4)。

⑩図/表があれば、各図/表は何を主張するためのものか

図1: モデルサイズごとのパラメータ数とラベラーの評価結果を示す。1.3BのInstructGPTが175BのGPT-3よりも高評価を得たことが主張されている。
図2: InstructGPTとGPT-3の比較結果を視覚的に示し、175BのInstructGPTがGPT-3よりも85 ± 3%の確率で優れていることを主張している。
図3: TruthfulQAベンチマークでのInstructGPTとGPT-3の真実性の比較結果を示している。InstructGPTが2倍の頻度で真実性の高い回答を生成することを主張している。
図4: 有毒性の削減に関するInstructGPTとGPT-3の比較結果を示し、InstructGPTが約25%有毒性の少ない出力を生成することを主張している。
図5: 公開NLPデータセットにおけるRLHF微調整中の性能低下とPPO-ptxの効果を視覚化しており、性能低下を回避できることを主張している。
図6: FLANやT0とInstructGPTの比較結果を示し、InstructGPTがこれらの公開モデルに対して優れていることを視覚的に主張している。

解説: 各図表は、モデルの性能、真実性、有毒性、公開データセットでのパフォーマンスに関する比較結果を視覚的に示し、InstructGPTの優位性や技術的課題を明確にしている。

「⑧具体的な数値の記載」で、「73.4 ± 2%、InstructGPTがFLANよりも優れているとラベラーに評価される割合」の数値は間違い。「29.8±2%、FLANの方が優れているとラベラーに評価される割合」が正しい。

「26.8 ± 2%、InstructGPTがT0よりも優れているとラベラーに評価される割合」の説明も間違い。「T0の方が優れているとラベラーに評価される割合」が正しい。

元文は、「(InstructGPT has a 73.4 ±2% winrate vs. our baseline, compared to 26.8 ±2% and
29.8 ±2% for our version of T0 and FLAN, respectively)」。大分、省略されて書かれてあり、かなり推測する必要がある。仕方ない。

「⑩図/表」の図1～6への参照の記載は本文中に無く、間違い。

▼質問(人間)
上記の図1～6への参照の記載は本文中にありません。よろしいでしょうか。

▼生成AIの回答(ChatGPT-4o)
本文中に図1～6への具体的な参照がない場合、それらの図表番号は記載すべきではない。従って、以下の修正を行う必要がある。

⑥比較の記載
図や表の番号への言及は削除し、単純な比較結果の表を提示する。
⑦上記⑥以外で表に出来そうな部分
図や表への参照を削除し、技術の説明を提供する。
⑧具体的な数値の記載
図や表番号の参照は不要なので、数値のみを提供する。
⑩図/表があれば、各図/表は何を主張するためのものか
この項目は不要となるため、解説も含めて削除する。

修正した内容で再度作成する場合は、その旨をお知らせください。

▼生成AIの回答(ChatGPT-4o)
以下に関連研究の参照番号を、元文に記載された内容に基づいて表形式でまとめる。

参照番号	概要説明
(Gehman et al., 2020)	RealToxicityPromptsデータセットを使用して、モデルの毒性出力を評価。InstructGPTはGPT-3と比較して約25%毒性が少ない。
(Rudinger et al., 2018)	Winogenderデータセットを使用し、性別バイアスの評価を行う。InstructGPTはGPT-3に対して有意な改善は見られない。
(Nangia et al., 2020)	CrowSPairsデータセットを使用して、バイアスの評価を行う。InstructGPTはGPT-3に対して有意な改善は見られない。
(Rajpurkar et al., 2018)	SQuADデータセットを使用して、モデルのパフォーマンスを評価。RLHFの微調整後に一部のNLPデータセットでパフォーマンス低下が見られる。
(Dua et al., 2019)	DROPデータセットを使用して、RLHFの微調整に伴うパフォーマンス低下が見られる。
(Zellers et al., 2019)	HellaSwagデータセットを使用して、モデルの性能を評価。RLHF微調整後のパフォーマンス低下が報告されている。
(Bojar et al., 2015)	WMT 2015フランス語から英語への翻訳データセットを使用し、RLHF微調整によりパフォーマンス低下が発生。
(Wei et al., 2021)	FLANデータセットを使用し、InstructGPTと比較した際の性能差を評価。InstructGPTが優れていることが報告される。
(Sanh et al., 2021)	T0++データセットを使用してInstructGPTと比較。InstructGPTの性能がT0よりも優れていることが報告される。

以上が関連研究の参照番号と概要説明である。

2 Related work

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 2 Related work Research on alignment and learning from human feedback. We build on previous techniques to align models with human intentions, particularly reinforcement learning from human feed- back (RLHF). Originally developed for training simple robots in simulated environments and Atari games (Christiano et al., 2017; Ibarz et al., 2018), it has recently been applied to fine-tuning language models to summarize text (Ziegler et al., 2019; Stiennon et al., 2020; Böhm et al., 2019; Wu et al., 2021). This work is in turn influenced by similar work using human feedback as a reward in domains such as dialogue (Jaques et al., 2019; Yi et al., 2019; Hancock et al., 2019), translation (Kreutzer et al., 2018; Bahdanau et al., 2016), semantic parsing (Lawrence and Riezler, 2018), story generation (Zhou and Xu, 2020), review generation (Cho et al., 2018), and evidence extraction (Perez et al., 2019). Madaan et al. (2022) use written human feedback to augment prompts and improve the performance of GPT-3. There has also been work on aligning agents in text-based environments using RL with a normative prior (Nahian et al., 2021). Our work can be seen as a direct application of RLHF to aligning language models on a broad distribution of language tasks. The question of what it means for language models to be aligned has also received attention re- cently (Gabriel, 2020). Kenton et al. (2021) catalog behavioral issues in LMs that result from misalignment, including producing harmful content and gaming misspecified objectives. In concur- rent work, Askell et al. (2021) propose language assistants as a testbed for alignment research, study some simple baselines, and their scaling properties. Training language models to follow instructions. Our work is also related to research on cross- task generalization in language models, where LMs are fine-tuned on a broad range of public NLP datasets (usually prefixed with an appropriate instruction) and evaluated on a different set of NLP tasks. There has been a range of work in this domain (Yi et al., 2019; Mishra et al., 2021; Wei et al., 2021; Khashabi et al., 2020; Sanh et al., 2021; Aribandi et al., 2021), which differ in training and evaluation data, formatting of instructions, size of pretrained models, and other experimental details. A consistent finding across studies is that fine-tuning LMs on a range of NLP tasks, with instructions, improves their downstream performance on held-out tasks, both in the zero-shot and few-shot settings. There is also a related line of work on instruction following for navigation, where models are trained to follow natural language instructions to navigate in a simulated environment (Bahdanau et al., 2018; Abramson et al., 2020; Zhao et al., 2021). Evaluating the harms of language models. A goal of modifying the behavior of language models is to mitigate the harms of these models when they’re deployed in the real world. These risks have been extensively documented (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021; Weidinger et al., 2021; Tamkin et al., 2021). Language models can produce biased outputs (Dhamala et al., 2021; Liang et al., 2021; Manela et al., 2021; Caliskan et al., 2017; Kirk et al., 2021), leak private data (Carlini et al., 2021), generate misinformation (Solaiman et al., 2019; Buchanan et al., 2021), and be used maliciously; for a thorough review we direct the reader to Weidinger et al. (2021). Deploying language models in specific domains gives rise to new risks and challenges, for example in dialog systems (Henderson et al., 2018; Xu et al., 2020; Dinan et al., 2019b). There is a nascent but growing field that aims to build benchmarks to concretely evaluate these harms, particularly around toxicity (Gehman et al., 2020), stereotypes (Nadeem et al., 2020), and social bias (Dhamala et al., 2021; Nangia et al., 2020; Rudinger et al., 2018). Making significant progress on these problems is hard since well-intentioned interventions on LM behavior can have side-effects (Welbl et al., 2021; Blodgett et al., 2020); for instance, efforts to reduce the toxicity of LMs can reduce their ability to model text from under-represented groups, due to prejudicial correlations in the training data (Xu et al., 2021). Modifying the behavior of language models to mitigate harms. There are many ways to change the generation behavior of language models. Solaiman and Dennison (2021) fine-tune LMs on a small, value-targeted dataset, which improves the models’ ability to adhere to these values on a question answering task. Ngo et al. (2021) filter the pretraining dataset by removing documents on which a language model has a high conditional likelihood of generating a set of researcher-written trigger phrases. When trained on this filtered dataset, their LMs generate less harmful text, at the cost of a slight decrease in language modeling performance. Xu et al. (2020) use a variety of approaches to improve the safety of chatbots, including data filtering, blocking certain words or n-grams during generation, safety-specific control tokens (Keskar et al., 2019; Dinan et al., 2019a), and human-in-the- loop data collection (Dinan et al., 2019b). Other approaches for mitigating the generated bias by LMs use word embedding regularization (Liu et al., 2019; Huang et al., 2019), data augmentation (Liu et al., 2019; Dinan et al., 2019a; Sheng et al., 2019), null space projection to make the distribution over sensitive tokens more uniform (Liang et al., 2021), different objective functions (Qian et al., 2019), or causal mediation analysis (Vig et al., 2020). There is also work on steering the generation of language models using a second (usually smaller) language model (Dathathri et al., 2019; Krause et al., 2020), and variants of this idea have been applied to reducing language model toxicity (Schick et al., 2021).

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この章節は、関連研究に関する説明であり、主に以下の要旨を持つ:

人間のフィードバックによる強化学習 (RLHF): 本論文では、RLHFを基礎にして、言語モデルを人間の意図に合わせて調整する研究を行っている。RLHFは、もともと単純なロボットやゲームでの訓練に使用されていたが、最近ではテキストの要約や会話、翻訳、物語生成、証拠抽出など、さまざまな言語モデルの微調整に応用されている (Christiano et al., 2017; Ziegler et al., 2019; Stiennon et al., 2020; Jaques et al., 2019; Kreutzer et al., 2018; Zhou and Xu, 2020)。
指示に従う訓練を行う言語モデル: 言語モデルが異なるタスクに対して指示に基づいて適応する能力を向上させるために、様々なNLPデータセットで訓練する手法が研究されている。指示に基づく微調整は、zero-shotやfew-shotタスクでの性能を向上させることが示されている (Wei et al., 2021; Mishra et al., 2021)。
言語モデルの有害性評価: 言語モデルの実世界での展開時に発生するリスクを軽減するための研究が進められており、毒性、偏見、プライバシーの侵害、虚偽情報の生成などの問題が挙げられている (Bender et al., 2021; Gehman et al., 2020; Dhamala et al., 2021)。
言語モデルの生成行動の修正: 言語モデルの生成を修正する手法として、特定の価値観に基づくデータセットでの微調整やデータセットのフィルタリング、生成バイアスの軽減のためのアプローチが挙げられている (Solaiman and Dennison, 2021; Xu et al., 2020)。

②具体的な処理方法の記載があれば説明下さい

RLHFの手法: もともとロボットやAtariゲームでの訓練に使用されていたRLHFが、テキスト要約や対話システムなどの言語モデルの微調整に適用されている (Christiano et al., 2017; Ziegler et al., 2019; Stiennon et al., 2020)。人間のフィードバックを報酬として使用し、モデルの生成を改善する。
フィードバックを活用した調整: 対話、翻訳、意味解析、物語生成など、各種の領域で人間のフィードバックを使ってモデルを微調整する手法が紹介されている (Jaques et al., 2019; Kreutzer et al., 2018; Lawrence and Riezler, 2018; Zhou and Xu, 2020)。
指示に基づく学習: 多様なNLPタスクのデータセットに対して、指示を前置して学習させることで、異なるタスクでの性能向上を図る (Wei et al., 2021; Sanh et al., 2021)。
データセットのフィルタリング: 生成されるテキストの毒性を減らすために、特定のフレーズを生成しやすいドキュメントをフィルタリングし、その結果、毒性が減少するが性能が若干低下する (Ngo et al., 2021)。

③改良点・工夫点・テクニック等の記載があれば説明下さい

RLHFの拡張: 初期のRLHF技術がロボット訓練やゲームプレイに限定されていたが、これを言語モデルの微調整に応用する工夫が行われている (Ziegler et al., 2019; Stiennon et al., 2020)。
フィードバックの活用: 対話や翻訳、ストーリー生成など、多様なフィードバックが利用される分野での適応が工夫されている (Jaques et al., 2019; Kreutzer et al., 2018)。
データセットのフィルタリング: 高い毒性を持つ可能性のあるフレーズを事前にフィルタリングすることで、モデルの生成行動をより安全なものにする (Ngo et al., 2021)。

④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい

RLHFの成功要因: RLHFは人間のフィードバックを報酬として使用するため、生成されたテキストが人間の意図に近づくことができる。これにより、テキスト生成の精度が高まり、より有用な出力が得られる (Christiano et al., 2017; Stiennon et al., 2020)。
データセットフィルタリングの効果: 生成モデルが特定のフレーズやトリガーワードに偏ることを防ぐことで、生成されたテキストの偏見や毒性を軽減する。これにより、モデルの安全性が向上する (Ngo et al., 2021)。

⑤性能が向上した記載があれば説明下さい

RLHFの効果: RLHFを使った微調整により、テキストの要約や対話システム、翻訳などにおいてモデルの性能が向上することが確認されている (Ziegler et al., 2019; Stiennon et al., 2020)。
フィードバックを用いた微調整: 対話や物語生成、証拠抽出など、さまざまなタスクにおいて、RLHFが生成精度や安全性を向上させている (Jaques et al., 2019; Kreutzer et al., 2018; Perez et al., 2019)。

▼生成AIの回答(ChatGPT-4o)
⑥比較の記載があれば違いを表でまとめて下さい。

研究	内容	対象領域	特徴
Christiano et al. (2017)	RLHFの手法を初めて提案	ゲーム、ロボット	初期のRLHF技術
Ziegler et al. (2019)	言語モデルにRLHFを適用	テキスト要約	人間のフィードバックで性能向上
Jaques et al. (2019)	対話システムでRLHFを使用	対話システム	人間のフィードバックを報酬として使用
Kreutzer et al. (2018)	翻訳モデルにRLHFを適用	翻訳	フィードバックを報酬に
Stiennon et al. (2020)	言語モデルの微調整	テキスト生成	微調整により生成の精度が向上
Ngo et al. (2021)	データフィルタリングで生成毒性を削減	言語モデル	フィルタリングによる毒性削減

解説: 主要な比較として、RLHFを適用した分野と手法の違いが示されている。ゲームやロボットから始まり、言語モデルや翻訳、対話システムなどに応用されている。

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

手法	特徴	改善点
RLHF	人間のフィードバックを報酬として使用	テキスト生成や対話での精度向上
データフィルタリング	トリガーワードを含むデータを除外	毒性の軽減
微調整	指示付きデータセットでの訓練	few-shot, zero-shot性能の向上
安全性改善	言語モデルの生成バイアスを軽減	偏見や毒性の削減

解説: この表は、各手法の特徴とその改善点をまとめたものである。RLHFやデータフィルタリング、微調整、安全性の向上といったアプローチが主な内容である。

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。

数値	説明
-	本章節内には具体的な数値が記載されていない

解説: 本文中に具体的な数値の記載はないため、該当なしとする。

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

変数名	説明
-	本章節内には具体的な変数名の記載がない

解説: 本文中に数式や具体的な変数名の記載はないため、該当なしとする。

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。

本章節内には図や表は含まれていないため、該当なし。

▼生成AIの回答(ChatGPT-4o)
⑪関連研究の参照番号を全て列挙して、表にまとめて下さい。

参照番号	概要説明
Christiano et al., 2017	強化学習による人間のフィードバック（RLHF）を使用したシンプルなロボットやゲームの訓練手法
Ibarz et al., 2018	強化学習と人間のフィードバックを用いたロボットとゲームの学習の応用
Ziegler et al., 2019	人間のフィードバックを用いた言語モデルのテキスト要約における強化学習手法
Stiennon et al., 2020	人間のフィードバックを用いて言語モデルを微調整するアプローチ
Böhm et al., 2019	言語モデルにおけるフィードバックを活用した新しい方法
Wu et al., 2021	人間のフィードバックを用いた言語モデルの強化学習手法
Jaques et al., 2019	人間のフィードバックを対話システムで使用する手法
Yi et al., 2019	対話システムにおける人間のフィードバックの活用
Hancock et al., 2019	対話システムのフィードバックに基づく強化学習手法
Kreutzer et al., 2018	翻訳タスクにおける人間のフィードバックの適用
Bahdanau et al., 2016	翻訳モデルのための強化学習とフィードバック
Lawrence and Riezler, 2018	セマンティックパーシングのフィードバック活用手法
Zhou and Xu, 2020	ストーリー生成における人間のフィードバックの使用
Cho et al., 2018	レビュー生成におけるフィードバックの活用
Perez et al., 2019	エビデンス抽出のための人間のフィードバック使用手法
Madaan et al., 2022	GPT-3のプロンプトを強化するための人間のフィードバックの利用
Nahian et al., 2021	テキストベースの環境におけるRLHFと規範的事前分布を使用した手法
Gabriel, 2020	言語モデルの整合性に関する問題の検討
Kenton et al., 2021	言語モデルの不整合性によって生じる行動上の問題の分類
Askell et al., 2021	言語アシスタントを使った整合性研究のためのテストベッド提案
Mishra et al., 2021	NLPタスクにおける指示付きのデータセットでの訓練
Wei et al., 2021	多様なNLPタスクにおける言語モデルの指示に基づく訓練手法
Khashabi et al., 2020	NLPタスクにおける少数ショットやゼロショットの指示付きデータセットでの訓練
Sanh et al., 2021	公開NLPデータセットに基づく言語モデルの指示付き訓練
Aribandi et al., 2021	複数の言語タスクに対応する指示付き訓練手法
Bahdanau et al., 2018	シミュレートされた環境でのナビゲーションタスクにおける指示付き訓練
Abramson et al., 2020	ナビゲーションタスクでの言語モデルの指示に基づく訓練
Zhao et al., 2021	ナビゲーションモデルにおける指示に基づいたフィードバック訓練
Bender et al., 2021	言語モデルが生むリスクとハームに関する調査
Bommasani et al., 2021	言語モデルが生み出すリスクと課題の評価
Weidinger et al., 2021	言語モデルのリスクとハームに関する包括的なレビュー
Dhamala et al., 2021	言語モデルによるバイアスの生成に関する研究
Liang et al., 2021	言語モデルが生むバイアスとその軽減方法に関する研究
Manela et al., 2021	言語モデルのバイアスに関する研究
Caliskan et al., 2017	言語モデルにおけるバイアスの調査
Kirk et al., 2021	言語モデルが生むバイアスに関する研究
Carlini et al., 2021	言語モデルによるプライベートデータの漏洩に関する研究
Solaiman et al., 2019	言語モデルが生む誤情報に関する研究
Buchanan et al., 2021	言語モデルによる誤情報生成のリスク
Henderson et al., 2018	対話システムにおける言語モデルのリスク
Xu et al., 2020	対話システムの安全性向上に関する研究
Dinan et al., 2019a	言語モデルにおけるバイアス軽減手法
Dinan et al., 2019b	チャットボットの安全性向上のための手法
Nadeem et al., 2020	言語モデルが生み出すステレオタイプの調査
Rudinger et al., 2018	言語モデルにおけるバイアスと社会的影響に関する研究
Sheng et al., 2019	データ拡張を用いたバイアス軽減手法
Liu et al., 2019	埋め込み正則化によるバイアス軽減手法
Qian et al., 2019	バイアス軽減のための異なる目的関数を用いた研究
Keskar et al., 2019	安全性に特化した制御トークンの導入
Vig et al., 2020	因果仲介分析を用いたバイアス軽減手法
Dathathri et al., 2019	小型言語モデルを使った生成テキストの制御
Krause et al., 2020	小型言語モデルを使った生成制御のバリエーション
Schick et al., 2021	言語モデルの毒性削減に向けた生成制御手法

解説: この表は、関連研究に言及されたすべての参照文献番号とその概要説明をまとめたものである。類似した研究は関連性に基づきまとめて配置されている。

3 Methods and experimental details

3.1 High-level methodology

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3 Methods and experimental details 3.1 High-level methodology Our methodology follows that of Ziegler et al. (2019) and Stiennon et al. (2020), who applied it in the stylistic continuation and summarization domains. We start with a pretrained language model (Radford et al., 2019; Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022), a distribution of prompts on which we want our model to produce aligned outputs, and a team of trained human labelers (see Sections 3.4 for details). We then apply the following three steps (Figure 2). Step 1: Collect demonstration data, and train a supervised policy. Our labelers provide demon- strations of the desired behavior on the input prompt distribution (see Section 3.2 for details on this distribution). We then fine-tune a pretrained GPT-3 model on this data using supervised learning. Step 2: Collect comparison data, and train a reward model. We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. We then train a reward model to predict the human-preferred output. Step 3: Optimize a policy against the reward model using PPO. We use the output of the RM as a scalar reward. We fine-tune the supervised policy to optimize this reward using the PPO algorithm (Schulman et al., 2017). Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy. In practice, most of our comparison data comes from our supervised policies, with some coming from our PPO policies.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

本節では、Ziegler et al. (2019) および Stiennon et al. (2020) の手法に基づくモデルのトレーニング方法が説明されている。以下が要旨である。

手法の概要: 事前学習済み言語モデルを使用し、訓練された人間ラベラーによるデータを用いて、段階的に最適化を行う手法。
3つのステップ: デモデータ収集とスーパーバイズドポリシーの訓練、比較データ収集と報酬モデルの訓練、PPOを用いたポリシーの最適化が順次説明されている。
反復的なプロセス: 比較データと報酬モデルを用いてポリシーを継続的に改善するサイクル。

②具体的な処理方法の記載

Step 1: Collect demonstration data, and train a supervised policy
ラベラーが入力プロンプトに対して望ましい出力を示すデモンストレーションデータを収集する。このデータを使用して、事前学習済みのGPT-3モデルをスーパーバイズド学習で微調整する。このプロセスで使用されるプロンプトの分布は、Section 3.2で詳述されている。
Step 2: Collect comparison data, and train a reward model
モデルの出力に対してラベラーが好む出力を比較するデータセットを収集する。ラベラーは与えられた入力に対してどの出力を好むかを指示する。この比較データを使い、報酬モデル（RM）を訓練し、人間が好む出力を予測する。
Step 3: Optimize a policy against the reward model using PPO
Step 2で得られたRMを報酬として利用し、スーパーバイズドポリシーをPPO（Proximal Policy Optimization）アルゴリズムを使用して最適化する。このプロセスは繰り返し可能であり、最新のポリシーを用いて新たな比較データを収集し、報酬モデルとポリシーを更新する。大半の比較データはスーパーバイズドポリシーから収集され、少数のデータはPPOポリシーから収集される。

③改良点・工夫点・テクニック等の記載

Ziegler et al. (2019) および Stiennon et al. (2020)の手法を採用し、スタイル継続や要約の領域におけるモデルの改善を試みている。
比較データと報酬モデルを反復的に使用して、ポリシーの継続的な改善を図るサイクル設計。
PPOを用いることで、報酬モデルに基づくポリシー最適化を効率化。

④メカニズムの解明・なぜそうなるのか等の記載

報酬モデルが人間のラベラーによって好まれる出力を予測するため、ラベラーのフィードバックに基づくポリシー最適化が可能となる。
比較データの収集と反復的なポリシー更新により、モデルは次第により人間の好みに沿った出力を生成するようになる。
PPOの導入により、報酬のスカラー値を基にポリシーの最適化が行われ、直接的なポリシー更新が可能になる。

⑤性能が向上した記載

性能向上の具体的な数値については、本節では記載されていない。ただし、報酬モデルとPPOを用いた最適化の結果、出力が人間の好みに近づくことが期待される。具体的な性能向上に関する結果は、図2などの図表で示されている可能性がある。

⑥比較の記載

比較に関する直接的な記載はないが、以下のステップを比較し表にまとめることができる。

ステップ	処理内容	目的
Step 1	デモデータ収集とスーパーバイズドポリシー訓練	GPT-3モデルを人間のフィードバックに基づき微調整
Step 2	比較データ収集と報酬モデル訓練	人間の好みを反映した報酬モデルの訓練
Step 3	PPOを用いた報酬モデルに対するポリシー最適化	報酬モデルに基づく最適なポリシーの取得

⑦上記⑥以外で表に出来そうな部分

報酬モデルの訓練とPPOによるポリシー最適化に関して、訓練データの種類と使用方法について表でまとめられる。

データ種別	使用方法	出典
デモンストレーションデータ	スーパーバイズドポリシーの微調整	Ziegler et al. (2019)
比較データ	報酬モデルの訓練	Stiennon et al. (2020)
PPOポリシーのデータ	ポリシー最適化に使用	Schulman et al. (2017)

⑧具体的な数値の記載

本節には具体的な数値の記載は見当たらないため、なし。

⑨具体的な変数名(数式用の記号)の記載

本節に具体的な変数名は記載されていないため、なし。

⑩図/表があれば、各図/表は何を主張するためのものか

図2: 3つのステップ（デモデータの収集、比較データの収集と報酬モデルの訓練、PPOを用いたポリシーの最適化）のプロセスフローを示している。この図の目的は、モデル最適化の手順を視覚的に説明し、プロセス全体の流れを明確にすること。

3.2 Dataset

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3.2 Dataset Our prompt dataset consists primarily of text prompts submitted to the OpenAI API, specifically those using an earlier version of the InstructGPT models (trained via supervised learning on a subset of our demonstration data) on the Playground interface.4 Customers using the Playground were informed that their data could be used to train further models via a recurring notification any time InstructGPT models were used. In this paper we do not use data from customers using the API in production. We heuristically deduplicate prompts by checking for prompts that share a long common prefix, and we limit the number of prompts to 200 per user ID. We also create our train, validation, and test splits based on user ID, so that the validation and test sets contain no data from users whose data is in the training set. To avoid the models learning potentially sensitive customer details, we filter all prompts in the training split for personally identifiable information (PII). To train the very first InstructGPT models, we asked labelers to write prompts themselves. This is because we needed an initial source of instruction-like prompts to bootstrap the process, and these kinds of prompts weren’t often submitted to the regular GPT-3 models on the API. We asked labelers to write three kinds of prompts: • Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring the tasks had sufficient diversity. • Few-shot: We ask the labelers to come up with an instruction, and multiple query/response pairs for that instruction. • User-based: We had a number of use-cases stated in waitlist applications to the OpenAI API. We asked labelers to come up with prompts corresponding to these use cases. From these prompts, we produce three different datasets used in our fine-tuning procedure: (1) our SFT dataset, with labeler demonstrations used to train our SFT models, (2) our RM dataset, with labeler rankings of model outputs used to train our RMs, and (3) our PPO dataset, without any human labels, which are used as inputs for RLHF fine-tuning. The SFT dataset contains about 13k training prompts (from the API and labeler-written), the RM dataset has 33k training prompts (from the API and labeler-written), and the PPO dataset has 31k training prompts (only from the API). More details on dataset sizes are provided in Table 6. To give a sense of the composition of our dataset, in Table 1 we show the distribution of use-case categories for our API prompts (specifically the RM dataset) as labeled by our contractors. Most of the use-cases have are generative, rather than classification or QA. We also show some illustrative prompts (written by researchers to mimic the kinds of prompts submitted to InstructGPT models) in Table 2; more prompts submitted to InstructGPT models are shown in Appendix A.2.1, and prompts submitted to GPT-3 models are shown in Appendix A.2.2. We provide more details about our dataset in Appendix A.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

本節では、InstructGPTモデルのトレーニングに使用したプロンプトデータセットの構成と収集方法について説明されている。要旨は以下である。

プロンプトデータの概要: OpenAI APIを通じて提供されたテキストプロンプトに基づいてデータが収集され、データは重複排除と個人識別情報（PII）のフィルタリングを経てトレーニング用に整理されている。
プロンプトの種類: InstructGPTのトレーニングの初期段階では、ラベラーによる指示に基づくプロンプト（Plain、Few-shot、User-based）も作成された。
データセットの種類: SFTデータセット、RMデータセット、PPOデータセットの3つが作成され、それぞれ異なる目的で使用されている。
データの分布: プロンプトの使用ケースのカテゴリとその分布、および例示的なプロンプトが表や付録に記載されている。

②具体的な処理方法の記載

プロンプトデータの収集:
- OpenAI APIのPlaygroundインターフェースを通じてユーザーが提出したプロンプトを収集。プロンプトの重複を除外するため、長い共通の接頭辞を共有するプロンプトをチェックし、1ユーザーあたり200のプロンプトに制限している。
- トレーニング、検証、テストセットはユーザーIDに基づいて分割し、トレーニングセットのユーザーから検証・テストセットにはデータを含めないようにしている。
- プロンプトのフィルタリングにより、個人識別情報（PII）を含むデータは除去。
ラベラーによるプロンプト作成:
- 最初のInstructGPTモデルを訓練するため、ラベラーがプロンプトを作成。これには、Plainプロンプト（任意のタスク）、Few-shotプロンプト（複数のクエリ/レスポンスペアを伴う指示）、およびUser-basedプロンプト（APIの待機リストに基づくユースケース）が含まれる。
データセットの種類:
- SFTデータセット: ラベラーによるデモンストレーションで作成されたデータセット（約13kプロンプト）。
- RMデータセット: ラベラーによるモデル出力のランク付けを用いたデータセット（約33kプロンプト）。
- PPOデータセット: 人間のラベル付けなしで収集されたAPIプロンプトのデータセット（約31kプロンプト）。
データセットの分割と構成:
- トレーニング、検証、テストセットにおいてユーザーIDに基づいて分割し、交差汚染を防ぐ。
- 各データセットのサイズやその内訳は表6および表1に詳述されている。

③改良点・工夫点・テクニック等の記載

プロンプトの重複排除のために、共通の接頭辞を持つプロンプトをヒューリスティックに検出し、最大200プロンプトに制限することでデータの多様性を確保している。
ユーザーIDに基づくデータ分割により、トレーニングデータに含まれたユーザーのデータが検証やテストに流用されないよう工夫している。
PIIのフィルタリングを通じて、モデルが個人識別情報を学習しないようにしている。

④メカニズムの解明・なぜそうなるのか等の記載

プロンプトの重複排除と制限により、モデルが同じユーザーから提供された類似のプロンプトに偏らないようにすることで、汎用性が向上する。
ユーザーIDによるデータセットの分割により、モデルがトレーニング中に特定のユーザーの特徴を学習しすぎることを防ぎ、より一般化された結果を得ることができる。

⑤性能が向上した記載

本節においては性能向上に関する具体的な記載はない。ただし、重複排除やデータの多様性を考慮した分割が、モデルの汎用性向上に貢献する可能性が示唆されている。

⑥比較の記載

データセットの種類に基づく比較が表形式でまとめられる。

データセット	プロンプト数	ラベラーの介入	使用目的
SFTデータセット	約13k	あり（ラベラーのデモ）	スーパーバイズドトレーニング
RMデータセット	約33k	あり（ラベラーのランク付け）	報酬モデルのトレーニング
PPOデータセット	約31k	なし	強化学習による微調整

⑦上記⑥以外で表に出来そうな部分

プロンプトの種類やフィルタリングプロセスについて表形式でまとめることができる。

プロンプトタイプ	説明	作成者
Plainプロンプト	任意のタスクを指示するプロンプト	ラベラー
Few-shotプロンプト	指示と複数のクエリ/レスポンスペアを含むプロンプト	ラベラー
User-basedプロンプト	APIの待機リストに基づくユースケース	ラベラー

⑧具体的な数値の記載

数値	説明
200	1ユーザーあたりのプロンプト上限数
約13k	SFTデータセットのプロンプト数
約33k	RMデータセットのプロンプト数
約31k	PPOデータセットのプロンプト数

⑨具体的な変数名(数式用の記号)の記載

本節では変数名や数式の記号は記載されていないため、なし。

⑩図/表があれば、各図/表は何を主張するためのものか

表1: APIプロンプトのユースケースカテゴリの分布を示している。これは、プロンプトがどのような目的で提出されたかを示すことで、データの構成を視覚的に理解させるためのものである。
表2: InstructGPTモデルに提出されたプロンプトの例を示し、研究者がどのような種類のプロンプトを使用しているかを説明するために掲載されている。

3.3 Tasks

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3.3 Tasks Our training tasks are from two sources: (1) a dataset of prompts written by our labelers and (2) a dataset of prompts submitted to early InstructGPT models on our API (see Table 6). These prompts are very diverse and include generation, question answering, dialog, summarization, extractions, and other natural language tasks (see Table 1). Our dataset is over 96% English, however in Section 4.3 we also probe our model’s ability to respond to instructions in other languages and complete coding tasks. For each natural language prompt, the task is most often specified directly through a natural language instruction (e.g. “Write a story about a wise frog”), but could also be indirectly through either few-shot examples (e.g. giving two examples of frog stories, and prompting the model to generate a new one) or implicit continuation (e.g. providing the start of a story about a frog). In each case, we ask our labelers to do their best to infer the intent of the user who wrote the prompt, and ask them to skip inputs where the task is very unclear. Moreover, our labelers also take into account the implicit intentions such as truthfulness of the response, and potentially harmful outputs such as biased or toxic language, guided by the instructions we provide them (see Appendix B) and their best judgment.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

本節は、InstructGPTモデルのトレーニングに使用されたタスクの内容について説明している。タスクは多様なプロンプトから成り、自然言語生成や質問応答などのタスクが含まれる。要旨は以下である。

タスクの出所: ラベラーが作成したプロンプトデータセットと、初期のInstructGPTモデルに提出されたAPIプロンプトのデータセット。
タスクの種類: 自然言語生成、質問応答、対話、要約、情報抽出などの多様なタスクが含まれる。
プロンプト形式: 明示的な自然言語指示、Few-shot例示、または暗黙的な継続タスクとしてプロンプトが提供される。
ラベラーの役割: ラベラーはプロンプトの意図を推測し、曖昧なものはスキップする。また、応答の正確性や有害な出力がないかも考慮する。

②具体的な処理方法の記載

タスクの収集元:
- ラベラーによるプロンプトデータセットと、初期InstructGPTモデルへのAPIプロンプトが使用されている。
- ラベラーが作成したプロンプトは、生成、質問応答、対話、要約、情報抽出など多様な自然言語タスクを含んでいる（表1参照）。
プロンプトの形式:
- 自然言語指示を含むプロンプト（例: 「賢いカエルの物語を書いてください」）。
- Few-shot例示（例: カエルの物語の2つの例を示し、新しい物語を生成させる）。
- 暗黙的な継続タスク（例: カエルの物語の冒頭を提供し、それを続けさせる）。
ラベラーの指示:
- ラベラーは、ユーザーのプロンプトの意図を推測し、明確でないタスクはスキップする。
- 応答の正確性や偏見や有害な出力が含まれていないかも考慮する（Appendix Bに詳細な指示が記載されている）。

③改良点・工夫点・テクニック等の記載

意図推測の工夫: ラベラーはユーザーの意図を推測し、意図が曖昧であればタスクをスキップすることで、精度の高いデータセットを作成している。
有害な出力の回避: 偏見や有害な言語を回避するために、ラベラーは応答の内容に注意を払い、フィルタリングする。

④メカニズムの解明・なぜそうなるのか等の記載

自然言語指示の形式: 明示的な自然言語指示やFew-shot学習の例示により、モデルはユーザーのタスクの意図を直接または間接的に理解しやすくなる。
ラベラーの推測力: ラベラーがユーザーの意図を正確に推測することで、モデルが曖昧なタスクに対しても適切に対応できるようになる。

⑤性能が向上した記載

性能向上に関する具体的な数値や図表は記載されていないが、ラベラーがプロンプトの意図を推測し、曖昧なタスクをスキップすることが、トレーニングデータの品質向上に寄与していると考えられる。

⑥比較の記載

なし。

⑦上記⑥以外で表に出来そうな部分

プロンプトの形式とラベラーの対応について表形式でまとめることができる。

プロンプト形式	説明	ラベラーの対応
明示的な指示	自然言語による明確なタスク指示	ユーザーの意図を推測
Few-shot例示	例を示して新しい出力を生成させる	例に基づいて応答を評価
暗黙的な継続	タスクの開始を提供し、継続を促す	継続の意図を推測

⑧具体的な数値の記載

数値は記載されていないため、なし。

⑨具体的な変数名(数式用の記号)の記載

本節では変数名や数式の記号は記載されていないため、なし。

⑩図/表があれば、各図/表は何を主張するためのものか

表1: プロンプトの種類とユースケースのカテゴリを示しており、タスクの多様性とその割合を視覚化している。
表6: プロンプトデータセットの規模と出所を示しており、どのデータがどのモデルのトレーニングに使用されたかを明確にしている。

3.4 Human data collection

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3.4 Human data collection To produce our demonstration and comparison data, and to conduct our main evaluations, we hired a team of about 40 contractors on Upwork and through ScaleAI. Compared to earlier work that collects human preference data on the task of summarization (Ziegler et al., 2019; Stiennon et al., 2020; Wu et al., 2021), our inputs span a much broader range of tasks, and can occasionally include controversial and sensitive topics. Our aim was to select a group of labelers who were sensitive to the preferences of different demographic groups, and who were good at identifying outputs that were potentially harmful. Thus, we conducted a screening test designed to measure labeler performance on these axes. We selected labelers who performed well on this test; for more information about our selection procedure and labeler demographics, see Appendix B.1. During training and evaluation, our alignment criteria may come into conflict: for example, when a user requests a potentially harmful response. During training we prioritize helpfulness to the user (not doing so requires making some difficult design decisions that we leave to future work; see Section 5.4 for more discussion). However, in our final evaluations we asked labelers prioritize truthfulness and harmlessness (since this is what we really care about). As in Stiennon et al. (2020), we collaborate closely with labelers over the course of the project. We have an onboarding process to train labelers on the project, write detailed instructions for each task (see Appendix B.2), and answer labeler questions in a shared chat room. As an initial study to see how well our model generalizes to the preferences of other labelers, we hire a separate set of labelers who do not produce any of the training data. These labelers are sourced from the same vendors, but do not undergo a screening test. Despite the complexity of the task, we find that inter-annotator agreement rates are quite high: training labelers agree with each-other 72.6 ± 1.5% of the time, while for held-out labelers this number is 77.3 ± 1.3%. For comparison, in the summarization work of Stiennon et al. (2020) researcher-researcher agreement was 73 ± 4%.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この節は、InstructGPTのトレーニングと評価において、人間のデータ収集プロセスに関する詳細を説明している。主にラベラーの選定と評価に焦点を当てている。

デモンストレーションと比較データの生成: 約40名のラベラーを雇用し、さまざまなタスクに対する人間のフィードバックを収集した。
幅広いタスクへの対応: 入力データは、従来の要約タスクよりも広範囲にわたる（Stiennon et al., 2020の研究と比較）。
ラベラーの選定とスクリーニング: ラベラーは、さまざまな人々の好みに敏感であり、有害な出力を特定する能力を持つ者をスクリーニングテストで選抜した。
優先事項の衝突: トレーニング時にはユーザーの役に立つことが優先されるが、最終評価では「真実性」と「無害性」が重視される。
ラベラー間の合意率: トレーニング中のラベラー間合意率は72.6 ± 1.5%、テスト用ラベラー間では77.3 ± 1.3%。

②具体的な処理方法の記載

ラベラーの採用: UpworkやScaleAIから約40名のコントラクターを雇用し、幅広いタスクに対応するために人間のフィードバックデータを収集した。
ラベラーの選定プロセス: スクリーニングテストを通じて、異なる人口集団の好みに敏感で、有害な出力を特定できる能力を持つラベラーを選抜した。
- スクリーニングの詳細やラベラーの選定基準についてはAppendix B.1に記載されている。
トレーニングと評価時の優先事項:
- トレーニング時: ユーザーの役に立つことを優先。
- 最終評価時: 「真実性」と「無害性」が重視される。
ラベラーとの連携: Stiennon et al. (2020)の方法に基づき、ラベラーと密接に連携し、オンボーディングプロセスを実施し、詳細なタスク指示を提供した。また、質問に答えるためのチャットルームも提供した。
外部ラベラーの利用: トレーニングに使用されないラベラーを別途雇用し、モデルの一般化能力を評価するために利用した。これらのラベラーはスクリーニングテストを受けていない。

③改良点・工夫点・テクニック等の記載

ラベラー選定の改良: スクリーニングテストにより、人口統計的な敏感さと有害出力を認識する能力を持つラベラーを選抜している点。
外部ラベラーの使用: トレーニングに参加しない別のラベラーグループを用いることで、モデルの一般化能力を評価している点が工夫として挙げられる。

④メカニズムの解明・なぜそうなるのか等の記載

スクリーニングテストの役割: スクリーニングテストにより、敏感なトピックや有害な出力に対して適切に対応できるラベラーを選抜できる。これにより、モデルがより安全で有用な応答を生成するための高品質なデータが得られる。
優先事項の変更: トレーニング時に役立つ応答を優先することで、ユーザーの要求に応える能力が向上する一方、最終評価で真実性と無害性を重視することで、最終的な応答の品質と安全性を確保している。

⑤性能が向上した記載

ラベラー間合意率の向上: トレーニング中のラベラー間の合意率は72.6 ± 1.5%、テスト用ラベラーでは77.3 ± 1.3%と高い合意率が確認された。これは、Stiennon et al. (2020)の研究における73 ± 4%と比較しても高い水準である。
モデルの一般化能力の評価: 外部ラベラーを使用して、モデルがトレーニングに含まれないデータに対しても高い性能を維持していることが示された。

⑥比較の記載

研究	ラベラーの合意率	対応する図/表番号
本研究 (トレーニング中)	72.6 ± 1.5%	なし
本研究 (テスト用ラベラー)	77.3 ± 1.3%	なし
Stiennon et al. (2020)	73 ± 4%	なし

⑦上記⑥以外で表に出来そうな部分

ラベラー選定基準と優先事項を表にまとめることができる。

項目	説明
スクリーニングテスト	ラベラーの人口統計的敏感さと有害出力を識別する能力を測定する
トレーニング時の優先事項	ユーザーの役に立つことを重視
最終評価時の優先事項	真実性と無害性を重視

⑧具体的な数値の記載

数値	説明
約40	ラベラーの人数
72.6 ± 1.5%	トレーニング中のラベラー間合意率
77.3 ± 1.3%	テスト用ラベラー間合意率
73 ± 4%	Stiennon et al. (2020) の研究における合意率

⑨具体的な変数名(数式用の記号)の記載

本節には具体的な変数名や数式の記号はないため、なし。

⑩図/表があれば、各図/表は何を主張するためのものか

図表の記載はないため、なし。

3.5 Models

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3.5 Models We start with the GPT-3 pretrained language models from Brown et al. (2020). These models are trained on a broad distribution of Internet data and are adaptable to a wide range of downstream tasks, but have poorly characterized behavior. Starting from these models, we then train models with three different techniques: Supervised fine-tuning (SFT). We fine-tune GPT-3 on our labeler demonstrations using supervised learning. We trained for 16 epochs, using a cosine learning rate decay, and residual dropout of 0.2. We do our final SFT model selection based on the RM score on the validation set. Similarly to Wu et al. (2021), we find that our SFT models overfit on validation loss after 1 epoch; however, we find that training for more epochs helps both the RM score and human preference ratings, despite this overfitting. Reward modeling (RM). Starting from the SFT model with the final unembedding layer removed, we trained a model to take in a prompt and response, and output a scalar reward. In this paper we only use 6B RMs, as this saves a lot of compute, and we found that 175B RM training could be unstable and thus was less suitable to be used as the value function during RL (see Appendix C for more details). In Stiennon et al. (2020), the RM is trained on a dataset of comparisons between two model outputs on the same input. They use a cross-entropy loss, with the comparisons as labels—the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler. In order to speed up comparison collection, we present labelers with anywhere between K = 4 and K = 9 responses to rank. This produces (K 2 ) comparisons for each prompt shown to a labeler. Since comparisons are very correlated within each labeling task, we found that if we simply shuffle the comparisons into one dataset, a single pass over the dataset caused the reward model to overfit.5 Instead, we train on all (K 2 ) comparisons from each prompt as a single batch element. This is much more computationally efficient because it only requires a single forward pass of the RM for each completion (rather than (K 2 ) forward passes for K completions) and, because it no longer overfits, it achieves much improved validation accuracy and log loss. Specifically, the loss function for the reward model is: loss (θ) = − 1 (K 2 ) E(x,yw ,yl)∼D [log (σ (rθ (x, yw) − rθ (x, yl)))] (1) where rθ (x, y) is the scalar output of the reward model for prompt x and completion y with parameters θ, yw is the preferred completion out of the pair of yw and yl, and D is the dataset of human comparisons. Finally, since the RM loss is invariant to shifts in reward, we normalize the reward model using a bias so that the labeler demonstrations achieve a mean score of 0 before doing RL.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この節は、GPT-3モデルをベースにしたトレーニングプロセスに関する詳細を説明している。特に、3つの異なる技術を用いたモデルのトレーニング手法（Supervised fine-tuning, Reward modeling, Reinforcement learning）に焦点を当てている。

Supervised fine-tuning (SFT): GPT-3モデルをラベラーのデモンストレーションに基づいてファインチューニングし、RMスコアと人間の評価を改善した。
Reward modeling (RM): SFTモデルをベースに、入力と応答を受け取り、スカラー報酬を出力する報酬モデルをトレーニングした。
オーバーフィッティングの回避: 比較データをシャッフルせずに1つのバッチとして処理することで、計算効率を向上させ、オーバーフィットを回避した。
報酬モデルの損失関数: ラベル間の比較を元に、報酬モデルの損失関数を計算した（式(1)）。

②具体的な処理方法の記載

Supervised fine-tuning (SFT):
- GPT-3モデルをラベラーのデモンストレーションデータで16エポックにわたってファインチューニングした。
- コサイン学習率減衰と、残差ドロップアウト率0.2を使用した。
- 最終的なモデル選択は、検証セットのRMスコアに基づいて行った。
- Wu et al. (2021)と同様に、SFTモデルは1エポック後に検証損失でオーバーフィットすることが確認されたが、より多くのエポックでトレーニングすることで、RMスコアと人間の評価が向上した。
Reward modeling (RM):
- 最終的なアンエンベディング層を除去したSFTモデルからスタートし、入力（プロンプト）と応答を受け取り、スカラー報酬を出力するモデルをトレーニングした。
- 6BパラメータのRMのみを使用し、これは計算リソースを大幅に節約するためである。また、175BパラメータのRMはトレーニングが不安定で、RLの価値関数として使用するには適していなかった。
- 比較データ収集の高速化のため、ラベラーにK=4～9の応答を提示し、それらをランク付けさせた。これにより、各プロンプトに対して(K^2)個の比較が得られる。
- 比較データが同一タスク内で非常に相関しているため、比較をシャッフルすると報酬モデルがオーバーフィットすることが確認された。そこで、すべての比較を1つのバッチ要素として処理し、計算効率を向上させ、オーバーフィットを防いだ。
- 損失関数: 報酬モデルの損失関数は次の通りである。
  $
  loss(\theta) = - \frac{1}{K^2} \mathbb{E}{(x, y_w, y_l) \sim D} [\log(\sigma(r{\theta}(x, y_w) - r_{\theta}(x, y_l)))]
  $
  ここで、$ r_{\theta}(x, y) $はプロンプト$ x $と応答$ y $に対する報酬モデルのスカラー出力であり、$ y_w $は人間ラベラーが選好した応答、$ y_l $は比較対象の応答、$ D $は人間の比較データセットである。
報酬モデルの正規化: 報酬モデルの損失は報酬のシフトに不変であるため、報酬を正規化し、ラベラーのデモンストレーションが平均スコア0を達成するようにバイアスを調整した。

③改良点・工夫点・テクニック等の記載

SFTにおける多エポックトレーニングの利点: オーバーフィットが検証損失に現れたにもかかわらず、多エポックでトレーニングすることでRMスコアと人間の評価が向上した。
比較データ処理の改良: 各比較をシャッフルせずにバッチとして処理することで、計算効率を向上させ、報酬モデルのオーバーフィットを防ぐ工夫がされている。

④メカニズムの解明・なぜそうなるのか等の記載

多エポックトレーニングが効果的な理由: オーバーフィットが検証損失に現れたとしても、より多くのエポックでトレーニングすることで、ラベラーの評価やRMスコアが改善される。これは、モデルが過剰に適合するデータの一部が、最終的な人間の好みに対応しているためと考えられる。
比較データをシャッフルしない理由: 各ラベル付けタスク内の比較が非常に相関しているため、シャッフルすると報酬モデルがオーバーフィットする。そのため、比較を1つのバッチ要素として扱うことで、この問題を回避し、計算効率を向上させている。

⑤性能が向上した記載

SFTモデル: 多エポックでトレーニングすることで、RMスコアと人間の好みに基づく評価が向上した（具体的な数値は記載なし）。
報酬モデル: 各比較をシャッフルせずにバッチ処理した結果、計算効率が向上し、検証精度とロスが改善された（具体的な数値は記載なし）。

「②具体的な処理方法」の「$K^2$」は間違い。ChatGPTへのテキスト入力を「(K 2)」としてしまっているため、正しい意味が伝わらず、仕方ない。

論文の本文中では、二項係数（binomial coefficient）と呼ばれる表記方法で記載される。

最終モデル(GPT-3)は175Bの大サイズなのに、報酬によって最終モデルに教えるための報酬モデル(RM)は6Bの(100倍以上小さい)方が良い。知識等の全てを教える訳ではなく、回答が好ましいスタイルかだけを教えれば良いので、スタイルのチェックだけであれば小さいモデル(6B)で十分、ということなのかと推測。

▼生成AIの回答(ChatGPT-4o)
⑥比較の記載があれば違いを表でまとめて下さい。

手法	特徴	改善点	計算効率
Supervised fine-tuning (SFT)	GPT-3をラベラーのデモンストレーションデータでファインチューニング	検証損失は1エポックでオーバーフィットするが、多エポックでRMスコアと人間評価が向上	エポック数を増やすとオーバーフィットするが、RMスコアは向上
Reward modeling (RM)	SFTモデルを基に報酬モデルをトレーニングし、スカラー報酬を出力	K個の応答を比較するデータをバッチ要素として処理することで、オーバーフィットを回避し、計算効率を向上	(K^2)の比較データを1つのバッチとして処理することで、計算効率が大幅に向上

解説:
この表では、Supervised fine-tuning（SFT）とReward modeling（RM）の違いを比較している。SFTでは、エポック数を増やすことでRMスコアと人間評価が向上することが特徴である。一方、RMでは比較データをシャッフルせずにバッチ要素として処理することで、オーバーフィットを回避しつつ計算効率を向上させた。

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

トレーニング手法	エポック数	ドロップアウト率	学習率減衰方法	最終評価指標
Supervised fine-tuning (SFT)	16エポック	0.2	コサイン学習率減衰	RMスコア、人間評価

解説:
SFTのトレーニング設定に関する詳細を表にまとめた。エポック数、ドロップアウト率、学習率減衰方法、最終的な評価指標としてRMスコアと人間の好みの評価が挙げられている。

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。

数値	説明
16	SFTでのエポック数
0.2	SFTでの残差ドロップアウト率
K=4~9	報酬モデルの比較データ収集時に提示される応答の数
6B	報酬モデルのパラメータ数（6BパラメータのRMのみを使用）
175B	175BパラメータのRMが不安定であるため使用しなかった
(K^2)	報酬モデルでの比較データの組み合わせ数

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

変数名	説明
$ r_{\theta} $	プロンプト$ x $と応答$ y $に対する報酬モデルのスカラー出力
$ x $	プロンプト
$ y_w $	人間ラベラーが選好した応答
$ y_l $	比較対象の応答
$ D $	人間の比較データセット
$ K $	応答の数（比較データ収集時に提示される応答の数）

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。

この節には具体的な図表の記載はない。

「⑧具体的な数値」の「$K^2$」は間違い。ChatGPTへのテキスト入力を「(K 2)」としてしまっているため、正しい意味が伝わらず、仕方ない。

論文の本文中では、二項係数（binomial coefficient）と呼ばれる表記方法で記載される。

3.5 Models (続き)

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) Reinforcement learning (RL). Once again following Stiennon et al. (2020), we fine-tuned the SFT model on our environment using PPO (Schulman et al., 2017). The environment is a bandit environment which presents a random customer prompt and expects a response to the prompt. Given the prompt and response, it produces a reward determined by the reward model and ends the episode. In addition, we add a per-token KL penalty from the SFT model at each token to mitigate over- optimization of the reward model. The value function is initialized from the RM. We call these models “PPO.” We also experiment with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets. We call these models “PPO-ptx.” We maximize the following combined objective function in RL training: objective (φ) =E(x,y)∼DπRL φ [rθ (x, y) − β log (πRL φ (y | x)/πSFT(y | x))] + γEx∼Dpretrain [log(πRL φ (x))] (2) where πRL φ is the learned RL policy, πSFT is the supervised trained model, and Dpretrain is the pretraining distribution. The KL reward coefficient, β, and the pretraining loss coefficient, γ, control the strength of the KL penalty and pretraining gradients respectively. For "PPO" models, γ is set to 0. Unless otherwise specified, in this paper InstructGPT refers to the PPO-ptx models. Baselines. We compare the performance of our PPO models to our SFT models and GPT-3. We also compare to GPT-3 when it is provided a few-shot prefix to ‘prompt’ it into an instruction-following mode (GPT-3-prompted). This prefix is prepended to the user-specified instruction.6 We additionally compare InstructGPT to fine-tuning 175B GPT-3 on the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) datasets, which both consist of a variety of NLP tasks, combined with natural language instructions for each task (the datasets differ in the NLP datasets included, and the style of instructions used). We fine-tune them on approximately 1 million examples respectively and choose the checkpoint which obtains the highest reward model score on the validation set. See Appendix C for more training details.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か: この節は、強化学習（Reinforcement Learning, RL）を用いたSFT（Supervised Fine-Tuning）モデルのファインチューニングについて述べている。具体的には、PPO（Proximal Policy Optimization）アルゴリズムを使用したモデルと、事前学習の勾配を組み合わせたPPO-ptxモデルについて記載している。
要旨:
- SFTモデルをPPOアルゴリズムで強化学習する。
- 環境はバンディット環境で、ランダムなプロンプトに対する応答を生成し、報酬モデルに基づいて報酬を算出する。
- KLペナルティを導入して、報酬モデルの過剰最適化を防ぐ。
- 事前学習の勾配をPPOに組み込み、公開NLPデータセットでの性能低下を修正する試み（PPO-ptxモデル）。
- ベースラインとして、SFTモデル、GPT-3、GPT-3のfew-shot学習、FLAN、T0と比較。

②具体的な処理方法の記載があれば説明下さい。

SFTモデルをPPOアルゴリズムでファインチューニングする。環境はバンディット環境であり、プロンプトと応答のペアに基づいて報酬を計算する。報酬は報酬モデルによって決定され、エピソードはそこで終了する。
KLペナルティを各トークンごとに追加し、報酬モデルの過剰最適化を防止する。このKLペナルティは、SFTモデルとPPOモデルの確率分布の差に基づいて計算される。
事前学習の勾配をPPOの勾配に組み込むことで、公開NLPデータセット上での性能低下を防ぐ。この手法を用いるモデルをPPO-ptxと呼ぶ。
RLトレーニングでは以下の結合目的関数を最大化する:
$
\text{objective} (\phi) = E_{(x, y) \sim D_{\pi_{\text{RL}}}} [r_{\theta} (x, y) - \beta \log (\frac{\pi_{\text{RL}} (y | x)}{\pi_{\text{SFT}} (y | x)})] + \gamma E_{x \sim D_{\text{pretrain}}} [\log (\pi_{\text{RL}} (x))]
$
- $\pi_{\text{RL}}$: 強化学習で学習されたポリシー
- $\pi_{\text{SFT}}$: SFTモデルのポリシー
- $D_{\text{pretrain}}$: 事前学習データの分布
- $\beta$: KLペナルティの係数
- $\gamma$: 事前学習の勾配の重み
ベースラインとして、SFTモデル、GPT-3、GPT-3のfew-shot学習、FLAN、T0のモデルと比較し、性能を評価する。

③改良点・工夫点・テクニック等の記載があれば説明下さい。

KLペナルティ: 各トークンに対してSFTモデルとのKLペナルティを加えることで、報酬モデルの過剰最適化を防ぐ工夫がなされている。
事前学習の勾配の統合: 事前学習の勾配をRLの目的関数に組み込むことで、公開NLPデータセットでの性能低下を修正する工夫が見られる。これにより、学習が進む中で事前学習データの知識を活用する。
PPO-ptxモデル: PPOモデルに事前学習の勾配を加えることによって、公開データセットでの性能低下を修正しつつ、PPOモデルの報酬最適化も行う。このハイブリッド手法がPPO-ptxモデルである。

④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。

KLペナルティの導入: KLペナルティを導入することで、強化学習における報酬モデルの過剰最適化を防ぎ、生成する応答のバランスを保つ。このペナルティがないと、報酬モデルが特定のパターンに過度に適応しすぎる恐れがあるため、各トークンにおいてSFTモデルの分布と比較して正則化を行う。
事前学習の勾配の統合: 事前学習の勾配を組み込むことにより、公開NLPデータセットでの性能低下を防ぐ。これは、強化学習の過程で事前学習データの知識が失われるのを防ぐためであり、結果としてより汎用的な性能を維持する。

⑤性能が向上した記載があれば説明下さい。

PPO-ptxモデル: 事前学習の勾配を加えたPPO-ptxモデルは、PPOモデルに比べて公開NLPデータセットでの性能低下を修正しつつ、報酬モデルの最適化も達成している。
ベースラインとの比較: FLANやT0データセットを用いた175B GPT-3のファインチューニングに比べて、InstructGPT（PPO-ptxモデル）はより高い報酬モデルスコアを得ていることが記載されている（詳細はAppendix Cに記載）。

論文の本文中でも、上記の内容以上の数式の説明はない模様。

参考:

KLペナルティ項: プロンプト x に対して、今学習しているモデル (RL) が y という回答を生成する確率が、もとのファインチューニングした SFT モデルが y という回答を生成する確率よりも大きく高くなっているとペナルティを与えるものです。
1つ目はペナルティが強化学習中のモデルが y という回答を生成する確率が高くなるときにペナルティが大きくなる式になっていることから理解ができます。

事前学習データセットの項: この項は、事前学習データセットにある文章の重要性を高める項というイメージです。$D_{pretrain}$ は事前学習で使った一般的な自然言語処理データセットを意味します。

なお、InstructGPT では上記の目的関数を最大化しますが、通常の勾配降下法で解くと戦略 ϕ の更新が大きくなりすぎる傾向があり、そこで PPO (Proximal Policy Optimization) というアルゴリズムを使います。
上記の目的変数を最大化するために、単純に勾配を計算して θ を更新するという方法ができるのですが、その場合パラメータの更新幅が大きすぎる傾向があるため、パラメータを更新するサイズの上限を決めるイメージです。

▼生成AIの回答(ChatGPT-4o)
⑥比較の記載があれば違いを表でまとめて下さい。

モデル名	使用手法	目的関数	KLペナルティ	事前学習勾配	使用データセット	ベースラインとの比較
PPO	PPOアルゴリズム	強化学習	あり	なし	ランダムプロンプト	SFTモデル, GPT-3と比較
PPO-ptx	PPO + 事前学習勾配	強化学習	あり	あり	ランダムプロンプト	SFTモデル, GPT-3と比較
GPT-3-prompted	GPT-3 few-shot	プロンプト	なし	なし	ランダムプロンプト	SFTモデル, GPT-3と比較
GPT-3 (175B)	GPT-3 (FLAN/T0ファインチューニング)	なし	なし	なし	FLAN, T0	InstructGPTと比較

図/表番号: Appendix Cで追加の詳細が記載されている。

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

RLモデルの目的関数に含まれる項目

項目	説明
$r_{\theta} (x, y)$	報酬モデルの出力で、プロンプトxと応答yに対するスカラー報酬
$\log (\pi_{\text{RL}} (y \| x))$	RLポリシーの対数尤度
$\log (\pi_{\text{SFT}} (y \| x))$	SFTモデルの対数尤度
$\beta$	KLペナルティの重み
$\gamma$	事前学習勾配の重み

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。

数値	説明
175B	GPT-3のモデルサイズ（FLAN/T0でのファインチューニング）
1 million	FLAN, T0データセットでのファインチューニングに使用された例数
$\beta = 0$	PPOモデルにおけるKLペナルティの重み
$\gamma = 0$	PPOモデルにおける事前学習勾配の重み
任意値	PPO-ptxモデルにおける$\gamma$（具体値は記載なし）

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

変数名	説明
$r_{\theta} (x, y)$	プロンプトxと応答yに対する報酬モデルのスカラー出力
$\pi_{\text{RL}} (y \| x)$	強化学習で学習されたポリシーの確率
$\pi_{\text{SFT}} (y \| x)$	SFTモデルの確率
$\beta$	KLペナルティの重み
$\gamma$	事前学習勾配の重み
$D_{\pi_{\text{RL}}}$	強化学習のデータ分布
$D_{\text{pretrain}}$	事前学習データの分布

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。

図/表の主張: Appendix Cでは、175B GPT-3のファインチューニングや、報酬モデルのスコアが記載されており、InstructGPTモデルが他のベースラインよりも高い報酬モデルスコアを達成したことを示している。

「⑨具体的な変数名」では、全ての変数名が列挙されていない。

生成AIがMarkdown形式の表として回答する時に、項目の中に「|」の文字を含むと(例: 条件付き確率のP(y | x)等)、表がくずれて表示されない問題がある。「|」へエスケープする必要がある。

「⑩図/表」の「Appendix C」は図/表ではないが、回答に記載される。

▼質問(人間)
上記の「⑨具体的な変数名」では、全ての変数名が列挙されていないので、再度、具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)

▼生成AIの回答(ChatGPT-4o)
⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

変数名	説明
$ r_{\theta} (x, y) $	プロンプト $x$ と応答 $y$ に対する報酬モデルのスカラー出力
$ \pi_{\text{RL}} (y \| x) $	強化学習で学習されたポリシー $ \pi_{\text{RL}} $ の確率
$ \pi_{\text{SFT}} (y \| x) $	SFTモデル $ \pi_{\text{SFT}} $ の確率
$ \beta $	KLペナルティの重み
$ \gamma $	事前学習勾配の重み
$ D_{\pi_{\text{RL}}} $	強化学習データの分布
$ D_{\text{pretrain}} $	事前学習データの分布
$ \mathcal{L}_{\text{PPO}} $	PPOモデルの目的関数
$ \log(\pi_{\text{RL}}(y \| x)) $	RLポリシー $ \pi_{\text{RL}} $ の対数尤度
$ \log(\pi_{\text{SFT}}(y \| x)) $	SFTポリシー $ \pi_{\text{SFT}} $ の対数尤度
$ x $	ランダムに提示されたプロンプト
$ y $	プロンプト $ x $ に対する応答
$ \mathcal{D}_{\text{pretrain}} $	事前学習データの分布
$ \theta $	報酬モデルのパラメータ
$ \phi $	強化学習ポリシーのパラメータ
$ \mathbb{E} $	期待値
$ \sigma $	ロジスティック関数
$ \text{loss}(\theta) $	報酬モデルの損失関数

これで全ての数式用の記号が表にまとめられている。

▼質問(人間)
πRL、πSFTとは何か?

▼生成AIの回答(ChatGPT-4o)
$ \pi_{\text{RL}} $ と $ \pi_{\text{SFT}} $ は、それぞれ以下のように定義される。

$ \pi_{\text{RL}} $: 強化学習 (Reinforcement Learning, RL) によって学習されたポリシーである。これは、強化学習のプロセスを通じて、与えられたプロンプト $ x $ に対する応答 $ y $ の生成確率を表す。RLポリシーは、報酬モデルに基づいて最適な応答を選択しようとする。
$ \pi_{\text{SFT}} $: 教師あり学習 (Supervised Fine-Tuning, SFT) によって学習されたポリシーである。これは、人間のラベラーによるデモンストレーションを使用してファインチューニングされたGPT-3モデルのポリシーであり、プロンプト $ x $ に対する応答 $ y $ の生成確率を表す。SFTポリシーは、事前にラベル付けされたデータに基づいて最も適切な応答を選択しようとする。

つまり、$ \pi_{\text{RL}} $ は強化学習によって得られた応答生成モデルを指し、$ \pi_{\text{SFT}} $ は教師あり学習によって得られた応答生成モデルを指す。

3.6 Evaluation

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3.6 Evaluation To evaluate how “aligned” our models are, we first need to clarify what alignment means in this context. The definition of alignment has historically been a vague and confusing topic, with various competing proposals (Chen et al., 2021; Leike et al., 2018; Gabriel, 2020). Following Leike et al. (2018), our aim is to train models that act in accordance with user intentions. More practically, for the purpose of our language tasks, we use a framework similar to Askell et al. (2021), who define models to be aligned if they are helpful, honest, and harmless. To be helpful, the model should follow instructions, but also infer intention from a few-shot prompt or another interpretable pattern such as “Q: {question}\nA:”. Since a given prompt’s intention can be unclear or ambiguous, we rely on judgment from our labelers, and our main metric is labeler preference ratings. However, since our labelers are not the users who generated the prompts, there could be a divergence between what a user actually intended and what the labeler thought was intended from only reading the prompt. It is unclear how to measure honesty in purely generative models; this requires comparing the model’s actual output to its “belief” about the correct output, and since the model is a big black box, we can’t infer its beliefs. Instead, we measure truthfulness—whether the model’s statements about the world are true—using two metrics: (1) evaluating our model’s tendency to make up information on closed domain tasks (“hallucinations”), and (2) using the TruthfulQA dataset (Lin et al., 2021). Needless to say, this only captures a small part of what is actually meant by truthfulness. Similarly to honesty, measuring the harms of language models also poses many challenges. In most cases, the harms from language models depend on how their outputs are used in the real world. For instance, a model generating toxic outputs could be harmful in the context of a deployed chatbot, but might even be helpful if used for data augmentation to train a more accurate toxicity detection model. Earlier in the project, we had labelers evaluate whether an output was ‘potentially harmful’. However, we discontinued this as it required too much speculation about how the outputs would ultimately be used; especially since our data also comes from customers who interact with the Playground API interface (rather than from production use cases). Therefore we use a suite of more specific proxy criteria that aim to capture different aspects of behavior in a deployed model that could end up being harmful: we have labelers evaluate whether an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content. We also benchmark our model on datasets intended to measure bias and toxicity, such as RealToxicityPrompts (Gehman et al., 2020) and CrowS-Pairs (Nangia et al., 2020). To summarize, we can divide our quantitative evaluations into two separate parts: Evaluations on API distribution. Our main metric is human preference ratings on a held out set of prompts from the same source as our training distribution. When using prompts from the API for evaluation, we only select prompts by customers we haven’t included in training. However, given that our training prompts are designed to be used with InstructGPT models, it’s likely that they disadvantage the GPT-3 baselines. Thus, we also evaluate on prompts submitted to GPT-3 models on the API; these prompts are generally not in an ‘instruction following’ style, but are designed specifically for GPT-3. In both cases, for each model we calculate how often its outputs are preferred to a baseline policy; we choose our 175B SFT model as the baseline since its performance is near the middle of the pack. Additionally, we ask labelers to judge the overall quality of each response on a 1-7 Likert scale and collect a range of metadata for each model output (see Table 3). Evaluations on public NLP datasets. We evaluate on two types of public datasets: those that capture an aspect of language model safety, particularly truthfulness, toxicity, and bias, and those that capture zero-shot performance on traditional NLP tasks like question answering, reading comprehen- sion, and summarization. We also conduct human evaluations of toxicity on the RealToxicityPrompts dataset (Gehman et al., 2020). We are releasing samples from our models on all of the sampling-based NLP tasks.7

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

本章は、InstructGPTモデルの「アラインメント」についての評価手法を述べている。「アラインメント」とは、モデルがユーザーの意図に沿って行動するかどうかを評価するものである。

アラインメントの定義において、モデルが「ユーザーの意図に沿った行動をする」ことを目指している (Leike et al., 2018)
モデルが「役立つ (helpful)」、「誠実 (honest)」、そして「無害 (harmless)」であるかどうかを基準にして評価する (Askell et al., 2021)
役立つモデルは、ユーザーの指示に従い、質問の意図を推論することが求められる
誠実さを評価するためには、モデルの「信念」を評価する必要があるが、代わりに真実性 (truthfulness) を評価する
無害性の評価には、毒性やバイアスを測定するためのデータセット（RealToxicityPromptsやCrowS-Pairs）が使用される
APIディストリビューションに基づく評価と、公的NLPデータセットに基づく評価の2つに分けて評価する

② 具体的な処理方法の記載

役立つモデルの評価方法: モデルが指示に従い、曖昧なプロンプトに対してもユーザーの意図を推論できるかどうかを、ラベラーの判断によって評価する。主なメトリックは、ラベラーによる「好みの評価」であり、ラベラーはプロンプトに対して最も適切と思われる応答を評価する。
誠実さの評価方法: 生成モデルでの誠実さを測定するために、(1) クローズドドメインタスクにおける「幻覚」を評価し、(2) TruthfulQAデータセットを使用してモデルの発言の真実性を測定する。
無害性の評価方法: 言語モデルが有害な出力を生成するリスクを評価するために、ラベラーが出力が「不適切」であるかどうかや、特定の保護対象グループに対して侮辱的かどうかを評価する。加えて、RealToxicityPromptsやCrowS-Pairsなどのデータセットを使って、モデルのバイアスや毒性を評価する。
評価の2つの方法: APIディストリビューションと公的なNLPデータセットに基づく評価が行われる。APIディストリビューションでは、ラベラーがプロンプトごとにモデルの出力の品質を1〜7のリッカートスケールで評価する。また、GPT-3のAPIを通じて提出されたプロンプトに基づく評価も行い、モデル出力の好みを計算する。公的なNLPデータセットでは、真実性や毒性、バイアスに関する評価と、従来のNLPタスクのゼロショット性能に関する評価を行う。

③ 改良点・工夫点・テクニック等の記載

役立つ評価方法の改善: ラベラーによる評価を主な指標とし、ユーザーの意図を適切に反映した応答かどうかを確認しているが、ラベラーはプロンプトを生成したユーザーではないため、ユーザーの意図との間に乖離が生じる可能性があることを認識している。この点を改善するため、役立つかどうかの評価には慎重な判断が必要である。
真実性の評価: モデルが「信念」を持っているかどうかは測定できないが、「幻覚」を評価することで、クローズドドメインタスクにおけるモデルの誠実さを評価している。
無害性の評価: ラベラーが出力が「有害」であるかどうかを評価するプロセスは中止されたが、その代わりに、特定の基準に基づいてモデルの出力の毒性やバイアスを評価する。

④ メカニズムの解明・なぜそうなるのか等の記載

役立つ評価: モデルが役立つかどうかは、プロンプトの解釈に依存する。プロンプトが曖昧である場合、ラベラーが意図を推論して応答の評価を行うが、このプロセスにはユーザーの意図とのズレが生じる可能性がある。
真実性の評価: 誠実さの直接的な評価は難しいため、代わりにモデルが情報を「作り出す」傾向（幻覚）を評価することで、間接的に真実性を測定している。
無害性の評価: モデルの無害性は、出力の使用方法に依存するため、汎用的な指標で評価するのは難しいが、毒性やバイアスに関するデータセットを用いることで、特定の危険な傾向を捉えている。

⑤ 性能が向上した記載

人間の好み評価: APIディストリビューションに基づく評価において、ラベラーが特定のモデルの出力を好む頻度を基に評価が行われ、各モデルの性能がベースラインの175B SFTモデルと比較されている。性能評価の詳細は表3に示されている。
公的NLPデータセットでの評価: 真実性、毒性、バイアスなどの安全性に関する性能や、従来のNLPタスクでのゼロショット性能が評価され、これらの指標を通じてInstructGPTモデルが評価されている。

▼生成AIの回答(ChatGPT-4o)
⑥ 比較の記載

本節では、GPT-3モデルとInstructGPTモデルの出力が比較されている。主に、プロンプトの「役立ち度」や「好み」などの評価基準に基づく比較が行われている。また、ベースラインとして175B SFTモデルが使用されている。

モデル	評価基準	特徴	図/表番号
GPT-3	APIディストリビューションでのプロンプト	「指示に従う」プロンプトではない、GPT-3に最適化されたプロンプト	なし
InstructGPT	APIディストリビューションでのプロンプト	指示に従うプロンプトに基づく、ユーザーの意図に沿った出力が可能	なし
175B SFT	ベースラインモデル	APIディストリビューション評価において「中程度の性能」	表3

⑦ 上記⑥以外で表に出来そうな部分

公的NLPデータセットでの評価が、真実性、毒性、バイアス、NLPタスクのゼロショット性能に基づいて行われているため、これらの要素を表でまとめることができる。

公的NLPデータセット	評価対象	説明	図/表番号
RealToxicityPrompts	毒性	人間による評価に基づいてモデルの毒性を測定	なし
CrowS-Pairs	バイアス	人間による評価に基づいてモデルのバイアスを測定	なし
TruthfulQA	真実性	モデルの発言が真実かどうかを評価するデータセット	なし
伝統的なNLPタスク	ゼロショット性能	質問応答や要約などの従来のタスクでモデルの性能を評価	なし

⑧ 具体的な数値

評価の具体的な数値は提示されていないため、記載なし。

⑨ 具体的な変数名

変数名	説明
175B SFT	ベースラインとして使用されるモデルのパラメータ数（1750億パラメータ）
TruthfulQA	モデルの真実性を評価するデータセット
RealToxicityPrompts	モデルの毒性を評価するデータセット
CrowS-Pairs	モデルのバイアスを評価するデータセット
Likert scale	応答の質を評価する際に使用される1-7のスケール

⑩ 図/表の説明

表3: APIディストリビューションに基づく評価の結果を示す。ラベラーによる「好みの評価」に基づいて、各モデルの出力がどれだけ好まれるかを評価している。また、リッカートスケールを使って出力の全体的な質を評価し、その結果をまとめている。

4 Results

4.1 Results on the API distribution

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4 Results In this section, we provide experimental evidence for our claims in Section 1, sorted into three parts: results on the API prompt distribution, results on public NLP datasets, and qualitative results. 4.1 Results on the API distribution Labelers significantly prefer InstructGPT outputs over outputs from GPT-3. On our test set of prompts, our labelers significantly prefer InstructGPT outputs across model sizes. These results are shown in Figure 1. We find that GPT-3 outputs perform the worst, and one can obtain significant step-size improvements by using a well-crafted few-shot prompt (GPT-3 (prompted)), then by training on demonstrations using supervised learning (SFT), and finally by training on comparison data using PPO. Adding updates on the pretraining mix during PPO does not lead to large changes in labeler preference. To illustrate the magnitude of our gains: when compared directly, 175B InstructGPT outputs are preferred to GPT-3 outputs 85 ± 3% of the time, and preferred 71 ± 4% of the time to few-shot GPT-3. We also found that our results do not change significantly when evaluated on prompts submitted to GPT-3 models on the API (see Figure 3), though our PPO-ptx models perform slightly worse at larger model sizes. In Figure 4 we show that labelers also rate InstructGPT outputs favorably along several more concrete axes. Specifically, compared to GPT-3, InstructGPT outputs are more appropriate in the context of a customer assistant, more often follow explicit constraints defined in the instruction (e.g. “Write your answer in 2 paragraphs or less.”), are less likely to fail to follow the correct instruction entirely, and make up facts (‘hallucinate’) less often in closed-domain tasks. These results suggest that InstructGPT models are more reliable and easier to control than GPT-3. We’ve found that our other metadata categories occur too infrequently in our API to obtain statistically significant differences between our models. Our models generalize to the preferences of "held-out" labelers that did not produce any train- ing data. Held-out labelers have similar ranking preferences as workers who we used to produce training data (see Figure 3). In particular, according to held-out workers, all of our InstructGPT models still greatly outperform the GPT-3 baselines. Thus, our InstructGPT models aren’t simply overfitting to the preferences of our training labelers. We see further evidence of this from the generalization capabilities of our reward models. We ran an experiment where we split our labelers into 5 groups, and train 5 RMs (with 3 different seeds) using 5-fold cross validation (training on 4 of the groups, and evaluating on the held-out group). These RMs have an accuracy of 69.6 ± 0.9% on predicting the preferences of labelers in the held-out group, a small decrease from their 72.4 ± 0.4% accuracy on predicting the preferences of labelers in their training set. Public NLP datasets are not reflective of how our language models are used. In Figure 5, we also compare InstructGPT to our 175B GPT-3 baselines fine-tuned on the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) datasets (see Appendix C for details). We find that these models perform better than GPT-3, on par with GPT-3 with a well-chosen prompt, and worse than our SFT baseline. This indicates that these datasets are not sufficiently diverse to improve performance on our API prompt distribution. In a head to head comparison, our 175B InstructGPT model outputs were preferred over our FLAN model 78 ±4% of the time and over our T0 model 79 ± 4% of the time. Likert scores for these models are shown in Figure 5. We believe our InstructGPT model outperforms FLAN and T0 for two reasons. First, public NLP datasets are designed to capture tasks that are easy to evaluate with automatic metrics, such as classification, question answering, and to a certain extent summarization and translation. However, classification and QA are only a small part (about 18%) of what API customers use our language models for, whereas open-ended generation and brainstorming consist of about 57% of our prompt dataset according to labelers (see Table 1). Second, it can be difficult for public NLP datasets to obtain a very high diversity of inputs (at least, on the kinds of inputs that real-world users would be interested in using). Of course, tasks found in NLP datasets do represent a kind of instruction that we would like language models to be able to solve, so the broadest type instruction-following model would combine both types of datasets.

Figure 3: Preference results of our models, measured by winrate against the 175B SFT model. Left:
results on prompts submitted to GPT models on the API; Right: results on prompts submitted to
InstructGPT models on the API; Top: results from held-out labelers; Bottom: results from training
labelers. We omit GPT (prompted) from the evals on prompts submitted to GPT-3 models (left) as
these prompts are already designed to perform well for GPT-3, as opposed to prompts submitted to
InstructGPT models (right).

Figure 4: Metadata results on the API distribution. Note that, due to dataset sizes, these results are
collapsed across model sizes. See Appendix E.2 for analysis that includes model size. Compared
to GPT-3, the PPO models are more appropriate in the context of a customer assistant, are better at
following explicit constraints in the instruction and attempting the correct instruction, and less likely
to ‘hallucinate’ (meaning, making up information on closed domain tasks like summarization).

Figure 5: Comparing our models with FLAN and T0 in terms of Likert scores on a 1-7 scale, on the
InstructGPT prompt distribution. FLAN and T0 perform better than default GPT-3, and comparably
with a few-shot GPT-3 model placed into ‘instruction-following’ mode.

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

何についての記載か: 本節は、InstructGPTモデルとGPT-3モデルの実験結果についての記載である。APIプロンプト分布、公開NLPデータセット、定性的な評価に基づく実験結果をまとめている。
要旨:
- InstructGPTはGPT-3よりも一貫して優れた結果を示しており、特に指示に従う能力や事実の捏造（ハルシネーション）の少なさが評価されている。
- PPOを使用したInstructGPTの学習は、ラベラーによる好みの評価において、GPT-3や他の比較モデルよりも高い評価を得ている。
- 公開NLPデータセットはAPIプロンプト分布に対しては不十分であり、FLANやT0と比較してもInstructGPTが優れている。
- InstructGPTは、訓練データを提供していないラベラーからも高評価を得ており、モデルの汎化性能が確認されている。

② 具体的な処理方法の記載

APIプロンプト分布の評価: GPT-3およびInstructGPTモデルに対して、ラベラーが好む出力を評価した。結果は、プロンプトごとにInstructGPTモデルの出力が85 ± 3%の割合でGPT-3よりも好まれ、GPT-3のfew-shotプロンプトと比較しても71 ± 4%の割合でInstructGPTが好まれた（Figure 1）。
モデルの汎化性能の検証: 5-foldクロスバリデーションを使用し、異なるラベラーグループの好みを予測するための報酬モデル（RM）を訓練。精度は、訓練データセットで72.4 ± 0.4%、訓練データを提供していないラベラーに対しては69.6 ± 0.9%であった。
公開NLPデータセットの評価: FLAN（Wei et al., 2021）とT0（Sanh et al., 2021）のデータセットでファインチューニングした175B GPT-3モデルとInstructGPTを比較。InstructGPTはFLANおよびT0モデルに対して、それぞれ78 ± 4%、79 ± 4%の割合で好まれた（Figure 5）。

③ 改良点・工夫点・テクニック等の記載

PPO（Proximal Policy Optimization）の導入: InstructGPTは、事前学習後に比較データを使用してPPOによる学習を行い、ラベラーの好みを反映したモデル出力を生成する。これにより、従来のGPT-3よりも出力の質が向上している。
APIプロンプト分布に特化した学習: InstructGPTはAPIプロンプト分布に基づいて学習されており、FLANやT0などの公開データセットよりも実際の使用ケースに適した結果を生成することができる。

④ メカニズムの解明・なぜそうなるのか等の記載

PPOの効果: PPOは報酬に基づく学習手法であり、InstructGPTはこの手法を用いてラベラーの好みに合った出力を生成する能力を高めている。特に、指示に従う能力や事実の捏造が少ないことが評価されている。従って、PPOによるモデルの訓練は、従来の単純なファインチューニングやfew-shotプロンプトと比べて、ラベラーによる評価を向上させている。
汎化性能の確認: 5-foldクロスバリデーションの結果、InstructGPTは異なるラベラーグループに対しても一貫した評価を得ており、モデルが特定の訓練データに過剰適合していないことが示されている。

⑤ 性能が向上した記載

InstructGPT vs GPT-3: InstructGPTモデルはGPT-3よりも85 ± 3%の割合でラベラーに好まれた。また、few-shotプロンプトを使用したGPT-3と比較しても71 ± 4%の割合でInstructGPTが好まれた（Figure 1）。
InstructGPT vs FLAN/T0: 175B InstructGPTモデルは、FLANおよびT0モデルに対して、それぞれ78 ± 4%、79 ± 4%の割合でラベラーに好まれた（Figure 5）。

▼生成AIの回答(ChatGPT-4o)
⑥ 比較の記載があれば違いを表でまとめて下さい。

モデル	GPT-3	GPT-3 (few-shot prompt)	InstructGPT (PPO)	InstructGPT (PPO-ptx)	FLAN	T0
ラベラーによる好み	最も低い	GPT-3より改善	85 ± 3% GPT-3より好まれる	85 ± 3% GPT-3より好まれるが変化少ない	78 ± 4% InstructGPTより低い	79 ± 4% InstructGPTより低い
指示の遵守	低い	低い	高い	高いが変化少ない	中程度	中程度
ハルシネーションの頻度	多い	やや少ない	少ない	少ないが変化少ない	中程度	中程度

図/表番号: Figure 1, Figure 3, Figure 4, Figure 5

⑦ 上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

メトリック	GPT-3	InstructGPT (PPO)	InstructGPT (PPO-ptx)
Open-ended generation	低い	高い	高いが変化少ない
Brainstorming	低い	高い	高いが変化少ない
Classification/QA対応	高い	低い	低い
指示に従う能力	低い	高い	高い

図/表番号: Figure 4, Table 1

⑧ 具体的な数値の記載を全て列挙して、表にまとめて下さい。

数値	説明
85 ± 3%	InstructGPTがGPT-3よりも好まれる割合 (Figure 1)
71 ± 4%	InstructGPTがfew-shot GPT-3よりも好まれる割合 (Figure 1)
69.6 ± 0.9%	報酬モデルのheld-outラベラーグループに対する精度 (Figure 3)
72.4 ± 0.4%	報酬モデルの訓練ラベラーグループに対する精度 (Figure 3)
78 ± 4%	InstructGPTがFLANよりも好まれる割合 (Figure 5)
79 ± 4%	InstructGPTがT0よりも好まれる割合 (Figure 5)
18%	APIプロンプトで分類/QAタスクの割合 (Table 1)
57%	APIプロンプトでオープンエンド生成/ブレインストーミングの割合 (Table 1)

⑨ 具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

変数名	説明
PPO	Proximal Policy Optimization、報酬に基づく強化学習アルゴリズム
RM	Reward Model、報酬モデル
SFT	Supervised Fine-Tuning、教師ありファインチューニング

⑩ 図/表があれば、各図/表は何を主張するためのものかを説明下さい。

Figure 1: InstructGPTとGPT-3の出力に対するラベラーの好みの比較を示しており、InstructGPTの優位性を強調している。
Figure 3: 報酬モデルの精度および、訓練データセットとheld-outラベラーに対するモデルの評価を示しており、モデルの汎化性能を評価している。
Figure 4: InstructGPTとGPT-3の指示に従う能力やハルシネーションの頻度などの具体的な評価軸における比較結果を示している。
Figure 5: FLANおよびT0とInstructGPTの比較に基づくLikertスコアを示しており、InstructGPTの方が好まれることを示している。
Table 1: APIプロンプトにおけるタスクの分布を示し、公開NLPデータセットが必ずしも多様性のあるプロンプト分布を反映していないことを強調している。

4.2 Results on public NLP datasets

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4.2 Results on public NLP datasets InstructGPT models show improvements in truthfulness over GPT-3. As measured by human evaluatoins on the TruthfulQA dataset, our PPO models show small but significant improvements in generating truthful and informative outputs compared to GPT-3 (see Figure 6). This behavior is the default: our models do not have to be specifically instructed to tell the truth to exhibit improved truthfulness. Interestingly, the exception is our 1.3B PPO-ptx model, which performs slightly worse than a GPT-3 model of the same size. When evaluated only on prompts that were not adversarially selected against GPT-3, our PPO models are still significantly more truthful and informative than GPT-3 (although the absolute improvement decreases by a couple of percentage points. Following Lin et al. (2021), we also give a helpful “Instruction+QA” prompt that instructs the model to respond with “I have no comment” when it is not certain of the correct answer. In this case, our PPO models err on the side of being truthful and uninformative rather than confidently saying a falsehood; the baseline GPT-3 model aren’t as good at this. Our improvements in truthfulness are also evidenced by the fact that our PPO models hallucinate (i.e. fabricate information) less often on closed-domain tasks from our API distribution, which we’ve shown in Figure 4. InstructGPT shows small improvements in toxicity over GPT-3, but not bias. We first evaluate our models on the RealToxicityPrompts dataset (Gehman et al., 2020). We do this in two ways: we run model samples through the Perspective API8 to obtain automatic toxicity scores, which is the standard evaluation procedure for this dataset, and we also send these samples to labelers to obtain ratings on absolute toxicity, toxicity relative to the prompt, continuity, and overall output preference. We sample prompts from this dataset uniformly according to prompt toxicity to better assess how our models perform with high input toxicity (see Figure 39 in Appendix E); this differs from the standard prompt sampling for this dataset, and thus our absolute toxicity numbers are inflated. Our results are in Figure 7. We find that, when instructed to produce a safe and respectful output (“respectful prompt”), InstructGPT models generate less toxic outputs than those from GPT-3 according to the Perspective API. This advantage disappears when the respectful prompt is removed (“no prompt”). Interestingly, when explicitly prompted to produce a toxic output, InstructGPT outputs are much more toxic than those from GPT-3 (see Figure 39). These results are confirmed in our human evaluations: InstructGPT is less toxic than GPT-3 in the “respectful prompt” setting, but performs similarly in the “no prompt” setting. We provide extended results in Appendix E. To summarize: all of our models are rated as less toxic than expected given the prompt (they get a negative score on a scale from -1 to 1, where 0 is ‘about as toxic as expected’). Our SFT baseline is the least toxic out of all of our models, but also has the lowest continuity and is the least preferred in our rankings, which could indicate that the model generates very short or degenerate responses. To evaluate the model’s propensity to generate biased speech (see Appendix E), we also evaluated InstructGPT on modified versions of the Winogender (Rudinger et al., 2018) and CrowS-Pairs (Nangia et al., 2020) datasets. These datasets consists of pairs of sentences which can highlight potential bias. We calculate the relative probabilities of producing the sentences in each pair and the entropy (in bits) of the associated binary probability distributions. Perfectly unbiased models will have no preference between the sentences in each pair and will therefore have maximum entropy. By this metric, our models are not less biased than GPT-3. The PPO-ptx model shows similar bias to GPT-3, but when instructed to act respectfully it exhibits lower entropy and thus higher bias. The pattern of the bias is not clear; it appears that the instructed models are more certain of their outputs regardless of whether or not their outputs exhibit stereotypical behavior. We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure. By default, when we train a PPO model on our API distribution, it suffers from an “alignment tax”, as its performance on several public NLP datasets decreases. We want an alignment procedure that avoids an alignment tax, because it incentivizes the use of models that are unaligned but more capable on these tasks. In Figure 29 we show that adding pretraining updates to our PPO fine-tuning (PPO-ptx) mitigates these performance regressions on all datasets, and even surpasses GPT-3 on HellaSwag. The performance of the PPO-ptx model still lags behind GPT-3 on DROP, SQuADv2, and translation; more work is needed to study and further eliminate these performance regressions. Mixing in pretraining updates performs better than the simpler solution of increasing the KL co- efficient. In Figure 33, we show that there is a value of the pretraining mix coefficient that both reverses the performance regressions on SQuADv2 and DROP (the datasets we used for testing), and has minimal reductions in validation reward. In contrast, increasing the KL coefficient (Figure 34) leads to significant decreases in validation reward and never fully recovers on DROP and SQuAD. Changing the KL model from the PPO init to GPT-3 gives similar results.

Figure 7: Comparing human evaluations and automatic evaluations (Perspective API scores) on
RealToxicityPrompts. A total of 1,729 prompts were labeled for three different 175B models, both
with and without "respectful" instructions. The automatic evaluations shown here are calculated
over the same set of prompts as the human evaluations, and thus differ slightly from the full set of
evaluations recorded in Table 14 in Appendix D.

Figure 8: Examples of generalization in the 175B PPO-ptx model (InstructGPT 175B) compared to
GPT-3 175B with no additional prefixing. Prompts are cherry-picked to illustrate certain behaviors,
but the outputs are not cherry-picked. (1) InstructGPT can follow instructions in other languages,
though it sometimes generates outputs in English. GPT-3 requires more careful prompting, similarly
to in English. (2) InstructGPT can summarize and answer questions about code more reliably than
GPT-3 (though its answer here isn’t quite correct). For the code QA example, GPT-3 does answer the
question about 50% of the time.

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

InstructGPTモデルとGPT-3モデルの比較を通じて、公共のNLPデータセットにおける真実性、毒性、バイアス、パフォーマンス向上の評価についての結果が述べられている。

要旨は以下の通り:

真実性の向上: InstructGPTモデルはGPT-3と比較してTruthfulQAデータセットでの真実性が向上しているが、1.3BパラメータのPPO-ptxモデルはわずかに悪化している。
毒性の軽減: RealToxicityPromptsデータセットでの評価において、InstructGPTモデルは「respectful prompt」に従うとGPT-3よりも毒性が低いが、毒性を指示すると逆に高くなる。
バイアスの評価: InstructGPTはバイアスについてはGPT-3と同等であり、場合によってはバイアスが強化される可能性がある。
パフォーマンスの改善: Pretraining updatesを用いることで、PPO-ptxモデルはGPT-3を上回る性能を示し、特にHellaSwagでの改善が確認されるが、DROPやSQuADv2、翻訳タスクでは依然としてGPT-3に劣る。

② 具体的な処理方法の記載があれば説明下さい。

TruthfulQAデータセットでの評価: PPOモデルはTruthfulQAデータセット上でGPT-3よりも真実性と情報性が改善されている。特に、「I have no comment」を回答するようにモデルに指示するInstruction+QAプロンプトが用いられ、誤情報を避ける方向で回答が生成される (Lin et al., 2021)。
RealToxicityPromptsデータセットでの評価: Perspective APIを使用し、毒性スコアを自動計算。モデルが生成したサンプルはラベラーによっても評価され、プロンプトに対する毒性、全体的な出力の好ましさなどが評価される。
バイアスの評価: WinogenderとCrowS-Pairsデータセットを使用し、文ペアの生成確率とバイナリ確率分布のエントロピーを計算する。理想的なモデルはエントロピーが最大であり、文ペア間に偏りがないことが求められる。
PPO-ptxモデルのファインチューニング: Pretraining updatesをRLHFのファインチューニングプロセスに追加することで、NLPデータセットでのパフォーマンスを維持・向上させる。KL係数の調整は代替手法として検討されるが、最適ではない。

③ 改良点・工夫点・テクニック等の記載があれば説明下さい。

PPOモデルの真実性向上: TruthfulQAデータセット上で、特定のプロンプト指示なしでもPPOモデルは真実性が向上する。ただし、1.3BパラメータのPPO-ptxモデルはGPT-3と比べて真実性が低下するため、この点で改善が必要。
Pretraining updatesの導入: RLHFファインチューニングにおけるPPOモデルのパフォーマンス低下（alignment tax）を最小限に抑えるため、Pretraining updatesを導入し、パフォーマンスの低下を回避している。
Instruction+QAプロンプト: モデルが誤った情報を提供しないようにするため、「I have no comment」を返答するInstruction+QAプロンプトを使用する。これにより、誤情報の生成が減少し、真実性が向上する。

④ メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。

真実性の向上のメカニズム: PPOモデルは、真実性を高めるために特定の指示がなくても情報性が向上している。これは、報酬モデルによる強化学習が真実性に対するバイアスをかけるためである。一方で、1.3BパラメータのPPO-ptxモデルがGPT-3よりも真実性で劣る点は、学習プロセスの違いによるものと考えられる。
毒性の軽減のメカニズム: InstructGPTは「respectful prompt」に従う場合には毒性が軽減されるが、明示的に毒性を指示された場合は逆に高くなる。これは、モデルが指示に忠実であり、指示に応じて生成する内容を変更できるためである。

⑤ 性能が向上した記載があれば説明下さい。

TruthfulQAデータセットでの真実性の向上: InstructGPTのPPOモデルはGPT-3と比較してわずかではあるが、真実性と情報性が向上している (Figure 6)。
毒性の軽減: Perspective APIによる評価およびラベラーの評価で、InstructGPTモデルは「respectful prompt」に従った場合にGPT-3よりも毒性が低くなる (Figure 7)。
HellaSwagでの性能改善: PPO-ptxモデルはHellaSwagデータセットでGPT-3を上回るパフォーマンスを示している (Figure 29)。
SQuADv2やDROPでの改善: Pretraining mix coefficientの調整により、SQuADv2およびDROPデータセットでのパフォーマンス低下が改善された (Figure 33)。

▼生成AIの回答(ChatGPT-4o)
⑥ 比較の記載があれば違いを表でまとめて下さい。

比較項目	InstructGPT	GPT-3	図/表番号
真実性 (TruthfulQA)	TruthfulQAでGPT-3よりわずかに改善	基準性能	図6
毒性 (RealToxicityPrompts)	Respectful promptでは低毒性だが、毒性指示時はGPT-3より高い	毒性指示時の出力は低い	図7、図39
バイアス (Winogender, CrowS-Pairs)	GPT-3と同様かやや悪化。指示時にバイアスが強化される傾向	バイアスは比較的安定	Appendix E
HellaSwagでの性能	GPT-3よりも優れた性能を発揮	基準性能	図29
DROP, SQuADv2での性能	GPT-3よりも低い	基準性能	図29, 図33, 図34

⑦ 上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

比較項目	Pretraining updatesを使用	KL coefficientを増加	図/表番号
SQuADv2, DROPでの改善	パフォーマンス低下を軽減し、改善	パフォーマンスは回復せず低下	図33, 図34
バリデーション報酬	バリデーション報酬の減少は少ない	バリデーション報酬の減少が大きい	図33, 図34

⑧ 具体的な数値の記載を全て列挙して、表にまとめて下さい。

数値	説明
1.3B	PPO-ptxモデルのパラメータ数
175B	InstructGPTおよびGPT-3モデルのパラメータ数
数％	Non-adversarial promptsでのGPT-3に対するTruthfulQAでの絶対改善度
-1 ～ 1 (毒性スコア)	RealToxicityPromptsでの出力がプロンプトに対してどの程度予想外に毒性があるかの評価スケール
1,729	RealToxicityPromptsにおけるプロンプト数
約50%	GPT-3がコードQAタスクで正答した割合

⑨ 具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

変数名	説明
KL coefficient	PPOの強化学習におけるKLダイバージェンスの重み係数
Pretraining updates	PPO-ptxモデルにおける事前学習の更新
Entropy (エントロピー)	WinogenderやCrowS-Pairsデータセットでのバイアス評価指標。最大エントロピーでバイアスが無い

⑩ 図/表があれば、各図/表は何を主張するためのものかを説明下さい。

図6: TruthfulQAデータセットにおけるInstructGPTとGPT-3の真実性の比較結果。InstructGPTがわずかに優れていることを示す。
図7: RealToxicityPromptsデータセットでのInstructGPTとGPT-3の毒性比較。Respectful promptではInstructGPTが毒性を低く抑えられることを示す。
図29: HellaSwagデータセットにおけるPPO-ptxモデルとGPT-3のパフォーマンス比較。PPO-ptxが優れていることを示す。
図33: Pretraining mix coefficientがSQuADv2およびDROPでのパフォーマンス低下を逆転させることを示す。
図34: KL coefficientを増加させた場合のSQuADv2およびDROPでのパフォーマンス変動を示す。

4.3 Qualitative results

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4.3 Qualitative results InstructGPT models show promising generalization to instructions outside of the RLHF fine- tuning distribution. In particular, we find that InstructGPT shows ability to follow instructions in non-English languages, and perform summarization and question-answering for code. This is interesting because non-English languages and code form a tiny minority of our fine-tuning data,9 and it suggests that, in some cases, alignment methods could generalize to producing the desired behavior on inputs that humans did not directly supervise. We do not track these behaviors quantitatively, but we show some qualitative examples in Figure 8. Our 175B PPO-ptx model is able to reliably answers questions about code, and can also follow instructions in other languages; however, we notice that it often produces an output in English even when the instruction is in another language. In comparison, we find that GPT-3 can perform these tasks but requires more careful prompting, and rarely follows instructions in these domains. InstructGPT still makes simple mistakes. In interacting with our 175B PPO-ptx model, we have noticed it can still make simple mistakes, despite its strong performance on many different language tasks. To give a few examples: (1) when given an instruction with a false premise, the model sometimes incorrectly assumes the premise is true, (2) the model can overly hedge; when given a simple question, it can sometimes say that there is no one answer to the question and give multiple possible answers, even when there is one fairly clear answer from the context, and (3) the model’s performance degrades when instructions contain multiple explicit constraints (e.g. “list 10 movies made in the 1930’s set in France”) or when constraints can be challenging for language models (e.g. writing a summary in a specified number of sentences). We show some examples of these behaviors in Figure 9. We suspect that behavior (2) emerges partly because we instruct labelers to reward epistemic humility; thus, they may tend to reward outputs that hedge, and this gets picked up by our reward model. We suspect that behavior (1) occurs because there are few prompts in the training set that assume false premises, and our models don’t generalize well to these examples. We believe both these behaviors could be dramatically reduced with adversarial data collection (Dinan et al., 2019b).

Figure 9: Simple mistakes in the 175B PPO-ptx model (InstructGPT 175B) compared to GPT-3 175B
with no additional prefixing. Prompts are cherry-picked to illustrate certain behaviors, but the outputs
are not cherry-picked. (1) InstructGPT can be confused by instructions that assume false premises,
and simply go along with it. (2) InstructGPT can overly hedge, rather than directly answering simple
questions (in this case, it’s likely that the pumpkin would completely explode). Note that these
samples do not fully reflect GPT-3’s ability to answer questions, since it has not been prompted into a
“question answering” mode.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

記載内容: InstructGPTの定性的な結果に関する説明である。特に、非英語の指示やコードに関する質問応答タスクに対するモデルの一般化能力について記載している。また、InstructGPTが犯す単純なミスについても触れている。
要旨:
- InstructGPTは、RLHFでチューニングされていない指示に対しても一般化する能力を示す。
- 非英語言語やコードに関する質問応答で有望な結果を示すが、出力は英語になる傾向がある。
- GPT-3もこれらのタスクを実行可能だが、より慎重なプロンプト設定が必要である。
- InstructGPTは単純な質問でも複数の答えを提示するなどのミスを犯すことがある。
- これらの問題は、訓練データに不足している部分やエピステミック・ヒューミリティ（知識の謙遜）を重視する指示によるものと考えられる。

②具体的な処理方法の記載があれば説明下さい。

InstructGPTのトレーニングには、RLHF (Reinforcement Learning from Human Feedback) が使用されており、これにより人間のフィードバックを通じてモデルの挙動を調整している。
モデルは、英語以外の言語やコードに関するタスクでも指示に従うことが可能だが、出力は英語になる傾向がある。
プロンプト設定には、より慎重な調整が必要であり、特にGPT-3では適切なタスク実行のために工夫が必要である。

③改良点・工夫点・テクニック等の記載があれば説明下さい。

InstructGPTは、RLHFに基づいてトレーニングされているため、英語以外の言語やコードに関するタスクにも対応できるようになっている。これは、訓練データセットに含まれる非英語言語やコードが少数であるにもかかわらず、一般化能力を示すものである。
InstructGPTは、非英語の言語指示に対しても一定の適応力を持ち、コード質問応答タスクにおいても高い精度を発揮するが、慎重なプロンプトが必要なGPT-3よりも使いやすいとされる。

④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。

エピステミック・ヒューミリティ（知識の謙遜）を重視する指示が、モデルの回答に対して慎重な態度を生む可能性が高い。この結果、モデルが単純な質問に対しても複数の答えを提示するようになると考えられる。
モデルが誤った前提を受け入れてしまう理由として、訓練データセット内で誤った前提を前提とするプロンプトが少ないため、そのようなプロンプトに対して適切に応答できない可能性がある。

⑤性能が向上した記載があれば説明下さい。

非英語言語やコードに関する指示の追従能力が向上しているが、これは数値的に定量化されておらず、定性的な観察に基づくものである。
InstructGPTは、GPT-3と比較してこれらのタスクでより高い性能を発揮している (図8)。

⑥比較の記載があれば違いを表でまとめて下さい。

比較項目	InstructGPT 175B	GPT-3 175B	図/表番号
非英語言語での指示追従	指示に従うが出力が英語になる傾向あり	指示に従うが慎重なプロンプト設定が必要	図8
コードの質問応答	高い正答率	プロンプトが慎重でないと誤答しやすい	図8
誤った前提の質問への応答	誤りを受け入れてしまうことがある	より明確に間違いを指摘することがある	図9
複数の制約がある指示	性能が低下する	より柔軟に対応することがある	図9

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

特徴	InstructGPT 175Bの問題点	図/表番号
複数の明確な制約を含む指示に対する対応	制約が多い指示に対して性能が低下する	図9
誤った前提を含む質問への対応	誤った前提をそのまま受け入れてしまうことがある	図9
複数の答えを提示する傾向	エピステミック・ヒューミリティの強調により、簡単な質問でも過度に慎重な回答をすることがある	図9

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。

数値	説明
175B	InstructGPTおよびGPT-3モデルのパラメータ数
少数 (tiny minority)	訓練データセット内の非英語言語およびコードの割合

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

変数名	説明
RLHF	強化学習による人間のフィードバックを活用したトレーニング方法
Epistemic humility	エピステミック・ヒューミリティ。知識の不確実性を認める慎重な態度を指示する概念

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。

図8: InstructGPTとGPT-3の比較例を示しており、特に非英語言語やコードに関する指示に対する一般化能力を示している。InstructGPTがこれらのタスクでより高い性能を発揮することを強調している。
図9: InstructGPTの誤った前提や複数の制約を含む指示に対する応答例を示し、特定のシナリオでのモデルの弱点や挙動を可視化している。

▼質問(人間)
InstructGPTが犯す単純なミス・性能が劣化する場合についてまとめて下さい。

▼生成AIの回答(ChatGPT-4o)
InstructGPTが犯す単純なミスや性能が劣化する場合について、以下にまとめる。

InstructGPTが犯す単純なミス

誤った前提を含む質問に対する応答:
- 誤った前提を含む指示に対して、モデルがその前提をそのまま受け入れ、誤答してしまうことがある。これは、訓練データに誤った前提を前提とするプロンプトが少ないことが原因と考えられる (図9)。
過度な慎重さ（過剰な保留）:
- 単純な質問に対しても、複数の可能性を提示するなど、過度に慎重な応答をする場合がある。これは、エピステミック・ヒューミリティ（知識の謙遜）を重視するフィードバックの影響で、曖昧な回答が報奨されやすいことが原因である (図9)。

性能が劣化する場合

複数の明確な制約を含む指示:
- 指示に多くの明確な制約が含まれると、InstructGPTの性能が劣化する。例えば、「1930年代にフランスで制作された映画を10本挙げる」などの複数の制約がある指示に対して、適切な応答を生成できない場合がある (図9)。
複雑な形式の制約がある場合:
- 言語モデルにとって難しい制約（例えば「指定された文章数で要約する」など）を含む指示に対しても、性能が低下することが観察されている。これにより、指定された要件を満たすことができず、適切な応答を生成できない (図9)。

これらの問題点は、特定のデータが訓練セットに不足しているため、または報酬モデルの設計上の影響で生じていると考えられ、今後の改良の余地があるとされている。

5 Discussion

5.1 Implications for alignment research

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 5 Discussion 5.1 Implications for alignment research This research is part of our broader research program to align AI systems with human intentions (Chris- tiano et al., 2017; Ziegler et al., 2019; Stiennon et al., 2020). Even though this work focuses on our current language model systems, we seek general and scalable methods that work for future AI systems (Leike et al., 2018). The systems we work with here are still fairly limited, but they are among the largest language models today and we apply them on a wide range of language tasks, including classification, summarization, question-answering, creative writing, dialogue, and others. Our approach to alignment research in this work is iterative: we are improving the alignment of current AI systems instead of focusing abstractly on aligning AI systems that don’t yet exist. A disadvantage of this approach is that we are not directly facing alignment problems that occur only when aligning superhuman systems (Bostrom, 2014). However, our approach does provides us with a clear empirical feedback loop of what works and what does not. We believe that this feedback loop is essential to refine our alignment techniques, and it forces us to keep pace with progress in machine learning. Moreover, the alignment technique we use here, RLHF, is an important building block in several proposals to align superhuman systems (Leike et al., 2018; Irving et al., 2018; Christiano et al., 2018). For example, RLHF was a central method in recent work on summarizing books, a task that exhibits some of the difficulties of aligning superhuman AI systems as it is difficult for humans to evaluate directly (Wu et al., 2021). From this work, we can draw lessons for alignment research more generally: 1. The cost of increasing model alignment is modest relative to pretraining. The cost of collecting our data and the compute for training runs, including experimental runs is a fraction of what was spent to train GPT-3: training our 175B SFT model requires 4.9 petaflops/s-days and training our 175B PPO-ptx model requires 60 petaflops/s-days, compared to 3,640 petaflops/s-days for GPT-3 (Brown et al., 2020). At the same time, our results show that RLHF is very effective at making language models more helpful to users, more so than a 100x model size increase. This suggests that right now increasing investments in alignment of existing language models is more cost-effective than training larger models—at least for our customers’ natural language task distribution. 2. We’ve seen some evidence that InstructGPT generalizes ‘following instructions’ to settings that we don’t supervise it in, for example on non-English language tasks and code-related tasks. This is an important property because it’s prohibitively expensive to have humans supervise models on every task they perform. More research is needed to study how well this generalization scales with increased capabilities; see Christiano et al. (2021) for recent research in this direction. 3. We were able to mitigate most of the performance degradations introduced by our fine-tuning. If this was not the case, these performance degradations would constitute an alignment tax—an additional cost for aligning the model. Any technique with a high tax might not see adoption. To avoid incentives for future highly capable AI systems to remain unaligned with human intent, there is a need for alignment techniques that have low alignment tax. To this end, our results are good news for RLHF as a low-tax alignment technique. 4. We’ve validated alignment techniques from research in the real world. Alignment research has historically been rather abstract, focusing on either theoretical results (Soares et al., 2015), small synthetic domains (Christiano et al., 2018; Leike et al., 2017), or training ML models on public NLP datasets (Ziegler et al., 2019; Stiennon et al., 2020). Our work provides grounding for alignment research in AI systems that are being used in production in the real world with customers.10 This enables an important feedback loop on the techniques’ effectiveness and limitations.

▼生成AIの回答(ChatGPT-4o)
5.1 Implications for alignment research

① 何についての記載か? + 要旨は何?

記載内容: 本章は、AIシステムの人間の意図への整合性（アライメント）に関する研究の影響を論じている。具体的には、現在の言語モデルのアライメント技術（特にRLHF）の有効性や課題、将来のAIシステムへの応用可能性について述べている。
要旨:
- 本研究は、AIシステムを人間の意図に整合させるためのアプローチであり、特に現行の言語モデルに焦点を当てている。
- RLHF（強化学習による人間のフィードバック）は、既存のモデルにおいて効果的なアライメント技術であり、将来的な超人的AIシステムへの適用も期待されている。
- アライメント技術のコストは、GPT-3の訓練に比べて非常に低く、現在のモデルの整合性向上は、モデルサイズを100倍にするよりもコスト効率が良い。
- InstructGPTは、監督されていないタスク（非英語やコードに関連するタスク）に対しても、指示に従う能力を一般化できる兆候が見られる。
- 本研究は、アライメント技術が現実世界のシステムにおいても有効であることを示している。

② 具体的な処理方法の記載があれば説明下さい。

アライメント技術には主にRLHFが使用されており、これは人間のフィードバックをもとに強化学習を行う手法である。RLHFは、従来のNLPタスク（Ziegler et al., 2019; Stiennon et al., 2020）や要約タスク（Wu et al., 2021）で成功を収めており、本研究でもその有効性が実証されている。

③ 改良点・工夫点・テクニック等の記載があれば説明下さい。

アライメント技術の向上: アライメント技術を現行のAIシステムに適用し、フィードバックループを形成することで、技術の有効性を確認し、進化させる。
コスト効率の向上: モデルサイズの100倍増加よりも、アライメントの改善に投資する方が現在の自然言語タスクにおいてコスト効率が高いことが示されている。
RLHFの活用: 人間のフィードバックを用いた強化学習を通じて、よりユーザーに有用なモデルを構築し、かつパフォーマンスの劣化を最小限に抑える。

④ メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。

RLHFを活用することで、ユーザーに有用な応答を生成する能力が強化され、モデルが人間の意図に整合するように調整される。これにより、フィードバックループが形成され、アライメント技術が進化し続けることができる。また、フィードバックによる「低アライメント税」（低コストでの整合性向上）は、将来的なAIシステムでも重要な要素となる。

⑤ 性能が向上した記載があれば説明下さい。

RLHFを用いたモデルは、GPT-3のようにモデルサイズを増加させるよりも効率的に性能を向上させることができる。特に、GPT-3に比べて4.9ペタフロップス/秒-日（SFTモデル）および60ペタフロップス/秒-日（PPO-ptxモデル）という低コストでトレーニングできる点が強調されている【図9】。

⑥ 比較の記載があれば違いを表でまとめて下さい。

モデル	訓練コスト (ペタフロップス/秒-日)	主な特徴
GPT-3	3,640	大規模モデルによる性能向上
SFT 175B	4.9	小規模なデータでのアライメント強化
PPO-ptx 175B	60	アライメントに特化したトレーニング

⑦ 上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

無し

⑧ 具体的な数値の記載を全て列挙して、表にまとめて下さい。

数値	説明
4.9ペタフロップス/秒-日	SFT 175Bモデルのトレーニングに必要な計算リソース
60ペタフロップス/秒-日	PPO-ptx 175Bモデルのトレーニングに必要な計算リソース
3,640ペタフロップス/秒-日	GPT-3のトレーニングに必要な計算リソース
100倍	モデルサイズの増加に比べて、アライメント改善の方が効率的

⑨ 具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

無し

⑩ 図/表があれば、各図/表は何を主張するためのものかを説明下さい。

図9: GPT-3と比較して、InstructGPTのモデルが低コストでアライメント改善を達成できることを示している。特に、訓練コストにおける違いとアライメント技術の有効性を強調している点が注目ポイントである。

5.2 Who are we aligning to?

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 5.2 Who are we aligning to? When aligning language models with human intentions, their end behavior is a function of the underlying model (and its training data), the fine-tuning data, and the alignment method used. In this section, we describe a number of factors that influence the fine-tuning data specifically, to ultimately determine what and who we’re aligning to. We then consider areas for improvement before a larger discussion of the limitations of our work in Section 5.3. The literature often frames alignment using such terms as “human preferences” or “human values.” In this work, we have aligned to a set of labelers’ preferences that were influenced, among others things, by the instructions they were given, the context in which they received them (as a paid job), and who they received them from. Some crucial caveats apply: First, we are aligning to demonstrations and preferences provided by our training labelers, who directly produce the data that we use to fine-tune our models. We describe our labeler hiring process and demographics in Appendix B; in general, they are mostly English-speaking people living in the United States or Southeast Asia hired via Upwork or Scale AI. They disagree with each other on many examples; we found the inter-labeler agreement to be about 73%. Second, we are aligning to our preferences, as the researchers designing this study (and thus by proxy to our broader research organization, OpenAI): we write the labeling instructions that labelers use as a guide when writing demonstrations and choosing their preferred output, and we answer their questions about edge cases in a shared chat room. More study is needed on the exact effect of different instruction sets and interface designs on the data collected from labelers and its ultimate effect on model behavior. Third, our training data is determined by prompts sent by OpenAI customers to models on the OpenAI API Playground, and thus we are implicitly aligning to what customers think is valuable and, in some cases, what their end-users think is valuable to currently use the API for. Customers and their end users may disagree or customers may not be optimizing for end users’ well-being; for example, a customer may want a model that maximizes the amount of time a user spends on their platform, which is not necessarily what end-users want. In practice, our labelers don’t have visibility into the contexts in which a given prompt or completion will be seen. Fourth, OpenAI’s customers are not representative of all potential or current users of language models—let alone of all individuals and groups impacted by language model use. For most of the duration of this project, users of the OpenAI API were selected off of a waitlist. The initial seeds for this waitlist were OpenAI employees, biasing the ultimate group toward our own networks. Stepping back, there are many difficulties in designing an alignment process that is fair, transparent, and has suitable accountability mechanisms in place. The goal of this paper is to demonstrate that this alignment technique can align to an specific human reference group for a specific application. We are not claiming that researchers, the labelers we hired, or our API customers are the right source of preferences. There are many stakeholders to consider—the organization training the model, the customers using the model to develop products, the end users of these products, and the broader population who may be directly or indirectly affected. It is not only a matter of making the alignment process more participatory; it is impossible that one can train a system that is aligned to everyone’s preferences at once, or where everyone would endorse the tradeoffs. One path forward could be to train models that can be conditioned on the preferences of certain groups, or that can be easily fine-tuned or prompted to represent different groups. Different models can then be deployed and used by groups who endorse different values. However, these models might still end up affecting broader society and there are a lot of difficult decisions to be made relating to whose preferences to condition on, and how to ensure that all groups can be represented and can opt out of processes that may be harmful.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か: 言語モデルを人間の意図に沿って整合させる際の課題と、それに関与する異なる要素やグループについての記載。
要旨:
- 言語モデルの整合性はモデル自体、微調整データ、使用するアラインメント手法に依存する。
- 微調整データの提供者（ラベラー）や顧客の意図にモデルがどのように整合されているかが述べられている。
- ラベラーの個人的な意図や顧客のビジネス目標など、さまざまなバイアスが整合性に影響を与える可能性がある。
- すべての人々の意図や価値観に同時に整合することは不可能である。
- 特定のグループの意図や価値観に応じたモデルを訓練し、調整することが今後の方向性として提案されている。

②具体的な処理方法の記載があれば説明下さい。

微調整データは、ラベラーが提供するデモンストレーションや好みに基づいており、これらのラベラーは主に英語話者であり、アメリカや東南アジアの地域から採用されている（Appendix Bを参照）。
ラベラーの指示は、研究者（OpenAI）によって作成され、ラベラーはこれに従ってモデルの微調整を行う。
モデルの微調整データは、OpenAI APIの顧客から送信されたプロンプトに基づいており、顧客の意図やニーズに応じた調整が行われる。

③改良点・工夫点・テクニック等の記載があれば説明下さい。

現在の整合手法は特定の人間の参照グループに対して有効であるが、これが最適な整合先であるとは限らない。
将来的には、特定のグループの意図に応じて調整可能なモデルを訓練し、さまざまなグループの価値観に基づいたモデルを展開する可能性が考慮されている。
異なる価値観を持つグループに対して異なるモデルを提供することで、整合性の問題に対処するアイデアが提案されている。

④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。

整合性の問題が生じる理由は、ラベラーの個人的な好みや、顧客が期待するモデルの挙動に関する価値観の違いに起因している。
さまざまなグループの価値観や意図に基づいてモデルを整合させるためには、そのグループの意図に沿ったモデルを個別に作成する必要があるとされている。

⑤性能が向上した記載があれば説明下さい。

性能向上に関する具体的な記載はなし。

⑥比較の記載があれば違いを表でまとめて下さい。

比較対象	説明
ラベラーの意図	ラベラーは特定の指示や文脈に従って微調整データを提供している。
研究者の意図	研究者はラベラーに指示を与え、ラベラーのデータを用いてモデルを微調整する。
顧客の意図	顧客はAPIを介してモデルを利用し、特定のプロンプトや意図に応じた出力を期待している。

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

要素	説明
ラベラーの採用プロセス	主に英語話者であり、アメリカや東南アジアの地域から採用。
顧客のプロンプト	OpenAI APIを使用して顧客が送信するプロンプトが微調整データの基礎となる。
将来的な方向性	異なるグループに応じたモデルを訓練・調整する可能性がある。

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。

数値	説明
73%	ラベラー間の一致率。

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

該当なし。

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。

該当なし。

5.3 Limitations

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 5.3 Limitations Methodology. The behavior of our InstructGPT models is determined in part by the human feedback obtained from our contractors. Some of the labeling tasks rely on value judgments that may be impacted by the identity of our contractors, their beliefs, cultural backgrounds, and personal history. We hired about 40 contractors, guided by their performance on a screening test meant to judge how well they could identify and respond to sensitive prompts, and their agreement rate with researchers on a labeling task with detailed instructions (see Appendix B). We kept our team of contractors small because this facilitates high-bandwidth communication with a smaller set of contractors who are doing the task full-time. However, this group is clearly not representative of the full spectrum of people who will use and be affected by our deployed models. As a simple example, our labelers are primarily English-speaking and our data consists almost entirely of English instructions. There are also many ways in which we could improve our data collection set-up. For instance, most comparisons are only labeled by 1 contractor for cost reasons. Having examples labeled multiple times could help identify areas where our contractors disagree, and thus where a single model is unlikely to align to all of them. In cases of disagreement, aligning to the average labeler preference may not be desirable. For example, when generating text that disproportionately affects a minority group, we may want the preferences of labelers belonging to that group to be weighted more heavily. Models. Our models are neither fully aligned nor fully safe; they still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting. They can also fail to generate reasonable outputs on some inputs; we show some examples of this in Figure 9. Perhaps the greatest limitation of our models is that, in most cases, they follow the user’s instruction, even if that could lead to harm in the real world. For example, when given a prompt instructing the models to be maximally biased, InstructGPT generates more toxic outputs than equivalently-sized GPT-3 models. We discuss potential mitigations in the following sections.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か: InstructGPTモデルの限界とその影響、特に人間のフィードバックに基づくモデルの訓練方法やモデルの安全性についての課題が述べられている。
要旨:
- モデルの挙動は、ラベラー（契約者）から得られる人間のフィードバックに大きく依存しているが、ラベラーの文化的背景や信念が影響を与える可能性がある。
- コストやデータ収集方法に関する制約がモデルの整合性や公平性に影響している。
- 現在のモデルは、完全に安全でも整合されてもおらず、バイアスや有害な出力を生成することがある。
- 一部のプロンプトに対しては、GPT-3モデルよりも悪い挙動を示す場合がある。

②具体的な処理方法の記載があれば説明下さい。

ラベラーは40名程度で、スクリーニングテストによって選定され、敏感なプロンプトに適切に対応できるかや、研究者とのラベル付け作業の合意率に基づいて選ばれている。
ほとんどの比較はコストの関係で1人のラベラーによってラベル付けされており、これが意見の不一致を特定しにくくしている。
ラベル付けには平均的なラベラーの好みに整合させる方法が使われているが、少数派グループに対しては、異なる重み付けを行うことが検討されている。

③改良点・工夫点・テクニック等の記載があれば説明下さい。

コスト削減のため、ほとんどの比較が1人のラベラーによって行われているが、複数のラベラーにラベル付けを行わせることで、意見の不一致が発生する部分を特定しやすくすることが考えられている。
少数派グループに対しては、そのグループのラベラーの意見を重視することで、モデルのバイアス軽減が図られる可能性がある。

④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。

モデルの限界が生じる理由は、ラベラーの個人的な価値観や信念がフィードバックに影響を与えること、およびラベル付けの手法が一部の意見に偏っている可能性があるためである。
コスト削減のためのデータ収集方法が、モデルの多様性や整合性に影響を与えていると考えられる。

⑤性能が向上した記載があれば説明下さい。

性能向上に関する具体的な記載はなし。ただし、特定の少数派グループに対して重み付けを行うことで、モデルの出力が改善される可能性が示唆されている。

⑥比較の記載があれば違いを表でまとめて下さい。

比較対象	説明
GPT-3モデル	一部のプロンプトに対して、InstructGPTモデルよりも毒性の少ない出力を生成する場合がある。
InstructGPTモデル	ユーザーの指示に従いすぎることで、GPT-3モデルよりも毒性の高い出力を生成する可能性がある。

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

要素	説明
ラベラーの選定基準	スクリーニングテストと研究者との合意率に基づいて選定される。
データ収集方法	ほとんどの比較が1人のラベラーによってラベル付けされている。
モデルのリスク	モデルが毒性のある出力を生成したり、誤った情報を作り出す可能性がある。

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。

数値	説明
40	雇用されたラベラーの人数。

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

該当なし。

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。

図9: モデルがいくつかの入力に対して不適切な出力を生成する例が示されている。主張ポイントは、InstructGPTモデルの限界として、特定のプロンプトに対して適切な出力を生成できないことがある点。

5.4 Open questions

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 5.4 Open questions This work is a first step towards using alignment techniques to fine-tune language models to follow a wide range of instructions. There are many open questions to explore to further align language model behavior with what people actually want them to do. Many methods could be tried to further decrease the models’ propensity to generate toxic, biased, or otherwise harmful outputs. For example, one could use an adversarial set-up where labelers find the worst-case behaviors of the model, which are then labeled and added to the dataset (Dinan et al., 2019b). One could also combine our method with ways of filtering the pretraining data (Ngo et al., 2021), either for training the initial pretrained models, or for the data we use for our pretraining mix approach. Similarly, one could combine our approach with methods that improve models’ truthfulness, such as WebGPT (Nakano et al., 2021). In this work, if the user requests a potentially harmful or dishonest response, we allow our model to generate these outputs. Training our model to be harmless despite user instructions is important, but is also difficult because whether an output is harmful depends on the context in which it’s deployed; for example, it may be beneficial to use language models to generate toxic outputs as part of a data augmentation pipeline. Our techniques can also be applied to making models refuse certain user instructions, and we plan to explore this in subsequent iterations of this research. Getting models to do what we want is directly related to the steerability and controllability litera- ture (Dathathri et al., 2019; Krause et al., 2020). A promising future path is combining RLHF with other methods of steerability, for example using control codes (Keskar et al., 2019), or modifying the sampling procedure at inference time using a smaller model (Dathathri et al., 2019). While we mainly focus on RLHF, there are many other algorithms that could be used to train policies on our demonstration and comparison data to get even better results. For example, one could explore expert iteration (Anthony et al., 2017; Silver et al., 2017), or simpler behavior cloning methods that use a subset of the comparison data. One could also try constrained optimization approaches (Achiam et al., 2017) that maximize the score from a reward model conditioned on generating a small number of harmful behaviors. Comparisons are also not necessarily the most efficient way of providing an alignment signal. For example, we could have labelers edit model responses to make them better, or generate critiques of model responses in natural language. There is also a vast space of options for designing interfaces for labelers to provide feedback to language models; this is an interesting human-computer interaction problem. Our proposal for mitigating the alignment tax, by incorporating pretraining data into RLHF fine- tuning, does not completely mitigate performance regressions, and may make certain undesirable behaviors more likely for some tasks (if these behaviors are present in the pretraining data). This is an interesting area for further research. Another modification that would likely improve our method is to filter the pretraining mix data for toxic content (Ngo et al., 2021), or augment this data with synthetic instructions. As discussed in detail in Gabriel (2020), there are subtle differences between aligning to instructions, intentions, revealed preferences, ideal preferences, interests, and values. Gabriel (2020) advocate for a principle-based approach to alignment: in other words, for identifying “fair principles for alignment that receive reflective endorsement despite widespread variation in people’s moral beliefs.” In our paper we align to the inferred user intention for simplicity, but more research is required in this area. Indeed, one of the biggest open questions is how to design an alignment process that is transparent, that meaningfully represents the people impacted by the technology, and that synthesizes peoples’ values in a way that achieves broad consensus amongst many groups. We discuss some related considerations in Section 5.2.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か: 言語モデルを指示に従わせるための整合技術（alignment techniques）に関する今後の研究課題について。
要旨:
- 言語モデルが毒性やバイアスのある出力を生成しないようにするための追加的な方法の提案。
- WebGPTなどの他の技術と組み合わせて、モデルの真実性を高める方法の模索。
- RLHF（人間のフィードバックに基づく強化学習）以外の手法や、ラベル付けインターフェースの設計に関する提案。
- アラインメント課税（alignment tax）を軽減するためのデータ調整の必要性とさらなる研究の必要性。

②具体的な処理方法の記載があれば説明下さい。

Adversarialセットアップ: ラベラーがモデルの最悪の挙動を特定し、それをデータセットに追加することで有害な出力を減少させる手法が提案されている【Dinan et al., 2019b】。
フィルタリング方法: 事前学習データをフィルタリングし、有害なデータや毒性のあるコンテンツを除外する方法【Ngo et al., 2021】。
RLHFの拡張: 他の手法と組み合わせることで、モデルのステアラビリティ（操縦可能性）を改善する方法が提案されている【Dathathri et al., 2019; Krause et al., 2020】。

③改良点・工夫点・テクニック等の記載があれば説明下さい。

Adversarialセットアップの使用: モデルの最悪の挙動を特定し、それをデータセットに追加することで、モデルの毒性やバイアスのある出力を減らす試み。
WebGPTとの組み合わせ: モデルの真実性を高めるため、WebGPT【Nakano et al., 2021】の手法との併用を提案している。
フィードバックインターフェースの設計: ラベラーがモデルの出力を編集したり、自然言語で批評を生成するインターフェースの設計が検討されている。

④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。

モデルの毒性やバイアスのある出力は、事前学習データの中に含まれる有害なコンテンツが原因であると考えられている。そのため、データセットのフィルタリングや、悪い挙動を特定してデータセットに追加することが有効な対策とされている。

⑤性能が向上した記載があれば説明下さい。

性能向上に関する具体的な数値はなし。ただし、データセットのフィルタリングや有害な出力を減少させるためのAdversarialセットアップの導入により、性能向上が期待されている。

⑥比較の記載があれば違いを表でまとめて下さい。

手法	説明
Adversarialセットアップ	モデルの最悪の挙動を特定し、それをデータセットに追加することで、毒性やバイアスのある出力を減少させる。
WebGPT	モデルの真実性を向上させるための手法で、他の技術と組み合わせることで効果が期待される。
RLHF以外の手法	Behavior cloningや専門家の反復など、他の手法も検討されている。

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

要素	説明
アラインメント課税	事前学習データの組み込みによりパフォーマンス低下の可能性があるため、フィルタリングが推奨されている。
フィードバック方法	自然言語による批評や出力の編集など、多様なフィードバック手法が提案されている。
操縦可能性（Steerability）	RLHFと他の手法を組み合わせることで、モデルの挙動を制御しやすくする技術。

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。

該当なし。

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

該当なし。

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。

該当なし。

5.5 Broader impacts

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 5.5 Broader impacts This work is motivated by our aim to increase the positive impact of large language models by training them to do what a given set of humans want them to do. By default, language models optimize the next word prediction objective, which is only a proxy for what we want these models to do. Our results indicate that our techniques hold promise for making language models more helpful, truthful, and harmless. In the longer term, alignment failures could lead to more severe consequences, particularly if these models are deployed in safety-critical situations. We expect that as model scaling continues, greater care has to be taken to ensure that they are aligned with human intentions (Bostrom, 2014). However, making language models better at following user intentions also makes them easier to misuse. It may be easier to use these models to generate convincing misinformation, or hateful or abusive content. Alignment techniques are not a panacea for resolving safety issues associated with large language models; rather, they should be used as one tool in a broader safety ecosystem. Aside from intentional misuse, there are many domains where large language models should be deployed only with great care, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying people based on protected characteristics, determining eligibility for credit, employment, or hous- ing, generating political advertisements, and law enforcement. If these models are open-sourced, it becomes challenging to limit harmful applications in these and other domains without proper regulation. On the other hand, if large language model access is restricted to a few organizations with the resources required to train them, this excludes most people from access to cutting-edge ML technology. Another option is for an organization to own the end-to-end infrastructure of model deployment, and make it accessible via an API. This allows for the implementation of safety protocols like use case restriction (only allowing the model to be used for certain applications), monitoring for misuse and revoking access to those who misuse the system, and rate limiting to prevent the generation of large-scale misinformation. However, this can come at the cost of reduced transparency and increased centralization of power because it requires the API provider to make decisions on where to draw the line on each of these questions. Finally, as discussed in Section 5.2, the question of who these models are aligned to is extremely important, and will significantly affect whether the net impact of these models is positive or negative.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か: 言語モデルの整合技術（alignment techniques）がもたらす広範な影響についての議論。
要旨:
- 言語モデルを人間の意図に合わせて整合させることによる、ポジティブな影響とネガティブな影響のバランスに関する懸念。
- モデルがより人間の意図に従うことで、悪用（偽情報の生成、攻撃的なコンテンツの生成）が容易になる可能性。
- 整合技術が安全性に関する問題を解決する万能薬ではないこと。
- 医療診断や差別的な特徴に基づく分類などの高リスク分野では、言語モデルの慎重な運用が求められること。
- モデルの公開が限られた組織に集中することのリスクと、API経由でのアクセス提供による安全対策の導入の利点と欠点の議論。

②具体的な処理方法の記載があれば説明下さい。

整合技術の適用: 言語モデルを人間の意図に従うように調整するために、RLHF（人間のフィードバックに基づく強化学習）を使用するが、これがすべての安全性問題を解決するわけではないとされている。具体的な安全プロトコルとしては、API経由でのモデルアクセスを制限し、使用目的の制限や不正使用の監視、アクセス権の取り消し、レートリミットなどが挙げられている。

③改良点・工夫点・テクニック等の記載があれば説明下さい。

API経由のアクセス提供: 言語モデルへのアクセスを制限する方法として、API経由でのみ利用を許可し、特定の用途に限ることや、不正利用があればアクセスを取り消すといった手法が提案されている。これにより、大規模な偽情報の生成や悪用を防ぐことができるが、透明性の低下や中央集権化のリスクがある。

④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。

整合技術を適用することで、言語モデルはユーザーの指示により忠実に従うようになるが、それにより偽情報や攻撃的なコンテンツの生成が容易になる。したがって、APIベースのアクセス制限や使用目的の制限が有効な対策とされるが、これには透明性の低下や権力の集中といった副作用が伴う。

⑤性能が向上した記載があれば説明下さい。

言語モデルの整合技術を適用することで、モデルの出力がより有用で、真実であり、害を与えないものになることが期待される。しかし、具体的な性能向上に関する数値や図表は記載されていない。

⑥比較の記載があれば違いを表でまとめて下さい。

アプローチ	メリット	デメリット
オープンソース化	多くの人々が最先端の技術にアクセス可能	悪用が難しく、規制が難しい
API経由のアクセス	使用目的の制限、不正利用の監視、アクセス取り消し、レートリミットの導入が可能	透明性の低下、権力の集中

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。

課題	説明
整合技術の限界	整合技術はすべての安全性問題を解決するわけではなく、他の安全手法と併用すべきである。
使用制限	言語モデルの運用には、高リスク分野では慎重さが求められる。

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。

該当なし。

⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。

該当なし。

⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。

該当なし。

Appendix

同論文の末尾にあるAppendixについては、下記の別記事に記載しました。

あとがき

所感

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。(良い意味で論文の要点をそのまま抜き出した形)

分かったこと

分かったこと・注目した点を列挙すると:

GPT-3モデルに対して人間のフィードバックを用いたファインチューニングを行うことで、モデルの出力を改善し、ユーザーの期待により沿ったものにする。
人間の嗜好に基づく微調整: InstructGPTは、GPT-3と同じアーキテクチャを持つが、人間のフィードバックを基にした強化学習 (RLHF) によって微調整されており、これによりモデルがユーザーの指示により適切に従うようになっている。

1.3BパラメータのInstructGPTは、175BパラメータのGPT-3と比較して出力品質が高いと評価される。
InstructGPTは、真実性の向上と有害な出力の削減に成功しつつ、公開されているNLPデータセット上での性能の低下は最小限に抑えられている。

大規模言語モデルは、予測タスクに優れているが、ユーザーの指示に従うことや、真実性を保持し有害な出力を避けることには課題がある。

40名の契約社員がデータラベリングのために雇われ、スクリーニングテストでのパフォーマンスに基づいて選抜された（詳細はSection 3.4およびAppendix B.1に記載）。

人間のフィードバックを報酬として使用し、強化学習 (PPO) によるファインチューニングを行うことで、モデルがより人間の意図に沿った出力を生成するように調整された。

大規模言語モデルの目標は「次のトークンを予測する」ことであるため、ユーザーの指示や期待に直接対応するわけではない。このミスマッチが有害な出力や不正確な情報の生成につながる。

PPO-ptx手法の使用: RLHFによる微調整中に、公開NLPデータセットでの性能低下を防ぐため、PPO-ptxを使用することで、性能低下を最小化した。
RLHF (Reinforcement Learning from Human Feedback) 微調整: 人間のラベラーのフィードバックを用いて強化学習を行い、PPO (Proximal Policy Optimization) アルゴリズムで報酬関数を最適化した。また、PPO-ptx手法を用いて、事前学習分布の対数尤度を増加させることで、公開NLPデータセットでの性能低下を最小化した。

InstructGPTは依然として簡単なミスを犯すことがあり、安全性と信頼性の向上が課題。

RLHFの手法: もともとロボットやAtariゲームでの訓練に使用されていたRLHFが、テキスト要約や対話システムなどの言語モデルの微調整に適用されている (Christiano et al., 2017; Ziegler et al., 2019; Stiennon et al., 2020)。

3つのステップ: デモデータ収集とスーパーバイズドポリシーの訓練、比較データ収集と報酬モデルの訓練、PPOを用いたポリシーの最適化が順次説明されている。
反復的なプロセス: 比較データと報酬モデルを用いてポリシーを継続的に改善するサイクル。
↓
具体的な処理方法の記載:

Step 1: Collect demonstration data, and train a supervised policy
- ラベラーが入力プロンプトに対して望ましい出力を示すデモンストレーションデータを収集する。このデータを使用して、事前学習済みのGPT-3モデルをスーパーバイズド学習で微調整する。このプロセスで使用されるプロンプトの分布は、Section 3.2で詳述されている。
Step 2: Collect comparison data, and train a reward model
- モデルの出力に対してラベラーが好む出力を比較するデータセットを収集する。ラベラーは与えられた入力に対してどの出力を好むかを指示する。この比較データを使い、報酬モデル（RM）を訓練し、人間が好む出力を予測する。
Step 3: Optimize a policy against the reward model using PPO
- Step 2で得られたRMを報酬として利用し、スーパーバイズドポリシーをPPO（Proximal Policy Optimization）アルゴリズムを使用して最適化する。このプロセスは繰り返し可能であり、最新のポリシーを用いて新たな比較データを収集し、報酬モデルとポリシーを更新する。

Ziegler et al. (2019) および Stiennon et al. (2020)の手法を採用し、スタイル継続や要約の領域におけるモデルの改善を試みている。
比較データと報酬モデルを反復的に使用して、ポリシーの継続的な改善を図るサイクル設計。

データセットの種類: SFTデータセット、RMデータセット、PPOデータセットの3つが作成され、それぞれ異なる目的で使用されている。

SFTデータセット: ラベラーによるデモンストレーションで作成されたデータセット（約13kプロンプト）。
RMデータセット: ラベラーによるモデル出力のランク付けを行ったデータセット（約33kプロンプト）。
PPOデータセット: 人間のラベル付けなしで収集されたAPIプロンプトのデータセット（約31kプロンプト）。

プロンプトデータの収集:
OpenAI APIのPlaygroundインターフェースを通じてユーザーが提出したプロンプトを収集。
プロンプトのフィルタリングにより、個人識別情報（PII）を含むデータは除去。
ラベラーは、ユーザーのプロンプトの意図を推測し、明確でないプロンプトはスキップする。

最初のInstructGPTモデルを訓練するため、ラベラーがプロンプトを作成。これには、Plainプロンプト（任意のタスク）、Few-shotプロンプト（複数のクエリ/レスポンスペアを伴う指示）、およびUser-basedプロンプト（APIの待機リストに基づくユースケース）が含まれる。

ユーザーIDによるデータセットの分割により(1ユーザーあたり最大200プロンプトに制限)、モデルがトレーニング中に特定のユーザーの特徴を学習しすぎることを防ぎ、より一般化された結果を得ることができる。

幅広いタスクへの対応: 入力データは、従来の要約タスクよりも広範囲にわたる（Stiennon et al., 2020の研究と比較）。

Supervised fine-tuning (SFT):
GPT-3モデルをラベラーのデモンストレーションデータで16エポックにわたってファインチューニングした。
検証損失は1エポックでオーバーフィットするが、オーバーフィットが検証損失に現れたとしても、より多くのエポックでトレーニングすることで、ラベラーの評価やRMスコアが改善される。これは、モデルが過剰に適合するデータの一部が、最終的な人間の好みに対応しているためと考えられる。

Reward modeling (RM):
最終的なアンエンベディング層を除去したSFTモデルからスタートし、入力（プロンプト）と応答を受け取り、スカラー報酬を出力するモデルをトレーニングした。
6BパラメータのRMのみを使用し、これは計算リソースを大幅に節約するためである。また、175BパラメータのRMはトレーニングが不安定で、RLの価値関数として使用するには適していなかった。

比較データ収集の高速化のため、ラベラーにK=4～9の応答を提示し、それらをランク付けさせた。これにより、各プロンプトに対して${}_K C_2$個の比較が得られる。
比較データが同一タスク内で非常に相関しているため、比較をシャッフルすると報酬モデルがオーバーフィットすることが確認された。そこで、すべての比較を1つのバッチ要素として処理し、計算効率を向上させ、オーバーフィットを防いだ。

報酬モデルの正規化: 報酬モデルの損失は報酬のシフトに不変であるため、報酬を正規化し、ラベラーのデモンストレーションが平均スコア0を達成するようにバイアスを調整した。

Reinforcement learning (RL):
ランダムなプロンプトに対する応答を生成し、報酬モデルに基づいて報酬を算出する。
KLペナルティを各トークンごとに追加し、報酬モデルの過剰最適化を防止する。このKLペナルティは、SFTモデルとPPOモデルの確率分布の差に基づいて計算される。
事前学習の勾配をPPOの勾配に組み込むことで、公開NLPデータセット上での性能低下を防ぐ。この手法を用いるモデルをPPO-ptxと呼ぶ。

役立つモデルの評価方法: モデルが指示に従い、曖昧なプロンプトに対してもユーザーの意図を推論できるかどうかを、ラベラーの判断によって評価する。主なメトリックは、ラベラーによる「好みの評価」であり、ラベラーはプロンプトに対して最も適切と思われる応答を評価する。

ラベラーがプロンプトごとにモデルの出力の品質を1〜7のリッカートスケールで評価する。

InstructGPTはGPT-3よりも一貫して優れた結果を示しており、特に指示に従う能力や事実の捏造（ハルシネーション）の少なさが評価されている。

InstructGPTはAPIプロンプト分布に基づいて学習されており、FLANやT0などの公開データセットよりも実際の使用ケースに適した結果を生成することができる。

PPOによるモデルの訓練は、従来の単純なファインチューニングやfew-shotプロンプトと比べて、ラベラーによる評価を向上させている。

PPO-ptxモデルのファインチューニング: Pretraining updatesをRLHFのファインチューニングプロセスに追加することで、NLPデータセットでのパフォーマンスを維持・向上させる。KL係数の調整は代替手法として検討されるが、最適ではない。

InstructGPTは、RLHFでチューニングされていない指示に対しても一般化する能力を示す。
GPT-3もこれらのタスクを実行可能だが、より慎重なプロンプト設定が必要である。

InstructGPTが犯す単純なミスについて:
誤った前提を含む質問に対する応答: 誤った前提を含む指示に対して、モデルがその前提をそのまま受け入れ、誤答してしまうことがある。
過度な慎重さ（過剰な保留）: 単純な質問に対しても、複数の可能性を提示するなど、過度に慎重な応答をする場合がある。
複数の明確な制約を含む指示: 指示に多くの明確な制約が含まれると、InstructGPTの性能が劣化する。例えば、「1930年代にフランスで制作された映画を10本挙げる」などの複数の制約がある指示に対して、適切な応答を生成できない場合がある (図9)。
複雑な形式の制約がある場合: 言語モデルにとって難しい制約（例えば「指定された文章数で要約する」など）を含む指示に対しても、性能が低下することが観察されている。これにより、指定された要件を満たすことができず、適切な応答を生成できない (図9)。

アライメント技術のコストは、GPT-3の訓練に比べて非常に低く、現在のモデルの整合性向上は、モデルサイズを100倍にするよりもコスト効率が良い。

RLHFは、従来のNLPタスク（Ziegler et al., 2019; Stiennon et al., 2020）や要約タスク（Wu et al., 2021）で成功を収めており、本研究でもその有効性が実証されている。

整合技術を適用することで、言語モデルはユーザーの指示により忠実に従うようになるが、それにより偽情報や攻撃的なコンテンツの生成が容易になる。したがって、APIベースのアクセス制限や使用目的の制限が有効な対策とされるが、これには透明性の低下や権力の集中といった副作用が伴う。

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up