生成AIを用いて2020年ImageNetでSoTAの論文「Meta Pseudo Labels (2020)」を読んでみた (続き)

Last updated at 2025-03-22Posted at 2024-09-16

はじめに

前回、生成AIを用いて2020年ImageNetでSoTAの論文「Meta Pseudo Labels」の内容(本文)を(なるべく)把握してみました。
同論文の末尾にあるAppendixについても、前回と同様の方法で、把握してみます。

生徒モデルの勾配 $gs(\hat{y}_u)$ は教師モデルのパラメータ $θ_T$ には依存しておらず、擬似ラベル $\hat{y}_u$ を通じてのみ間接的に依存するので、$θ_T$に対する勾配計算は通常通りには出来ないということが分かりました。(REINFORCEアルゴリズムを使用して勾配を計算、詳細は理解できず)
(末尾の「分かったこと」章を参照)

以降で、ChatGPTに聞いてみた例を記載します。

前回の記事: (本文の内容)

対象の論文

論文: (Meta Pseudo Labelsに関する論文)

[2003.10580v4] Meta Pseudo Labels
https://arxiv.org/abs/2003.10580v4
(PDF: https://arxiv.org/pdf/2003.10580v4)

質問時の各章節の区切り部分

論文の中にある各章節を、下記のように区切って、部分毎に生成AIに内容を質問していきます。

A. Derivation of the Teacher’s Update Rule
---
B. Pseudo Code for Meta Pseudo Labels with UDA
---
C. Experimental Details
C.1. Dataset Splits
C.2. Modifications of RandAugment [13]
---
C.3. Additional Implementation Details
---
C.4. Hyper-parameters
---
D. More Detailed Analysis of Meta Pseudo Label’s Behaviors
D.1. Visualizing the Contributions of Meta Pseudo Labels
D.2. Meta Pseudo Labels Is An Effective Regularization Strategy
---
D.3. Meta Pseudo Labels Is a Mechanism to Addresses the Confirmation Bias of Pseudo Labels
---
D.4. Meta Pseudo Labels with Different Training Techniques for the Teacher
D.5. Meta Pseudo Labels with Different Amounts of Labeled Data
---
E. Results with An Economical Version of Meta Pseudo Labels

生成AIへの質問方法

生成AIを活用して、知りたい記事・論文の1節分(適度な長さ)のテキストをコピー＆ペーストして、その下に質問内容を「①～ ②～ …」と番号付きで書いて、生成AIに渡せば、全質問に一発で的確に回答してくれるので、非常に良好でした。記事全体を読む必要なく、知りたい点の情報だけを収集できます。

生成AIへの質問例:

(論文・記事の各章節を貼り付け)

上記の内容に関して下記の質問に回答下さい: (である調で記載、質問に対して該当するものが無ければ無しと記載、対応する図/表番号があれば記載)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載)
②改良点・工夫点・テクニック等の記載があれば説明下さい。
③性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。
⑤具体的な処理方法の記載があれば説明下さい。(簡略化せずに全て記載、既存手法の適用であれば引用元を記載)

続けて下記の質問に追加で回答下さい:
⑥比較の記載があれば違いを表でまとめて下さい。(対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(|数値|説明|の表へ)
⑨具体的な変数名(symbol)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものか(掲載理由・注目ポイント等)を説明下さい。

※回答が長くなりそうな場合は、適宜、分けて質問: ①②③④⑤、⑥⑦⑧⑨⑩
※その他、不明点があれば、適宜、追加で質問。

質問内容は、記事・論文を読んでいていつも知りたいと思う点(改良点・工夫点・テクニック・メカニズムの解明)にしています。また、表で比較した方が素早く把握できるので、可能であれば記事を表に変換するようにしています。

論文・記事を貼り付けるテキストの長さは、1節分程度の量にとどめた方が、良い回答が得られました。生成AIの回答の文量が多くなってくると、回答が長くなり過ぎないように、生成AIが勝手に(適度に)端折り始めてしまい、重要な点が回答から抜けてしまう可能性が高くなります。

事前知識

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。

Meta Pseudo Labelsについての分かりやすい解説記事(下記)等を事前にチェックして、中核部分の内容をあらかじめ分かっていると、理解しやすいと思います。生成AIは実際の細かい処理方法自体を分かりやすく説明してはくれない傾向があります。

注意点

論文のテキスト内容だけを貼り付けて、生成AIに質問しています。論文の中の図・表の部分は貼り付けていません。図・表の内容は生成AIの回答には含まれず、別途論文を見る必要があります。

以降で、生成AIの回答内容が読みにくい・分かりづらい場合は、論文の本文でも同じように書かれてあり、論文の本文を読んでも同じように分かりづらいことが多くあります。論文では、既存研究等があるため、多くの説明を省略・一言だけサラッと書かれてある等、同種の研究に取り組む人でなければ、なかなか全容の理解に辿りつくのは難しい側面があります。この点は、生成AIの回答性能が悪いのではなく、論文という性質上、生じるものと考えています。

生成AIに質問

以降で、ChatGPTに実際に聞いてみた例を記載します。

生成AIへの質問＆回答の全容

生成AIへの質問＆回答の全容・詳細:

A. Derivation of the Teacher’s Update Rule

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) A. Derivation of the Teacher’s Update Rule In this section, we present the detailed derivation of the teacher’s update rule in Section 2. Mathematical Notations and Conventions. Since we will work with the chain rule, we use the standard Jacobian notations.3 Specifically, for a differentiable function f : Rm → Rn, and for a vector x ∈ Rm, we use the notation ∂f ∂x ∈ Rn×m to denote the Jacobian matrix of f , whose dimension is n × m. Additionally, when we mention the Jacobian of a function f at multiple points such as x1 and x2, we will use the notations of ∂f ∂x ∣ ∣ ∣x=x1 and ∂f ∂x ∣ ∣ ∣x=x2 . Furthermore, by mathematical conventions, a vector v ∈ Rn is treated as a column matrix – that is, a matrix of size n × 1. For this reason, the gradient vector of a multi-variable real-valued function is actually the transpose of of its Jacobian matrix. Finally, all multiplications in this section are standard matrix multiplications. If an operand is a vector, then the operand is treated as a column matrix. Dimension Annotations. Understanding that these notations and conventions might cause confusions, in the derivation below, we annotate the dimensions of the computed quantities to ensure that there is no confusion caused to our readers. To this end, we respectively use |S| and |T | to denote the dimensions of the parameters θS , θT . That is, θS ∈ R|S|×1 and θT ∈ R|T |×1. We now present the derivation. Suppose that on a batch of unlabeled examples xu, the teacher samples the pseudo labelŝ yu ∼ T (xu; θT ) and the student uses (xu,̂ yu) to update its parameter θS . In expectation, the student’s new parameter is Êyu∼T (xu;θT ) [θS −ηS ∇ηS CE(̂ yu, S(xu; θS ))]. We will update the teacher’s parameter to minimize the student’s cross-entropy on a batch of labeled data a this expected parameter. To this end, we need to compute the Jacobian: ∂R ∂θT ︸︷︷︸ 1×|T | = ∂ ∂θT CE ( yl, S ( xl; Êyu∼T (xu;θT ) [θS − ηS ∇ηS CE(̂yu, S(xu; θS ))])) (4) To simplify our notation, let us define ¯θ′ S ︸︷︷︸ |S|×1 = Êyu∼T (xu;θT ) [θS − ηS ∇ηS CE(̂ yu, S(xu; θS ))] (5) Then, by the chain rule, we have ∂R ∂θT ︸︷︷︸ 1×|T | = ∂ ∂θT CE ( yl, S ( xl; Êyu∼T (xu;θT ) [θS − ηS ∇ηS CE(̂yu, S(xu; θS ))])) = ∂ ∂θT CE (yl, S(xl; ¯θ′ S )) = ∂CE (yl, S(xl; ¯θ′ S )) ∂θS ∣ ∣ ∣ ∣ ∣θS =¯θ′ S ) ︸︷︷︸ 1×|S| · ∂ ¯θ′ S ∂θT ︸︷︷︸ |S|×|T | (6) The first factor in Equation 6 can be simply computed via back-propagation. We now focus on the second term. We have ∂ ¯θ′ S ∂θT ︸︷︷︸ |S|×|T | = ∂ ∂θT Êyu∼T (xu;θT ) [θS − ηS ∇ηS CE(̂yu, S(xu; θS ))] = ∂ ∂θT Êyu∼T (xu;θT )  θS − ηS · ( ∂CE (̂ yu, S(xu; θS )) ∂θS ∣ ∣ ∣ ∣θS =θS )>  (7) Note that in Equation 7 above, the Jacobian of CE (̂yu, S(xu; θS )), which has dimension 1 × |S|, needs to be transposed to match the dimension of θS , which, as we discussed above, conventionally has dimension |S| × 1. Now, since θS in Equation 7 does not depend on θT , we can leave it out of subsequent derivations. Also, to simplify notations, let us define the gradient gS (̂ yu) ︸︷︷︸ |S|×|1| = ( ∂CE (̂yu, S(xu; θS )) ∂θS ∣ ∣ ∣ ∣θS =θS )> (8) Then, Equation 7 becomes ∂ ¯θ′ S ∂θT ︸︷︷︸ |S|×|T | = −ηS · ∂ ∂θT Êyu∼T (xu;θT ) [ gS (̂yu) ︸︷︷︸ |S|×1 ] (9) Since gS (̂yu) has no dependency on on θT , except for viâ yu, we can apply the REINFORCE equation [75] to achieve ∂ ¯θ(t+1) S ∂θT ︸︷︷︸ |S|×|T | = −ηS · ∂ ∂θT Êyu∼T (xu;θT ) [gS (̂yu)] = −ηS · Êyu∼T (xu;θT ) [ gS (̂yu) ︸︷︷︸ |S|×1 · ∂ log P (̂yu|xu; θT ) ∂θT ︸︷︷︸ 1×|T | ] = ηS · Êyu∼T (xu;θT ) [ gS (̂ yu) ︸︷︷︸ |S|×1 · ∂CE (̂yu, T (xu; θT )) ∂θT ︸︷︷︸ 1×|T | ] (10) Here, the last equality in Equation 10 is is due to the definition of the cross-entropy loss, which is the negative of the log-prob term in the previous line. Now, we can substitute Equation 10 into Equation 6 to obtain ∂R ∂θT ︸︷︷︸ 1×|T | = ∂CE (yl, S(xl; ¯θ′ S )) ∂θS ∣ ∣ ∣ ∣ ∣θS =¯θ′ S ︸︷︷︸ 1×|S| · ∂ ¯θ′ S ∂θT ︸︷︷︸ |S|×|T | = ηS · ∂CE (yl, S(xl; ¯θ′ S )) ∂θS ∣ ∣ ∣ ∣ ∣θS =¯θ′ S ︸︷︷︸ 1×|S| · Êyu∼T (xu;θT ) [ gS (̂yu) ︸︷︷︸ |S|×1 · ∂CE (̂yu, T (xu; θT )) ∂θT ︸︷︷︸ 1×|T | ] (11) Finally, we use Monte Carlo approximation for every term in Equation 11 using the sampled̂ yu. In particular, we ap- proximate ¯θ′ S with the parameter obtained from θS by updating the student parameter on (xu,̂ yu), i.e., θ′ S = θS − ηS · ∇θS CE (̂yu, S(xu; θS ))), and approximate the expected value in the second term with the same usinĝ yu. With these approximation, we obtain the gradient ∇θT Lu(θT , θS ) from Equation 1: ∇θT Ll = ηS · ∂CE (yl, S(xl; θ′ S )) ∂θS ︸︷︷︸ 1×|S| · ( ∂CE (̂yu, S(xu; θS )) ∂θS ∣ ∣ ∣ ∣θS =θS )> ︸︷︷︸ |S|×1 · ∂CE (̂yu, T (xu; θT )) ∂θT ︸︷︷︸ 1×|T | = ηS · (( ∇θ′ S CE (yl, S(xl; θ′ S ) )> · ∇θS CE (̂ yu, S(xu; θS )) ) ︸︷︷︸ A scalar := h ·∇θT CE (̂ yu, T (xu; θT )) (12)

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この部分では、教師モデルの更新規則の詳細な導出について説明している。特に、教師モデルのパラメータをどのように更新するか、その導出過程をチェインルールを用いて数学的に記述している。

要旨

チェインルールを用いた教師モデルの更新規則の導出。
微分におけるヤコビアン行列の使用法を説明。
教師モデルが生徒モデルのパラメータ更新に基づいて更新される仕組みを説明。
教師モデルのパラメータ更新の最終的な勾配式を得るために、REINFORCEアルゴリズムを適用。

②改良点・工夫点・テクニック等の記載

ヤコビアン行列やチェインルールを使って、教師モデルの更新に必要な勾配計算を効率化している。
勾配計算で生徒モデルの更新がどのように教師モデルに影響を与えるかを厳密に記述し、REINFORCEアルゴリズムを利用して非微分可能な部分を扱う工夫がされている。
モンテカルロ近似を用いて、教師モデルの勾配を計算する際の効率向上が図られている。

③性能が向上した記載

本章では性能向上に関する具体的な記載はない。

④メカニズムの解明・なぜそうなるのか等の記載

生徒モデルのパラメータが教師モデルの更新にどのように寄与するかを、チェインルールとヤコビアン行列を用いて詳細に解明している。
教師モデルの更新は、教師が生徒モデルの出力に対してどのように影響を与えるかを通じて説明されており、教師モデルのパラメータが生徒モデルのクロスエントロピー損失を最小化する方向に更新される。
REINFORCEアルゴリズムを適用することで、教師モデルのパラメータ更新がラベルの非微分可能な部分に対してどのように作用するかを明確化している。

⑤具体的な処理方法の記載

微分計算において、関数のヤコビアンを用いる。特に、生徒モデルのパラメータ更新がどのように教師モデルに影響するかをチェインルールに基づいて導出している。
期待値に対する勾配の計算においては、REINFORCEアルゴリズムを使用して、教師モデルのパラメータがラベルに依存する部分の勾配を計算している。
モンテカルロ近似を用いて、勾配計算を効率化し、生徒モデルのパラメータを用いて教師モデルのパラメータ更新を行っている。

▼質問(人間)
続けて下記の質問に追加で回答下さい:
⑥比較の記載があれば違いを表でまとめて下さい。(対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(|数値|説明|の表へ)
⑨具体的な変数名(symbol)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものか(掲載理由・注目ポイント等)を説明下さい。

▼生成AIの回答(ChatGPT-4o)
⑥比較の記載があれば違いを表でまとめて下さい。(対応する図/表番号があれば明記)
この章には他の手法との比較に関する具体的な記載はない。

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(対応する図/表番号があれば記載)
以下の2つの項目を表にまとめることができる。

ヤコビアンの次元
勾配計算の処理過程

項目	説明
ヤコビアンの次元	ヤコビアンの次元は、関数の出力次元 × 入力次元であり、具体的にはn×mの次元を持つ。
勾配計算の処理過程	チェインルールに基づく勾配計算で、生徒モデルのパラメータ更新が教師モデルにどのように影響するかを記述。

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(|数値|説明|の表へ)
本章では、数値に関する具体的な記載はない。

⑨具体的な変数名(symbol)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)

変数名	説明
$ \theta_S $	生徒モデルのパラメータ ($
$ \theta_T $	教師モデルのパラメータ ($
$ \hat{y}_u $	教師モデルによって生成された擬似ラベル
$ CE $	クロスエントロピー損失関数
$ \eta_S $	生徒モデルの学習率
$ R $	教師モデルの損失関数
$ \partial f / \partial x $	関数 $ f $ のヤコビアン行列
$ g_S(\hat{y}_u) $	生徒モデルの損失関数 $ CE(\hat{y}_u, S(x_u; \theta_S)) $ に基づく勾配 ($
$ \nabla \theta_T $	教師モデルの損失関数に対するパラメータ $ \theta_T $ の勾配

⑩図/表があれば、各図/表は何を主張するためのものか(掲載理由・注目ポイント等)を説明下さい。
本章には対応する図や表が存在しない。

具体的な数式の説明は全て行われず。ピンポイントで質問する必要がある。

▼質問(人間)
上記の「REINFORCEアルゴリズム」の具体的な処理方法について記載があれば、簡略化せず元文内容から全て抜き出して説明下さい。(具体値があれば記載、既存手法の適用であれば引用元を記載)

▼生成AIの回答(ChatGPT-4o)
本文で「REINFORCEアルゴリズム」について具体的な処理方法は、次の部分に記載されている。

本文からの抜粋:

「Since $ gs(\hat{y}u) $ has no dependency on $ \theta_T $, except for via $ \hat{y}u $, we can apply the REINFORCE equation [75] to achieve:
$
\frac{\partial \bar{\theta}{S}^{(t+1)}}{\partial \theta_T} = -\eta_S \cdot \frac{\partial}{\partial \theta_T} E{\hat{y}_u \sim T (x_u; \theta_T)}[g_S (\hat{y}u)]
= -\eta_S \cdot E{\hat{y}_u \sim T (x_u; \theta_T)}\left[ g_S (\hat{y}_u) \cdot \frac{\partial \log P(\hat{y}u | x_u; \theta_T)}{\partial \theta_T} \right]
$
$
= \eta_S \cdot E{\hat{y}_u \sim T (x_u; \theta_T)}\left[ g_S (\hat{y}_u) \cdot \frac{\partial CE (\hat{y}_u, T (x_u; \theta_T))}{\partial \theta_T} \right]
$

ここで、式の最後の等号はクロスエントロピー損失の定義によるもので、前の行にある対数確率項の負の値に対応している。」

処理方法の詳細な説明:

まず、生徒モデルの勾配 $ g_S (\hat{y}_u) $ は教師モデルのパラメータ $ \theta_T $ には依存しない。ただし、擬似ラベル $ \hat{y}_u $ を通じてのみ間接的に依存する。このため、REINFORCEアルゴリズムを適用する。
REINFORCEアルゴリズムを適用すると、以下の式が得られる：
$
\frac{\partial \bar{\theta}{S}^{(t+1)}}{\partial \theta_T} = -\eta_S \cdot \frac{\partial}{\partial \theta_T} E{\hat{y}_u \sim T (x_u; \theta_T)}[g_S (\hat{y}_u)]
$
この式は、教師モデル $ T(x_u; \theta_T) $ に従ってサンプリングされた擬似ラベル $ \hat{y}_u $ に関する期待値を表している。
次に、期待値の勾配は、REINFORCEの公式に従い、以下の形に展開される：
$
= -\eta_S \cdot E_{\hat{y}_u \sim T (x_u; \theta_T)}\left[ g_S (\hat{y}_u) \cdot \frac{\partial \log P(\hat{y}_u | x_u; \theta_T)}{\partial \theta_T} \right]
$
ここでは、擬似ラベルの生成確率 $ P(\hat{y}_u | x_u; \theta_T) $ の対数確率の勾配が用いられており、これは教師モデルのパラメータ $ \theta_T $ に対する勾配を表している。
最後に、クロスエントロピー損失の定義により、対数確率項をクロスエントロピーの勾配に置き換えることができるため、次の式になる：
$
= \eta_S \cdot E_{\hat{y}_u \sim T (x_u; \theta_T)}\left[ g_S (\hat{y}_u) \cdot \frac{\partial CE (\hat{y}_u, T (x_u; \theta_T))}{\partial \theta_T} \right]
$

参考文献:
REINFORCEアルゴリズムの適用に関しては、[Williams, R. J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine learning 8.3-4 (1992): 229-256.] が引用されている。

REINFORCEアルゴリズムについて、数式の理解までは出来なかった。
論文の本文を見ても、同様に理解できない。

論文の中の式を直接確認すると、下記の式変形が、REINFORCEアルゴリズム。
$\theta_T$の偏微分により、$gs(\hat{y}_u)$からlog P(～)が出現し、それがCE(～)となる部分が、通常の偏微分操作では出現しない。ここの式変形に飛躍があり、理解が出来なかった。

▼質問(人間)
上記の「モンテカルロ近似を用いて」の具体的な処理方法について記載があれば、簡略化せず元文内容から全て抜き出して説明下さい。(具体値があれば記載、既存手法の適用であれば引用元を記載)

▼生成AIの回答(ChatGPT-4o)
本文で「モンテカルロ近似を用いて」の具体的な処理方法は、次の部分に記載されている。

本文からの抜粋:

「Finally, we use Monte Carlo approximation for every term in Equation 11 using the sampled $ \hat{y}_u $. In particular, we approximate $ \bar{\theta}'_S $ with the parameter obtained from $ \theta_S $ by updating the student parameter on $ (x_u, \hat{y}_u) $, i.e.,
$
\theta'S = \theta_S - \eta_S \cdot \nabla{\theta_S} CE (\hat{y}u, S(x_u; \theta_S)),
$
and approximate the expected value in the second term with the same using $ \hat{y}u $. With these approximation, we obtain the gradient $ \nabla{\theta_T} L_u(\theta_T, \theta_S) $ from Equation 1:
$
\nabla{\theta_T} L_u = \eta_S \cdot \frac{\partial CE (y_l, S(x_l; \theta'_S))}{\partial \theta_S}
\cdot
\left(
\frac{\partial CE (\hat{y}_u, S(x_u; \theta_S))}{\partial \theta_S}
\right)^\top
\cdot \frac{\partial CE (\hat{y}u, T (x_u; \theta_T))}{\partial \theta_T}
$
$
= \eta_S \cdot \left( \nabla{\theta'_S} CE (y_l, S(x_l; \theta'S))^\top \cdot \nabla{\theta_S} CE (\hat{y}u, S(x_u; \theta_S)) \right) \cdot \nabla{\theta_T} CE (\hat{y}_u, T(x_u; \theta_T)).
$」

処理方法の詳細な説明:

最初に、モンテカルロ近似を適用するために、式 (11) のすべての項に対してサンプルされた $ \hat{y}_u $ を使用する。このサンプリングされた擬似ラベル $ \hat{y}_u $ に基づき、以下のように $ \bar{\theta}'_S $ を近似する：
$
\theta'S = \theta_S - \eta_S \cdot \nabla{\theta_S} CE (\hat{y}_u, S(x_u; \theta_S))
$
この式は、擬似ラベル $ \hat{y}_u $ を使用して生徒モデルのパラメータ $ \theta_S $ を更新するプロセスを表している。
次に、第2項である期待値を、サンプリングされた $ \hat{y}u $ による同様の近似で表現する。この近似を通じて、式 (1) からの勾配 $ \nabla{\theta_T} L_u(\theta_T, \theta_S) $ が得られる。
具体的には、モンテカルロ近似を用いた勾配の式は次のようになる：
$
\nabla_{\theta_T} L_u = \eta_S \cdot \frac{\partial CE (y_l, S(x_l; \theta'_S))}{\partial \theta_S}
\cdot
\left(
\frac{\partial CE (\hat{y}_u, S(x_u; \theta_S))}{\partial \theta_S}
\right)^\top
\cdot \frac{\partial CE (\hat{y}_u, T (x_u; \theta_T))}{\partial \theta_T}
$
これをさらに簡潔に表現すると、次の形になる：
$
= \eta_S \cdot \left( \nabla_{\theta'_S} CE (y_l, S(x_l; \theta'S))^\top \cdot \nabla{\theta_S} CE (\hat{y}u, S(x_u; \theta_S)) \right) \cdot \nabla{\theta_T} CE (\hat{y}_u, T(x_u; \theta_T))
$

参考文献:
既存手法として、モンテカルロ近似に関する特定の引用元はこの部分に明示されていないが、一般的に広く用いられている手法であるため、特定の文献への言及はない。

モンテカルロ近似を用いた勾配について、数式の理解までは出来なかった。
論文の本文を見ても、同様に理解できない。

▼質問(人間)
上記の内容から、なぜ「モンテカルロ近似を用いて」が必要になるのか、具体的な説明の記載があれば、簡略化せず説明下さい。(具体値があれば記載、既存手法の適用であれば引用元を記載)

▼生成AIの回答(ChatGPT-4o)
本文で「モンテカルロ近似を用いて」が必要になる理由として、次のような具体的な説明が記載されている。

本文からの抜粋:

「In practice, it is intractable to directly compute the expected risk due to the integral over the distribution of the unlabeled data. Therefore, we approximate the risk using a Monte Carlo approximation based on the sampled pseudo-labels $ \hat{y}_u $.」

具体的な説明:

期待リスクの直接計算が困難であるため：この手法においては、ラベルのないデータの分布に対する積分を含む期待リスク（損失関数に基づいた期待値）を直接的に計算することが非常に困難である。これは、ラベルのないデータに対して全体の分布を考慮しなければならないため、計算量が膨大になるためである。
モンテカルロ近似が必要な理由：そこで、期待リスクを直接計算する代わりに、モンテカルロ近似を使用してこのリスクをサンプリングによって推定する手法が必要となる。具体的には、擬似ラベル $ \hat{y}_u $ をサンプリングすることで、ラベルのないデータに対するリスクを近似している。これにより、計算負荷を大幅に軽減しつつ、十分な精度で期待リスクの推定が可能になる。

要点：

ラベルのないデータに対する積分の計算が非現実的なため、モンテカルロ近似が必要となる。
擬似ラベル $ \hat{y}_u $ をサンプリングして、リスクの推定を行っている。

このため、モンテカルロ近似は、計算可能な範囲で期待リスクを推定するために不可欠である。

B. Pseudo Code for Meta Pseudo Labels with UDA

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) B. Pseudo Code for Meta Pseudo Labels with UDA In this section, we present the pseudo code for Meta Pseudo Labels where the teacher is trained with an extended objective to include the UDA loss. We emphasize that the UDA objective is applied on the teacher, while the student still only learns from the pseudo labeled data given by the teacher. The pseudo code can be found in Algorithm 1. Algorithm 1 The Meta Pseudo Labels method, applied to a teacher trained with UDA [76]. Input: Labeled data xl, yl and unlabeled data xu. Initialize θ(0) T and θ(0) S for t = 0 to N − 1 do Sample an unlabeled example xu and a labeled example xl, yl Sample a pseudo label̂ yu ∼ P (·|xu; θT ) Update the student using the pseudo label̂ yu: θ(t+1) S = θ(t) S − ηS ∇θS CE(̂ yu, S(xu; θS )|θS =θ(t) S Compute the teacher’s feedback coefficient as in Equation 12: h = ηS · (( ∇θ′ S CE ( yl, S(xl; θ(t+1) S ) )> · ∇θS CE (̂ yu, S(xu; θ(t) S ) )) Compute the teacher’s gradient from the student’s feedback: g(t) T = h · ∇θT CE(̂ yu, T (xu; θT ))|θT =θ(t) T Compute the teacher’s gradient on labeled data: g(t) T,supervised = ∇θT CE(yl, T (xl; θT ))|θT =θ(t) T Compute the teacher’s gradient on the UDA loss with unlabeled data: g(t) T,UDA = ∇θT CE ( StopGradient(T (xl); θT ), T (RandAugment(xl); θT ) )∣ ∣ ∣θT =θ(t) T Update the teacher: θ(t+1) T = θ(t) T − ηT · ( g(t) T + g(t) T,supervised + g(t) T,UDA ) end return θ(N) S . Only the student model is returned for predictions and evaluations

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

この部分では、Meta Pseudo Labels (MPL) 法における擬似コードが説明されている。特に、教師モデルがUnsupervised Data Augmentation (UDA) の損失を含む拡張目的で学習され、UDA 目標が教師に適用される点に焦点が当てられている。

要旨

MPL 法の擬似コードを説明。
教師モデルはUDA損失を通して学習される。
生徒モデルは、教師が生成した擬似ラベルを用いて学習する。
UDA 目標は教師モデルにのみ適用され、生徒モデルは擬似ラベルから学習する。

②改良点・工夫点・テクニック等の記載

教師モデルにはUDA損失が追加されており、教師モデルが擬似ラベルの品質を改善するためのフィードバックを受け取る。これにより、生徒モデルがより正確な擬似ラベルから学習できるようになる。
生徒モデルの更新は擬似ラベルを用いて行われるが、教師モデルは生徒モデルの性能に基づくフィードバックを受け取るため、双方向に影響を与える学習フローを確立している。
RandAugmentというデータ拡張手法を教師モデルの学習に用いることで、教師の性能をさらに向上させる工夫がなされている。

③性能が向上した記載

UDA損失を教師モデルに導入することで、教師モデルはラベルなしデータの扱いがより効果的になり、性能が向上する。このことにより、生徒モデルの学習精度も向上する。
擬似ラベルの精度向上によって、生徒モデルの性能が向上する。

④メカニズムの解明・なぜそうなるのか等の記載

UDA損失を用いることによって、教師モデルがラベルのないデータセットでも一般化性能を向上させることができる。このことにより、教師モデルがより正確な擬似ラベルを生成し、その結果生徒モデルの学習が改善される。
教師が生徒からフィードバックを受け取ることで、教師モデルの学習が生徒モデルの性能向上に反映される。この双方向の学習メカニズムがモデルの相互強化に寄与している。

⑤具体的な処理方法の記載

教師モデルと生徒モデルの初期化が行われる (θ(0)_Tとθ(0)_S)。
ラベルなしデータからxuを、ラベル付きデータからxl, ylをサンプリングする。
教師モデルを用いて、xuから擬似ラベルŷuをサンプリングする: ŷu ∼ P(·|xu; θ_T)。
生徒モデルは擬似ラベルŷuを用いて学習される: θ(t+1)_S = θ(t)_S − ηS ∇θS CE(ŷu, S(xu; θS))|θS=θ(t)_S。
生徒モデルが更新された後、教師モデルは生徒モデルのフィードバックを受け取る。具体的には、次のように計算される:
h = ηS · ((∇θ′_S CE(yl, S(xl; θ(t+1)_S)) > · ∇θS CE(ŷu, S(xu; θ(t)_S)))。
教師モデルは次の勾配に基づいて更新される:
- 擬似ラベルに基づく勾配: g(t)_T = h · ∇θT CE(ŷu, T(xu; θT))|θT=θ(t)_T。
- ラベル付きデータに基づく勾配: g(t)_T,supervised = ∇θT CE(yl, T(xl; θT))|θT=θ(t)_T。
- UDA損失に基づく勾配: g(t)_T,UDA = ∇θT CE(StopGradient(T(xl); θT), T(RandAugment(xl); θT))∣θT=θ(t)_T。
教師モデルのパラメータが更新される: θ(t+1)_T = θ(t)_T − ηT · (g(t)_T + g(t)_T,supervised + g(t)_T,UDA)。
最終的に、教師モデルと生徒モデルのうち、生徒モデル(θ(N)_S)のみが予測や評価のために使用される。

▼生成AIの回答(ChatGPT-4o)
⑥比較の記載

擬似コード内で言及されているMeta Pseudo Labels（MPL）法と他の手法の違いについては、特定の比較が記載されていないため、この部分には明確な表はない。ただし、UDAを組み込んだ教師モデルの使用により、生徒モデルが単なる擬似ラベル学習に比べて精度が向上する点が示唆されている。表が記載されている場合はそれに基づき、詳細な比較が可能である。

⑦上記⑥以外で表に出来そうな部分

モデル	使用する目的関数	使用データ	損失関数	目的
教師モデル	UDA損失	ラベルなしデータ	Cross-Entropy (CE)	擬似ラベル生成の品質向上
生徒モデル	なし	教師モデルから生成された擬似ラベル	Cross-Entropy (CE)	擬似ラベルから学習
RandAugment	データ拡張	ラベル付きデータ	Cross-Entropy (CE)	データ多様化による性能向上

⑧具体的な数値の記載

該当する数値が具体的に記載されていないため、このセクションで提示できる数値は存在しない。元文中に数値が記載されている場合は、それに基づき具体的な表を作成することができる。

⑨具体的な変数名(symbol)の記載

変数名	説明
θ_T	教師モデルのパラメータ
θ_S	生徒モデルのパラメータ
η_S	生徒モデルの学習率
η_T	教師モデルの学習率
CE	クロスエントロピー損失関数
h	教師のフィードバック係数
g_T	教師モデルの勾配
g_T,supervised	教師モデルのラベル付きデータに対する勾配
g_T,UDA	UDA損失に基づく教師モデルの勾配
ŷ_u	擬似ラベル
x_u	ラベルなしデータ
x_l, y_l	ラベル付きデータとそのラベル
N	学習ステップ数
StopGradient	勾配の伝搬を停止させる操作
RandAugment	ラベル付きデータのデータ拡張手法

⑩図/表があれば、各図/表は何を主張するためのものか

元文には図や表に関する記載がないため、図/表に対する解説はない。ただし、例えば「Table 1」が擬似コードのアルゴリズムの結果に基づく性能評価を示すものであれば、それがどのような条件下での性能評価かを注目することが重要となる。

▼質問(人間)
上記のB章のUDAを用いた処理と、前章AでのREINFORCEアルゴリズム・モンテカルロ近似を用いた処理は、どのように関係するのか?

▼生成AIの回答(ChatGPT-4o)
B章のUDA（Unsupervised Data Augmentation）を用いた処理と、A章で記載されているREINFORCEアルゴリズムやモンテカルロ近似を用いた処理は、同時には行わない。これらは異なるアプローチであり、それぞれ異なる目的を持っているため、並行して適用されることはない。

具体的には:

REINFORCEアルゴリズムとモンテカルロ近似（A章）:
- 主に強化学習の文脈で使用される手法であり、エージェントが環境から報酬を得る際の学習に用いられる。エージェントが行動方針を改善するために、モンテカルロ近似を使って累積報酬を推定し、その推定値に基づいて方針の改善を行う。
UDAを用いたMeta Pseudo Labels法（B章）:
- これは教師モデルがラベルなしデータに対して擬似ラベルを生成し、そのラベルを生徒モデルが学習するという半教師あり学習の手法であり、UDAはラベルなしデータの扱い方を改善するためのものである。この手法は分類や回帰のタスクに対して適用され、強化学習とは異なる枠組みで動作する。

結論として、これらの手法は異なるアルゴリズムや目的のため、同時に行うことはない。ただし、特定の応用において、両方の手法を段階的に組み合わせることは理論的には可能であるが、そのような記載は今回の文脈には見られない。

A章とB章の理解がそもそも難しい。生成AIに質問して理解できる範囲を超えている。

生成AIの回答は「これらの手法は異なるアルゴリズムや目的のため、同時に行うことはない」となったが、数式のつながりを見ると、A章のREINFORCEアルゴリズムと、B章のUDAは、同時に適用しているように見える。

C. Experimental Details

C.1. Dataset Splits

C.2. Modifications of RandAugment [13]

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) C. Experimental Details In this section, we provide the training details for our experiments in Section 3 and Section 4. C.1. Dataset Splits We describe how the datasets CIFAR-10-4K, SVHN-1K, and ImageNet-10% in Section 3.2 are constructed. For CIFAR-10, we download the five training data batch files from CIFAR-10’s official website.4 Then, we load all the images into a list of 50,000 images, keeping the order as downloaded. The fisrt 5,000 images are typically reserved for validation, so we remove them. The next 4,000 images are used as labeled data. For SVHN, we download the data from the mat files on SVHN’s official site5, and follow the same procedure as with CIFAR-10. We note that this selection process leads to a slight imbalance in the class distribution for both CIFAR-10-4K and SVHN-1K, but the settings are the same for all of our experiments. For ImageNet, we follow the procedure in Inception’s GitHub6. This results in 1,024 training TFRecord shards of approximately the same size. The order of the images in these shards are deterministic. For ImageNet-10%, we use the first 102 shards; for ImageNet-20%, we use the first 204 shards; and so on. The last 20 shards, corresponding to roughly 25,000 images, are reserved for hyper-parameters tuning (used in Section 3.3 and Section 4). C.2. Modifications of RandAugment [13] We modify a few data augmentation strategies as introduced by RandAugment [ 13]. Our modifications mostly target the SVHN dataset. In particular, we remove all rotations from the set of augmentation operations since rotation is a wrong invariance for digits such as 6 and 9. We also remove horizontal translations because they cause another wrong invariance for digits 3 and 8, e.g., when 8 is pushed half-outside the image and the remaining part looks like a 3. Table 5 presents the transformations that we keep for our datasets.

上記の内容に関して下記の質問に回答下さい: (である調で記載、質問に対して該当するものが無ければ無しと記載、対応する図/表番号があれば記載)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載)
②改良点・工夫点・テクニック等の記載があれば説明下さい。
③性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。
⑤具体的な処理方法の記載があれば説明下さい。(簡略化せずに全て記載、既存手法の適用であれば引用元を記載)
⑥比較の記載があれば違いを表でまとめて下さい。(対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(|数値|説明|の表へ)
⑨具体的な変数名(symbol)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものか(掲載理由・注目ポイント等)を説明下さい。

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

本章では、実験に使用したデータセットの分割方法およびデータ拡張の詳細について記載している。

要旨

CIFAR-10-4K、SVHN-1K、ImageNet-10%などのデータセットの作成方法を説明 (Section 3.2)。
CIFAR-10とSVHNに対して、データセットをダウンロードし、最初の一定数の画像を使用してラベル付きデータを構築。
ImageNetデータセットでは、データのシャード分割に基づき、10%や20%などのサブセットを作成。
RandAugmentのデータ拡張戦略を一部修正し、特にSVHNデータセットに適用。誤った不変性を避けるために回転や水平移動の操作を削除。

②改良点・工夫点・テクニック等の記載

SVHNデータセットに対して、数字の6や9の回転による不正確なラベルの生成を防ぐため、RandAugmentから回転操作を削除する工夫を行った。
同様に、水平移動が数字3や8に対して誤った不変性を引き起こすことを防ぐために、水平移動操作も削除している。

③性能が向上した記載

データ拡張の最適化によって、SVHNデータセットに対して誤った不変性の影響を最小限に抑え、分類精度の向上が期待される (Table 5)。

④メカニズムの解明・なぜそうなるのか等の記載

回転操作や水平移動が、数字6や9、3や8の誤認識を引き起こす理由は、これらの操作がこれらの数字の形状に大きな変化をもたらすためである。具体的には、数字6と9は回転により互いに変換され、水平移動は数字3が8に見えるような誤った形状の変化をもたらす。

⑤具体的な処理方法の記載

CIFAR-10とSVHNでは、公式ウェブサイトから画像データをダウンロードし、特定の順序で使用。
- CIFAR-10: 50,000枚の画像のうち最初の5,000枚を除外し、次の4,000枚をラベル付きデータとして使用。
- SVHN: CIFAR-10と同様の手順で、1,000枚のラベル付きデータを作成。
ImageNetでは、公式GitHubの手順に従い、シャード分割されたデータを使用。
- ImageNet-10%: 最初の102シャードを使用。
- ImageNet-20%: 最初の204シャードを使用し、25,000枚の画像をハイパーパラメータチューニングに使用。
RandAugmentのデータ拡張に対して、SVHNデータセットでは、回転と水平移動操作を削除し、他の拡張操作のみを適用 (Table 5) 。

⑥比較の記載

データセット	元のRandAugment操作	改良後の操作	説明	表番号
SVHN	回転、水平移動あり	回転、水平移動なし	誤認識を防ぐための改良	Table 5

⑦上記⑥以外で表に出来そうな部分

データセット	使用したシャード数	ハイパーパラメータ調整に使用したシャード数	備考
ImageNet-10%	102	20	25,000枚を使用
ImageNet-20%	204	20	25,000枚を使用

⑧具体的な数値の記載

数値	説明
4,000	CIFAR-10のラベル付き画像数
1,000	SVHNのラベル付き画像数
102	ImageNet-10%の使用シャード数
204	ImageNet-20%の使用シャード数
25,000	ハイパーパラメータ調整に使用したImageNetの画像数

⑨具体的な変数名(symbol)の記載

変数名	説明
TFRecord shards	ImageNetのシャード形式のデータ単位
RandAugment	データ拡張手法の一つ

⑩図/表があれば、各図/表は何を主張するためのものか

Table 5: RandAugmentのデータ拡張戦略の詳細と、その中でSVHNデータセット向けに行った変更点を示している。この表は、誤った不変性を避けるためにどのような操作を削除したかを強調している。

C.3. Additional Implementation Details

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) C.3. Additional Implementation Details To improve the stability of Meta Pseudo Labels, we use the following details in the Meta Pseudo Labels process. Use cosine distance instead of dot product in Equation 12. The dot product h in Equation 12 has a large value range, especially at the beginning of the Meta Pseudo Labels process. Thus, in order to stabilize training, we compute h using the gradients’ cosine distance. This modification requires very little modification in our code. We give two justifications why the use of cosine distance makes sense mathematically. First, h in Equation 12 is on a scalar which is multiplied with the teacher’s gradient with respect to θT . Changing dot product into cosine distance does not change the sign of h, and thus preserving the actions to increase or to decrease the probabilities of the sampled pseudo labels. Second, cosine distance’s value range is much smaller than that of dot product, making the Meta Pseudo Labels updates more numerically stable. Specifically, the value range of cosine distance is [−1, 1], while the value range of dot products, as observed in our experiments, is about [−5 × 104, 5 × 104]. This range also depends on the weight decay hyper-parameter. Additionally, the dot product h, as shown in Equation 12 and as derived in Section A, results from the application of the chain rule in a so-called bi-level optimization procedure. Bi-level optimization has been applied in some past work, such as Hyper Gradient Descent [3], which also replaces dot product with cosine distance to improve the numerical stability. Use a baseline for h in Equation 12. To further reduce the variance of h, we maintain a moving average b of h and subtract b from h every time we compute g(t) T as in Equation 12. This practice is also widely applied in Reinforcement Learning literature. While using cosine distance is very crucial to maintain the numerical stability of Meta Pseudo Labels, using the moving average baseline only slightly improves Meta Pseudo Labels’s performance. We suspect that not using the moving average baseline is also fine, especially when Meta Pseudo Labels can train for many steps without overfitting.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

本節では、Meta Pseudo Labelsの安定性を向上させるための実装の詳細について記載している。

要旨

Meta Pseudo Labelsにおいて、数値の安定性を改善するためにコサイン距離を使用。
勾配の計算において、ドット積ではなくコサイン距離を使用する理由を2つ説明。
hの分散を減少させるために、hの移動平均を導入する。

②改良点・工夫点・テクニック等の記載

ドット積の代わりにコサイン距離を使用することで、hの値域を縮小し、学習の安定性を向上させた。これにより、Meta Pseudo Labelsの数値的な安定性が向上した。
hの移動平均を導入し、g(t)Tの計算時に毎回hから移動平均を引くことで、hの分散を減少させた。

③性能が向上した記載

コサイン距離の使用は、数値的な安定性を向上させ、学習を安定化させる効果があった。具体的には、コサイン距離の値域が[-1, 1]に対して、ドット積の値域は[-5×10^4, 5×10^4]であるため、安定した学習を実現している。
hの移動平均を使用することで、わずかにMeta Pseudo Labelsの性能が向上したが、その効果は限定的であった。

④メカニズムの解明・なぜそうなるのか等の記載

ドット積の値域が大きいため、学習の初期段階で数値の不安定性が生じやすい。これを防ぐためにコサイン距離を使用することで、hの値域を縮小し、学習が安定化する。コサイン距離に置き換えても、hの符号は変わらず、ラベルの確率を増減させる動作は維持される。
hの移動平均を使用することで、hの分散が減少し、結果として計算の安定性が向上する。

⑤具体的な処理方法の記載

勾配の計算において、ドット積ではなくコサイン距離を使用する。具体的には、Meta Pseudo Labelsプロセスの途中でhを計算する際、コサイン距離を使用してhを算出し、その結果を教師の勾配に掛ける。
hの移動平均bを維持し、g(t)Tの計算時に毎回hからbを引く。この手法は強化学習において広く使用されている技術に基づく。

⑥比較の記載

手法	hの計算方法	値域	学習の安定性	性能への影響
ドット積 (従来手法)	ドット積を使用	[-5×10^4, 5×10^4]	不安定	少し劣る
コサイン距離 (改良)	コサイン距離を使用	[-1, 1]	安定	数値的安定性が向上

⑦上記⑥以外で表に出来そうな部分

改良点	説明	効果
コサイン距離の使用	ドット積ではなくコサイン距離を使用してhを計算。	数値的安定性の向上。
hの移動平均の導入	hの移動平均bを維持し、g(t)Tの計算時に毎回bを引く	hの分散を減少させ、わずかに性能を向上させる。

⑧具体的な数値の記載

数値	説明
[-1, 1]	コサイン距離の値域。
[-5×10^4, 5×10^4]	ドット積の値域（実験で観測された値）。

⑨具体的な変数名(symbol)の記載

変数名	説明
h	Meta Pseudo Labelsにおけるスカラー値。教師の勾配に掛け合わされる。
b	hの移動平均。g(t)Tの計算時にhから引かれる。
g(t)T	Meta Pseudo Labelsの勾配計算式。

⑩図/表があれば、各図/表は何を主張するためのものか

図や表の提示は無し。

hの部分のドット積を、コサイン距離に変えてしまって良いのかどうかの理解が出来なかった。(単に-1～1への正規化?)
論文の本文を見ても、同様に理解できない。

hの部分:

▼質問(人間)
hの部分のドット積を、コサイン距離に変えてしまう点の理解が出来ない。単に-1～1への正規化?

▼生成AIの回答(ChatGPT-4o)
「hの部分のドット積をコサイン距離に変える」という点については、単なる正規化というよりも、数値的な安定性を目的とした変更である。以下の点が重要である。

コサイン距離の特徴:
コサイン距離は、2つのベクトル間の角度に基づいて類似度を計算するもので、値域が[-1, 1]に限定される。これは、ドット積の値が非常に大きくなる可能性がある状況で、数値的な安定性を保つための方法として用いられる。特に勾配の計算において、非常に大きな値を持つと、最適化の際に不安定な挙動（大きなステップ幅など）が生じやすい。この問題を避けるために、値域が制限されたコサイン距離を使用する。
ドット積とコサイン距離の違い:
- ドット積は、2つのベクトルの大きさ（ノルム）とベクトル間の角度を考慮するため、大きさが変わると値が大きく変動する。そのため、勾配の大きさに直接影響を与え、数値が不安定になる可能性がある。
- コサイン距離は、2つのベクトル間の角度のみに基づいて類似度を計算するため、ベクトルの大きさには影響されない。これにより、勾配の符号（増加か減少か）は維持されつつ、勾配の大きさが安定する。
正規化とは異なる点:
単に-1～1の範囲に正規化しているわけではない。ドット積ではベクトルの大きさが勾配に直接影響を与えるのに対し、コサイン距離はベクトルの方向（角度）に依存する。これにより、計算の安定性を高めつつ、勾配の方向性は維持されるという利点がある。
コサイン距離の数値的な安定性:
特に学習初期段階でドット積が非常に大きくなる場合があるため、コサイン距離を使用することで値域が限定され、勾配の更新が安定する。この安定性が、学習が進む際に誤った方向への大幅な勾配更新を避ける役割を果たす。

このため、ドット積をコサイン距離に置き換えるのは、単なる正規化というよりも、学習における勾配の安定性を保つための工夫である。

C.4. Hyper-parameters

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) C.4. Hyper-parameters Optimizers. In all our experiments, the WideResNet-28-2 for CIFAR-10-4K and SVHN-1K and the ResNet-50 for ImageNet- 10% and full ImageNet are updated with Nesterov Momentum with default the momentum coefficient of 0.9. The networks’ learning rate follow the cosine decay [41 ]. Meanwhile, the EfficientNet-L2 and EfficientNet-B6-Wide for ImageNet+JFT are trained with RMSProp [66 ] and with an exponential decay learning rate. These are the default optimizers and learning rate schedules used for the architectures in their corresponding papers. We have only one substantial change of optimizer: when we finetune EfficientNet-L2 and EfficientNet-B6-Wide on the labeled data from ImageNet (see Section 4), we use the LARS optimizer [82 ] with their default parameters, i.e., momentum 0.9 and learning rate 0.001, training for 20,000 steps with a batch size of 4,096. We finetune using this optimizer instead of SGD in Noisy Student [ 77] because unlike Noisy Student, the student model in Meta Pseudo Labels never trains directly on any labeled example, and hence can benefit from a more “aggressive” finetuning process with stronger optimiziers. Numerical Hyper-parameters. To tune hyper-parameters, we follow [48 ] and allow each method to have 128 trials of hyper-parameters. When we tune, we let each model train for up to 50,000 steps. The optimal hyper-parameters are then used to run experiments that last for much more steps, as we report below. In our experiments with Meta Pseudo Labels, training for more steps typically leads to stronger results. We stop at 1 million steps for CIFAR-10-4K and SVHN-1K, and at 0.5 million steps for ImageNet because these are the standards from past papers. We report the hyper-parameters for our baselines and for Meta Pseudo Labels in Section 3 in Tables 6, 7, 8. We note that our settings for UDA is different from originally reported by the original UDA paper [ 76]. In their work, UDA [76 ] use a much larger batch size for their UDA objective. In our implementation of UDA, we keep these batch sizes the same. This leads to a much easier implementation of data parallelism in our framework, TensorFlow [1 ] running on TPU big pods. To compensate for the difference, we train all UDA baselines for much longer than the UDA paper [ 76]. During the training process, we also mask out the supervised examples with high confidence. Effectively, our UDA model receives roughly the same amount of training with labeled examples and unlabeled examples as the models in [76 ]. We have also verified that on ImageNet-10% with the augmentation policy from AutoAugment [ 12 ], our UDA implementation achives 68.77% top-1 accuracy, which is similar to 68.66% that the UDA paper [76] reported.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か: モデルのトレーニングに使用したハイパーパラメータ、オプティマイザ、学習率のスケジュール、およびUDAの実装に関する詳細について記載されている。
要旨:
- CIFAR-10-4K、SVHN-1KではNesterov Momentum、ImageNetではRMSPropを使用し、特定の設定ではLARSオプティマイザに切り替えている。
- ハイパーパラメータのチューニングには128試行を行い、50,000ステップまでトレーニングを行った。
- 最適なハイパーパラメータを用いて、CIFAR-10-4K、SVHN-1Kで100万ステップ、ImageNetで50万ステップの実験を実施している。
- UDAの実装では、元の論文とは異なるバッチサイズを採用し、より長いトレーニングを行った。

②改良点・工夫点・テクニック等の記載

EfficientNet-L2およびEfficientNet-B6-Wideのファインチューニングにおいて、従来のSGDではなくLARSオプティマイザを使用し、より積極的なファインチューニングを行うことで効果を高めている。この変更は、Meta Pseudo Labelsがラベル付きデータを直接学習しない特性に適している。
ハイパーパラメータチューニングにおいて、UDAモデルに対して長時間のトレーニングを行うことで、元論文の結果に近い精度を達成している。

③性能が向上した記載

ImageNet-10%のUDA実装では、オリジナルのUDA論文で報告された68.66%のTop-1精度に対し、68.77%のTop-1精度を達成している【表6, 表7, 表8】。

④メカニズムの解明・なぜそうなるのか等の記載

Meta Pseudo Labelsではラベル付きデータを直接使わないため、LARSオプティマイザのような「積極的」な最適化がより効果的であると考えられる。これにより、トレーニングの安定性と性能向上が期待される。

⑤具体的な処理方法の記載

Nesterov Momentumのモーメンタム係数0.9を使用し、学習率はコサイン減衰を採用。
EfficientNet-L2とEfficientNet-B6-Wideのファインチューニングでは、LARSオプティマイザ（モーメンタム0.9、学習率0.001）を使用し、20,000ステップ、バッチサイズ4,096でトレーニングを行った。
ハイパーパラメータチューニングでは128回の試行を行い、各試行は50,000ステップまでトレーニング。最適なハイパーパラメータを用いて、CIFAR-10-4KおよびSVHN-1Kで100万ステップ、ImageNetでは50万ステップまでトレーニングを行った。

⑥比較の記載

モデル	オプティマイザ	学習率	トレーニングステップ数	バッチサイズ
WideResNet-28-2 (CIFAR-10-4K, SVHN-1K)	Nesterov Momentum	コサイン減衰	1,000,000	不明
ResNet-50 (ImageNet-10%, full ImageNet)	Nesterov Momentum	コサイン減衰	500,000	不明
EfficientNet-L2, EfficientNet-B6-Wide (ImageNet+JFT)	RMSProp	指数減衰	不明	不明
EfficientNet-L2, EfficientNet-B6-Wide (ファインチューニング)	LARS	0.001	20,000	4,096

【表6, 表7, 表8】

⑦上記⑥以外で表に出来そうな部分

特に無し。

⑧具体的な数値の記載

数値	説明
0.9	Nesterov MomentumおよびLARSのモーメンタム係数
0.001	EfficientNetのファインチューニング時の学習率
1,000,000	CIFAR-10-4KおよびSVHN-1Kでのトレーニングステップ数
500,000	ImageNetでのトレーニングステップ数
4,096	EfficientNetのファインチューニング時のバッチサイズ
68.77%	UDAのImageNet-10%におけるTop-1精度
68.66%	UDA論文におけるTop-1精度

⑨具体的な変数名(symbol)の記載

特に無し。

⑩図/表があれば、各図/表は何を主張するためのものか

【表6, 表7, 表8】は、ハイパーパラメータの設定や性能に関する比較結果を示している。

D. More Detailed Analysis of Meta Pseudo Label’s Behaviors

D.1. Visualizing the Contributions of Meta Pseudo Labels

D.2. Meta Pseudo Labels Is An Effective Regularization Strategy

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) D. More Detailed Analysis of Meta Pseudo Label’s Behaviors We have seen in Section 3 and Section 4 that Meta Pseudo Labels leads to strong performances on multiple image classification benchmarks. In this section, we provide further analysis of Meta Pseudo Labels and related baselines on more restricted and more controlled environments to provide better insights about Meta Pseudo Labels’ behaviors. D.1. Visualizing the Contributions of Meta Pseudo Labels To understand the contributions of Meta Pseudo Labels (MPL), in Figure 3, we visualize the relative gains of various methods on ImageNet-10% (Section 3.2). From the figure, we have two observations. First, for a purely supervised teacher, Meta Pseudo Labels outperforms RandAugment. We suspect this is because Meta Pseudo Labels is more effective form of regularization for the student. This is very crucial for ImageNet-10%, where we only have about 128 images per class for each of the 1,000 classes. Second, UDA improves over Supervised+MPL+Finetune by 6.05% in top-1 accuracy. This is in the same ballpark with the gain that UDA+MPL delivers above UDA, which is 5.25%. As UDA’s accuracy is already high, such improvement is very significant. Finally, finetuning only slightly improves over UDA+MPL. This extra performance boost is a unique advantage of Meta Pseudo Labels, since the student never directly learns from labeled data. D.2. Meta Pseudo Labels Is An Effective Regularization Strategy The rest of this paper uses Meta Pseudo Labels as a semi-supervised learning method. In this section, we show that Meta Pseudo Labels can behave like an effective regularization method for supervised learning. This behavior can be achieved by making labeled data the same with unlabeled data in Figure 1. In this case, Meta Pseudo Labels can be seen as an adaptive form of Label Smoothing: the teacher generates soft labels on labeled data for the student, just like the way Label Smoothing smooths the hard labels to regularize the model. The main difference is that the policy in Label Smoothing is fixed, whereas the policy of the teacher in Meta Pseudo Labels is adaptive to enhance the student’s performance. To confirm the effect of Meta Pseudo Labels, we compare the method to Supervised Learning and Label Smoothing on CIFAR-10-4K and SVHN-1K. All models and settings are the same as in Section 3.2, except that we do not use RandAugment and we restrict the unlabeled data to the same set of labeled data. We choose CIFAR-10-4K and SVHN-1K for this experiment because Label Smoothing is typically already used in ImageNet models. The results are shown in Table 9. As can be seen from the table, Meta Pseudo Labels achieves 83.71% on CIFAR-10-4K and 91.89% on SVHN-1K. Both of these are significantly better than the accuracy obtained by supervised learning with and without Label Smoothing. This shows the importance of feedback in Meta Pseudo Labels.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か: Meta Pseudo Labels（MPL）の動作に関する詳細な分析を行い、その有効性や他の手法との比較を行っている。
要旨:
- MPLがImageNet-10%などの複数の画像分類ベンチマークにおいて優れた性能を発揮することを示している。
- MPLは、学生モデルに対する効果的な正則化手法として機能し、特にラベル付きデータが少ない環境で有効である。
- MPLは、Label Smoothingの適応的な形式として解釈でき、通常のLabel Smoothingに比べて効果が高いことを実証している。

②改良点・工夫点・テクニック等の記載

MPLを使用することで、教師モデルが適応的に学生モデルにフィードバックを提供し、Label Smoothingのような固定された正則化手法よりも効果的な学習が可能になる点が工夫として挙げられる。

③性能が向上した記載

ImageNet-10%において、MPLを使用することで、純粋な教師モデルを使用したRandAugmentよりも優れた性能を示している。
UDA+MPLは、UDA単独に比べて5.25%のTop-1精度の向上を達成している。また、UDAはSupervised+MPL+Finetuneよりも6.05%の向上を示している【図3】。
CIFAR-10-4Kで83.71%、SVHN-1Kで91.89%の精度を達成しており、これらの結果はLabel Smoothingや通常の教師付き学習よりも高い【表9】。

④メカニズムの解明・なぜそうなるのか等の記載

MPLは、学生モデルに対するフィードバックを通じて、教師モデルが適応的に学習を進めるため、Label Smoothingのような固定された正則化手法よりも効果的な学習が可能になる。これにより、特にデータが少ない環境での性能が向上する。

⑤具体的な処理方法の記載

MPLでは、教師モデルが適応的に学生モデルにラベルを生成し、それを用いて学習を進める。この手法は、Label Smoothingと類似しているが、固定された正則化手法ではなく、適応的なフィードバックを通じて学生モデルの性能を向上させる。
実験では、CIFAR-10-4KとSVHN-1Kを使用し、同じラベル付きデータセットをラベルなしデータとして扱う。

⑥比較の記載

手法	CIFAR-10-4K 精度	SVHN-1K 精度	説明
教師付き学習	不明	不明	通常の教師付き学習
Label Smoothing	不明	不明	固定された正則化手法
Meta Pseudo Labels	83.71%	91.89%	MPLによる適応的なフィードバック

【表9】

⑦上記⑥以外で表に出来そうな部分

特に無し。

⑧具体的な数値の記載

数値	説明
6.05%	UDAがSupervised+MPL+Finetuneより向上したTop-1精度
5.25%	UDA+MPLがUDA単独より向上したTop-1精度
83.71%	CIFAR-10-4KでのMeta Pseudo Labelsの精度
91.89%	SVHN-1KでのMeta Pseudo Labelsの精度

⑨具体的な変数名(symbol)の記載

特に無し。

⑩図/表があれば、各図/表は何を主張するためのものか

図3: ImageNet-10%における様々な手法の相対的な精度向上を示し、MPLがRandAugmentやUDAと比較して有効であることを視覚化している。

D.3. Meta Pseudo Labels Is a Mechanism to Addresses the Confirmation Bias of Pseudo Labels

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) D.3. Meta Pseudo Labels Is a Mechanism to Addresses the Confirmation Bias of Pseudo Labels In this section, we show empirical evidence that Meta Pseudo Labels helps to address the teacher’s confirmation bias [2 ] in Pseudo Labels. To this end, we analyze the training accuracy of the teacher and the student in Meta Pseudo Labels from our experiments for CIFAR-10-4K and ImageNet-10% in Section 3.2. In Figure 4, we plot the accuracy percentage at each training batch throughout the training process of a teacher and a student in Meta Pseudo Labels. We also plot the same data for a supervised model. From the figure, we have two observations: • On CIFAR-10-4K (Figure 4-Left), the student’s training accuracy in Meta Pseudo Labels is much lower that of the same network in Supervised Learning. As CIFAR-10-4K has very few labeled data, if the teacher converges quickly like in Supervised Learning, it will not generalize to the unlabeled data and hence will teach the student in inaccurate pseudo labels. In contrast, Figure 4-Left shows that both the teacher and student in Meta Pseudo Labels converge much slower. To see this, note that in Meta Pseudo Labels, the student’s training accuracy is measured by how much it agrees with the teacher’s pseudo labels. Therefore, the student in Meta Pseudo Labels having a lower training accuracy means that the student often disagrees with the pseudo labels that the teacher samples. This disagreement forces the teacher to constantly updates its weights to generate better pseudo labels, and makes it hard for the student to converge as the student has to learn from the teacher’s changing pseudo labels. This behavior prevents both the teacher and the student from the premature convergence that causes the confirmation bias in Supervised Learning and Pseudo Labels. • On ImageNet-10% (Figure 4-Right), the student also disagrees with the teacher’s pseudo labels, as shown in the student’s low training accuracy. Additionally, we observe that the teacher’s training accuracy surges up faster than the supervised model’s accuracy. We suspect that this is beneficial for the student learning, since ImageNet has 1,000 classes so in order to effectively teach the student to do well on the labeled dataset, the teacher has to become more accurate. Therefore, the feedback from the student is beneficial for the teacher’s learn as well. This trend of high training accuracy only changes at the end of the training procedure, where the training accuracy of Supervised Learning surpasses those of the teacher and the student in Meta Pseudo Labels. From this last sign, we suspect that the supervised model has overfitted to the small set of labeled training examples in ImageNet-10%, which will causes the confirmation bias if this supervised model is used to generate pseudo labels for another student model to learn from.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か: Meta Pseudo Labels（MPL）が擬似ラベルの確認バイアス（confirmation bias）を解決するメカニズムについての記載。
要旨:
- 擬似ラベルの教師モデルの確認バイアスをMPLがどのように軽減するか、CIFAR-10-4KおよびImageNet-10%の実験結果を用いて説明している。
- MPLでは、教師と学生のモデルが訓練中に異なるため、確認バイアスの原因となる早期収束が回避され、モデルがより良い擬似ラベルを生成できる。

②改良点・工夫点・テクニック等の記載

MPLでは、学生モデルが教師モデルの擬似ラベルに常に同意しないことが重要であり、これにより教師モデルは擬似ラベルの品質を継続的に改善することが求められる。この点が、確認バイアスの軽減につながる工夫である。

③性能が向上した記載

CIFAR-10-4Kでは、MPLを使用した学生モデルの訓練精度が教師付き学習モデルよりも低く、これは擬似ラベルの質の向上に寄与している。
ImageNet-10%では、教師モデルの訓練精度が非常に早く向上し、これが学生モデルの学習に役立っている【図4】。

④メカニズムの解明・なぜそうなるのか等の記載

MPLでは、学生モデルが教師モデルの擬似ラベルに対して常に異議を唱えることで、教師モデルが擬似ラベルの精度を改善し続ける必要がある。このプロセスにより、モデルの早期収束と確認バイアスが回避され、モデルが一般化しやすくなる。特にCIFAR-10-4Kのようにラベル付きデータが少ない場合、教師モデルが早期に収束すると、誤った擬似ラベルが生成されやすくなるため、MPLによる遅い収束が有効である。

⑤具体的な処理方法の記載

MPLでは、学生モデルの訓練精度は、教師モデルが生成する擬似ラベルにどれだけ一致するかで測定される。この不一致が、教師モデルの重みの更新を促し、より良い擬似ラベルを生成する。
CIFAR-10-4KとImageNet-10%での実験では、訓練バッチごとの教師と学生のモデルの精度をプロットし、確認バイアスの回避メカニズムを視覚的に示している。

⑥比較の記載

データセット	モデル	訓練精度	説明
CIFAR-10-4K	MPL（学生）	低い	MPLにより擬似ラベルの不一致が継続的に改善される
CIFAR-10-4K	教師付き学習	高い	早期収束による確認バイアスの影響
ImageNet-10%	MPL（学生）	低い	学生モデルが擬似ラベルに同意せず、教師モデルが改善を続ける
ImageNet-10%	教師付き学習	高い	ラベル付きデータのみに過度に適合している可能性がある

【図4】

⑦上記⑥以外で表に出来そうな部分

特に無し。

⑧具体的な数値の記載

数値	説明
CIFAR-10-4K（左図）	MPLの学生モデルの訓練精度が低い
ImageNet-10%（右図）	教師モデルの精度が早期に上昇

⑨具体的な変数名(symbol)の記載

特に無し。

⑩図/表があれば、各図/表は何を主張するためのものか

図4: CIFAR-10-4KとImageNet-10%における、MPLおよび教師付き学習の訓練バッチごとの精度を視覚化。これにより、MPLが擬似ラベルの確認バイアスを軽減するメカニズムが示されている。

D.4. Meta Pseudo Labels with Different Training Techniques for the Teacher

D.5. Meta Pseudo Labels with Different Amounts of Labeled Data

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) D.4. Meta Pseudo Labels with Different Training Techniques for the Teacher In Sections 3 and Section 4, we have presented Meta Pseudo Labels results where the teacher is trained with UDA. In Table 10, we further show that on CIFAR-10-4K, Meta Pseudo Labels improves over different teachers trained with different techniques, including Pseudo Labels [36], Mixup [85], and RandAugment. These results indicate that Meta Pseudo Labels is effective with all techniques. Additionally, the results suggest that better training techniques for the teacher tend to result in better students. D.5. Meta Pseudo Labels with Different Amounts of Labeled Data We study how much Meta Pseudo Labels improves as more labeled data becomes available. To this end, we experiment with 10%, 20%, 40%, 80%, and 100% of the labeled examples in ImageNet. We compare Meta Pseudo Labels with supervised learning and RandAugment. We plot the results in Figure 5. From the figure, it can be seen that Meta Pseudo Labels delivers substantial gains with less data, but plateaus as more labeled data becomes available. This result suggests that Meta Pseudo Labels is more effective for low-resource image classification problems.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か: Meta Pseudo Labels（MPL）が異なる訓練手法やラベル付きデータの量に対してどのように効果を示すかについての記載。
要旨:
- MPLは、Pseudo Labels、Mixup、RandAugmentなどの異なる教師モデルの訓練手法に対しても有効であることを示している。
- MPLは、ラベル付きデータが少ない場合に大きな性能向上を示し、データ量が増えるにつれて効果が減少する。

②改良点・工夫点・テクニック等の記載

MPLは、異なる教師モデルの訓練手法（Pseudo Labels, Mixup, RandAugment）に対しても一貫して効果を発揮する点が改良点である。また、ラベル付きデータの利用量に応じて、MPLの効果を確認するための実験を実施し、少量データに対する有効性を示している。

③性能が向上した記載

CIFAR-10-4Kにおいて、MPLはPseudo Labels、Mixup、RandAugmentなどの教師モデルの訓練手法に対して性能向上を示している【表10】。
ImageNetにおいて、MPLはラベル付きデータが少ないほど大きな性能向上を示し、10%、20%、40%、80%、100%のデータ量での性能を比較した結果、少量データで特に有効であることが示されている【図5】。

④メカニズムの解明・なぜそうなるのか等の記載

MPLは、教師モデルが生成する擬似ラベルを学生モデルが学習するため、教師モデルの訓練手法が異なっても、その擬似ラベルの品質が学生モデルの性能に直接影響する。このため、より高度な訓練手法で教師モデルを訓練すれば、MPLの効果も向上する。また、ラベル付きデータが少ない場合には擬似ラベルの効果が大きく、MPLが特に有効である。

⑤具体的な処理方法の記載

CIFAR-10-4Kでは、異なる訓練手法（Pseudo Labels、Mixup、RandAugment）を用いて教師モデルを訓練し、MPLがそれぞれの手法に対して効果を示すかを検証している【表10】。
ImageNetでは、ラベル付きデータの量を10%、20%、40%、80%、100%と変化させ、MPLと教師付き学習およびRandAugmentの性能を比較している【図5】。

⑥比較の記載

訓練手法	MPLの性能向上	対応する図/表
Pseudo Labels	向上あり	表10
Mixup	向上あり	表10
RandAugment	向上あり	表10

ラベル付きデータ量	MPLの性能	対応する図/表
10%	大幅な性能向上	図5
20%	性能向上	図5
40%	性能向上	図5
80%	性能向上	図5
100%	増加が減少する	図5

⑦上記⑥以外で表に出来そうな部分

特に無し。

⑧具体的な数値の記載

数値	説明
CIFAR-10-4K	Pseudo Labels、Mixup、RandAugmentに対してMPLの性能が向上
ImageNet-10%	MPLが大幅な性能向上を示す
ImageNet-20%	MPLが性能向上を示す
ImageNet-40%	MPLが性能向上を示す
ImageNet-80%	MPLが性能向上を示す
ImageNet-100%	性能向上が減少する

⑨具体的な変数名(symbol)の記載

特に無し。

⑩図/表があれば、各図/表は何を主張するためのものか

表10: 異なる訓練手法を用いた教師モデルに対するMPLの効果を示しており、どの手法に対しても性能が向上することを示している。
図5: ImageNetにおけるラベル付きデータの割合に対するMPLの効果を示しており、少量データで特に大きな性能向上が見られることを視覚的に表している。

E. Results with An Economical Version of Meta Pseudo Labels

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) E. Results with An Economical Version of Meta Pseudo Labels Meta Pseudo Labels requires storing both the teacher model and the student model in memory. For model architectures with a large memory footprint, such as EfficientNet-L2 and EfficientNet-B6-Wide in our experiments, this memory footprint exceeds 16G of available memory in our accelerators. While we have implemented a hybrid data-model parallelism in Section 4 which allows us to run Meta Pseudo Labels with large model architectures, the tradeoff is a slow and expensive training procedure. To allow a more efficient training of large models with Meta Pseudo Labels, we design a more economical alternative to instantiate the teacher, termed Reduced Meta Pseudo Labels. In Reduced Meta Pseudo Labels, we first train a large teacher model T to convergence. Next, we use T to pre-compute all target distributions for the student’s training data. Importantly, until this step, the student model has not been loaded into memory, effectively avoiding the large memory footprint of Meta Pseudo Labels. Then, we parameterize a reduced teacher T ′ as a small and efficient network, such as a multi-layered perceptron (MLP), to be trained the along with student. This reduced teacher T ′ takes as input the distribution predicted by the large teacher T and outputs a calibrated distribution for the student to learn. Intuitively, Reduced Meta Pseudo Labels works reasonably well because the large teacher T is reasonably accurate, and hence many actions of the reduced teacher T ′ would be close to an identity map, which can be handled by an MLP. Meanwhile, Reduced Meta Pseudo Labels retains the benefit of Meta Pseudo Labels, as the teacher T ′ can still adapt to the learning state of the student θT . To evaluate whether Meta Pseudo Labels can scale to problems with a large number of labeled examples, we now turn to full labeled sets of CIFAR-10, SVHN and ImageNet. We use out-of-domain unlabeled data for CIFAR-10 and ImageNet. We experiment with Reduced Meta Pseudo Labels whose memory footprint allows our large-scale experiments. We show that the benefit of Meta Pseudo Labels, i.e., having a teacher that adapts to the student’s learning state throughout the student’s learning, stil extends to large datasets with more advanced architectures and out-of-domain unlabeled data. Model Architectures. For our student model, we use EfficinetNet-B0 for CIFAR-10 and SVHN, and use EfficientNet-B7 for ImageNet. Meanwhile, our teacher model is a small 5-layer perceptron, with ReLU activation, and with a hidden size of 128 units for CIFAR-10 and of 512 units for ImageNet. Labeled Data. Per standard practices, we reserve 4,000 examples of CIFAR-10, 7,300 examples from SVHN, and 40 data shards of ImageNet for hyper-parameter tuning. This leaves about 45,000 labeled examples for CIFAR-10, 65,000 labeled examples for SVHN, and 1.23 million labeled examples for ImageNet. As in Section 3.2, these labeled data serve as both the validation data for the student and the pre-training data for the teacher. Unlabeled Data. For CIFAR-10, our unlabeled data comes from the TinyImages dataset which has 80 million images [ 67]. For SVHN, we use the extra images that come with the standard training set of SVHN which has about 530,000 images. For ImageNet, our unlabeled data comes from the YFCC-100M dataset which has 100 million images [65 ]. To collect unlabeled data relevant to the tasks at hand, we use the pre-trained teacher to assign class distributions to images in TinyImages and YFCC-100M, and then keep K images with highest probabilities for each class. The values of K are 50,000 for CIFAR-10, 35,000 for SVHN, and 12,800 for ImageNet. Baselines. We compare Reduced Meta Pseudo Labels to NoisyStudent [77 ], because it can be directly compared to Reduced Meta Pseudo Labels. In fact, the only difference between NoisyStudent and Reduced Meta Pseudo Labels is that Reduced Meta Pseudo Labels has a teacher that adapts to the student’s learning state. Results. As presented in Table 11, Reduced Meta Pseudo Labels outperforms NoisyStudent on both CIFAR-10 and ImageNet, and is on-par with NoisyStudent on SVHN. In particular, on ImageNet, Meta Pseudo Labels with EfficientNet-B7 achieves a top-1 accuracy of 86.87%, which is 1.06% better than the strong baseline NoisyStudent. On CIFAR-10, Meta Pseudo Labels leads to an improvement of 0.34% in accuracy on NoisyStudent, marking a 19% error reduction. For SVHN, we suspect there are two reasons of why the gain of Reduced Meta Pseudo Labels is not significant. First, NoisyStudent already achieves a very high accuracy. Second, the unlabeled images are high-quality, which we know by manual inspection. Meanwhile, for many ImageNet categories, there are not sufficient images from YFCC100M, so we end up with low-quality or out-of-domain images. On such noisy data, Reduced Meta Pseudo Labels’s adaptive adjustment becomes more crucial for the student’s performance, leading to more significant gain.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何?

何についての記載か: Meta Pseudo Labels (MPL) のメモリ効率を改善する「Reduced Meta Pseudo Labels」についての記載。
要旨:
- MPLを大規模なモデルで効率的に動作させるために、Reduced Meta Pseudo Labelsが提案されている。
- Reduced Meta Pseudo Labelsは、事前に大規模な教師モデルを訓練し、その出力を元に、メモリ効率の高い小型の教師モデルを使うことで、MPLのメリットを維持しつつ、メモリ使用量を大幅に削減する。
- CIFAR-10、SVHN、ImageNetといった大規模データセットで、Reduced Meta Pseudo Labelsの効果が検証され、特にImageNetではNoisyStudentを上回る性能を示している。

②改良点・工夫点・テクニック等の記載

Reduced Meta Pseudo Labelsの主な改良点は、メモリ消費を削減するために、以下の2段階プロセスが導入された点である:
1. 大規模な教師モデルを事前に訓練し、その結果を使って学生モデルのトレーニングデータのターゲット分布を事前に計算する。
2. メモリ効率の高い小型の教師モデル（多層パーセプトロンなど）を使い、事前に計算されたターゲット分布を調整しながら学生モデルを訓練する。
このアプローチにより、MPLのメリットを保ちながら、大規模モデルを効率的に訓練できる。

③性能が向上した記載

Reduced Meta Pseudo Labelsは、CIFAR-10およびImageNetにおいてNoisyStudentを上回る性能を示している【表11】。
- ImageNetでのトップ1精度は86.87%で、NoisyStudentを1.06%上回る【表11】。
- CIFAR-10ではNoisyStudentよりも0.34%の精度向上を達成し、誤差率の19%削減を記録している【表11】。
- SVHNでは、NoisyStudentとほぼ同等の性能を示している【表11】。

④メカニズムの解明・なぜそうなるのか等の記載

Reduced Meta Pseudo Labelsが効果的である理由は、大規模な教師モデルが精度の高い予測分布を生成し、それをもとに小型の教師モデルが調整を行うことで、学生モデルの性能が向上するためである。
特に、ImageNetのようなノイズの多いデータセットでは、Reduced Meta Pseudo Labelsの適応的な調整が効果を発揮し、NoisyStudentよりも大きな性能向上をもたらす。

⑤具体的な処理方法の記載

Reduced Meta Pseudo Labelsのプロセス:
1. まず、大規模な教師モデル $ T $ を完全に訓練し、学生モデルのトレーニングデータに対するターゲット分布を事前計算する。この段階では、学生モデルはメモリにロードされないため、MPLの大きなメモリ消費を回避する。
2. 次に、事前計算されたターゲット分布を基に、小型で効率的な教師モデル $ T' $ （多層パーセプトロン）を用いて、学生モデルと同時に訓練する。この小型モデルは、大規模教師モデルの分布を調整しながら学生モデルに学習させる。
データセットの使用:
- CIFAR-10: TinyImagesデータセットからラベルなしデータを使用し、教師モデルで割り当てられたクラス分布に基づいて、50,000枚の画像を選択。
- SVHN: SVHN標準トレーニングセットからの追加画像（約530,000枚）を使用。
- ImageNet: YFCC-100Mデータセットから12,800枚のラベルなしデータを使用。
実験モデル:
- 学生モデル: CIFAR-10とSVHNにEfficientNet-B0、ImageNetにEfficientNet-B7を使用。
- 教師モデル: CIFAR-10には128ユニットの5層パーセプトロン、ImageNetには512ユニットの5層パーセプトロンを使用。

▼生成AIの回答(ChatGPT-4o)
⑥比較の記載があれば違いを表でまとめて下さい。(対応する図/表番号があれば明記)

以下に、Reduced Meta Pseudo LabelsとNoisyStudentの比較を表形式でまとめる。

データセット	Reduced Meta Pseudo Labels (精度)	NoisyStudent (精度)	比較結果	図/表番号
CIFAR-10	86.87%	85.81%	Reduced Meta Pseudo Labelsが1.06%向上	表11
ImageNet	0.34%	---	Reduced Meta Pseudo Labelsが19%誤差削減	表11
SVHN	同等の性能	---	SVHNでは性能差はほとんどなし	表11

⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(対応する図/表番号があれば記載)

教師モデルと学生モデルの組み合わせ、および使用されたデータセットを表形式にまとめる。

データセット	教師モデル	学生モデル	ラベルありデータ数	ラベルなしデータ数	図/表番号
CIFAR-10	5層パーセプトロン (128ユニット)	EfficientNet-B0	45,000	50,000 (TinyImages)	なし
SVHN	5層パーセプトロン (128ユニット)	EfficientNet-B0	65,000	530,000 (SVHN Extra) (←間違い、35000)	なし
ImageNet	5層パーセプトロン (512ユニット)	EfficientNet-B7	1.23百万	12,800 (YFCC-100M)	なし

⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(|数値|説明|の表へ)

数値	説明
86.87%	ImageNetにおけるReduced Meta Pseudo Labelsのトップ1精度【表11】
85.81%	ImageNetにおけるNoisyStudentのトップ1精度【表11】
0.34%	CIFAR-10におけるReduced Meta Pseudo LabelsとNoisyStudentの精度差【表11】
19%	CIFAR-10における誤差率削減（NoisyStudentとの比較）【表11】
50,000	CIFAR-10で使用されたラベルなしデータの数（TinyImagesから）
530,000 (←間違い、35000)	SVHNで使用されたラベルなしデータの数（SVHN Extraから）
12,800	ImageNetで使用されたラベルなしデータの数（YFCC-100Mから）

⑨具体的な変数名(symbol)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|の表へ)

変数名	説明
$ T $	事前に訓練された大規模な教師モデル
$ T' $	メモリ効率の高い小型の教師モデル（多層パーセプトロン）
$ \theta_T $	学生モデルの学習状態
$ K $	各クラスで保持されたラベルなしデータの数

⑩図/表があれば、各図/表は何を主張するためのものか(掲載理由・注目ポイント等)を説明下さい。

表11: Reduced Meta Pseudo LabelsとNoisyStudentの性能比較結果を示しており、ImageNetやCIFAR-10における性能向上が視覚的に確認できる。特に、ImageNetでの1.06%の精度向上が注目点である。

あとがき

所感

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。(良い意味で論文の要点をそのまま抜き出した形)

分かったこと

分かったこと・注目した点を列挙すると:

REINFORCEアルゴリズム？(別の文献[75]を適用したもの)…よく分からず
期待値に対する勾配の計算においては、REINFORCEアルゴリズムを使用して、教師モデルのパラメータがラベルに依存する部分の勾配を計算している。
生徒モデルの勾配 $gs(y^u)$ は教師モデルのパラメータ $θ_T$ には依存しない。ただし、擬似ラベル $y^u$ を通じてのみ間接的に依存する。このため、REINFORCEアルゴリズムを適用する。

モンテカルロ近似を用いる？…よく分からず
モンテカルロ近似を用いて勾配を計算し、生徒モデルのパラメータを用いて教師モデルのパラメータ更新を行っている。
擬似ラベル $y^u$ をサンプリングすることで、ラベルのないデータに対するリスクを近似している。これにより、計算負荷を大幅に軽減しつつ、十分な精度で期待リスク（損失関数に基づいた期待値）の推定が可能になる。

Meta Pseudo Labels (MPL) 法では、教師モデルがUnsupervised Data Augmentation (UDA) の損失を含む拡張目的で学習される。
UDA 目標は教師モデルにのみ適用され、生徒モデルは擬似ラベルから学習する。
教師モデルにはUDA損失が追加されており、教師モデルが擬似ラベルの品質を改善するためのフィードバックを受け取る。
教師モデルは次の勾配に基づいて更新される: 3つ全て単純に加算して最終的な勾配とする。

擬似ラベルに基づく勾配: g(t)_T
ラベル付きデータに基づく勾配: g(t)_T,supervised
UDA損失に基づく勾配: g(t)_T,UDA

RandAugmentのデータ拡張戦略を一部修正とは:
SVHNデータセットに対して、数字の6や9の回転による不正確なラベルの生成を防ぐため、RandAugmentから回転操作を削除する工夫を行った。
同様に、水平移動が数字3や8に対して誤った不変性を引き起こすことを防ぐために、水平移動操作も削除している。

勾配の計算において、ドット積ではなくコサイン距離を使用する。具体的には、Meta Pseudo Labelsプロセスの途中でhを計算する際、コサイン距離を使用してhを算出し、その結果を教師の勾配に掛ける。これにより、Meta Pseudo Labelsの数値的な安定性が向上した。
hの移動平均を導入し、g(t)Tの計算時に毎回hから移動平均を引くことで、hの分散を減少させた。
hの移動平均を使用することで、わずかにMeta Pseudo Labelsの性能が向上したが、その効果は限定的であった。

hの部分のドット積をコサイン距離に置き換える:
ドット積は、2つのベクトルの大きさ（ノルム）とベクトル間の角度を考慮するため、大きさが変わると値が大きく変動する。そのため、勾配の大きさに直接影響を与え、数値が不安定になる可能性がある。
コサイン距離は、2つのベクトル間の角度のみに基づいて類似度を計算するため、ベクトルの大きさには影響されない。これにより、勾配の符号（増加か減少か）は維持されつつ、勾配の大きさが安定する。

ハイパーパラメータのチューニングには128試行を行い、50,000ステップまでトレーニングを行った。

MPLは、生徒モデルに対する効果的な正則化手法として機能。特にラベル付きデータが少ない環境で有効である。

MPLでは、教師と学生のモデルが訓練中に異なるため、確認バイアスの原因となる早期収束が回避され、モデルがより良い擬似ラベルを生成できる。

Reduced Meta Pseudo Labelsのプロセス:
まず、大規模な教師モデル T を完全に訓練し、学生モデルのトレーニングデータに対するターゲット分布を事前計算する。この段階では、学生モデルはメモリにロードされないため、MPLの大きなメモリ消費を回避する。
次に、事前計算されたターゲット分布を基に、小型で効率的な教師モデル T′（多層パーセプトロン）を用いて、学生モデルと同時に訓練する。この小型モデルは、大規模教師モデルの分布を調整しながら学生モデルに学習させる。
小型で効率的な教師モデル T′: CIFAR-10には128ユニットの5層パーセプトロン、ImageNetには512ユニットの5層パーセプトロンを使用。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up