More than 3 years have passed since last update.

Deep Double Descent: Where Bigger Models and More Data Hurt 【2 OUR RESULTS】【論文 DeepL 翻訳】

Last updated at 2020-06-13Posted at 2020-06-10

この記事は自分用のメモみたいなものです.
ほぼ DeepL 翻訳でお送りします.
間違いがあれば指摘していだだけると嬉しいです.

翻訳元
[Deep Double Descent: Where Bigger Models and More Data Hurt ]
(https://arxiv.org/abs/1912.02292)

2 OUR RESULTS

訳文

仮説をより正確に述べるために, effective model complexity の概念を定義する. 学習手順 $\mathcal{T}$ を，ラベル付けされた学習サンプルの集合 $S = {(x_1, y_1), . . . ,(x_n, y_n)}$ を入力とし, データをラベルにマッピングした分類器 $\mathcal{T}(S)$ を出力する. $\mathcal{T}$(w.r.t. 分布 $\mathcal{D}$) の effective model complexity を, $\mathcal{T}$ が平均 $\approx 0$ の学習誤差を達成するサンプル数 $n$ の最大値と定義する.

定義 1 (Effective Model Complexity) 学習手順 $\mathcal{T}$ の分布 $\mathcal{D}$ とパラメータ $\epsilon > 0$ に関する Effective Model Complexity (EMC)は次のように定義される.

$$ \mathrm{EMC} _ {\mathcal{D},\epsilon} (\mathcal{T}) := \mathrm{max} \{ n|\mathbb{E}_{S \sim D^n} [\mathrm{Error} _ S(\mathcal{T}(S))] \leq \epsilon \} $$

ここで, $\mathrm{Error} _ S(M)$ は, 訓練サンプル $S$ 上のモデル $M$ の平均誤差である.

主な仮説は以下のように非公式に述べることができる.

仮説 1 (Generalized Double Descent hypothesis, 非公式) 任意の自然データ分布 $\mathcal{D}$, ニューラルネットワークベースの学習手順 $\mathcal{T}$, 小さな $\epsilon > 0$ について, $\mathcal{D}$ からの $n$ 個のサンプルに基づいてラベルを予測するタスクを考えると, 次のようになる.

Under-paremeterized regime. $\mathrm{EMC} _ {\mathcal{D},\epsilon} (\mathcal{T})$ が $n$ よりも十分に小さい場合, その effective complexity を増加させる $\mathcal{T}$ の任意の摂動は, テスト誤差を減少させる.

Over-parameterized regime. $\mathrm{EMC} _ {\mathcal{D},\epsilon} (\mathcal{T})$ が $n$ よりも十分に大きい場合, その effective complexity を増加させる $\mathcal{T}$ の任意の摂動は, テスト誤差を減少させる.

Critically parameterized regime. $\mathrm{EMC} _ {\mathcal{D},\epsilon} (\mathcal{T}) \approx n$ の場合, $\mathcal{T}$ の effective complexity を増加させる摂動は, テストエラーを減少させたり, 増加させたりする可能性がある.

仮説 1 は, いくつかの点で非公式である. 我々は, パラメータ $\epsilon$ を選択する原則的な方法を持っていない (そして, 現在はヒューリスティックに $\epsilon = 0.1$ を使用している). また, "十分に小さい" と "十分に大きい" についても, まだ正式な仕様を持っていない. 我々の実験では, $\mathrm{EMC} _ {\mathcal{D},\epsilon} (\mathcal{T}) = n$ のときに補間しきい値の周りに臨界区間があることが示唆されている: この区間以下と以上では, 複雑さの増加は性能を助けるが, この区間内では性能を損なう可能性がある. 臨界区間の幅は, 我々がまだ完全に理解していない方法で, 分布と訓練手順の両方に依存する.

仮説 1 は, 最適化アルゴリズム, モデルサイズ, テスト性能の間の相互作用に光を当て, それらについての競合する直感のいくつかを調整するのに役立つと信じている. この論文の主な結果は, データセット, アーキテクチャ, 最適化アルゴリズムの自然な選択をいくつか考慮し, モデルパラメータの数, 訓練の長さ, 分布中の label noise の量, 訓練サンプル数を変化させることによって "補間しきい値" を変化させた, 様々な設定の下での仮説 1 の実験的検証である.

Model-wise Double Descent. セクション 5 では, 一定の大きな最適化ステップ数に対して, サイズが大きくなったモデルのテスト誤差を研究する. 様々な最新のデータセット (CIFAR-10, CIFAR-100, IWSLT‘14 de-en, label noise の量が異なる), モデルアーキテクチャ (CNN, ResNets, Transformer), オプティマイザ (SGD, Adam), 訓練サンプル数, 訓練手順 (data-augmentation, 正則化) において, "model-wise double-descent" が発生することを示す. さらに, テストエラーのピークは, 補間しきい値で系統的に発生することがわかった. 特に, より大きなモデルがより悪いという現実的な設定を示す.

Epoch-wise Double Descent. セクション 6 では, 固定された大規模なアーキテクチャのテスト誤差を学習の過程で研究する. 上記と同様の設定で, モデルが訓練誤差 $\approx 0$ に達するのに十分な時間だけ訓練されたときに, テスト性能の対応するピークがあることを実証する. 大規模モデルのテスト誤差は, 最初に (訓練の開始時に) 減少し, 次に (critical regime 付近で) 増加し, 次に (訓練の終了時に) 再び減少する - つまり, 訓練時間を長くすることで overfitting を修正することができる.

Sample-wise Non-monotonicity. セクション 7 では, 訓練サンプル数を変化させた場合の固定モデルと訓練手順のテスト誤差を研究する. 我々の一般化された double-descent 仮説と整合的に, サンプル数がモデルが適合できる最大値に近い場合の "critical regime" での明確なテスト挙動を観察している. これはしばしば長い plateau (台地) 領域として現れることがあり, この領域では, 有意に多くのデータを取ることは, 訓練を完了させる際には役に立たないかもしれない (CIFAR-10 上の CNNs の場合のように). さらに, これがピーク - 固定されたアーキテクチャとトレーニング手順のために, より多くのデータが実際に傷つく場所 - として現れる設定 (IWSLT'14 en-de の Transformer) を示す.

Label Noise についての注意. 我々は, double descent のすべての形態を, 訓練セットに label noise がある設定で最も強く観察する (実世界で訓練データを収集する場合によく見られる). しかし, label noise がなくても, テスト誤差のピークを持ついくつかの現実的な設定も示している. CIFAR-100 上の ResNets (図 4a) と CNN (図 20), IWSLT‘14 上の Transformer (図 8)である. さらに, 我々のすべての実験は, critical regime - ノイズがない場合のテスト誤差はしばしば "plateau (台地)" として現れ, label noise を加えるとピークに発展する - でのテスト動作が明らかに異なることを示している. さらなる議論はセクション 8 を参照.

原文

To state our hypothesis more precisely, we define the notion of effective model complexity. We define a training procedure $\mathcal{T}$ to be any procedure that takes as input a set $S = {(x_1, y_1), . . . ,(x_n, y_n)}$ of labeled training samples and outputs a classifier $\mathcal{T}(S)$ mapping data to labels. We define the effective model complexity of $\mathcal{T}$(w.r.t. distribution $\mathcal{D}$) to be the maximum number of samples $n$ on which $\mathcal{T}$ achieves on average $\approx 0$ training error.

Definition 1 (Effective Model Complexity) The Effective Model Complexity (EMC) of a training procedure $\mathcal{T}$ , with respect to distribution $\mathcal{D}$ and parameter $\epsilon > 0$, is defined as:

$$ \mathrm{EMC} _ {\mathcal{D},\epsilon} (\mathcal{T}) := \mathrm{max} \{ n|\mathbb{E}_{S \sim D^n} [\mathrm{Error} _ S(\mathcal{T}(S))] \leq \epsilon \} $$

where $\mathrm{Error} _ S(M)$ is the mean error of model $M$ on train samples $S$.

Our main hypothesis can be informally stated as follows:

Hypothesis 1 (Generalized Double Descent hypothesis, informal) For any natural data distribution $\mathcal{D}$, neural-network-based training procedure $\mathcal{T}$, and small $\epsilon > 0$, if we consider the task of predicting labels based on $n$ samples from $\mathcal{D}$ then:

Under-paremeterized regime. If $\mathrm{EMC} _ {\mathcal{D},\epsilon} (\mathcal{T})$ is sufficiently smaller than $n$, any perturbation of $\mathcal{T}$ that increases its effective complexity will decrease the test error.

Over-parameterized regime. If $\mathrm{EMC} _ {\mathcal{D},\epsilon} (\mathcal{T})$ is sufficiently larger than $n$, any perturbation of $\mathcal{T}$ that increases its effective complexity will decrease the test error.

Critically parameterized regime. If $\mathrm{EMC} _ {\mathcal{D},\epsilon} (\mathcal{T}) \approx n$, then a perturbation of $\mathcal{T}$ that increases its effective complexity might decrease or increase the test error.

Hypothesis 1 is informal in several ways. We do not have a principled way to choose the parameter $\epsilon$ (and currently heuristically use $\epsilon = 0.1$). We also are yet to have a formal specification for "sufficiently smaller" and "sufficiently larger". Our experiments suggest that there is a critical interval around the interpolation threshold when $\mathrm{EMC} _ {\mathcal{D},\epsilon} (\mathcal{T}) = n$: below and above this interval increasing complexity helps performance, while within this interval it may hurt performance. The width of the critical interval depends on both the distribution and the training procedure in ways we do not yet completely understand.

We believe Hypothesis 1 sheds light on the interaction between optimization algorithms, model size, and test performance and helps reconcile some of the competing intuitions about them. The main result of this paper is an experimental validation of Hypothesis 1 under a variety of settings, where we considered several natural choices of datasets, architectures, and optimization algorithms, and we changed the "interpolation threshold" by varying the number of model parameters, the length of training, the amount of label noise in the distribution, and the number of train samples.

Model-wise Double Descent. In Section 5, we study the test error of models of increasing size, for a fixed large number of optimization steps. We show that "model-wise double-descent" occurs for various modern datasets (CIFAR-10, CIFAR-100, IWSLT‘14 de-en, with varying amounts of label noise), model architectures (CNNs, ResNets, Transformers), optimizers (SGD, Adam), number of train samples, and training procedures (data-augmentation, and regularization). Moreover, the peak in test error systematically occurs at the interpolation threshold. In particular, we demonstrate realistic settings in which bigger models are worse.

Epoch-wise Double Descent. In Section 6, we study the test error of a fixed, large architecture over the course of training. We demonstrate, in similar settings as above, a corresponding peak in test performance when models are trained just long enough to reach $\approx 0$ train error. The test error of a large model first decreases (at the beginning of training), then increases (around the critical regime), then decreases once more (at the end of training)—that is, training longer can correct overfitting.

Sample-wise Non-monotonicity. In Section 7, we study the test error of a fixed model and training procedure, for varying number of train samples. Consistent with our generalized double-descent hypothesis, we observe distinct test behavior in the “critical regime”, when the number of samples is near the maximum that the model can fit. This often manifests as a long plateau region, in which taking significantly more data might not help when training to completion (as is the case for CNNs on CIFAR-10). Moreover, we show settings (Transformers on IWSLT‘14 en-de), where this manifests as a peak—and for a fixed architecture and training procedure, more data actually hurts.

Remarks on Label Noise. We observe all forms of double descent most strongly in settings with label noise in the train set (as is often the case when collecting train data in the real-world). However, we also show several realistic settings with a test-error peak even without label noise: ResNets (Figure 4a) and CNNs (Figure 20) on CIFAR-100; Transformers on IWSLT‘14 (Figure 8). Moreover, all our experiments demonstrate distinctly different test behavior in the critical regime— often manifesting as a “plateau” in the test error in the noiseless case which develops into a peak with added label noise. See Section 8 for further discussion.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up