More than 3 years have passed since last update.

Deep Double Descent: Where Bigger Models and More Data Hurt 【7 SAMPLE-WISE NON-MONOTONICITY】【論文 DeepL 翻訳】

Last updated at 2020-06-14Posted at 2020-06-14

この記事は自分用のメモみたいなものです.
ほぼ DeepL 翻訳でお送りします.
間違いがあれば指摘していだだけると嬉しいです.

翻訳元
[Deep Double Descent: Where Bigger Models and More Data Hurt ]
(https://arxiv.org/abs/1912.02292)

前: 【6 EPOCH-WISE DOUBLE DESCENT】
次: 【8 CONCLUSION AND DISCUSSION】

7 SAMPLE-WISE NON-MONOTONICITY

訳文

このセクションでは, 固定モデルと訓練手順について, 訓練サンプル数を変化させた場合の効果を調べる. これまでに, model-wise double descent と epoch-wise double descent では、$\mathrm{EMC} _ {\mathcal{D},\epsilon} (\mathcal{T}) \approx n$ の critical regime での挙動を EMC を変化させることで調べてきた. ここでは, 訓練サンプル数 $n$ を変化させることで critical regime を探索する. $n$ を増加させることで, 同じ訓練手順 $\mathcal{T}$ が効果的にオーバーパラメータ化された状態から効果的にアンダーパラメータ化された状態に切り替わることができる.

サンプル数の増加は, テスト誤差と model complexity のグラフに 2 つの異なる効果があることを示す. 一方では, (予想通り) サンプル数の増加は, 曲線下面積を縮小する. 一方で, サンプル数の増加は, "曲線を右にずらす" 効果もあり, テスト誤差がピークとなる model complexity を増加させる.

これらの双子の効果を図 11a に示す. この効果が "相殺" - 訓練サンプルが4倍以上でも, 訓練完了時のテスト性能が向上しない - されるモデルサイズの範囲があることに注意しなくてはならない. critically-parameterized regime 以外では, 十分に under- もしくは over-　parameterized されたモデルでは, より多くのサンプルを持つことが有効である. この現象は, 図 11a と同じ設定で, モデルとサンプルサイズの両方の関数としてのテスト誤差を示す図 12 で裏付けられている.

いくつかの設定では, これら 2 つの効果が組み合わさって, 図 3 のように, より多くのデータが実際にテストのパフォーマンスに影響を与えるモデルサイズの領域を生み出す (図 11b も参照). この現象は DNN に特有のものではないことに注意する必要がある: 線形モデルであっても, より多くのデータが障害となることがある (付録 D を参照).

(a) CIFAR-10 上の 5 層 CNN について，データセットサイズを変えた場合のモデルごとの double descent．上: 2$×$多くのサンプルで学習してもテスト誤差が改善されないモデルサイズの範囲 (緑の範囲) がある. 下: 4$×$以上のサンプルで学習してもテストエラーが改善されないモデルサイズの範囲 (赤の範囲) がある．
(b) Sample-wise non-monotonicity. IWSLT’14 で訓練された 2 つの transformer モデルについて，訓練サンプル数の関数としてのテスト損失 (単語毎の錯乱度). どちらのモデルサイズにおいても，より多くのサンプル数が性能を低下させる領域がある．同一の設定での model-wise double-descent の図 3 と比較せよ.
図 11: Sample-wise non-monotonicity.

図 12: 左: CIFAR-10 + $20%$ ノイズの5層 CNN のモデルサイズと訓練サンプル数の関数としてのテスト誤差. 高いテスト誤差のリッジが補間しきい値に沿ってあることに注意. 右: 左側のプロットの3つのスライス. 異なるサイズのモデルについてより多くのデータの効果を示している. 学習が完了するまでの間は，データの増加は小さいモデルと大きいモデルには役立つが，near-critically-parameterized されたモデル (緑) には役立たない．

原文

In this section, we investigate the effect of varying the number of train samples, for a fixed model and training procedure. Previously, in model-wise and epoch-wise double descent, we explored behavior in the critical regime, where $\mathrm{EMC} _ {\mathcal{D},\epsilon} (\mathcal{T}) \approx n$, by varying the EMC. Here, we explore the critical regime by varying the number of train samples $n$. By increasing n, the same training procedure $\mathcal{T}$ can switch from being effectively over-parameterized to effectively under-parameterized.

We show that increasing the number of samples has two different effects on the test error vs. model complexity graph. On the one hand, (as expected) increasing the number of samples shrinks the area under the curve. On the other hand, increasing the number of samples also has the effect of “shifting the curve to the right” and increasing the model complexity at which test error peaks.

These twin effects are shown in Figure 11a. Note that there is a range of model sizes where the effects “cancel out”—and having 4× more train samples does not help test performance when training to completion. Outside the critically-parameterized regime, for sufficiently under- or over- parameterized models, having more samples helps. This phenomenon is corroborated in Figure 12, which shows test error as a function of both model and sample size, in the same setting as Figure 11a.

In some settings, these two effects combine to yield a regime of model sizes where more data actually hurts test performance as in Figure 3 (see also Figure 11b). Note that this phenomenon is not unique to DNNs: more data can hurt even for linear models (see Appendix D).

(a) Model-wise double descent for 5-layer CNNs on CIFAR-10, for varying dataset sizes. Top: There is a range of model sizes (shaded green) where training on 2× more samples does not improve test error. Bottom: There is a range of model sizes (shaded red) where training on 4× more samples does not improve test error.
(b) Sample-wise non-monotonicity. Test loss (per-word perplexity) as a function of number of train samples, for two transformer models trained to completion on IWSLT’14. For both model sizes, there is a regime where more samples hurt performance. Compare to Figure 3, of model-wise double-descent in the identical setting.
Figure 11: Sample-wise non-monotonicity.

Figure 12: Left: Test Error as a function of model size and number of train samples, for 5-layer CNNs on CIFAR-10 + $20%$ noise. Note the ridge of high test error again lies along the interpolation threshold. Right: Three slices of the left plot, showing the effect of more data for models of different sizes. Note that, when training to completion, more data helps for small and large models, but does not help for near-critically-parameterized models (green).

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up