More than 3 years have passed since last update.

Deep Double Descent: Where Bigger Models and More Data Hurt 【4 EXPERIMENTAL SETUP】【論文 DeepL 翻訳】

Last updated at 2020-06-14Posted at 2020-06-14

この記事は自分用のメモみたいなものです.
ほぼ DeepL 翻訳でお送りします.
間違いがあれば指摘していだだけると嬉しいです.

翻訳元
[Deep Double Descent: Where Bigger Models and More Data Hurt ]
(https://arxiv.org/abs/1912.02292)

4 EXPERIMENTAL SETUP

訳文

ここでは実験のセットアップを簡単に説明します; 詳細は付録 B を参照のこと. 我々は 3 つのアーキテクチャのファミリーを考えている: ResNets, standard CNNs, Transformers である. ResNets: 我々は, 畳み込み層の幅 (フィルタの数) をスケーリングすることで, ResNet18s のファミリー (He et al. (2016)) をパラメータ化する. 具体的には, $k$ を変化させるための層幅 $[k, 2k, 4k, 8k]$ を用いる. 標準の ResNet18 は, $k = 64$ に相当する. Standard CNNs: 我々は 5-layer CNNs のシンプルなファミリーを考えており, 変数 $k$ による幅 $[k, 2k, 4k, 8k]$ の 4 つの畳み込み層と, 全結合層を持つ. 状況により, 幅 $k=64$ の CNN は, data-augmentation とあわせて, CIFAR-10 で $90%$ 以上のテストaccuracy がでる. Transformers: Vaswani et al. による 6 layer encoder-decoder, Ott et al. (2019) によって実装されたものを考慮する. embedding 次元 $d_{model}$ を変更し, 全結合層の幅を比例的に設定することで, ネットワークのサイズをスケーリングする $(d_{ff} = 4 \cdot d_{model})$. ResNets と CNNs については, クロスエントロピー損失を用いて, 以下のオプティマイザーを用いて学習を行う: (1) 学習率 0.0001 の Adam, 4K epochs; (2) 学習率 $\propto \frac{1}{\sqrt{T}}$ の SGD, 500K gradient steps. Transformers は 80K の gradient steps で, $10%$ の label smoothing と drop-out なしで学習.

Label Noise. 我々の実験では, 確率 $p$ の label noise とは, 確率 $(1 − p)$ で正しいラベルを持ち, そうでない場合は一様にランダムな不正解ラベルを持つサンプルでの学習を指す (label noise は 1 回だけサンプリングされ, エポックごとにサンプリングされるわけではない). 図 1 はノイズの多い分布でのテスト誤差をプロットしたもので, 残りの図はクリーンな分布での検定誤差をプロットしたものである (2 つの曲線は互いに線形に再スケーリングしたもの).

原文

We briefly describe the experimental setup here; full details are in Appendix B. We consider three families of architectures: ResNets, standard CNNs, and Transformers. ResNets: We parameterize a family of ResNet18s (He et al. (2016)) by scaling the width (number of filters) of convolutional layers. Specifically, we use layer widths $[k, 2k, 4k, 8k]$ for varying $k$. The standard ResNet18 corresponds to $k = 64$. Standard CNNs: We consider a simple family of 5-layer CNNs, with 4 convolutional layers of widths $[k, 2k, 4k, 8k]$ for varying $k$, and a fully-connected layer. For context, the CNN with width $k = 64$, can reach over $90%$ test accuracy on CIFAR-10 with data-augmentation. Transformers: We consider the 6 layer encoder-decoder from Vaswani et al. (2017), as implemented by Ott et al. (2019). We scale the size of the network by modifying the embedding dimension $d_{model}$, and setting the width of the fully-connected layers proportionally $(d_{ff} = 4 \cdot d_{model})$. For ResNets and CNNs, we train with cross-entropy loss, and the following optimizers: (1) Adam with learning-rate 0.0001 for 4K epochs; (2) SGD with learning rate $\propto \frac{1}{\sqrt{T}}$ for 500K gradient steps. We train Transformers for 80K gradient steps, with $10%$ label smoothing and no drop-out.

Label Noise. In our experiments, label noise of probability $p$ refers to training on a samples which have the correct label with probability $(1 − p)$, and a uniformly random incorrect label otherwise (label noise is sampled only once and not per epoch). Figure 1 plots test error on the noisy distribution, while the remaining figures plot test error with respect to the clean distribution (the two curves are just linear rescaling of one another).

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up