More than 3 years have passed since last update.

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks 【4 VGG AND RESNET FOR CIFAR10】【論文 DeepL 翻訳】

Last updated at 2020-07-05Posted at 2020-07-05

この記事は自分用のメモみたいなものです.
ほぼ DeepL 翻訳でお送りします.
間違いがあれば指摘していだだけると嬉しいです.

翻訳元
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

前: 【3 WINNING TICKETS IN CONVOLUTIONAL NETWORKS】
次: 【5 DISCUSSION】

4 VGG AND RESNET FOR CIFAR10

訳文

ここでは, 実際に使われているアーキテクチャや技術を想起させるネットワーク上での宝くじ仮説を研究する. 具体的には, VGG スタイルの深層畳み込みネットワーク (VGG-19 on CIFAR10—Simonyan & Zisserman (2014)) と残差ネットワーク (Resnet-18 on CIFAR10—He et al. (2016)) を考える.¹ これらのネットワークは, バッチ正則化, 重み減衰, 学習率低下スケジュール, およびデータオーギュメンテーションを用いて学習される. これらのアーキテクチャのすべての winning tickets を見つけ続ける; しかし, それらを見つけるための私たちの方法である反復プルーニングは, 使用される特定の学習率に敏感である. これらの実験では, アーリーストッピング時間を測定するのではなく (これらの大規模なネットワークでは, 学習率のスケジュールと絡み合っている), 精度が向上する相対的な速度を説明するために, 訓練中のいくつかの瞬間の精度をプロットする.

グローバルプルーニング. Lenet と Conv-2/4/6 では, 同じ割合で各層を別々に枝刈りする. Resnet-18 と VGG-19 については, この戦略を少し変更している: これらのより深いネットワークをグローバルに枝刈りし, すべての畳み込みレイヤーにまたがって, 最も低いマグニチュードの重みをまとめて削除する. 付録 I.1 では, 大規模な枝刈りにより, Resnet-18と VGG-19 について, より小さな当選チケットが特定されることがわかる. このような振る舞いについて, 我々が推測している説明は以下の通りである: これらのより深いネットワークでは, いくつかの層は他の層よりもはるかに多くのパラメータを持っている. 例えば, VGG-19 の最初の $2$ つの畳み込み層のパラメータは $1728$, $36864$であるが、最後の２つの畳み込み層のパラメータは $235$ 万である. すべてのレイヤーが同じ速度で枝刈りされると, これらの小さなレイヤーがボトルネックとなり, 可能な限り小さな当選チケットを特定することができなくなる. グローバルな枝刈りを行うことで, この落とし穴を回避することができる.

VGG-19. Liu et al. (2019) によって CIFAR10 のために適応されたバリアント VGG-19 を研究する; 我々は同じ訓練レジームとハイパーパラメータを使用する: モーメンタム ($0.9$) の SGD で $160$ エポック ($112,480$ イテレーション)行い, $80$ と $120$ エポックで学習率を $10$ のファクターで減少させる. このネットワークには$2000$ 万個のパラメータがある. 図 7 は, $2$ つの初期学習率で VGG-19 上で反復的な枝刈りとランダムな再初期化を行った結果を示している: $0.1$ (Liu et al. (2019) で使用) と $0.01$ である. より高い学習率では, 反復的な枝刈りは当選チケットを見つけることができず, 枝刈りされたネットワークがランダムに再初期化されたときよりもパフォーマンスは悪い. しかし, より低い学習率では, 通常のパターンが再び現れ, $P_m \geq 3.5 %$でサブネットワークが元の精度の $1$ パーセンテージポイント以内に留まる. (本来の精度に合わないので, 当選チケットではない.) ランダムに再初期化すると, サブネットワークは、この論文全体を通して他の実験と同じように枝刈りされるため, 精度が低下する. これらのサブネットワークは, 訓練の初期段階では枝刈りなしのネットワークよりも速く学習するが (図 7 左), 初期学習率が低いために, この精度の優位性は訓練の後半になると低下する. しかし, これらのサブネットワークは再初期化されたときよりも, まだより速く学習する.

図 7: 反復プルーニングしたときの VGG-19 のテスト精度 (30K, 60$, および 112K イテレーション時).

低い学習率の lottery ticket の挙動と高い学習率の精度の優位性とのギャップを埋めるために, 学習率 0 から初期学習率への線形学習率ウォームアップの効果を $k$ 回の反復にわたって探る. 学習率 0.1 でウォームアップ ($k =$ 10000, 緑の線）を用いて VGG-19 を学習すると, プルーニングされていないネットワークのテスト精度が約 1 $%$ ポイント向上する. ウォームアップにより, $P_m \geq$ 1.5 $%$ のときにこの初期精度を超えて winning tickets を見つけることが可能になる.

Resnet-18. Resnet-18 (He et al., 2016) は, CIFAR10 のために設計された残差接続を持つ 20 層の畳み込みネットワークである. 271,000 個のパラメータを持つ. このネットワークを, モーメンタム (0.9) の SGD を用いて 30,000 回のイテレーションで訓練し, 20,000 回と 25,000 回のイテレーションで学習率を 10 分の 1 に減少させた. 図 8 は, 学習率 0.1 (He et al. (2016) で使用) と 0.01 の場合の反復プルーニングとランダム再初期化の結果を示している. これらの結果は, 主に VGG のものを反映している: 反復プルーニングは, より低い学習率での winning tickets を見つけるが, より高い学習率ではない. 低い学習率 ($89.5%$ when $41.7% \geq P_m \geq 21.9%$) での最高の winning tickets の精度は, 高い学習率 ($90.5%$) での元のネットワークの精度を下回っている. 学習率が低くなると, winning ticket の学習速度は再び速くなる (図 8 の左図) が, 学習率が高くなると, 元のネットワークより遅くなる (右図). ウォームアップを用いて学習した winning tickets は, より高い学習率でプルーニングいていないネットワークとの精度差を縮め, $P_m \geq$ 11.8 $%$ で学習率 0.03 (ウォームアップ, $k =$ 20000) で 90.5 $%$ のテスト精度を達成した. これらのハイパーパラメタでは, $P_m \geq$ 11.8 $%$ のときにも winning tickets を見つけることができる. しかし, ウォームアップを行っても, 元の学習率 0.1 では, winning tickets を特定できるハイパーパラメタを見つけることができなかった.

図 8: 反復プルーニングしたときの Resnet-18 のテスト精度 (10K, 20K, 30K イテレーション時).

原文

Here, we study the lottery ticket hypothesis on networks evocative of the architectures and techniques used in practice. Specifically, we consider VGG-style deep convolutional networks (VGG-19 on CIFAR10—Simonyan & Zisserman (2014)) and residual networks (Resnet-18 on CIFAR10—He et al. (2016)).² These networks are trained with batchnorm, weight decay, decreasing learning rate schedules, and augmented training data. We continue to find winning tickets for all of these architectures; however, our method for finding them, iterative pruning, is sensitive to the particular learning rate used. In these experiments, rather than measure early-stopping time (which, for these larger networks, is entangled with learning rate schedules), we plot accuracy at several moments during training to illustrate the relative rates at which accuracy improves.

Global pruning. On Lenet and Conv-2/4/6, we prune each layer separately at the same rate. For Resnet-18 and VGG-19, we modify this strategy slightly: we prune these deeper networks globally, removing the lowest-magnitude weights collectively across all convolutional layers. In Appendix I.1, we find that global pruning identifies smaller winning tickets for Resnet-18 and VGG-19. Our conjectured explanation for this behavior is as follows: For these deeper networks, some layers have far more parameters than others. For example, the first two convolutional layers of VGG-19 have $1728$ and $36864$ parameters, while the last has $2.35$ million. When all layers are pruned at the same rate, these smaller layers become bottlenecks, preventing us from identifying the smallest possible winning tickets. Global pruning makes it possible to avoid this pitfall.

VGG-19. We study the variant VGG-19 adapted for CIFAR10 by Liu et al. (2019); we use the the same training regime and hyperparameters: $160$ epochs ($112,480$ iterations) with SGD with momentum ($0.9$) and decreasing the learning rate by a factor of $10$ at $80$ and $120$ epochs. This network has $20$ million parameters. Figure 7 shows the results of iterative pruning and random reinitialization on VGG-19 at two initial learning rates: $0.1$ (used in Liu et al. (2019)) and $0.01$. At the higher learning rate, iterative pruning does not find winning tickets, and performance is no better than when the pruned networks are randomly reinitialized. However, at the lower learning rate, the usual pattern reemerges, with subnetworks that remain within $1$ percentage point of the original accuracy while $P_m \geq 3.5 %$. (They are not winning tickets, since they do not match the original accuracy.) When randomly reinitialized, the subnetworks lose accuracy as they are pruned in the same manner as other experiments throughout this paper. Although these subnetworks learn faster than the unpruned network early in training (Figure 7 left), this accuracy advantage erodes later in training due to the lower initial learning rate. However, these subnetworks still learn faster than when reinitialized.

Figure 7: Test accuracy (at 30K, 60K, and 112K iterations) of VGG-19 when iteratively pruned.

To bridge the gap between the lottery ticket behavior of the lower learning rate and the accuracy advantage of the higher learning rate, we explore the effect of linear learning rate warmup from $0$ to the initial learning rate over $k$ iterations. Training VGG-19 with warmup ($k = 10000$, green line) at learning rate $0.1$ improves the test accuracy of the unpruned network by about one percentage point. Warmup makes it possible to find winning tickets, exceeding this initial accuracy when $P_m \geq 1.5%$.

Resnet-18. Resnet-18 (He et al., 2016) is a $20$ layer convolutional network with residual connections designed for CIFAR10. It has $271,000$ parameters. We train the network for $30,000$ iterations with SGD with momentum ($0.9$), decreasing the learning rate by a factor of $10$ at $20,000$ and $25,000$ iterations. Figure 8 shows the results of iterative pruning and random reinitialization at learning rates $0.1$ (used in He et al. (2016)) and $0.01$. These results largely mirror those of VGG: iterative pruning finds winning tickets at the lower learning rate but not the higher learning rate. The accuracy of the best winning tickets at the lower learning rate ($89.5%$ when $41.7% \geq P_m \geq 21.9%$) falls short of the original network’s accuracy at the higher learning rate ($90.5%$). At lower learning rate, the winning ticket again initially learns faster (left plots of Figure 8), but falls behind the unpruned network at the higher learning rate later in training (right plot). Winning tickets trained with warmup close the accuracy gap with the unpruned network at the higher learning rate, reaching $90.5%$ test accuracy with learning rate $0.03$ (warmup, $k = 20000$) at $P_m = 27.1%$. For these hyperparameters, we still find winning tickets when $P_m \geq 11.8%$. Even with warmup, however, we could not find hyperparameters for which we could identify winning tickets at the original learning rate, $0.1$.

Figure 8: Test accuracy (at 10K, 20K, and 30K iterations) of Resnet-18 when iteratively pruned.

ネットワーク, ハイパーパラメータ, および訓練レジームの詳細については, 図 2 および付録 I を参照. ↩
See Figure 2 and Appendices I for details on the networks, hyperparameters, and training regimes. ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up