More than 3 years have passed since last update.

Rigging the Lottery: Making All Tickets Winners 【4 Empirical Evaluation】【論文 DeepL 翻訳】

Last updated at 2020-07-26Posted at 2020-07-26

この記事は自分用のメモみたいなものです.
ほぼ DeepL 翻訳でお送りします.
間違いがあれば指摘していだだけると嬉しいです.

翻訳元
Rigging the Lottery: Making All Tickets Winners

前: 【3 Rigging The Lottery】
次: 【5 Discussion & Conclusion】

4 Empirical Evaluation

訳文

我々の実験には, ImageNet-2012 (Russakovsky et al., 2015) および CIFAR-10 (Krizhevsky et al.) データセット上の CNNs を用いた画像分類と, WikiText-103 データセット (Merity et al., 2016) 上の RNNs を用いた文字ベースの言語モデリングが含まれる. 我々のプルーニングベースラインに TensorFlow モデルプルーニングライブラリ(Zhu & Gupta, 2018) を使用している. 他の 3 つのベースライン (SET, SNFS, SNIP) とともに, 我々の方法のTensorflow (Abadi et al., 2015) の実装はここにある. 学習ステップを $M$ 因子で増加させる場合, 学習レートスケジュールのアンカーエポックとマスク更新スケジュールの終了イテレーションも同じ因子でスケーリングされる; 我々はこのスケーリングを添え字 (例えば, RigL$_{M \times}$) で示す.

4.1 ImageNet-2012 Dataset

本節のすべての実験では, オプティマイザとして運動量のある SGD を使用する. オプティマイザの運動量係数を 0.9, L2正則化係数を 0.0001, ラベルスムージング (Szegedy et al., 2016) を 0.1 に設定した. 学習率のスケジュールは, 線形ウォームアップから始まり, エポック 5 で 1.6 の最大値に達し, その後エポック30, 70, 90 で 10 の係数で落とされる. 我々は 32000 ステップで 4096 のバッチサイズでネットワークを訓練する. 我々の訓練パイプラインは, ランダムフリップとクロップを含む標準的なデータオーギュメンテーションを使用している. このセクションのすべてのネットワークでは, 一様な layer-wise sparsity を使用する場合, 最初のレイヤを密にしている.

4.1.1 ResNet-50

図2-左は, 80% のスパースな ResNet-50 を学習した場合の各手法の性能をまとめたものである. また, 同等のパラメータ数を持つ小さな密なネットワークも学習する. すべてのスパースなネットワークは, 特に指定がない限り, レイヤごとに一様な sparsity 分布を用い, コサイン更新スケジュール ($\alpha$ = 0.3，$\Delta T$ = 100) を用いている. 全体的に, すべての手法の性能が訓練時間とともに向上することが観察された. したがって, 各手法について, オリジナルの訓練ステップの最大 5 倍の訓練ステップで拡張訓練を実行した.

図2: (左) ImageNet-2012 分類タスクにおける様々な動的スパーストレーニング手法の性能. ResNet-50 アーキテクチャを使用している. スパースなネットワークは 80% のスパースなネットワークで, 層ごとに均一な sparsity 分布を持っている. 各曲線の点は, 1 から 5 までの乗数 (0.5 から 2 の間でスケーリングされたプルーニングを除く) での個々のトレーニングの実行に対応している. 各乗数で 3 回のトレーニングを繰り返し, 平均精度を報告する. 標準的な密な Resnet-50 の学習に必要な FLOP 数とその性能を赤の破線で示す. (右) RigL の拡張訓練を行った場合の性能.

Gale et al. (2019), Evci et al. (2019), Frankle et al. (2019), およびMostafa & Wang (2019) が指摘しているように, 固定の sparsity を持つネットワークを最初からトレーニングすると (静的に), パフォーマンスが劣ることになる. 同じ数のパラメータを持つ小さな密なネットワークをトレーニングすると, 静的よりも良い結果が得られるが, 動的スパースモデルの性能にはかなわない. 同様に SET は Small-Dense よりも性能が向上するが, 精度は 75% 程度で飽和する. 勾配情報を用いて新規接続を成長させる手法 (RigLとSNFS) の方が精度は高いが, RigL が最も高い精度を達成しており, 他の手法よりも少ない FLOP 数で一貫して成長させることができる.

異なるアプリケーションやシナリオでは, 推論のための FLOPs 数に制限が必要となる場合があることを考慮して, 様々な sparsity レベルでの本手法の性能を調査した. 前述したように, 本手法の強みは, 学習中のリソース要件が一定であることであり, 学習や推論の制約条件に合わせて sparsity を選択できることである. 図2-右では, 異なる sparsity での我々の手法の性能を示し, 1.5 倍のトレーニングステップを使用する Gale et al. (2019) のプルーニング結果と比較している. FLOPs に関して公正な比較を行うために, 他のすべての手法の学習スケジュールを 5 倍にスケールしている. なお, 学習を拡張しても, RigL を用いたスパースなネットワークの学習には, (80% のスパースな RigL-ERK を除いて) プルーニング法と比較して FLOPs が少ないことに注意が必要である.

我々の手法である一定の sparsity 分布を持つ RigL は, すべての sparsity レベルにおいて大きさに基づく反復的なプルーニングの性能を上回り, 学習に必要な FLOP 数も少なくて済む. Erdos-Renyi-Kernel (ERK) を用いたスパースネットワークでは, さらに優れた性能が得られる. 例えば, 96.5% の sparsity を持つ ResNet-50 は, 72.75% という驚くべきトップ 1 精度を達成しており, Gale et al. (2019) が報告した拡張された大きさによるプルーニングの結果よりも約 3.5% 高い. 先に観察されたように, より小さな密なモデル (同じ数のパラメータを持つ) や, 静的な接続性を持つスパースモデルでは, 同等のレベルの性能を発揮することはできない.

スパースな学習方法のより細かい比較を表 2 に示す. 一様なスパース分布を用い, FLOP/メモリフットプリントが (1-S) で直接スケールする手法は, 表の最初のサブグループに配置されている. 第 2 のサブグループには, 同じパラメータ数で推論に多くの FLOP を必要とする DSR や ERK sparsity 分布を持つネットワークが含まれる. 最後のサブグループには, 密なモデルを学習するのに比例した空間と作業を必要とする手法が含まれる.

表2：80% と 90% のスパース ResNet-50s のトレーニングにおけるスパーストレーニング法の性能とコスト. 訓練とテストに必要な FLOP は, 密なモデルの FLOP で正規化されている (FLOP の計算方法の詳細については付録 H を参照). 下付きの方法は学習時間を再スケーリングしたものを示し, '*' は他の場所で報告された結果を示す. (ERK) は Erdos-Renyi-Kernel sparsity 分布を持つスパースネットワークに対応する. 密なモデルの 20% のパラメータと 42% の FLOP を用いて, RigL$_{5 \times}$(ERK) は 77.1% のトップ 1 精度を達成している.

4.1.2 MobileNet

MobileNet は, リソース制約のある設定で著しく良好な性能を発揮するコンパクトなアーキテクチャである. 分離可能な畳み込みを持つコンパクトな性質のため, スパース化が困難であることが知られている (Zhu & Gupta, 2018). このセクションでは, MobileNet-v1 (Howard et al., 2017) と MobileNet-v2 (Sandler et al., 2018) に我々の手法を適用する. その低いパラメータ数のために, 第 1 層を密にしておき, 残りの層をスパース化するために, ERK および均一 sparsity 分布を使用する.

ベースラインと同様に RigL で学習したスパースモバイルネットの性能を図 3 に示す. このセクションでは, すべての実行に対して拡張学習 (元のステップ数の 5 倍) を行っている. MobileNets は ResNet-50 アーキテクチャと比較して sparsity に敏感であるが, RigL は高い sparsity でのスパースな MobileNets の学習に成功し, これまでに報告されているプルーニング結果を上回る性能を示した.

図3: (左) RigL は ImageNet-2012 データセット上の Sparse MobileNets の性能を大幅に向上させ, Zhu & Gupta (2018) が報告したプルーニング結果を上回っている. 密な MobileNets の性能を赤線で示している. (右) 推論 FLOP で提示されたスパース MobileNet-v1 アーキテクチャの性能. ERK 分布を持つネットワークは, 同じパラメータ数でより良い性能を得るが, 実行にはより多くの FLOPs を要する. RigL (Big-Sparse) を用いてより広いスパースモデルをトレーニングすると, 密なモデルよりも大幅に性能が向上する.

スパースなモデルの利点を実証するために, 次に, sparsity を利用して密なベースラインと同じ FLOPs とパラメータの総数を維持しながら, より広い MobileNet を訓練する. スパースな MobileNet-v1 は, 幅乗数 1.98, 一定の 75% の sparsity さで, 密なベースラインと同じ FLOPs とパラメータ数を持っている. このネットワークを RigL で学習すると, Top-1 の精度が 4.3% 向上する.

4.2 Character Level Language Modelling

先行研究のほとんどは, ビジョンネットワーク上でのスパーストレーニングしか検討していない (例外は, TIMIT (Garofolo et al., 1993) データセット上で LSTM (Hochreiter & Schmidhuber, 1997) をトレーニングした最も古い研究 Deep Rewiring (Bellec et al., 2018) である). これらの技術を完全に理解するためには, 異なるデータセット上で異なるアーキテクチャを調べることが重要である. Kalchbrenner et al. (2018) は, スパースGRU (Cho et al., 2014) が音声のモデリングに非常に効果的であることを発見したが, 彼らが使用したデータセットは利用できない. 公開されている WikiText103 (Merity et al., 2016) のデータセット上で文字レベルの言語モデリングを行う際に, 似たような特徴 (データセットサイズと語彙サイズがほぼ同じ) を持つプロキシタスクを選択した.

我々のネットワークは, 次元数 128 の共有埋め込み, 語彙サイズ256, 状態サイズ 512 の GRU, 256 単位と 128 単位の 2 つの線形層からなる GRU 状態からの読み出しから構成されている. 次のステップ予測タスクを, 標準的なクロスエントロピー損失, Adam オプティマイザ, 学習率 7e-4, L2正則化係数 5e-4, シーケンス長 512, バッチサイズ 32, 10 よりも大きい値の勾配絶対値クリッピングを用いて訓練した. すべてのモデルで sparsity を 75% に設定し, 20 万回の反復実行を行った. 大きさによるプルーニング (Zhu & Gupta, 2018) で sparsity を誘導する場合, イテレーション数 50,000 回から 150,000 回の間でプルーニングを 1,000 回の頻度で実行する. 我々は, 一様なスパース分布を持つスパースネットワークを初期化し, $\alpha$ = 0.1、$\Delta T$ = 100 のコサイン更新スケジュールを使用する. これまでの実験とは異なり, 我々は訓練の最後までマスクを更新し続けている.

図 4-左では, トレーニング終了時の様々なソリューションのステップごとの検証ビットを報告している. 各手法について拡張実行を行い, 訓練時間の増加に伴ってどのように変化するかを確認している. 前述したように, SET は他の動的訓練法に比べて性能が悪く, 訓練時間の増加に伴って性能はわずかに改善される. 一方, RigL と SNFS の性能は, 訓練ステップ数が増えるにつれて常に向上している. RigL は他のスパースな訓練手法の性能を上回っているが, この設定ではプルーニングの性能には及ばない.

4.3 WideResNet-22-2 on CIFAR-10

また, CIFAR-10 画像分類ベンチマークにおける RigL の性能を評価する. 我々は, 250 エポック (97656ステップ) のために 2 の幅の乗数を使用して 22 層のワイド残差ネットワーク (Zagoruyko & Komodakis, 2016) を訓練する. 学習率は 0.1 から始まり, 30,000 回のイテレーションごとに 5 倍にスケールダウンされる. L2正則化係数は 5e-4, バッチサイズは 128, 運動量係数は 0.9 である. マスク更新のための最終的な反復を除いて, RigL に特有のハイパーパラメータは ImageNet 実験と同じにしている. マスク更新間隔を変えた場合の結果は, 付録 I を参照されたい.

図 4-右は, 様々な sparsity レベルにおける RigL の最終的な精度を示している. 密なベースラインでは　94.1%　のテスト精度が得られている; 驚くべきことに, 50% のスパースなネットワークの中には, sparsity の正則化の側面を示す密なベースラインよりも一般化が進んでいるものがある. sparsity が高くなると, スタティックネットワークとプルーニングソリューションの間にパフォーマンスのギャップが見られる. スタティックネットワークのトレーニング時間が長くなると, 最終的な性能への影響は限定的になるようである. 一方, RigL は, トレーニングに必要なリソースのほんの一部で, pruning と同等の性能を発揮する.

図 4: (左) 文字レベルの言語モデリングタスクにおける様々なスパース学習法の最終的なテスト損失. クロスエントロピー損失はビットに変換されている (nats から). (右) CIFAR-10 タスクにおけるスパースな WideResNet-22-2 のテスト精度.

4.4 Analyzing the performance of RigL

このセクションでは, sparsity 分布, 更新スケジュール, および動的接続が我々の手法の性能に与える影響を検討する. SET と SNFS の結果は類似しており, 付録 C と F で議論されている.

マスク初期化の影響: 図 5-左は, RigL で学習したスパースな ResNet-50 の最終的なテスト精度に, sparsity 分布がどのように影響するかを示している. Erdos-Renyi-Kernel (ERK) は他の 2 つの分布よりも一貫して優れた性能を発揮する. ERK は, パラメータの少ない層の sparsities を下げることで, より多くのパラメータを自動的に割り当てている. この再割り当ては, ERK が他の分布よりも大きなマージンで他の分布を凌駕するような高い sparsity レベルでネットワークの容量を維持するために重要であると思われる. ERK 分布は一様分布に比べて性能は優れているが, 約 2 倍の FLOP を必要とする. このことは, 両モデルともパラメータ数が同じであるにもかかわらず, 精度と計算効率の間の興味深いトレードオフを浮き彫りにしている.

更新間隔と頻度の影響: 図 5-右は, 更新間隔 $\Delta T \in$ [50, 100, 500, 1000] と初期ドロップ率$\alpha \in$ [0.1, 0.3, 0.5] について, 本手法の性能を評価したものである. 最も精度が良いのは, 初期ドロップ率を 0.3 または 0.5 とし, 100 回ごとにマスクを更新した場合である. 注目すべきことは, 更新間隔を頻繁に (例えば 1000 回の繰り返し) にしても, RigL は 73.5% 以上の精度を達成しているということである.

図 5: (左) 異なる sparsity マスクを用いた異なる sparsities での RigL の性能 (右) コサインスケジュールでのアブレーション研究. その他の方法は付録にある.

動的接続の影響: Frankle et al. (2019) と Mostafa & Wang (2019) は, 静的スパーストレーニングの方が動的スパーストレーニングよりも高い損失を持つ解に収束することを観測した. 図 6-左では, 静的スパースな訓練によって得られた解とプルーニングによって得られた解の間に横たわる損失のランドスケープを調べ, 前者が後者から隔離された盆地にあるかどうかを理解する. この 2 つの間で線形補間を行うと, 予想通りの結果が得られ, 損失の高い障壁が存在することがわかる. しかし, これは 2 点間の無限に多くのパスのうちの 1 つに過ぎない; 最適化は, 曲線に沿った損失を最小化するなどの制約条件を条件として, 解を結ぶパラメトリック曲線を見つけるために使用することができる (Garipov et al., 2018; Draxler et al., 2018). 例えば, Garipov et al. (2018) は, 2 つの解の間のエネルギーが低い $2^{nd}$ 次ベジェ曲線を見つけることによって, 異なる密な解が同じ盆地に横たわっていることを示した. 彼らの方法に従って, 我々は 2 つのスパースな解の間に 2 次, 3 次のベジェ曲線を見つけることを試みる. 驚くべきことに, 3 次曲線を用いても, 損失の高い障壁のない経路を見つけることができなかった. これらの結果は, 静的スパーストレーニングは, 改良解から分離された局所的な最小値で行き詰まる可能性があることを示唆している. 一方、密空間全体にわたって二次ベジェ曲線を最適化すると, 改善された解へのほぼ単調なパスを見つけることができる.

図 6-右では, 静的スパーストレーニングで見つかった準最適解から RigL を訓練しており, 静的スパーストレーニングでの再訓練ではローカル最小値を逃れることができないのに対し, RigL はローカル最小値を逃れることができることを示している. RigL は, まず最小の大きさを持つ接続を除去する. これらの接続を除去することで, 損失への影響が最小になることが示されているためである (Han et al., 2015; Evci, 2018). 次に, これらの接続は損失を最も早く減らすことが期待されるため, 勾配が高い接続をアクティブにする. 我々は付録 A で, RigL が鞍点方向を高勾配の次元に置き換えることで, 悪い臨界点を脱出するという仮説を立てている.

図 6: (左) magnitude pruning モデル (0.0) と static sparsity (1.0) で学習したモデルの補間曲線上の様々な点で評価された学習損失. (右) 静的スパース解からの RigL 法とスタティック法の学習損失と最終的な精度.

原文

Our experiments include image classification using CNNs on the ImageNet-2012 (Russakovsky et al., 2015) and CIFAR-10 (Krizhevsky et al.) datasets and character based language modelling using RNNs with the WikiText-103 dataset (Merity et al., 2016). We use the TensorFlow Model Pruning library (Zhu & Gupta, 2018) for our pruning baselines. A Tensorflow (Abadi et al., 2015) implementation of our method along with three other baselines (SET, SNFS, SNIP) can be found here. When increasing the training steps by a factor M, the anchor epochs of the learning rate schedule and the end iteration of the mask update schedule are also scaled by the same factor; we indicate this scaling with a subscript (e.g. RigL$_{M \times}$).

4.1 ImageNet-2012 Dataset

In all experiments in this section, we use SGD with momentum as our optimizer. We set the momentum coefficient of the optimizer to 0.9, L2 regularization coefficient to 0.0001, and label smoothing (Szegedy et al., 2016) to 0.1. The learning rate schedule starts with a linear warm up reaching its maximum value of 1.6 at epoch 5 which is then dropped by a factor of 10 at epochs 30, 70 and 90. We train our networks with a batch size of 4096 for 32000 steps which roughly corresponds to 100 epochs of training. Our training pipeline uses standard data augmentation, which includes random flips and crops. For all the networks in this section, when using uniform layer-wise sparsity, we keep the very first layer dense.

4.1.1 ResNet-50

Figure 2-left summarizes the performance of various methods on training an 80% sparse ResNet-50. We also train small dense networks with equivalent parameter count. All sparse networks use a uniform layer-wise sparsity distribution unless otherwise specified and a cosine update schedule ($\alpha$ = 0.3, $\Delta T$ = 100). Overall, we observe that the performance of all methods improves with training time; thus, for each method we run extended training with up to 5$\times$ the training steps of the original.

Figure 2: (left) Performance of various dynamic sparse training methods on the ImageNet-2012 classification task. We use the ResNet-50 architecture. Sparse networks are 80% sparse with a uniform layer-wise sparsity distribution. Points at each curve correspond to the individual training runs with training multipliers from 1 to 5 (except pruning which is scaled between 0.5 and 2). We repeat training 3 times at every multiplier and report the mean accuracies. The number of FLOPs required to train a standard dense Resnet-50 along with its performance is indicated with a dashed red line. (right) Performance of RigL at different sparsity levels with extended training.

As noted by Gale et al. (2019), Evci et al. (2019), Frankle et al. (2019), and Mostafa & Wang (2019), training a network with fixed sparsity from scratch (Static) leads to inferior performance. Training a small dense network with the same number of parameters gets better results than Static, but fails to match the performance of dynamic sparse models. Similarly SET improves the performance over Small-Dense, however saturates around 75% accuracy indicating the limits of growing new connections randomly. Methods that use gradient information to grow new connections (RigL and SNFS) obtain higher accuracies, but RigL achieves the highest accuracy and does so while consistently requiring fewer FLOPs than the other methods.

Given that different applications or scenarios might require a limit on the number of FLOPs for inference, we investigate the performance of our method at various sparsity levels. As mentioned previously, one strength of our method is that its resource requirements are constant throughout training and we can choose the level of sparsity that fits our training and/or inference constraints. In Figure 2-right we show the performance of our method at different sparsities and compare them with the pruning results of Gale et al. (2019), which uses 1.5x training steps, relative to the original 32k iterations. To make a fair comparison with regards to FLOPs, we scale the learning schedule of all other methods by 5x. Note that even after extending the training, it takes less FLOPs to train sparse networks using RigL (except for the 80% sparse RigL-ERK) compared to the pruning method.

RigL, our method with constant sparsity distribution, exceeds the performance of magnitude based iterative pruning in all sparsity levels while requiring less FLOPs to train. Sparse networks that use Erdos-Renyi-Kernel (ERK) sparsity distribution obtains even greater performance. For example ResNet-50 with 96.5% sparsity achieves a remarkable 72.75% Top-1 Accuracy, around 3.5% higher than the extended magnitude pruning results reported by Gale et al. (2019). As observed earlier, smaller dense models (with the same number of parameters) or sparse models with a static connectivity can not perform at a comparable level.

A more fine grained comparison of sparse training methods is presented in Table 2. Methods using uniform sparsity distribution and whose FLOP/memory footprint scales directly with (1-S) are placed in the first sub-group of the table. The second sub-group includes DSR and networks with ERK sparsity distribution which require a higher number of FLOPs for inference with same parameter count. The final sub-group includes methods that require the space and the work proportional to training a dense model.

Table 2: Performance and cost of sparse training methods on training 80% and 90% sparse ResNet-50s. FLOPs needed for training and test are normalized with the FLOPs of a dense model (see Appendix H for details on how FLOPs are calculated). Methods with a subscript indicate a rescaled training time, and ‘*’ indicates results reported elsewhere. (ERK) corresponds to the sparse networks with Erdos-Renyi-Kernel sparsity distribution. RigL$_{5 \times}$ (ERK) achieves 77.1% Top-1 Accuracy using only 20% of the parameters of a dense model and 42% of its FLOPs.

4.1.2 MobileNet

MobileNet is a compact architecture that performs remarkably well in resource constrained settings. Due to its compact nature with separable convolutions it is known to be difficult to sparsify (Zhu & Gupta, 2018). In this section we apply our method to MobileNet-v1 (Howard et al., 2017) and MobileNet-v2 (Sandler et al., 2018). Due to its low parameter count we keep the first layer dense, and use ERK and Uniform sparsity distributions to sparsify the remaining layers.

The performance of sparse MobileNets trained with RigL as well as the baselines are shown in Figure 3. We do extended training (5x of the original number of steps) for all runs in this section. Although MobileNets are more sensi tive to sparsity compared to the ResNet-50 architecture, RigL successfully trains sparse MobileNets at high sparsities and exceeds the performance of previously reported pruning results.

Figure 3: (left) RigL significantly improves the performance of Sparse MobileNets on ImageNet-2012 dataset and exceeds the pruning results reported by Zhu & Gupta (2018). Performance of the dense MobileNets are indicated with red lines. (right) Performance of sparse MobileNet-v1 architectures presented with their inference FLOPs. Networks with ERK distribution get better performance with the same number of parameters but take more FLOPs to run. Training wider sparse models with RigL (Big-Sparse) yields a significant performance improvement over the dense model.

To demonstrate the advantages of sparse models, next, we train wider MobileNets while keeping the FLOPs and total number of parameters the same as the dense baseline using sparsity. A sparse MobileNet-v1 with width multiplier 1.98 and constant 75% sparsity has the same FLOPs and parameter count as the dense baseline. Training this network with RigL yields an impressive 4.3% absolute improvement in Top-1 Accuracy.

4.2 Character Level Language Modelling

Most prior work has only examined sparse training on vision networks (the exception is the earliest work Deep Rewiring (Bellec et al., 2018) which trained an LSTM (Hochreiter & Schmidhuber, 1997) on the TIMIT (Garofolo et al., 1993) dataset). To fully understand these techniques it is important to examine different architectures on different datasets. Kalchbrenner et al. (2018) found sparse GRUs (Cho et al., 2014) to be very effective at modeling speech, however the dataset they used is not available. We choose a proxy task with similar characteristics (dataset size and vocabulary size are approximately the same) character level language modeling on the publicly available WikiText103 (Merity et al., 2016) dataset.

Our network consists of a shared embedding with dimensionality 128, a vocabulary size of 256, a GRU with a state size of 512, a readout from the GRU state consisting of two linear layers with 256 units and 128 units respectively. We train the next step prediction task with the standard cross entropy loss, the Adam optimizer, a learning rate of 7e − 4, an L2 regularization coefficient of 5e − 4, a sequence length of 512, a batch size of 32 and gradient absolute value clipping of values larger (in magnitude) than 10. We set the sparsity to 75% for all models and run 200,000 iterations. When inducing sparsity with magnitude pruning (Zhu & Gupta, 2018), we perform pruning between iterations 50,000 and 150,000 with a frequency of 1,000. We initialize sparse networks with a uniform sparsity distribution and use a cosine update schedule with $\alpha$ = 0.1 and $\Delta T$ = 100. Unlike the previous experiments we keep updating the mask until the end of the training; we observed this performed slightly better than stopping at iteration 150,000.

In Figure 4-left we report the validation bits per step of various solutions at the end of the training. For each method we perform extended runs to see how they scale with increasing training time. As observed before, SET performs worse than the other dynamic training methods and its performance improves only slightly with increased training time. On the other hand the performance of RigL and SNFS improves constantly with more training steps. Even though RigL exceeds the performance of the other sparse training approaches it fails to match the performance of pruning in this setting.

4.3 WideResNet-22-2 on CIFAR-10

We also evaluate the performance of RigL on the CIFAR-10 image classification benchmark. We train a Wide Residual Network (Zagoruyko & Komodakis, 2016) with 22 layers using a width multiplier of 2 for 250 epochs (97656 steps). The learning rate starts at 0.1 which is scaled down by a factor of 5 every 30,000 iterations. We use an L2 regularization coefficient of 5e-4, a batch size of 128 and a momentum coefficient of 0.9. We keep the hyper-parameters specific to RigL same as the ImageNet experiments, except the final iteration for mask updates; which is adjusted to 75000. Results with different mask update intervals can be found in Appendix I.

The final accuracy of RigL for various sparsity levels is presented in Figure 4-right. The dense baseline obtains 94.1% test accuracy; surprisingly, some of the 50% sparse networks generalize better than the dense baseline demonstrating the regularization aspect of sparsity. With increased sparsity, we see a performance gap between the Static and Pruning solutions. Training static networks longer seems to have limited effect on the final performance. On the other hand, RigL matches the performance of pruning with only a fraction of the resources needed for training.

Figure 4: (left) Final validation loss of various sparse training methods on character level language modelling task. Cross entropy loss is converted to bits (from nats). (right) Test accuracies of sparse WideResNet-22-2’s on CIFAR-10 task.

4.4 Analyzing the performance of RigL

In this section we study the effect of sparsity distributions, update schedules, and dynamic connections on the performance of our method. The results for SET and SNFS are similar and are discussed in Appendices C and F.

Effect of Mask Initialization: Figure 5-left shows how the sparsity distribution affects the final test accuracy of sparse ResNet-50s trained with RigL. Erdos-Renyi-Kernel (ERK) performs consistently better than the other two distributions. ERK automatically allocates more parameters to the layers with few parameters by decreasing their sparsities. This reallocation seems to be crucial for preserving the capacity of the network at high sparsity levels where ERK outperforms other distributions by a greater margin. Though it performs better, the ERK distribution requires approximately twice as many FLOPs compared to a uniform distribution. This highlights an interesting tradeoff between accuracy and computational efficiency even though both models have the same number of parameters.

Effect of Update Schedule and Frequency: In Figure 5-right, we evaluate the performance of our method on update intervals $\Delta T \in$ [50, 100, 500, 1000] and initial drop fractions $\alpha \in$ [0.1, 0.3, 0.5]. The best accuracies are obtained when the mask is updated every 100 iterations with an initial drop fraction of 0.3 or 0.5. Notably, even with frequent update intervals (e.g. every 1000 iterations), RigL performs above 73.5%.

Figure 5: (left) Performance of RigL at different sparsities using different sparsity masks (right) Ablation study on cosine schedule. Other methods are in the appendix.

Effect of Dynamic connections: Frankle et al. (2019) and Mostafa & Wang (2019) observed that static sparse training converges to a solution with a higher loss than dynamic sparse training. In Figure 6-left we examine the loss landscape lying between a solution found via static sparse training and a solution found via pruning to understand whether the former lies in a basin isolated from the latter. Performing a linear interpolation between the two reveals the expected result – a high-loss barrier – demonstrating that the loss landscape is not trivially connected. However, this is only one of infinitely many paths between the two points; optimization can be used to find parametric curves that connect solutions (Garipov et al., 2018; Draxler et al., 2018) subject to constraints, such as minimizing the loss along the curve. For example Garipov et al. (2018) showed different dense solutions lie in the same basin by finding $2^{nd}$ order Bezier curves with low energy between the two solutions. Following their method, we attempt to find quadratic and cubic Bezier curves between the two sparse solutions. Surprisingly, even with a cubic curve, we fail to find a path without a high-loss barrier. These results suggest that static sparse training can get stuck at local minima that are isolated from improved solutions. On the other hand, when we optimize the quadratic Bezier curve across the full dense space we find a near-monotonic path to the improved solution, suggesting that allowing new connections to grow lends dynamic sparse training greater flexibility in navigating the loss landscape.

In Figure 6-right we train RigL starting from the sub-optimal solution found by static sparse training, demonstrating that it is able to escape the local minimum, whereas re-training with static sparse training cannot. RigL first removes connections with the smallest magnitudes since removing these connections have been shown to have a minimal effect on the loss (Han et al., 2015; Evci, 2018). Next, it activates connections with the high gradients, since these connections are expected to decrease the loss fastest. We hypothesize in Appendix A that RigL escapes bad critical points by replacing saddle directions with high gradient dimensions.

Figure 6: (left) Training loss evaluated at various points on interpolation curves between a magnitude pruning model (0.0) and a model trained with static sparsity (1.0). (right) Training loss of RigL and Static methods starting from the static sparse solution, and their final accuracies.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up