1 Introduction

訳文

スパースニューラルネットワークのパラメータおよび浮動小数点演算 (FLOP) の効率性は, 現在, 様々な問題について十分に実証されている (Han et al., 2015; Srinivas et al., 2017). リカレントニューラルネットワーク (RNNs) (Kalchbrenner et al., 2018) および畳み込みニューラルネットワーク (ConvNets) (Park et al., 2016; Elsen et al., 2019) の両方において, スパースを使用して推論時間の高速化が可能であることが複数の研究で示されている. 現在, 最も正確なスパースモデルは, メモリおよび FLOPs (Zhu & Gupta, 2018; Guo et al., 2016) の点で密なモデルを訓練するためのコストを最低でも必要とし, 時にはそれ以上のコストを必要とする技法で得られる (Molchanov et al., 2017). このパラダイムには2つの主な制限がある:

スパースなモデルの最大サイズは, 訓練可能な最大の密なモデルに制限されている. 仮にスパースモデルの方がパラメータ効率が良いとしても, 可能な限り最大の密なモデルよりも大きくて精度の高いモデルを訓練するためにプルーニングを使うことはできない.
非効率的である. 値がゼロのパラメータや, 推論中にゼロになるパラメータに対して, 大量の計算を行わなければならない.

さらに, 現在の最高のプルーニングアルゴリズムの性能がスパースモデルの品質の上限であるかどうかは不明のままである. Gale et al. (2019) は, 3つの異なる密対スパースのトレーニングアルゴリズムが, すべてほぼ同じスパース/精度のトレードオフを達成することを発見した. しかし, これは, これ以上の性能が得られないという決定的な証明には程遠い. 本研究では, 以下に紹介する方法を含む動的スパーストレーニングが, 初期密なネットワークをプルーニングする現在の最良のアプローチよりも精度の高いモデルを見つけることができるという驚くべき結果を示す. 重要なことは, 我々の方法は訓練中にモデルを実行するのに必要な FLOPs を変更しないので, 訓練前に特定の推論コストを決定することができるということである.
The Lottery Ticket Hypothesis (Frankle & Carbin, 2019) は, 反復的なプルーニングを伴うスパースなニューラルネットワークを見つけることができれば, そのスパースなネットワークを元の初期条件から始めることで, 同じレベルの精度でゼロから訓練することができるという仮説を立てた. この論文では, "ラッキー" な初期化を必要とせずに疎なモデルを学習する新しい方法を紹介する; このため, 我々は我々のメソッドを "The Rigged Lottery" またはRigL と呼ぶ. この手法は, 以下のようなものであることを示している:

メモリ効率が良い. それは, スパースなモデルのサイズに比例したメモリのみを必要とする. それは密なモデルのサイズである量を格納することを決して必要としない. これは, すべてのパラメータの運動量を保存する必要がある Dettmers & Zettlemoyer (2019) とは対照的である.
計算効率が良い. モデルの学習に必要な計算量は, モデル内の非ゼロパラメータの数に比例する.
正確. この手法によって達成される性能は, プルーニングに基づくアプローチの性能と一致し, 時にはそれを上回る.

我々の手法は, ネットワークの再配線を通知するために瞬間的な勾配情報を稀に使用することで動作する. これにより, スパースなパターンが静的なままであれば, 最適化がトラップされてしまうような局所的な最小値から逃れることができることを示している. 重要なことに, 完全な勾配情報が必要な回数が 1 / (1 - sparsity) 以下である限り, 全体的な作業はモデルの sparsity に比例したままである.
最後に, 付録 B では, RigL を最新の構造化ベイズプルーニング法と比較する. スパースなトレーニングでは, ネットワークの初期トレーニングや "圧縮" の際に必要なリソースがはるかに少なく, RigL によって発見された最終的なアーキテクチャはより小さく, 推論のために必要な FLOP がより少ないことを示している. このことは, 構造化されたプルーニング法によって発見されたより小さな密なネットワークを学習するためのコストだけを考慮した場合でも当てはまる.

原文

The parameter and floating point operation (FLOP) efficiency of sparse neural networks is now well demonstrated on a variety of problems (Han et al., 2015; Srinivas et al., 2017). Multiple works have shown inference time speedups are possible using sparsity for both Recurrent Neural Networks (RNNs) (Kalchbrenner et al., 2018) and Convolutional Neural Networks (ConvNets) (Park et al., 2016; Elsen et al., 2019). Currently, the most accurate sparse models are obtained with techniques that require, at a minimum, the cost of training a dense model in terms of memory and FLOPs (Zhu & Gupta, 2018; Guo et al., 2016), and sometimes significantly more (Molchanov et al., 2017). This paradigm has two main limitations:

The maximum size of sparse models is limited to the largest dense model that can be trained. Even if sparse models are more parameter efficient, we can’t use pruning to train models that are larger and more accurate than the largest possible dense models.
It is inefficient. Large amounts of computation must be performed for parameters that are zero valued or that will be zero during inference.

Additionally, it remains unknown if the performance of the current best pruning algorithms are an upper bound on the quality of sparse models. Gale et al. (2019) found that three different dense-to-sparse training algorithms all achieve about the same sparsity / accuracy trade-off. However, this is far from conclusive proof that no better performance is possible. In this work we show the surprising result that dynamic sparse training, which includes the method we introduce below, can find more accurate models than the current best approaches to pruning initially dense networks. Importantly, our method does not change the FLOPs required to execute the model during training, allowing one to decide on a specific inference cost prior to training.
The Lottery Ticket Hypothesis (Frankle & Carbin, 2019) hypothesized that if we can find a sparse neural network with iterative pruning, then we can train that sparse network from scratch, to the same level of accuracy, by starting from the original initial conditions. In this paper we introduce a new method for training sparse models without the need of a “lucky” initialization; for this reason, we call our method “The Rigged Lottery” or RigL. We show that this method is:

Memory efficient: It requires memory only proportional to the size of the sparse model. It never requires storing quantities that are the size of the dense model. This is in contrast to Dettmers & Zettlemoyer (2019) which requires storing the momentum for all parameters, even those that are zero valued.
Computationally efficient: The amount of computation required to train the model is proportional to the number of nonzero parameters in the model.
Accurate: The performance achieved by the method matches and sometimes exceeds the performance of pruning based approaches.

Our method works by infrequently using instantaneous gradient information to inform a re-wiring of the network. We show that this allows the optimization to escape local minima where it would otherwise become trapped if the sparsity pattern were to remain static. Crucially, as long as the full gradient information is needed less than every 1 / (1 − sparsity) iterations, then the overall work remains proportional to the model sparsity.
Finally, in appendix B we compare RigL with state-of-the-art structured Bayesian pruning methods. We show that sparse training requires far fewer resources when initially training or “compressing” a network and that the final architecture found by RigL is smaller and requires fewer FLOPs for inference. This remains true even when only considering the cost of training the smaller dense network found by the structured pruning methods.

Rigging the Lottery: Making All Tickets Winners 【1 Introduction】【論文 DeepL 翻訳】

1 Introduction

訳文

原文