生成AIを用いてEfficientNetV2の論文「EfficientNetV2: Smaller Models and Faster Training (2021)」を読んでみた

Last updated at 2025-03-22Posted at 2024-08-26

はじめに

生成AIを用いてEfficientNetV2の論文「EfficientNetV2: Smaller Models and Faster Training」の内容を(なるべく)把握してみました。(生成AIが)論文の記載内容を始めから最後まで読んで、実際にどのような記載があるのかを把握します。

(論文の分かりやすい解説記事は見るのですが、実際の論文までチェックしないので、生成AIを使って内容を把握してみました。)

主な改良点は下記の2点であることが分かりました:
depthwise convolutionsのボトルネックを解消し、効率的なFused-MBConvの新しい操作を組み込んだモデルを設計した。
トレーニング中に画像サイズを徐々に増加させるとともに、正則化（例：ドロップアウトやデータ拡張）を適応的に変化させることで、トレーニング速度と精度のバランスを最適化した。
(末尾の「分かったこと」章を参照)

以降で、ChatGPTに聞いてみた例を記載します。

他例: 同類の方法を使って読んでみた結果

対象の論文

論文: (EfficientNetV2に関する論文)

[2104.00298] EfficientNetV2: Smaller Models and Faster Training
https://arxiv.org/abs/2104.00298
(PDF: https://arxiv.org/pdf/2104.00298)

質問時の各章節の区切り部分

論文の中にある各章節を、下記のように区切って、部分毎に生成AIに内容を質問していきます。

Abstract
---
1.-Introduction
---
2.-Related work
---
3.-EfficientNetV2 Architecture Design
3.1. Review of EfficientNet
3.2. Understanding Training Efficiency
---
3.3. Training-Aware NAS and Scaling
---
4.-Progressive Learning
4.1. Motivation
4.2. Progressive Learning with adaptive Regularization
---
5.-Main Results
5.1. ImageNet ILSVRC2012
---
5.2. ImageNet21k
---
5.3. Transfer Learning Datasets
---
6.-Ablation Studies
6.1. Comparison to EfficientNet
---
6.2. Progressive Learning for Different Networks
6.3. Importance of Adaptive Regularization
---
7.-Conclusion

生成AIへの質問方法

生成AIを活用して、知りたい記事・論文の1節分(適度な長さ)のテキストをコピー＆ペーストして、その下に質問内容を「①～ ②～ …」と番号付きで書いて、生成AIに渡せば、全質問に一発で的確に回答してくれるので、非常に良好でした。記事全体を読む必要なく、知りたい点の情報だけを収集できます。

生成AIへの質問例:

(論文・記事の各章節を貼り付け)

上記の内容に関して下記の質問に回答下さい: (である調で記載、質問に対して該当するものが無ければ無しと記載、対応する図/表番号があれば記載)
①何についての記載か? + 要旨は何? + 対応する図/表番号を列挙 (要旨は箇条書きで記載、図/表番号は横1列で羅列)
②改良点・工夫点・テクニック等の記載があれば説明下さい。
③性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば記載)
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。
⑤比較の記載があれば違いを表でまとめて下さい。(対応する図/表番号があれば明記)
⑥上記⑤以外で表に出来そうな部分があれば表でまとめて下さい。(対応する図/表番号があれば記載)

質問内容は、記事・論文を読んでいていつも知りたいと思う点(改良点・工夫点・テクニック・メカニズムの解明)にしています。また、表で比較した方が素早く把握できるので、可能であれば記事を表に変換するようにしています。

論文・記事を貼り付けるテキストの長さは、1節分程度の量にとどめた方が、良い回答が得られました。生成AIの回答の文量が多くなってくると、回答が長くなり過ぎないように、生成AIが勝手に(適度に)端折り始めてしまい、重要な点が回答から抜けてしまう可能性が高まります。

事前知識

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。

EfficientNetV2についての分かりやすい解説記事(下記)等を事前にチェックして、中核部分の内容をあらかじめ分かっていると、理解しやすいと思います。生成AIは実際の細かい処理方法自体を分かりやすく説明してはくれない傾向があります。

注意点

論文のテキスト内容だけを貼り付けて、生成AIに質問しています。論文の中の図・表の部分は貼り付けていません。図・表の内容は生成AIの回答には含まれず、別途論文を見る必要があります。

以降で、生成AIの回答内容が読みにくい・分かりづらい場合は、論文の本文でも同じように書かれてあり、論文の本文を読んでも同じように分かりづらいことが多くあります。論文では、既存研究等があるため、多くの説明を省略・一言だけサラッと書かれてある等、同種の研究に取り組む人でなければ、なかなか全容の理解に辿りつくのは難しい側面があります。この点は、生成AIの回答性能が悪いのではなく、論文という性質上、生じるものと考えています。

生成AIに質問

以降で、ChatGPTに実際に聞いてみた例を記載します。

生成AIへの質問＆回答の全容

生成AIへの質問＆回答の全容・詳細:

Abstract

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) Abstract This paper introduces EfficientNetV2, a new fam- ily of convolutional networks that have faster training speed and better parameter efficiency than previous models. To develop these models, we use a combination of training-aware neural ar- chitecture search and scaling, to jointly optimize training speed and parameter efficiency. The mod- els were searched from the search space enriched with new ops such as Fused-MBConv. Our ex- periments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller. Our training can be further sped up by progres- sively increasing the image size during training, but it often causes a drop in accuracy. To com- pensate for this accuracy drop, we propose an improved method of progressive learning, which adaptively adjusts regularization (e.g. data aug- mentation) along with image size. With progressive learning, our EfficientNetV2 sig- nificantly outperforms previous models on Im- ageNet and CIFAR/Cars/Flowers datasets. By pretraining on the same ImageNet21k, our Effi- cientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the re- cent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources. Code is available at https://github.com/google/ automl/tree/master/efficientnetv2.

上記の内容に関して下記の質問に回答下さい: (である調で記載、質問に対して該当するものが無ければ無しと記載、対応する図/表番号があれば記載)
①何についての記載か? + 要旨は何? + 対応する図/表番号を列挙 (要旨は箇条書きで記載、図/表番号は横1列で羅列)
②改良点・工夫点・テクニック等の記載があれば説明下さい。
③性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば記載)
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。
⑤比較の記載があれば違いを表でまとめて下さい。(対応する図/表番号があれば明記)
⑥上記⑤以外で表に出来そうな部分があれば表でまとめて下さい。(対応する図/表番号があれば記載)

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何? + 図/表番号列挙
このセクションは、EfficientNetV2という新しい畳み込みネットワークファミリーについて述べており、そのモデルが従来のモデルよりも高速なトレーニング速度と優れたパラメーター効率を持つことを紹介している。

要旨

EfficientNetV2は、トレーニング速度とパラメーター効率を最適化するために、トレーニング認識型のニューラルアーキテクチャ検索とスケーリングを組み合わせて開発された。
新しい操作（Fused-MBConvなど）を追加した検索スペースからモデルを探索。
EfficientNetV2は、従来の最先端モデルと比較して最大6.8倍小型で、トレーニング速度が大幅に向上。
トレーニング中に画像サイズを徐々に増加させることで、さらにトレーニング速度を向上させるが、精度の低下を引き起こすことがある。
精度低下を補うために、画像サイズに応じて正則化（例：データ拡張）を適応的に調整する進化型の学習方法を提案。
EfficientNetV2は、ImageNetやCIFAR/Cars/Flowersデータセットで従来モデルを大幅に上回る性能を発揮。
ImageNet21kでの事前トレーニングにより、EfficientNetV2はImageNet ILSVRC2012で87.3%のトップ1精度を達成し、最近のViTモデルよりも2.0%高い精度を5倍から11倍速いトレーニング時間で達成した。

対応する図/表番号: 無し

②改良点・工夫点・テクニック等の記載

Fused-MBConvなどの新しい操作を組み込んだ検索スペースを使用してモデルを設計した。
画像サイズの進化的な学習方法により、精度低下を防ぎながらトレーニング速度を向上させた。

③性能が向上した記載

EfficientNetV2は、従来のモデルと比較して最大6.8倍小型であり、トレーニング速度が大幅に向上している。
EfficientNetV2は、ImageNet ILSVRC2012で87.3%のトップ1精度を達成し、ViTモデルよりも2.0%高い精度を5倍から11倍速いトレーニング時間で達成している。

④メカニズムの解明・なぜそうなるのか等の記載

画像サイズを徐々に増加させる進化型の学習方法を導入し、それに合わせて正則化を調整することで、トレーニング速度を向上させつつ精度低下を防ぐことができる。

⑤比較の記載

比較項目	EfficientNetV2	ViT	表番号
トップ1精度 (ImageNet ILSVRC2012)	87.3%	85.3%	無し
トレーニング速度	5x-11x速い	(基準)	無し
モデルサイズ	最大6.8倍小型	(基準)	無し

⑥上記⑤以外で表に出来そうな部分
無し

1. Introduction

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 1. Introduction Training efficiency is important to deep learning as model size and training data size are increasingly larger. For exam- ple, GPT-3 (Brown et al., 2020), with much a larger model and more training data, demonstrates the remarkable capa- bility in few shot learning, but it requires weeks of training with thousands of GPUs, making it difficult to retrain or improve. Training efficiency has gained significant interests recently. For instance, NFNets (Brock et al., 2021) aim to improve training efficiency by removing the expensive batch nor- malization; Several recent works (Srinivas et al., 2021) fo- cus on improving training speed by adding attention layers into convolutional networks (ConvNets); Vision Transform- ers (Dosovitskiy et al., 2021) improves training efficiency on large-scale datasets by using Transformer blocks. How- ever, these methods often come with expensive overhead on large parameter size, as shown in Figure 1(b). In this paper, we use an combination of training-aware neu- ral architecture search (NAS) and scaling to improve both training speed and parameter efficiency. Given the parame- ter efficiency of EfficientNets (Tan & Le, 2019a), we start by systematically studying the training bottlenecks in Effi- cientNets. Our study shows in EfficientNets: (1) training with very large image sizes is slow; (2) depthwise convolu- tions are slow in early layers. (3) equally scaling up every stage is sub-optimal. Based on these observations, we de- sign a search space enriched with additional ops such as Fused-MBConv, and apply training-aware NAS and scaling to jointly optimize model accuracy, training speed, and pa- rameter size. Our found networks, named EfficientNetV2, train up to 4x faster than prior models (Figure 3), while being up to 6.8x smaller in parameter size. Our training can be further sped up by progressively increas- ing image size during training. Many previous works, such as progressive resizing (Howard, 2018), FixRes (Touvron et al., 2019), and Mix&Match (Hoffer et al., 2019), have used smaller image sizes in training; however, they usually keep the same regularization for all image sizes, causing a drop in accuracy. We argue that keeping the same regular- ization for different image sizes is not ideal: for the same network, small image size leads to small network capac- ity and thus requires weak regularization; vice versa, large image size requires stronger regularization to combat overfit- ting (see Section 4.1). Based on this insight, we propose an improved method of progressive learning: in the early train- ing epochs, we train the network with small image size and weak regularization (e.g., dropout and data augmentation), then we gradually increase image size and add stronger reg- ularization. Built upon progressive resizing (Howard, 2018), but by dynamically adjusting regularization, our approach can speed up the training without causing accuracy drop. With the improved progressive learning, our EfficientNetV2 achieves strong results on ImageNet, CIFAR-10, CIFAR- 100, Cars, and Flowers dataset. On ImageNet, we achieve 85.7% top-1 accuracy while training 3x - 9x faster and being up to 6.8x smaller than previous models (Figure 1). Our Ef- ficientNetV2 and progressive learning also make it easier to train models on larger datasets. For example, ImageNet21k (Russakovsky et al., 2015) is about 10x larger than ImageNet ILSVRC2012, but our EfficientNetV2 can finish the training within two days using moderate computing resources of 32 TPUv3 cores. By pretraining on the public ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on Ima- geNet ILSVRC2012, outperforming the recent ViT-L/16 by 2.0% accuracy while training 5x-11x faster (Figure 1). Our contributions are threefold: • We introduce EfficientNetV2, a new family of smaller and faster models. Found by our training-aware NAS and scaling, EfficientNetV2 outperform previous mod- els in both training speed and parameter efficiency. • We propose an improved method of progressive learn- ing, which adaptively adjusts regularization along with image size. We show that it speeds up training, and simultaneously improves accuracy. • We demonstrate up to 11x faster training speed and up to 6.8x better parameter efficiency on ImageNet, CIFAR, Cars, and Flowers dataset, than prior art.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何? + 図/表番号列挙
このセクションでは、深層学習におけるトレーニング効率の向上を目指して、EfficientNetV2という新しいモデルファミリーを導入し、その設計原理や性能向上のための手法について説明している。

要旨

モデルサイズとトレーニングデータの増大により、深層学習のトレーニング効率がますます重要になっている。
EfficientNetV2は、トレーニング速度とパラメーター効率を最適化するために、トレーニング認識型ニューラルアーキテクチャ検索 (NAS) とスケーリングを組み合わせて設計された。
EfficientNetV2は、EfficientNetのボトルネックを解消し、最大4倍のトレーニング速度と最大6.8倍の小型化を実現。
進化的な学習手法を提案し、画像サイズと正則化の調整を行うことで、トレーニング速度を向上させながら精度の低下を防ぐ。
EfficientNetV2は、ImageNet、CIFAR-10、CIFAR-100、Cars、Flowersデータセットにおいて、従来のモデルを大幅に上回る性能を発揮。

対応する図/表番号: Figure 1, Figure 3

②改良点・工夫点・テクニック等の記載

EfficientNetV2の設計において、トレーニング速度とパラメーター効率を最適化するために、トレーニング認識型ニューラルアーキテクチャ検索 (NAS) とスケーリングを採用した。
深層畳み込み（depthwise convolutions）のボトルネックを解消し、効率的なFused-MBConvなどの新しい操作を導入した。
進化的な学習手法を導入し、トレーニング中に画像サイズを徐々に増加させるとともに、正則化（例：ドロップアウトやデータ拡張）を適応的に強化することで、トレーニング速度と精度のバランスを最適化した。

③性能が向上した記載

EfficientNetV2は、ImageNetデータセットで87.3%のトップ1精度を達成し、ViT-L/16よりも2.0%高い精度を、5倍から11倍速いトレーニング時間で実現している (Figure 1)。
EfficientNetV2は、従来のモデルと比較して最大6.8倍小型で、トレーニング速度が最大11倍速くなっている (Figure 1, Figure 3)。

④メカニズムの解明・なぜそうなるのか等の記載

画像サイズを徐々に増加させる進化的な学習手法により、小さな画像サイズでは弱い正則化、大きな画像サイズでは強い正則化を適用することで、過学習を防ぎつつトレーニング速度を最適化できる。

⑤比較の記載

比較項目	EfficientNetV2	ViT-L/16	表番号
トップ1精度 (ImageNet ILSVRC2012)	87.3%	85.3%	Figure 1
トレーニング速度	5x-11x速い	(基準)	Figure 1
モデルサイズ	最大6.8倍小型	(基準)	Figure 1, 3

⑥上記⑤以外で表に出来そうな部分

トレーニングボトルネック	EfficientNet (従来)	EfficientNetV2 (改善)	表番号
画像サイズが大きい時のトレーニング速度	遅い	進化的学習手法で高速化	Figure 1
初期層の深層畳み込みの速度	遅い	Fused-MBConvで高速化	Figure 1, 3
ステージごとの均等なスケーリング	サブオプティマル（最適ではない）	最適化されたスケーリング	Figure 1, 3

対応する図/表番号: Figure 1, Figure 3

▼質問(人間)
「最大6.8倍の小型化」とは、何を基準にして?

▼生成AIの回答(ChatGPT-4o)
「最大6.8倍の小型化」という表現は、EfficientNetV2が従来のEfficientNetや他の最先端モデルと比較して、パラメータの数やモデルサイズが最大6.8倍小さいことを意味しています。この比較基準は、主にEfficientNetV2が持つパラメータ数を従来のEfficientNetや他の大規模なモデル（例えば、Vision Transformerなど）と比較したものです。

具体的には、EfficientNetV2は、トレーニング速度を大幅に向上させつつ、パラメータ数（モデルサイズ）を従来のモデルよりも大幅に削減しています。この「6.8倍」という数値は、EfficientNetV2のパラメータ数が、EfficientNetや他の比較対象モデルに対して約6.8分の1であることを示しています。

図や表での具体的な比較を確認することで、この「6.8倍の小型化」がどのモデルを基準にしているかが明確になることがあります。

2. Related work

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 2. Related work Training and parameter efficiency: Many works, such as DenseNet (Huang et al., 2017) and EfficientNet (Tan & Le, 2019a), focus on parameter efficiency, aiming to achieve better accuracy with less parameters. Some more recent works aim to improve training or inference speed instead of parameter efficiency. For example, RegNet (Radosavovic et al., 2020), ResNeSt (Zhang et al., 2020), TResNet (Ridnik et al., 2020), and EfficientNet-X (Li et al., 2021) focus on GPU and/or TPU inference speed; NFNets (Brock et al., 2021) and BoTNets (Srinivas et al., 2021) focus on improving training speed. However, their training or inference speed often comes with the cost of more parameters. This paper aims to significantly improve both training speed and parameter efficiency than prior art. Progressive training: Previous works have proposed dif- ferent kinds of progressive training, which dynamically change the training settings or networks, for GANs (Karras et al., 2018), transfer learning (Karras et al., 2018), adver- sarial learning (Yu et al., 2019), and language models (Press et al., 2021). Progressive resizing (Howard, 2018) is mostly related to our approach, which aims to improve training speed. However, it usually comes with the cost of accuracy drop. Another closely related work is Mix&Match (Hoffer et al., 2019), which randomly sample different image size for each batch. Both progressive resizing and Mix&Match use the same regularization for all image sizes, causing a drop in accuracy. In this paper, our main difference is to adaptively adjust regularization as well so that we can im- prove both training speed and accuracy. Our approach is also partially inspired by curriculum learning (Bengio et al., 2009), which schedules training examples from easy to hard. Our approach also gradually increases learning difficulty by adding more regularization, but we don’t selectively pick training examples. Neural architecture search (NAS): By automating the network design process, NAS has been used to optimize the network architecture for image classification (Zoph et al., 2018), object detection (Chen et al., 2019; Tan et al., 2020), segmentation (Liu et al., 2019), hyperparameters (Dong et al., 2020), and other applications (Elsken et al., 2019). Previous NAS works mostly focus on improving FLOPs efficiency (Tan & Le, 2019b;a) or inference efficiency (Tan et al., 2019; Cai et al., 2019; Wu et al., 2019; Li et al., 2021). Unlike prior works, this paper uses NAS to optimize training and parameter efficiency.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何? + 図/表番号列挙
このセクションでは、EfficientNetV2の関連研究について述べられており、特にトレーニング効率やパラメータ効率、進行的なトレーニング、ニューラルアーキテクチャ検索（NAS）に関する研究が紹介されている。

要旨

多くの研究がパラメータ効率の向上やトレーニング/推論速度の改善に焦点を当てている。
Progressive trainingの手法として、トレーニング速度の向上を目指すアプローチがいくつかあるが、精度の低下が課題である。
NASは、主にFLOPs効率や推論効率の最適化に使用されてきたが、本論文ではトレーニング効率とパラメータ効率の最適化にNASを活用している。

対応する図/表番号: 無し

②改良点・工夫点・テクニック等の記載

Progressive resizingやMix&Matchといった手法では、異なる画像サイズに対して同じ正則化を使用することが精度低下を引き起こすと指摘し、本論文ではこれに対応するために正則化を動的に調整する進行的な学習方法を提案している。
NASを使用して、トレーニングとパラメータ効率の両方を最適化する新しいアプローチを導入している。

③性能が向上した記載

Progressive resizingやMix&Matchと比べて、本論文の進行的な学習方法は、トレーニング速度の向上に加え、精度の低下を抑えることができる。

④メカニズムの解明・なぜそうなるのか等の記載

異なる画像サイズに対して同じ正則化を適用することがなぜ精度低下につながるのか、正則化の強度を画像サイズに応じて調整することで、トレーニングの効率を高めながら精度を維持できると説明している。

⑤比較の記載

手法	改善点	課題
Progressive resizing	トレーニング速度の向上	正則化の強度が一定のため精度が低下
Mix&Match	画像サイズごとのランダムサンプリング	同様に精度が低下
本論文の進行的学習手法	トレーニング速度の向上、精度維持	正則化を動的に調整することで精度低下を防ぐ

対応する図/表番号: 無し

⑥上記⑤以外で表に出来そうな部分
無し

3. EfficientNetV2 Architecture Design

3.1. Review of EfficientNet

3.2. Understanding Training Efficiency

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3. EfficientNetV2 Architecture Design In this section, we study the training bottlenecks of Efficient- Net (Tan & Le, 2019a), and introduce our training-aware NAS and scaling, as well as EfficientNetV2 models. 3.1. Review of EfficientNet EfficientNet (Tan & Le, 2019a) is a family of models that are optimized for FLOPs and parameter efficiency. It leverages NAS to search for the baseline EfficientNet-B0 that has better trade-off on accuracy and FLOPs. The baseline model is then scaled up with a compound scaling strategy to obtain a family of models B1-B7. While recent works have claimed large gains on training or inference speed, they are often worse than EfficientNet in terms of parameters and FLOPs efficiency (Table 1). In this paper, we aim to improve the training speed while maintaining the parameter efficiency. 3.2. Understanding Training Efficiency We study the training bottlenecks of EfficientNet (Tan & Le, 2019a), henceforth is also called EfficientNetV1, and a few simple techniques to improve training speed. Training with very large image sizes is slow: As pointed out by previous works (Radosavovic et al., 2020), Efficient- Net’s large image size results in significant memory usage. Since the total memory on GPU/TPU is fixed, we have to train these models with smaller batch size, which drastically slows down the training. A simple improvement is to apply FixRes (Touvron et al., 2019), by using a smaller image size for training than for inference. As shown in Table 2, smaller image size leads to less computations and enables large batch size, and thus improves training speed by up to 2.2x. Notably, as pointed out in (Touvron et al., 2020; Brock et al., 2021), using smaller image size for training also leads to slightly better accuracy. But unlike (Touvron et al., 2019), we do not finetune any layers after training. In Section 4, we will explore a more advanced training approach, by progressively adjusting image size and regu- larization during training. Depthwise convolutions are slow in early layers but ef- fective in later stages: Another training bottleneck of Ef- ficientNet comes from the extensive depthwise convolu- tions (Sifre, 2014). Depthwise convolutions have fewer parameters and FLOPs than regular convolutions, but they often cannot fully utilize modern accelerators. Recently, Fused-MBConv is proposed in (Gupta & Tan, 2019) and later used in (Gupta & Akin, 2020; Xiong et al., 2020; Li et al., 2021) to better utilize mobile or server accelerators. It replaces the depthwise conv3x3 and expansion conv1x1 in MBConv (Sandler et al., 2018; Tan & Le, 2019a) with a single regular conv3x3, as shown in Figure 2. To system- atically compares these two building blocks, we gradually replace the original MBConv in EfficientNet-B4 with Fused- MBConv (Table 3). When applied in early stage 1-3, Fused- MBConv can improve training speed with a small overhead on parameters and FLOPs, but if we replace all blocks with Fused-MBConv (stage 1-7), then it significantly increases parameters and FLOPs while also slowing down the train- ing. Finding the right combination of these two building blocks, MBConv and Fused-MBConv, is non-trivial, which motivates us to leverage neural architecture search to auto- matically search for the best combination. Equally scaling up every stage is sub-optimal: Efficient- Net equally scales up all stages using a simple compound scaling rule. For example, when depth coefficient is 2, then all stages in the networks would double the number of lay- ers. However, these stages are not equally contributed to the training speed and parameter efficiency. In this paper, we will use a non-uniform scaling strategy to gradually add more layers to later stages. In addition, EfficientNets ag- gressively scale up image size, leading to large memory consumption and slow training. To address this issue, we slightly modify the scaling rule and restrict the maximum image size to a smaller value.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何? + 図/表番号列挙
このセクションでは、EfficientNetのトレーニング効率に関するボトルネックを分析し、それを改善するためのEfficientNetV2のアーキテクチャ設計について説明している。

要旨

EfficientNetはFLOPsとパラメータ効率に最適化されたモデルファミリーである。
EfficientNetのトレーニング効率を理解し、効率的な改善策を提案。
大きな画像サイズや初期層でのdepthwise convolutionsがトレーニングを遅くするボトルネックとなる。
Fused-MBConvの導入と段階的なスケーリング戦略でトレーニング速度を向上。
統一的なスケーリングよりも、非統一的なスケーリングがトレーニング速度とパラメータ効率の両方で有利。

対応する図/表番号: 図1, 図2, 表1, 表2, 表3

②改良点・工夫点・テクニック等の記載

FixResの導入: より小さな画像サイズを使用してトレーニングを行い、推論時には大きな画像サイズを使用することで、トレーニング速度を最大2.2倍向上させている（表2）。
Fused-MBConvの採用: 初期のステージでdepthwise convolutionsをFused-MBConvに置き換えることで、トレーニング速度を向上させ、パラメータとFLOPsのわずかな増加で抑えている（図2, 表3）。
非統一的なスケーリング: ステージごとに異なるスケーリング戦略を採用し、トレーニング速度を改善しながらメモリ消費を抑えている。

③性能が向上した記載

FixResの導入により、トレーニング速度が最大2.2倍向上している（表2）。
Fused-MBConvの使用により、トレーニング速度が向上し、初期のステージでの効果が特に顕著である（表3）。

④メカニズムの解明・なぜそうなるのか等の記載

FixResの効果: トレーニング時に小さな画像サイズを使用することで、メモリ使用量が減り、より大きなバッチサイズが使用できるため、トレーニングが高速化される。
Fused-MBConvの効果: depthwise convolutionsの代わりに、Fused-MBConvを使用することで、近代的なアクセラレータをより効果的に活用でき、トレーニング速度が向上する。

⑤比較の記載

手法	トレーニング速度	パラメータ効率	FLOPs効率
EfficientNet (原型)	標準	良好	良好
FixRes + EfficientNet	2.2倍高速化	良好	良好
EfficientNet + Fused-MBConv	早い (初期ステージ)	やや増加	やや増加

対応する図/表番号: 表2, 表3

⑥上記⑤以外で表に出来そうな部分

無し

▼質問(人間)
FixResの処理方法は? (若しくは既存手法を適用しただけか)

▼生成AIの回答(ChatGPT-4o)
FixResの処理方法は、既存の手法を適用したものである。具体的には、トレーニング時に使用する画像サイズを推論時に使用する画像サイズよりも小さくすることで、メモリ使用量を減らし、より大きなバッチサイズを使用可能にする。これにより、トレーニングの速度が向上する。推論時には、より大きな画像サイズを使用して精度を高めるが、トレーニングの後にフィンチューニングを行うことはなく、そのままのモデルを使用する。

▼質問(人間)
Fused-MBConvの処理方法は? (若しくは既存手法を適用しただけか)

▼生成AIの回答(ChatGPT-4o)
Fused-MBConvは、既存手法を適用したものであるが、特定の問題を解決するために改良が加えられている。Fused-MBConvは、従来のMBConvブロック（MobileNetV2で導入されたもの）の改良版であり、特にトレーニング速度の向上を目指している。

具体的には、Fused-MBConvは、MBConvブロックで使用される深層畳み込み（depthwise convolution）と拡張畳み込み（expansion convolution）を、単一の標準的な3x3畳み込みに置き換える。この変更により、深層畳み込みが持つ計算効率の悪さを解消し、モバイルデバイスやサーバーのアクセラレータをより効果的に利用できるようになる。

このFused-MBConvの処理方法は、特にEfficientNetの初期段階（ステージ1～3）に適用すると、パラメータやFLOPs（Floating Point Operations）に小さなオーバーヘッドを伴いながらも、トレーニング速度を向上させることができる。しかし、すべてのステージ（ステージ1～7）でFused-MBConvを使用すると、パラメータとFLOPsが大幅に増加し、トレーニング速度が逆に低下する可能性があるため、適切な組み合わせを見つけることが重要である。そのため、最適な組み合わせを自動的に探索するためにニューラルアーキテクチャ検索（NAS）が利用されている。

3.3. Training-Aware NAS and Scaling

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 3.3. Training-Aware NAS and Scaling To this end, we have learned multiple design choices for im- proving training speed. To search for the best combinations of those choices, we now propose a training-aware NAS. NAS Search: Our training-aware NAS framework is largely based on previous NAS works (Tan et al., 2019; Tan & Le, 2019a), but aims to jointly optimize accuracy, parameter efficiency, and training efficiency on modern ac- celerators. Specifically, we use EfficientNet as our backbone. Our search space is a stage-based factorized space similar to (Tan et al., 2019), which consists of the design choices for convolutional operation types {MBConv, Fused-MBConv}, number of layers, kernel size {3x3, 5x5}, expansion ratio {1, 4, 6}. On the other hand, we reduce the search space size by (1) removing unnecessary search options such as pooling skip ops, since they are never used in the original Efficient- Nets; (2) reusing the same channel sizes from the backbone as they are already searched in (Tan & Le, 2019a). Since the search space is smaller, we can apply reinforcement learn- ing (Tan et al., 2019) or simply random search on much larger networks that have comparable size as EfficientNet- B4. Specifically, we sample up to 1000 models and train each model about 10 epochs with reduced image size for training. Our search reward combines the model accuracy A, the normalized training step time S, and the parameter size P , using a simple weighted product A · Sw · P v , where w = -0.07 and v = -0.05 are empirically determined to balance the trade-offs similar to (Tan et al., 2019). EfficientNetV2 Architecture: Table 4 shows the architec- ture for our searched model EfficientNetV2-S. Compared to the EfficientNet backbone, our searched EfficientNetV2 has several major distinctions: (1) The first difference is EfficientNetV2 extensively uses both MBConv (Sandler et al., 2018; Tan & Le, 2019a) and the newly added fused-MBConv (Gupta & Tan, 2019) in the early layers. (2) Secondly, EfficientNetV2 prefers smaller expansion ratio for MBConv since smaller expansion ratios tend to have less memory access overhead. (3) Thirdly, EfficientNetV2 prefers smaller 3x3 kernel sizes, but it adds more layers to compensate the reduced receptive field resulted from the smaller kernel size. (4) Lastly, EfficientNetV2 completely removes the last stride-1 stage in the original EfficientNet, perhaps due to its large parameter size and memory access overhead. EfficientNetV2 Scaling: We scale up EfficientNetV2-S to obtain EfficientNetV2-M/L using similar compound scaling as (Tan & Le, 2019a), with a few additional optimizations: (1) we restrict the maximum inference image size to 480, as very large images often lead to expensive memory and training speed overhead; (2) as a heuristic, we also gradually add more layers to later stages (e.g., stage 5 and 6 in Table 4) in order to increase the network capacity without adding much runtime overhead. Training Speed Comparison: Figure 3 compares the train- ing step time for our new EfficientNetV2, where all models are trained with fixed image size without progressive learn- ing. For EfficientNet (Tan & Le, 2019a), we show two curves: one is trained with the original inference size, and the other is trained with about 30% smaller image size, same as EfficientNetV2 and NFNet (Touvron et al., 2019; Brock et al., 2021). All models are trained with 350 epochs, except NFNets are trained with 360 epochs, so all models have a similar number of training steps. Interestingly, we observe that when trained properly, EfficientNets still achieve pretty strong performance trade-off. More importantly, with our training-aware NAS and scaling, our proposed Efficient- NetV2 model train much faster than the other recent models. These results also align with our inference results as shown in Table 7 and Figure 5.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何? + 図/表番号列挙
このセクションでは、EfficientNetV2モデルのトレーニング速度を改善するための新しいNAS（ニューラルアーキテクチャサーチ）手法およびスケーリング手法について記載されている。対応する図はTable 4, Figure 3, Table 7, Figure 5。

要旨

トレーニング速度を向上させるために、複数のデザイン選択肢を学習し、それらを組み合わせて最適化するためのTraining-Aware NAS手法を提案。
EfficientNetをバックボーンに使用し、効率的なコンボリューションオペレーションや拡張比、カーネルサイズなどを最適化する。
EfficientNetV2では、MBConvとFused-MBConvを効果的に利用し、初期層での計算効率を向上。
モデルスケーリングにおいて、最大画像サイズを制限し、後期層にレイヤーを追加することで容量を増加させるが、ランタイムオーバーヘッドを最小限に抑える。

対応する図/表番号: Table 4, Figure 3, Table 7, Figure 5

②改良点・工夫点・テクニック等の記載

NASのサーチスペースを効率化するために、プーリングスキップオプションの除外や既存チャネルサイズの再利用などを行い、検索空間の縮小を実施。
MBConvとFused-MBConvの組み合わせを活用し、初期層での計算効率を向上させた。
小さな拡張比と小さなカーネルサイズを採用することで、メモリアクセスオーバーヘッドを削減しつつ、レイヤーの追加で受容野を補う工夫を行った。
大規模モデルでの強化学習やランダムサーチを使用し、トレーニング時間を最適化。

③性能が向上した記載

EfficientNetV2では、既存モデルと比較してトレーニング速度が向上し、トレードオフのバランスが最適化された (Figure 3, Table 7, Figure 5参照)。

④メカニズムの解明・なぜそうなるのか等の記載

小さな拡張比と小さなカーネルサイズを選択した理由は、メモリアクセスオーバーヘッドを低減しつつ、レイヤーを追加して受容野を補完することで、性能を維持しながら効率を向上させるため。

⑤比較の記載

モデル	トレーニング速度	トレードオフバランス	表/図番号
EfficientNet	遅い	トレードオフあり	Figure 3
EfficientNetV2	速い	トレードオフ最適化	Figure 3, Table 7, Figure 5

⑥上記⑤以外で表に出来そうな部分
無し

▼質問(人間)
Training-Aware NASの具体的な処理方法は?

▼生成AIの回答(ChatGPT-4o)
Training-Aware NASの具体的な処理方法は以下の通りである。

サーチスペースの定義:
- EfficientNetをバックボーンとして使用し、以下の要素を含むステージベースのファクタライズドスペースをサーチスペースとして定義する:
  - コンボリューションの種類: {MBConv, Fused-MBConv}
  - レイヤーの数
  - カーネルサイズ: {3x3, 5x5}
  - 拡張比: {1, 4, 6}
サーチスペースの縮小:
- 効率を上げるために、次の2つの方法でサーチスペースを縮小:
  1. 元のEfficientNetで使用されなかったプーリングスキップオプションを除外。
  2. (Tan & Le, 2019a)で既にサーチされたチャネルサイズを再利用。
モデルのサンプリングとトレーニング:
- 上記のサーチスペース内から最大1000モデルをサンプリングし、それぞれのモデルを画像サイズを縮小して約10エポックトレーニング。
報酬関数の定義と最適化:
- モデルの精度 $A$、正規化されたトレーニングステップ時間 $S$、パラメータサイズ $P$ を組み合わせた報酬関数を定義:
  $
  A \cdot S^w \cdot P^v
  $
  ここで、重み $w = -0.07$、係数 $v = -0.05$ は経験的に決定され、トレードオフをバランスさせる。
探索手法の適用:
- サーチスペースが縮小されているため、強化学習 (Tan et al., 2019) または単純なランダムサーチを用いて、EfficientNet-B4と同等のサイズを持つ大規模ネットワークに対して探索を実施。

4. Progressive Learning

4.1. Motivation

4.2. Progressive Learning with adaptive Regularization

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 4. Progressive Learning 4.1. Motivation As discussed in Section 3, image size plays an important role in training efficiency. In addition to FixRes (Touvron et al., 2019), many other works dynamically change image sizes during training (Howard, 2018; Hoffer et al., 2019), but they often cause a drop in accuracy. We hypothesize the accuracy drop comes from the unbal- anced regularization: when training with different image sizes, we should also adjust the regularization strength ac- cordingly (instead of using a fixed regularization as in previ- ous works). In fact, it is common that large models require stronger regularization to combat overfitting: for example, EfficientNet-B7 uses larger dropout and stronger data aug- mentation than the B0. In this paper, we argue that even for the same network, smaller image size leads to smaller network capacity and thus needs weaker regularization; vice versa, larger image size leads to more computations with larger capacity, and thus more vulnerable to overfitting. To validate our hypothesis, we train a model, sampled from our search space, with different image sizes and data aug- mentations (Table 5). When image size is small, it has the best accuracy with weak augmentation; but for larger im- ages, it performs better with stronger augmentation. This insight motivates us to adaptively adjust regularization along with image size during training, leading to our improved method of progressive learning. 4.2. Progressive Learning with adaptive Regularization Figure 4 illustrates the training process of our improved progressive learning: in the early training epochs, we train the network with smaller images and weak regularization, such that the network can learn simple representations easily and fast. Then, we gradually increase image size but also making learning more difficult by adding stronger regular- ization. Our approach is built upon (Howard, 2018) that progressively changes image size, but here we adaptively adjust regularization as well. Formally, suppose the whole training has N total steps, the target image size is Se, with a list of regularization magni- tude Φe = {φk e }, where k represents a type of regularization such as dropout rate or mixup rate value. We divide the train- ing into M stages: for each stage 1 ≤ i ≤ M , the model is trained with image size Si and regularization magnitude Φi = {φk i }. The last stage M would use the targeted image size Se and regularization Φe. For simplicity, we heuristi- cally pick the initial image size S0 and regularization Φ0, and then use a linear interpolation to determine the value for each stage. Algorithm 1 summarizes the procedure. At the beginning of each stage, the network will inherit all weights from the previous stage. Unlike transformers, whose weights (e.g., position embedding) may depend on input length, ConvNet weights are independent to image sizes and thus can be inherited easily. Our improved progressive learning is generally compatible to existing regularization. For simplicity, this paper mainly studies the following three types of regularization: • Dropout (Srivastava et al., 2014): a network-level reg- ularization, which reduces co-adaptation by randomly dropping channels. We will adjust the dropout rate γ. • RandAugment (Cubuk et al., 2020): a per-image data augmentation, with adjustable magnitude ∈. • Mixup (Zhang et al., 2018): a cross-image data aug- mentation. Given two images with labels (xi, yi) and (xj , yj ), it combines them with mixup ratio λ: ˜xi = λxj + (1 − λ)xi and ˜yi = λyj + (1 − λ)yi. We would adjust mixup ratio λ during training.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何? + 図/表番号列挙
このセクションでは、プログレッシブラーニングの方法論とそのモチベーションについて説明されている。特に、画像サイズの変化に応じて正則化強度を適応的に調整することで、トレーニングの効率を高める手法が提案されている。対応する図はFigure 4。

要旨

異なる画像サイズでトレーニングする際の精度低下の原因を、正則化の強度がバランスされていないためと仮定。
小さい画像サイズでは弱い正則化が最適であり、大きな画像サイズでは強い正則化が有効であることを実験で確認。
トレーニング中に画像サイズとともに正則化を調整するプログレッシブラーニング手法を提案。
提案手法は既存の正則化手法（Dropout、RandAugment、Mixup）とも互換性がある。

対応する図/表番号: Figure 4, Table 5

②改良点・工夫点・テクニック等の記載

画像サイズに応じて正則化の強度を調整することで、トレーニング効率を向上させるプログレッシブラーニング手法を導入。
トレーニングを複数のステージに分け、各ステージで画像サイズと正則化強度を線形補間で決定。

③性能が向上した記載

小さい画像サイズで弱い正則化、大きな画像サイズで強い正則化を適用することで、精度が向上することが示された (Table 5参照)。
提案手法により、トレーニング中の精度低下を防ぎ、より効率的な学習が可能となる。

④メカニズムの解明・なぜそうなるのか等の記載

小さい画像サイズはネットワークのキャパシティが小さくなるため、弱い正則化が適している。一方、大きな画像サイズは計算量とキャパシティが増加するため、過学習を防ぐために強い正則化が必要であると説明されている。

⑤比較の記載

画像サイズ	正則化強度	精度への影響	表番号/図番号
小サイズ	弱い	高精度	Table 5
大サイズ	強い	高精度	Table 5

⑥上記⑤以外で表に出来そうな部分
無し

▼質問(人間)
「ネットワークのキャパシティが小さくなる」の「キャパシティ」とは具体的に何?

▼生成AIの回答(ChatGPT-4o)
「ネットワークのキャパシティ（capacity）」とは、ニューラルネットワークが学習し、表現できる情報の量や複雑さのことを指す。具体的には、モデルのパラメータ数やネットワークの層の深さ、ユニット数などがキャパシティに影響を与える。キャパシティが大きいネットワークは、より多くのデータを学習でき、複雑なパターンを捉える能力が高い一方で、過学習（overfitting）のリスクも高まる。

「ネットワークのキャパシティが小さくなる」というのは、例えば入力画像サイズが小さくなることで、ネットワークが受け取る情報量が減少し、学習できる特徴やパターンが限定されることを意味する。結果として、ネットワークの表現能力が制約されるため、「キャパシティが小さくなる」と表現される。

5. Main Results

5.1. ImageNet ILSVRC2012

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 5. Main Results This section presents our experimental setups, the main results on ImageNet, and the transfer learning results on CIFAR-10, CIFAR-100, Cars, and Flowers. 5.1. ImageNet ILSVRC2012 Setup: ImageNet ILSVRC2012 (Russakovsky et al., 2015) contains about 1.28M training images and 50,000 validation images with 1000 classes. During architecture search or hyperparameter tuning, we reserve 25,000 images (about 2%) from the training set as minival for accuracy evalua- tion. We also use minival to perform early stopping. Our ImageNet training settings largely follow EfficientNets (Tan & Le, 2019a): RMSProp optimizer with decay 0.9 and momentum 0.9; batch norm momentum 0.99; weight de- cay 1e-5. Each model is trained for 350 epochs with total batch size 4096. Learning rate is first warmed up from 0 to 0.256, and then decayed by 0.97 every 2.4 epochs. We use exponential moving average with 0.9999 decay rate, RandAugment (Cubuk et al., 2020), Mixup (Zhang et al., 2018), Dropout (Srivastava et al., 2014), and stochastic depth (Huang et al., 2016) with 0.8 survival probability. For progressive learning, we divide the training process into four stages with about 87 epochs per stage: the early stage uses a small image size with weak regularization, while the later stages use larger image sizes with stronger regularization, as described in Algorithm 1. Table 6 shows the minimum (for the first stage) and maximum (for the last stage) values of image size and regularization. For simplicity, all models use the same minimum values of size and regularization, but they adopt different maximum values, as larger models generally require more regularization to combat overfitting. Following (Touvron et al., 2020), our maximum image size for training is about 20% smaller than inference, but we don’t finetune any layers after training. Results: As shown in Table 7, our EfficientNetV2 mod- els are significantly faster and achieves better accuracy and parameter efficiency than previous ConvNets and Trans- formers on ImageNet. In particular, our EfficientNetV2- M achieves comparable accuracy to EfficientNet-B7 while training 11x faster using the same computing resources. Our EfficientNetV2 models also significantly outperform all re- cent RegNet and ResNeSt, in both accuracy and inference speed. Figure 1 further visualizes the comparison on train- ing speed and parameter efficiency. Notably, this speedup is a combination of progressive training and better networks, and we will study the individual impact for each of them in our ablation studies. Recently, Vision Transformers have demonstrated impres- sive results on ImageNet accuracy and training speed. How- ever, here we show that properly designed ConvNets with improved training method can still largely outperform vi- sion transformers in both accuracy and training efficiency. In particular, our EfficientNetV2-L achieves 85.7% top-1 accuracy, surpassing ViT-L/16(21k), a much larger trans- former model pretrained on a larger ImageNet21k dataset. Here, ViTs are not well tuned on ImageNet ILSVRC2012; DeiTs use the same architectures as ViTs, but achieve better results by adding more regularization. Although our EfficientNetV2 models are optimized for train- ing, they also perform well for inference, because training speed often correlates with inference speed. Figure 5 visu- alizes the model size, FLOPs, and inference latency based on Table 7. Since latency often depends on hardware and software, here we use the same PyTorch Image Models codebase (Wightman, 2021) and run all models on the same machine using the batch size 16. In general, our models have slightly better parameters/FLOPs efficiency than Effi- cientNets, but our inference latency is up to 3x faster than EfficientNets. Compared to the recent ResNeSt that are spe- cially optimized for GPUs, our EfficientNetV2-M achieves 0.6% better accuracy with 2.8x faster inference speed.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何? + 図/表番号列挙
このセクションでは、ImageNet ILSVRC2012の実験設定、主な結果、およびCIFAR-10、CIFAR-100、Cars、Flowersの転移学習の結果について説明している。主にEfficientNetV2モデルの性能が過去のConvNetやVision Transformerと比較され、優れた結果を示している。

要旨

ImageNet ILSVRC2012におけるEfficientNetV2モデルのトレーニング設定。
EfficientNetV2モデルの精度とパラメータ効率が、過去のConvNetやTransformerを上回る。
EfficientNetV2-Mは、EfficientNet-B7と同等の精度を持ちながら、11倍速くトレーニングが可能。
Vision Transformer (ViT) に対するConvNetの優位性を示す結果。
EfficientNetV2モデルの推論速度はEfficientNetモデルよりも最大3倍速い。

対応する図/表番号: Table 6, Table 7, Figure 1, Figure 5

②改良点・工夫点・テクニック等の記載

EfficientNetV2モデルでは、ImageNetトレーニングにおいて、RMSPropオプティマイザ、RandAugment、Mixup、Dropout、Stochastic Depthなどを組み合わせて利用し、トレーニングの精度と効率を向上させた。
プログレッシブ学習を導入し、トレーニングプロセスを4段階に分けて、初期段階で小さな画像サイズと弱い正則化を使用し、後期段階で大きな画像サイズと強い正則化を使用することで、過学習を防止した (Table 6参照)。
トレーニング中にエクスポネンシャル移動平均を使用して、モデルの安定性を確保した。

③性能が向上した記載

EfficientNetV2-Mモデルは、EfficientNet-B7と同等の精度を持ちながら、トレーニングが11倍速くなった (Table 7, Figure 1参照)。
EfficientNetV2-Lモデルは、Vision Transformer ViT-L/16(21k)を上回る85.7%のトップ1精度を達成した。
EfficientNetV2-Mは、ResNeStと比較して0.6%高い精度を持ちながら、推論速度が2.8倍速くなった (Table 7, Figure 5参照)。

④メカニズムの解明・なぜそうなるのか等の記載

EfficientNetV2モデルのトレーニング速度の向上は、プログレッシブ学習と改善されたネットワーク構造の組み合わせによるものであると示唆されている (Figure 1参照)。
トレーニング速度が速いモデルは、推論速度も速くなる傾向があり、その相関関係が確認された (Figure 5参照)。

⑤比較の記載

モデル	トレーニング速度	精度	推論速度	表番号/図番号
EfficientNetV2-M	11x速い	Comparable to B7	2.8x速い	Table 7, Figure 1, Figure 5
EfficientNetV2-L	-	85.7%	-	-
Vision Transformer (ViT)	遅い	低い	遅い	-
ResNeSt	遅い	低い	遅い	Table 7, Figure 5

⑥上記⑤以外で表に出来そうな部分
無し

5.2. ImageNet21k

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 5.2. ImageNet21k Setup: ImageNet21k (Russakovsky et al., 2015) contains about 13M training images with 21,841 classes. The original ImageNet21k doesn’t have train/eval split, so we reserve ran- domly picked 100,000 images as validation set and use the remaining as training set. We largely reuse the same training settings as ImageNet ILSVRC2012 with a few changes: (1) we change the training epochs to 60 or 30 to reduce training time, and use cosine learning rate decay that can adapt to different steps without extra tuning; (2) since each image has multiple labels, we normalize the labels to have sum of 1 before computing softmax loss. After pretrained on ImageNet21k, each model is finetuned on ILSVRC2012 for 15 epochs using cosine learning rate decay. Results: Table 7 shows the performance comparison, where models tagged with 21k are pretrained on Ima- geNet21k and finetuned on ImageNet ILSVRC2012. Com- pared to the recent ViT-L/16(21k), our EfficientNetV2- L(21k) improves the top-1 accuracy by 1.5% (85.3% vs. 86.8%), using 2.5x fewer parameters and 3.6x fewer FLOPs, while running 6x - 7x faster in training and inference. We would like to highlight a few interesting observations: • Scaling up data size is more effective than simply scal- ing up model size in high-accuracy regime: when the top-1 accuracy is beyond 85%, it is very difficult to further improve it by simply increasing model size due to the severe overfitting. However, the extra Im- ageNet21K pretraining can significantly improve ac- curacy. The effectiveness of large datasets is also ob- served in previous works (Mahajan et al., 2018; Xie et al., 2020; Dosovitskiy et al., 2021). • Pretraining on ImageNet21k could be quite efficient. Although ImageNet21k has 10x more data, our training approach enables us to finish the pretraining of Effi- cientNetV2 within two days using 32 TPU cores (in- stead of weeks for ViT (Dosovitskiy et al., 2021)). This is more effective than training larger models on Ima- geNet. We suggest future research on large-scale mod- els use the public ImageNet21k as a default dataset.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何? + 図/表番号列挙
このセクションでは、ImageNet21kデータセットを使用したEfficientNetV2モデルのトレーニング設定および結果について説明している。特に、ImageNet21kでの事前学習が、モデルの精度と効率にどのように影響するかが焦点となっている。

要旨

ImageNet21kデータセットを使用してEfficientNetV2モデルを事前学習し、その後ImageNet ILSVRC2012で微調整を実施。
EfficientNetV2-L(21k)モデルは、ViT-L/16(21k)と比較して、より少ないパラメータとFLOPsで精度を向上させ、トレーニングと推論速度が大幅に速くなった。
データサイズの拡大は、モデルサイズの拡大よりも高精度領域では効果的である。
ImageNet21kでの事前学習は効率的であり、32 TPUコアを使用してEfficientNetV2の事前学習を2日間で完了できる。

対応する図/表番号: Table 7

②改良点・工夫点・テクニック等の記載

ImageNet21kでのトレーニングには、エポック数を60または30に短縮し、コサイン学習率減衰を使用してトレーニング時間を削減する工夫がなされた。
各画像に複数のラベルが付与されているため、ラベルを正規化してsoftmax損失を計算する際の工夫が見られる。

③性能が向上した記載

EfficientNetV2-L(21k)モデルは、ViT-L/16(21k)と比較して、トップ1精度が1.5%向上し、85.3%から86.8%に達した (Table 7参照)。
EfficientNetV2-L(21k)モデルは、ViT-L/16(21k)と比較して、2.5倍少ないパラメータと3.6倍少ないFLOPsで6x - 7x速いトレーニングおよび推論速度を達成した (Table 7参照)。

④メカニズムの解明・なぜそうなるのか等の記載

トップ1精度が85%を超えると、モデルサイズを単に拡大するだけでは精度の向上が難しくなるが、ImageNet21kでの事前学習により、過学習を抑制しながら精度を向上させることができると説明されている。
ImageNet21kでの事前学習は、より大規模なデータセットの使用が高精度モデルにとって効果的であることを示している。

⑤比較の記載

モデル	トップ1精度	パラメータ	FLOPs	トレーニング速度	推論速度	表番号/図番号
EfficientNetV2-L(21k)	86.8%	少ない (2.5x)	少ない (3.6x)	速い (6x - 7x)	速い (6x - 7x)	Table 7
ViT-L/16(21k)	85.3%	多い	多い	遅い	遅い	Table 7

⑥上記⑤以外で表に出来そうな部分
無し

5.3. Transfer Learning Datasets

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 5.3. Transfer Learning Datasets Setup: We evaluate our models on four transfer learning datasets: CIFAR-10, CIFAR-100, Flowers and Cars. Table 9 includes the statistics of these datasets. For this experiment, we use the checkpoints trained on Ima- geNet ILSVRC2012. For fair comparison, no ImageNet21k images are used here. Our finetuning settings are mostly the same as ImageNet training with a few modifications similar to (Dosovitskiy et al., 2021; Touvron et al., 2021): We use smaller batch size 512, smaller initial learning rate 0.001 with cosine decay. For all datasets, we train each model for fixed 10,000 steps. Since each model is finetuned with very few steps, we disable weight decay and use a simple cutout data augmentation. Results: Table 8 compares the transfer learning perfor- mance. In general, our models outperform previous Con- vNets and Vision Transformers for all these datasets, some- times by a non-trivial margin: for example, on CIFAR-100, EfficientNetV2-L achieves 0.6% better accuracy than prior GPipe/EfficientNets and 1.5% better accuracy than prior ViT/DeiT models. These results suggest that our models also generalize well beyond ImageNet.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何? + 図/表番号列挙
このセクションでは、EfficientNetV2モデルをCIFAR-10、CIFAR-100、Flowers、Carsの4つの転移学習データセットで評価した結果について述べている。

要旨

EfficientNetV2モデルをImageNet ILSVRC2012でトレーニングしたチェックポイントを使用し、転移学習を実施。
転移学習では、バッチサイズと初期学習率を調整し、10,000ステップの微調整を実施。
CIFAR-100やその他のデータセットで、EfficientNetV2-Lが従来のConvNetsやVision Transformersを上回る性能を発揮した。

対応する図/表番号: Table 8, Table 9

②改良点・工夫点・テクニック等の記載

転移学習において、バッチサイズを512に小さくし、初期学習率を0.001に設定してコサイン減衰を使用した。
微調整のステップ数を固定し、ウェイトデケイを無効化し、簡単なカットアウトデータ拡張を使用することで、効率的なトレーニングを実現した。

③性能が向上した記載

CIFAR-100において、EfficientNetV2-Lが従来のGPipe/EfficientNetsよりも0.6%高い精度を達成し、ViT/DeiTモデルよりも1.5%高い精度を達成した (Table 8参照)。

④メカニズムの解明・なぜそうなるのか等の記載

EfficientNetV2モデルは、ImageNet以外のデータセットに対しても良好に一般化する能力を持つことが示唆されている。

⑤比較の記載

データセット	モデル	精度の向上 (CIFAR-100)	表番号/図番号
CIFAR-100	EfficientNetV2-L	+0.6% (GPipe/EfficientNets)	Table 8
CIFAR-100	EfficientNetV2-L	+1.5% (ViT/DeiT)	Table 8

⑥上記⑤以外で表に出来そうな部分
無し

6. Ablation Studies

6.1. Comparison to EfficientNet

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 6. Ablation Studies 6.1. Comparison to EfficientNet In this section, we will compare our EfficientNetV2 (V2 for short) with EfficientNets (Tan & Le, 2019a) (V1 for short) under the same training and inference settings. Performance with the same training: Table 10 shows the performance comparison using the same progressive learn- ing settings. As we apply the same progressive learning to EfficientNet, its training speed (reduced from 139h to 54h) and accuracy (improved from 84.7% to 85.0%) are better than the original paper (Tan & Le, 2019a). How- ever, as shown in Table 10, our EfficientNetV2 models still outperform EfficientNets by a large margin: EfficientNetV2- M reduces parameters by 17% and FLOPs by 37%, while running 4.1x faster in training and 3.1x faster in inference than EfficientNet-B7. Since we are using the same training settings here, we attribute the gains to the EfficientNetV2 architecture. Scaling Down: Previous sections mostly focus on large- scale models. Here we compare smaller models by scaling down our EfficientNetV2-S using EfficientNet compound scaling. For easy comparison, all models are trained without progressive learning. Compared to small-size EfficientNets (V1), our new EfficientNetV2 (V2) models are generally faster while maintaining comparable parameter efficiency.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何? + 図/表番号列挙
このセクションでは、EfficientNetV2（V2）と従来のEfficientNet（V1）を同じトレーニングと推論設定の下で比較している。

要旨

EfficientNetV2とEfficientNetを同じ設定でトレーニングした場合の性能比較を行った。
EfficientNetV2は、EfficientNetに比べて大幅にパラメータ削減と計算量削減を達成し、トレーニングと推論の速度も大幅に向上した。
小規模モデルでもEfficientNetV2が優位性を示し、EfficientNetと比較して同等のパラメータ効率を維持しつつ、速度の向上を達成している。

対応する図/表番号: Table 10

②改良点・工夫点・テクニック等の記載

EfficientNetV2のアーキテクチャにより、EfficientNetに対してパラメータを17%削減し、FLOPsを37%削減しつつ、トレーニング速度を4.1倍、推論速度を3.1倍向上させている。

③性能が向上した記載

EfficientNetV2-MがEfficientNet-B7に対してパラメータを17%削減、FLOPsを37%削減、トレーニング速度を4.1倍、推論速度を3.1倍向上させた (Table 10参照)。
EfficientNetV2のトレーニング速度は、EfficientNetを使用した場合よりも139時間から54時間に短縮された。

④メカニズムの解明・なぜそうなるのか等の記載

同じトレーニング設定を使用した場合、性能向上はEfficientNetV2のアーキテクチャによるものであるとされている。

⑤比較の記載

モデル	パラメータ削減	FLOPs削減	トレーニング速度向上	推論速度向上	表番号/図番号
EfficientNetV2-M	-17%	-37%	4.1x	3.1x	Table 10

⑥上記⑤以外で表に出来そうな部分
無し

6.2. Progressive Learning for Different Networks

6.3. Importance of Adaptive Regularization

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 6.2. Progressive Learning for Different Networks We ablate the performance of our progressive learning for different networks. Table 12 shows the performance com- parison between our progressive training and the baseline training, using the same ResNet and EfficientNet models. Here, the baseline ResNets have higher accuracy than the original paper (He et al., 2016) because they are trained with our improved training settings (see Section 5) using more epochs and better optimizers. We also increase the image size from 224 to 380 for ResNets to further increase the network capacity and accuracy. As shown in Table 12, our progressive learning generally reduces the training time and meanwhile improves the accu- racy for all different networks. Not surprisingly, when the default image size is very small, such as ResNet50(224) with 224x224 size, the training speedup is limited (1.4x speedup); however, when the default image size is larger and the model is more complex, our approach achieves larger gains on ac- curacy and training efficiency: for ResNet152(380), our ap- proach improves speed up the training by 2.1x with slightly better accuracy; for EfficientNet-B4, our approach improves speed up the training by 2.2x. 6.3. Importance of Adaptive Regularization A key insight from our training approach is the adaptive regularization, which dynamically adjusts regularization according to image size. This paper chooses a simple pro- gressive approach for its simplicity, but it is also a general method that can be combined with other approaches. Table 13 studies our adaptive regularization on two training settings: one is to progressively increase image size from small to large (Howard, 2018), and the other is to randomly sample a different image size for each batch (Hoffer et al., 2019). Because TPU needs to recompile the graph for each new size, here we randomly sample a image size every eight epochs instead of every batch. Compared to the vanilla approaches of progressive or random resizing that use the same regularization for all image sizes, our adaptive regu- larization improves the accuracy by 0.7%. Figure 6 further compares the training curve for the progressive approach. Our adaptive regularization uses much smaller regulariza- tion for small images at the early training epochs, allowing models to converge faster and achieve better final accuracy.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何? + 図/表番号列挙
このセクションでは、異なるネットワークにおけるプログレッシブ学習の効果と、適応型正則化の重要性について議論している。

要旨

異なるネットワークに対するプログレッシブ学習の効果を評価し、トレーニング時間短縮と精度向上を確認した。
適応型正則化の重要性を示し、画像サイズに応じた動的な正則化調整が精度を向上させることを明らかにした。
適応型正則化を用いることで、従来のプログレッシブやランダムなリサイズアプローチに比べて精度が向上した。

対応する図/表番号: Table 12, Table 13, Figure 6

②改良点・工夫点・テクニック等の記載

プログレッシブ学習は、トレーニング時間を短縮しつつ精度を向上させる手法であり、特に画像サイズが大きくモデルが複雑な場合に顕著な効果を発揮する。
適応型正則化は、画像サイズに応じて正則化を動的に調整する手法であり、従来の固定正則化に比べて最終的な精度を向上させる。

③性能が向上した記載

ResNet152(380)でプログレッシブ学習を用いることで、トレーニング速度が2.1倍に向上し、精度もわずかに向上した (Table 12)。
EfficientNet-B4でプログレッシブ学習を用いることで、トレーニング速度が2.2倍に向上した (Table 12)。
適応型正則化を用いることで、従来のプログレッシブやランダムリサイズアプローチに比べて精度が0.7%向上した (Table 13)。

④メカニズムの解明・なぜそうなるのか等の記載

小さい画像サイズでのトレーニング初期に、適応型正則化を使用することで、モデルがより早く収束し、最終的な精度が向上するためである (Figure 6参照)。

⑤比較の記載

ネットワーク	プログレッシブ学習によるトレーニング速度	精度向上	表番号/図番号
ResNet50(224)	1.4x	-	Table 12
ResNet152(380)	2.1x	あり	Table 12
EfficientNet-B4	2.2x	あり	Table 12

⑥上記⑤以外で表に出来そうな部分

トレーニング設定	適応型正則化の使用	精度向上	表番号/図番号
Progressive	あり	0.7%	Table 13
Random Resizing	あり	0.7%	Table 13

7. Conclusion

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) 7. Conclusion This paper presents EfficientNetV2, a new family of smaller and faster neural networks for image recognition. Optimized with training-aware NAS and model scaling, our Efficient- NetV2 significantly outperforms previous models, while being much faster and more efficient in parameters. To fur- ther speed up the training, we propose an improved method of progressive learning, that jointly increases image size and regularization during training. Extensive experiments show our EfficientNetV2 achieves strong results on Ima- geNet, and CIFAR/Flowers/Cars. Compared to EfficientNet and more recent works, our EfficientNetV2 trains up to 11x faster while being up to 6.8x smaller. Acknowledgements Special thanks to Lucas Sloan for helping open sourcing. We thank Ruoming Pang, Sheng Li, Andrew Li, Hanxiao Liu, Zihang Dai, Neil Houlsby, Ross Wightman, Jeremy Howard, Thang Luong, Daiyi Peng, Yifeng Lu, Da Huang, Chen Liang, Aravind Srinivas, Irwan Bello, Max Moroz, Futang Peng for their feedback.

▼生成AIの回答(ChatGPT-4o)
①何についての記載か? + 要旨は何? + 図/表番号列挙
このセクションは、EfficientNetV2の概要とその成果についての結論をまとめている。

要旨

EfficientNetV2は、画像認識のための新しい小型で高速なニューラルネットワークファミリーである。
トレーニングを考慮したNASとモデルスケーリングにより最適化され、EfficientNetV2は以前のモデルよりも大幅に優れた性能を発揮する。
トレーニング速度をさらに向上させるために、トレーニング中に画像サイズと正則化を共同で増加させるプログレッシブ学習法を提案した。
EfficientNetV2はImageNetやCIFAR/Flowers/Carsなどで優れた結果を達成し、EfficientNetやその他の最近のモデルと比較して、最大11倍速く、最大6.8倍小さい。

対応する図/表番号: 無し

②改良点・工夫点・テクニック等の記載

EfficientNetV2は、トレーニングを考慮したNAS（Neural Architecture Search）とモデルスケーリングによって最適化され、従来のモデルよりも効率的に設計されている。
トレーニング中に画像サイズと正則化を同時に増加させるプログレッシブ学習法を提案し、トレーニング速度の向上を実現した。

③性能が向上した記載

EfficientNetV2は、EfficientNetや他の最近のモデルと比較して、最大11倍のトレーニング速度と、最大6.8倍のパラメータ効率を実現している。

④メカニズムの解明・なぜそうなるのか等の記載

EfficientNetV2の性能向上は、トレーニングを考慮したNASとモデルスケーリングによって、モデルが効率的に最適化された結果である。また、プログレッシブ学習法がトレーニング中の効率を高めている。

⑤比較の記載

比較の記載は無い。

⑥上記⑤以外で表に出来そうな部分

表に出来そうな部分は無い。

あとがき

所感

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。(良い意味で論文の要点をそのまま抜き出した形)

EfficientNetV2についての分かりやすい解説記事等を事前にチェックして、EfficientNetV2の実際の処理方法や中核部分の内容をあらかじめ把握していると、理解しやすいと思います。生成AIは実際の細かい処理方法自体を分かりやすく説明してはくれない傾向があります。

分かったこと

主な改良点は下記の2点であることが分かりました:
depthwise convolutionsのボトルネックを解消し、効率的なFused-MBConvなどの新しい操作を組み込んだ検索スペースを使用してモデルを設計した。
画像サイズの進化的な学習方法により、精度低下を防ぎながらトレーニング速度を向上させた。トレーニング中に画像サイズを徐々に増加させるとともに、正則化（例：ドロップアウトやデータ拡張）を適応的に強化することで、トレーニング速度と精度のバランスを最適化した。

その他で、分かったこと・注目した点を列挙すると:

Progressive resizingやMix&Matchといった手法では、異なる画像サイズに対して同じ正則化を使用することが精度低下を引き起こすと指摘し、本論文ではこれに対応するために正則化を動的に調整する進行的な学習方法を提案している。

大きな画像サイズや初期層でのdepthwise convolutionsがトレーニングを遅くするボトルネックとなる。

FixRes: トレーニング時に使用する画像サイズを推論時に使用する画像サイズよりも小さくすることで、メモリ使用量を減らし、より大きなバッチサイズを使用可能にする。これにより、トレーニングの速度が向上する。推論時には、より大きな画像サイズを使用して精度を高めるが、トレーニングの後にファインチューニングを行うことはなく、そのままのモデルを使用する。

報酬関数の定義と最適化:
モデルの精度 A、正規化されたトレーニングステップ時間 S、パラメータサイズ P を組み合わせた報酬関数を定義:
A⋅S^w⋅P^v
ここで、重み w=−0.07w=−0.07、係数 v=−0.05v=−0.05 は経験的に決定され、トレードオフをバランスさせる。

異なる画像サイズでトレーニングする際の精度低下の原因を、正則化の強度がバランスされていないためと仮定。
小さい画像サイズでは弱い正則化が最適であり、大きな画像サイズでは強い正則化が有効であることを実験で確認。
トレーニング中に画像サイズとともに正則化を調整するプログレッシブラーニング手法を提案。

EfficientNetV2-Mは、EfficientNet-B7と同等の精度を持ちながら、11倍速くトレーニングが可能。
EfficientNetV2モデルの推論速度はEfficientNetモデルよりも最大3倍速い。

トップ1精度が85%を超えると、モデルサイズを単に拡大するだけでは精度の向上が難しくなるが、ImageNet21kでの事前学習により、過学習を抑制しながら精度を向上させることができると説明されている。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up