[19.05] EfficientNet
Compound Scaling
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Common methods to enhance the performance of convolutional neural networks (CNNs) include:
- Deeper Networks: Increasing network depth, such as with ResNet.
- Wider Networks: Increasing network width, like WideResNet.
- Higher Resolution: Increasing the input image resolution.
While incremental improvements from these methods can be gratifying, the authors of this paper argue for a more holistic approach.
Defining the Problem
The authors of this paper suggest a unified approach to scaling models and propose a method called Compound Scaling.
Solving the Problem
Two Key Observations
In this paper, the authors present two critical observations that form the foundation of their compound scaling method:
-
Observation 1: Any single dimension scaling improves accuracy, but the improvement saturates as the model size increases.
Scaling the depth, width, or resolution of a convolutional neural network (CNN) typically enhances its accuracy. However, as the network becomes extremely deep, wide, or high-resolution, the accuracy gains quickly diminish and eventually plateau. This suggests that simply increasing one dimension cannot infinitely improve model performance and will ultimately hit a bottleneck.
-
Observation 2: Balancing network width, depth, and resolution is crucial for effective model scaling.
To achieve better accuracy and efficiency within given resource constraints, it is essential to balance the network's width, depth, and resolution. While individually scaling one dimension is straightforward, its impact is limited. Instead, simultaneously scaling all dimensions in a coordinated manner can utilize resources more effectively and achieve higher accuracy.
Compound Scaling
Here, the problem of model scaling is redefined:
Each layer of a CNN can be defined as a function , where is the operator, is the output tensor, and is the input tensor. The entire CNN can be represented as a combination of these layers: . Typically, the layers in a CNN are divided into stages, with all layers in a stage sharing the same architecture.
Therefore, a CNN can be defined as:
where denotes the layer being repeated times in the -th stage.
The goal of model scaling is to maximize accuracy within given resource constraints. Specifically, we aim to optimize the model by adjusting its depth (), width (), and resolution () while keeping the basic network architecture unchanged.
This can be formalized as an optimization problem:
Subject to:
The authors propose a compound scaling method using a compound coefficient to uniformly scale the network's width, depth, and resolution: