The Evolution of Neural Network Architecture: A comprehensive review of more than ten architectures from LeNet5 to ENet

Little scum · Posted on 4/24/2018 1:08:47 PM

LeNet5
LeNet5 was born in 1994 and is one of the first convolutional neural networks and has advanced the field of deep learning. Since 1988, after many successful iterations, this pioneering work by Yann LeCun has been named LeNet5 (see: Gradient-Based Learning Applied to Document Recognition).

LeNet5's architecture is based on the idea that (especially) the features of an image are distributed across the entire image, and that convolution with learnable parameters is an efficient way to extract similar features in multiple locations with a small number of parameters. At that time, there was no GPU to help with training, and even the CPU was slow. Therefore, the ability to save parameters as well as the calculation process is a key development. This is the opposite of using each pixel as a separate input for a large, multi-layered neural network. LeNet5 explains that those pixels should not be used in the first layer because the image has strong spatial correlations that cannot be exploited by using independent pixels in the image as different input features.

The LeNet5 features can be summarized as follows:

Convolutional neural networks use 3 layers as a sequence: convolutional, pooled, non-linear, → This is probably a key feature of image deep learning since this paper!
Use convolutions to extract spatial features
Subsample using mapping to spatial mean
Nonlinearity in the form of hyperbolic tangent (tanh) or S-shaped (sigmoid).
Multilayer Neural Networks (MLPs) serve as the final classifiers
Sparse connection matrices between layers avoid large computational costs

Overall, this network is the starting point for a large number of recent architectures and has inspired many in this space.

Interval
From 1998 to 2010, the neural network was in its incubation stage. Most people are unaware of their growing power, while other researchers are slow to do so. Thanks to the advent of mobile phone cameras and inexpensive digital cameras, more and more data is being exploited. And computing power is also growing, CPUs are getting faster, and GPUs are becoming computing tools for multiple purposes. These trends have allowed neural networks to progress, albeit slowly. Data and computing power make the tasks that neural networks can accomplish more and more interesting. After that everything became clear......

Dan Ciresan Net
In 2010, Dan Claudiu Ciresan and Jurgen Schmidhuber published an implementation of the earliest GPU neural network. This implementation is a 9-layer neural network running on an NVIDIA GTX 280 graphics processor, including forward and backpropagation.

AlexNet
In 2012, Alex Krizhevsky published Alexet (see: ImageNet Classification with Deep Convolutional Neural Networks), a deeper and wider version of LeNet that won the difficult ImageNet competition by a significant margin.

AlexNet extends LeNet's ideas to larger neural networks that can learn much more complex objects and object levels. The contributions of this work are:

Use a modified linear unit (ReLU) as a nonlinearity
Use the Dropout technique to selectively ignore individual neurons during training to avoid model overfitting
Override a large pool to avoid the average effect of average pooling
Reduce training time with the NVIDIA GTX 580 GPU

At that point, GPUs can provide a higher number of cores than CPUs, and training time can be increased by 10 times, which in turn allows for larger datasets and larger images.

AlexNet's success has led to a small revolution. Convolutional neural networks are now the backbone of deep learning, and they have become synonymous with "large neural networks that can now solve useful tasks."

Overfeat
In December 2013, the Yann LeCun lab at New York University proposed a derivative of AlexNet, Overfeat (see: OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks). This article also proposed a learning bounding box, which led to many papers on this same topic. I believe it's better to learn to split objects than to learn artificial bounding boxes.

VGG
The VGG network from the University of Oxford (see: Very Deep Convolutional Networks for Large-Scale Image Recognition) is the first network to use smaller 3×3 filters at each convolutional layer and combine them to process them as a convolutional sequence.

This seems to be the opposite of LeNet's principle, where large convolutions are used to obtain similar features in an image. Unlike AlexNet's 9×9 or 11×11 filters, the filters are starting to get smaller, coming closer to the infamous 1×1 convolution anomaly that LeNet is trying to avoid—at least on the first layer of the network. However, VGG has made great progress by using multiple 3×3 convolutions in succession to mimic the effects of larger receptive fields, such as 5×5 and 7×7. These ideas have also been used in more recent network architectures, such as Inception and ResNet.

VGG networks use multiple 3×3 convolutional layers to characterize complex features. Note that the 3rd, 4th, 5th blocks of VGG-E: 256×256, and 512×512 3×3 filters are used multiple times in turn to extract more complex features and combinations of these features. The effect is equivalent to a large 512×512 classifier with 3 convolutional layers. This obviously means that there are a lot of parameters and learning capabilities. But these networks are difficult to train and must be divided into smaller networks and accumulated layer by layer. This is due to the lack of a robust way to regularize the model or to more or less constrain a large amount of search space due to the large number of parameters.

VGG uses large feature sizes in many layers because inference is time-consuming at runtime. As with Inception's bottleneck, reducing the number of features will save some computational costs.

Network-in-network
The idea of Network in Network (NiN, see paper: Network In Network) is simple and great: the use of 1×1 convolution to provide more composable capabilities for the features of the convolutional layer.

The NiN architecture uses spatial MLP layers after individual convolutions to better combine features before other layers. Again, you can think that 1×1 convolutions are contrary to the original principle of LeNet, but in fact they can combine convolutional features in a better way than is not possible by simply stacking more convolutional features. This is different from using the original pixel as the next layer input. 1×1 convolutions are often used to spatically combine features on feature maps after convolution, so they can actually use very few parameters and share them across all pixels of these features!

MLP's ability to greatly increase the effectiveness of individual convolutional features by combining them into more complex groups. This idea has since been used in some recent architectures, such as ResNet, Inception, and its derivatives.

NiN also uses an average pooling layer as part of the final classifier, another practice that will become common. This is done by averaging the network's responses to multiple input images before classification.

GoogLeNet and Inception
Christian Szegedy from Google began his quest to reduce the computational overhead of deep neural networks and designed GoogLeNet, the first Inception architecture (see: Going Deeper with Convolutions).

It was in the fall of 2014, and deep learning models were becoming useful in classifying image versus video frames. Most skeptics no longer doubt that deep learning and neural networks are really back this time around, and will continue to evolve. Given the usefulness of these technologies, internet giants like Google are very interested in deploying these architectures efficiently and at scale on their servers.

Christian thought a lot about how deep neural networks can achieve high levels of performance, such as on ImageNet, while reducing their computational overhead. Or performance improvements can be made while ensuring the same computational overhead.

He and his team came up with the Inception module:

At first glance, this is basically a parallel combination of 1×1, 3×3, 5×5 convolutional filters. But the great idea of Inception is to reduce the number of features before expensive parallel modules with 1×1 convolutional blocks (NiN). This is commonly referred to as a "bottleneck". This section will be explained in the "bottleneck layer" section below.

GoogLeNet uses a backbone without an inception module as the initial layer, followed by an average pooling layer plus a softmax classifier similar to NiN. This classifier has a much smaller number of operations than AlexNet and VGG's classifiers. This also led to a very effective network design, see the paper: An Analysis of Deep Neural Network Models for Practical Applications.

Bottleneck layer
Inspired by NiN, Inception's bottleneck layer reduces the number of features in each layer, and thus the number of operations; So the inference time can be kept low. The number of features is reduced by 4 times before the data is passed through expensive convolutional modules. This is a significant savings in terms of computational costs and a success of the architecture.

Let's verify it specifically. Now that you have 256 feature inputs and 256 feature outputs, let's assume that the Inception layer can only perform 3×3 convolutions, which means a total of 256×256×3×3 convolutions (nearly 589,000 multiplication and accumulation (MAC) operations). This may be beyond our compute budget, say, to run the layer in 0.5 milliseconds on Google servers. Instead, we decided to reduce the number of features that needed to be convoluted, which was 64 (i.e. 256/4). In this case, we start with a convolution of 256 -> 64 1×1, then 64 convolutions on all branches of the Inception, followed by a 1×1 convolution from a feature of 64 -> 256, and now the operation is as follows:

256×64 × 1×1 = 16,000s
64×64 × 3×3 = 36,000s
64×256 × 1×1 = 16,000s

Compared with the previous 600,000, there are now a total of 70,000 computational volumes, which is almost 10 times less.

And even though we did better operations, we didn't lose its generality at this layer. The bottleneck layer has proven to be top-notch on datasets like ImageNet, and it's also being used in architectures like ResNet, which we'll introduce next.

It is successful because the input features are related, so redundancy can be reduced by properly combining them with 1×1 convolution. Then, after a small number of features are convoluted, they can be expanded again into meaningful combinations at the next layer.

Inception V3 (and V2)
Christian and his team are very prolific researchers. In February 2015, Batch-normalized Inception was introduced as Inception V2 (see paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift). Batch-normalization calculates the mean and standard deviation of all feature maps on the output of one layer, and uses these values to normalize their responses. This is equivalent to "whitening" the data, so that all neural maps respond in the same range and are zero mean. This helps with training when the next layer does not need to learn offset from the input data, and also focuses on how to better combine these features.

In December 2015, the team released a new version of the Inception module and similar architecture (see paper: Rethinking the Inception Architecture for Computer Vision). The paper better explains the original GoogLeNet architecture, giving more detail on design choices. The original idea is as follows:

By carefully building the network, the depth and width are balanced, so as to maximize the flow of information into the network. Before each pooling, add feature mapping.
As the depth increases, the depth or number of features of the network layer also increases systematically.
Use each layer depth increase to increase the binding of features before the next layer.
Using only 3×3 convolutions, a given 5×5 and 7×7 filters can be split into multiple 3×3s when possible. Look at the image below

As a result, the new Inception became:

You can also split the filter by flattening the convolution into more complex modules:

While performing inception calculations, the Inception module can also reduce the size of the data by providing pooling. This is basically similar to running a simple pooling layer in parallel when running a convolution:

Inception also uses a pooling layer and softmax as the final classifier.

ResNet
December 2015 saw a new change, which coincided with Inception V3. ResNet has a simple idea: feed the output of two consecutive convolutional layers and bypass the input to the next layer (see the paper: Deep Residual Learning for Image Recognition).

This is similar to some of the old ideas before. But in ResNet, they split the two layers and are applied to a larger scale. Triding after 2 layers is a key intuition because splitting one layer doesn't give more improvement. Passing through layer 2 might be thought of as a small classifier, or a network-in-network.

This is the first time that the number of network layers exceeds 100, and even 1000 layers can be trained.

ResNet, which has a large number of network layers, is starting to use a network layer similar to the Inception bottleneck layer:

This layer processes a smaller number of features by first convoluting a smaller number of features by a 1×1 with a smaller output (usually 1/4 of the input), then using a layer of 3×3 and then a layer of 1×1. Similar to the Inception module, this ensures low computational intensity while providing rich feature combinations.

ResNet uses a relatively simple initial layer on the inputs: a 7×7-volume base layer with two pools. Compare this with the more complex and less intuitive Inception V3 and V4.

ResNet also uses a pooling layer plus softmax as the final classifier.

Other insights about ResNet happen every day:

ResNet can be considered both parallel and contiguous, treating inputs and outputs (inouts) as parallel in many modules, while the outputs of each module are connected continuously.
ResNet can also be considered as a combination of parallel or continuous modules (see the paper: Residual Networks are Exponential Ensembles of Relatively Shallow Networks).
It has been found that ResNet typically operates in parallel on network blocks at layers 20-30. Instead of continuously flowing through the entire length of the network.
When ResNet feeds output back to input like an RNN, the network can be considered a better biologically trusted cortical model (see paper: Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex).

Inception V4
Here's another version of Inception from Christian and his team, which is similar to Inception V3:

Inception V4 also combines the Inception module and the ResNet module:

I think the architecture is not very concise, but it is also full of less transparent heuristics. It is difficult to understand the choices in it, and it is difficult for the authors to explain.

Given the simplicity of the network, which can be easily understood and corrected, ResNet may be better.

SqueezeNet
SqueezeNet (see paper: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size) is a recently published architecture that reprocesses the concepts in ResNet and Inception. A better architecture design network model is smaller, and the parameters do not yet require complex compression algorithms.

ENet
Our team plans to combine all the features of the recently revealed architecture to create a very efficient, low-weight network that uses fewer parameters and calculations to achieve top-notch results. This network architecture is called ENet and was designed by Adam Paszke. We've already used it for single-pixel markup and scene resolution.

For more information on ENet, see the paper ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. ENet is a network of encoding, adding, and decoding. The encoder is a regular CNN design to classify. The decoder is an upsampling network that propagates the classification backwards to the original image for segmentation. This only uses neural networks and no other algorithms for image segmentation.

ENet is designed to use the minimum amount of resources possible at the beginning. That's why it has such a small script, with a combined network of encoders and decoders occupying 0.7 MB with 16 fp accuracy. Even with such a small model, ENet is similar to or higher than other neural network solutions in segmentation accuracy.

Module analysis
The analysis of the CNN module, which has been done in the paper (Systematic evaluation of CNN advances on the ImageNet), is very helpful:

Use ELU nonlinearity without batchnorm or ReLU with batchnorm.
Use a learned RGB color space transformation.
Use a linear learning rate decline strategy.
Use the sum of the average and larger pooled layers.
Use a mini-batch size of approximately 128 to 256. If that's too big for your GPU, just scale the learning rate down to this size.
Use the fully connected layer as the convolution and average all predictions for making the final prediction.
When the study increases the training set size, it detects if there is a plateau that is not reached
The cleanliness of the data is more important than the size of the data.
If you can't increase the size of the input image, reduce the stride on subsequent layers, doing so has the same effect.
If your network has a complex and highly optimized architecture, such as GoogLeNet, you must be careful about modifying it.

Other architectures worth paying attention to
FractalNet (see paper: FractalNet: Ultra-Deep Neural Networks without Residuals) uses a recursive architecture, which is not tested on ImageNet. This architecture is a derivative of ResNet or more generally, ResNet.

future
We believe that creating a neural network architecture is a top priority for the development of deep learning. Our team highly recommends carefully reading and understanding the papers mentioned in the article.

But one might wonder why we spend so much time making architectures? Why not tell us what to use with data? How do you combine modules? These questions are good, but still under research, and there is a paper to refer to: Neural networks with differentiable structure.

Note that most of the architectures we've talked about in this article are about computer vision. Similar neural network architectures have been developed in other areas, and it is very interesting to learn about architectural changes in all other tasks.

If you are interested in comparing neural network architecture and computational performance, see the paper: An Analysis of Deep Neural Network Models for Practical Applications.

The Evolution of Neural Network Architecture: A comprehensive review of more than ten architectures from LeNet5 to ENet

Related Posts

Sections viewed