Convolutional Neural Networks — Summary of Krizhevsky et. al.‘s 2012 paper

Ahlemkaabi
5 min readFeb 3, 2022

ImageNet and the ImageNet Challenge:

Dr. Li is the inventor of ImageNet and the ImageNet Challenge, a critical large-scale dataset and benchmarking effort that has contributed to the latest developments in deep learning and AI.
It is a data-set of over 15 million labeled high-resolution images belonging to roughly 22,000 categories.

Introduction:

(background of the study and state of purpose.)

In the frame of the ILSVRC competition — ImageNet Large-Scale Visual Recognition Challenge as part of the Pascal Visual Object Challenge, starting in 2010 — Krizhevsky and his colleagues have trained one of the largest convolutional neural networks on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012 competitions and achieved best results.

In this blog I will try to summarize the study paper of this convolutional neural network, also known as AlexNet.

ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories.
In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.

Procedures:

(Describe the specifics of what this study involved.)

The study involves details they use to implement their model, such as how they preprocessed the data, which architecture they choose, which regularization techniques they used, and why, with some details about the hyperparameters.

The data-set(ImageNet):

ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available. ILSVRC-2012 competition for which test set labels are unavailable.

ImageNet consists of variable-resolution images, while the system requires a constant input dimensionality. So they had to handle this.

  1. Scale a possibly rectangular image so that the short side is 256 pixels.
  2. Take the middle 256x256 patch as the input image.

The architecture

Architecture of the network

Contains, eight learned layers( five convolutional and three Fully-connected)

In the paper, the team mentioned that the network has not a usual architecture which makes it that efficient.

Here are some of the unusual features of the network’s architecture:

1- ReLU non-linerarity

They have used a ReLU (neurons with non-linearity — Rectified Linear Units) activation function

This figure from the paper is the best way to explain their choice!

Figure 1

Figure 1: Four-layer convolutional neural models. network with ReLUs (solid line) reaches a 25%, training error rate on CIFAR-10 six times faster than an equivalent network with Tanh neurons (dashed line).

  • The learning rates for each network were chosen independently to make training as fast as possible.
  • No regularization of any kind was employed.

Conclusion: Deep convolutional neural networks with ReLUs train several times faster than their equivalents with Tanh units.

The team believes that

Faster learning has a great influence on the performance of large models trained on large datasets.

different Activation functions

2- training on multiple GPUs

Training such a large dataset needs memory! Therefore the team spread the network across two GPUs.

Current GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through host machine memory.

  • Puts half of the kernels on each GPU.
  • The GPUs communicates only on certain layers!

They also mentioned the local response Normalization, overlapping pooling, for more details I recommend reading the original paper link below!

Reducing Overfitting

Training such a large data set, the team consider two ways to combat overfitting

  • Data Augmentation: In which they use two forms
    1- Consists of generating image translations and horizontal reflections ( extracting random 224 x 224 patches and their horizontal reflections from the 256x256 images and training the network on these extracted patches.
    2- Consists of altering the intensities of the RGB channels in training images (they perform PAC Color Augmentation on the set of RGB pixel values throughout the ImageNet training set).
  • Dropout: Consists of setting to zero the output of each hidden neuron with probability of 0.5.

Training

The model was trained with stochastic gradient descent with a momentum of 0.9, weight decay of 0.0005, and batch size of 128 images

Initialization:

  • Weights(in each layer): from a zero-mean Gaussian distribution with a standard deviation of 0.01.
  • Biases(in the second, fourth, and fifth convolutional layers + the fully-connected hidden layers) with the constant 1.

→accelerates the early stages of learning by providing the ReLUs with positive inputs.

  • learning rate(for all layers): Was adjusted manually throughout training. It was initialized at 0.01.

Learning time:

  • Five to six days on two NVIDIA GTX 580 3GB GPUs. (90 cycles trough the training set of 1.2 millions images.)

Results:

(the major findings and results.)

The AlexNet major findings, that it was able to recognize off-center objects and most of its top five classes for each image are reasonable.

AlexNet achieved Top-1 error rate of 37.5% and Top-5 error rate of 17.0% on ILSVRC-2010.

On the 2012 Challenge, the authors pre-trained the model on the ImageNet 2011 Fall dataset release and reduced the error rate down to 15.3%

Conclusion:

(Summarizing the researchers’ conclusions.)

Using purely supervised learning, the deep convolutional neural network was able to achieve great results. Layers numbers and types are precisely chosen for this performance.

Personal Notes:

This study is insightful! in particular the arguments used to clarify why a specific decision to choose a hyperparameter value or function took place.
It had several concepts that I had to explore further, such as the local response Normalization, and training on multiple GPUs.

--

--