Original article was published by Amir Nejad on Deep Learning on Medium
The LeNet-5 model demonstrated superior ability to read and classify hand written digits. Even though the LeNet-5 network structure performed well on the MNIST dataset, the real test of classifying more complicated images like CIFAR-10 showed the model’s capacity of learning such complex patterns is too low. Therefore, the development of more powerful architectures went to a state of hibernation till 2012 when AlexNet was born. Considered as the first deep CNN model, AlexNet was proposed by Krizhevesky et al. There were several developments in the meantime that throttled improvement in neural network classification accuracy, namely:
- Max pooling: Pooling layer is used to reduce sensitivity of neural network models to the location of feature in the image. In the original LeNet-5 model, average pooling layers are used. However, Ranzato et al.  demonstrated good results by learning invariant features using max pooling layers. The max pooling layers discriminate against features with less dominant activation functions and only select the highest values. This way only the most important features are fed through the pooling layer. For more information refer to part 1 of this series (link).
- GPU/CUDA programming: In the early development of neural networks, one of the main bottlenecks during model training was the computational power of computers as they primarily used the CPU to train models. In 2007, the NVIDIA corporation launched the CUDA (Compute Unified Device Architecture) programming platform to facilitate parallel processing on GPUs (graphic processing units). CUDA enables model training on GPUs and results in much faster training time as compared to CPUs. As a result, training of larger networks became possible.
- ReLU activation function: Nair and Hinton  demonstrated rectified linear units’ (ReLU) ability to improve classification accuracy of Restricted Boltzmann Machines. ReLU units simply let any value above zero pass through the filter and suppress any value below zero. ReLU function is non-saturating, meaning the limit of the function as the input increases approaches infinity hence it can alleviate the vanishing gradient problem.
- ImageNet: Another catalyst to the success of the field of deep learning in general is the ImageNet database prepared by the professor Fei-Fei Li’s group at Stanford. ImageNet contains millions of annotated images from thousands of classes. The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) oversaw many advancements in convolutional neural network structures including AlexNet. The training data is a subset of ImageNet released in the year 2012 and has 1.2 million images belonging to 1,000 classes. The validation dataset consists of 50,000 images belonging to 1,000 classes (50 image per class). A sample of the ImageNet image can be seen in the example below:
Note on downloading the data: The official website of ImageNet (link) can provide the images to individuals. However, I received no download link after submitting my request. The easiest way fordownloading images is from ImageNet Object Localization Challenge (link).
AlexNet Model Structure
AlexNet achieved a winning top-5 test error rate of 15.3% (previous model error 26%) in the ILSVRC-2012 competition. The network architecture is similar to LeNet-5 model (read more on LeNet-5: Link) but with more convolutional layers, hence deeper model.
The main activation function used in the model is the non-saturating rectified linear units (ReLU) functions. The model is mainly comprised of 8 layers: 5 convolutional layers and 3 dense layers. Kernel size is reduced from 11×11 to 3×3. Each convolutional layer is followed by a max pooling layer. The model uses dropout in the first two fully-connected layers to avoid over-fitting. The implementation of AlexNet in Tensorflow is given below.
The model is trained using the Stochastic Gradient Descent (SGD) optimization algorithm. The learning rate was initialized at 0.01 with momentum of 0.9, and weight decay of 0.0005. The code snippet to build AlexNet model in Tensorflow can be seen below:
Note, the optimizer used in the model is gradient descent with momentum. This optimizer is located in a separate package called
tensorflow_addons (more info can be seen here).
AlexNet Demo on 2 Classes
Training AlexNet on the entire ImageNet dataset is time consuming and requires GPU computing capabilities. Therefore, in this section, I am going to demonstrate training of AlexNet type structure on ImageNet dataset consisting of two classes:
n03792782: mountain bike, all-terrain bike, off-roader
n03095699: container ship, container vessel
The training dataset is consists of 2,600 images belonging to two classes. Calling
AlexNet function results a network with over 62 million trainable parameters as it can be seen below:
Layer (type) Output Shape Param #
conv2d (Conv2D) (None, 55, 55, 96) 34944
max_pooling2d (MaxPooling2D) (None, 27, 27, 96) 0
conv2d_1 (Conv2D) (None, 27, 27, 256) 614656
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 256) 0
conv2d_2 (Conv2D) (None, 13, 13, 384) 885120
conv2d_3 (Conv2D) (None, 13, 13, 384) 1327488
conv2d_4 (Conv2D) (None, 13, 13, 256) 884992
max_pooling2d_2 (MaxPooling2 (None, 6, 6, 256) 0
flatten (Flatten) (None, 9216) 0
dense (Dense) (None, 4096) 37752832
dropout (Dropout) (None, 4096) 0
dense_1 (Dense) (None, 4096) 16781312
dropout_1 (Dropout) (None, 4096) 0
dense_2 (Dense) (None, 1000) 4097000
dense_3 (Dense) (None, 2) 2002
Total params: 62,380,346
Trainable params: 62,380,346
Non-trainable params: 0
Model Training and Evaluation
AlexNet model is trained for 90 epochs on the entire training data and validated on 50K images from the validation dataset. An example of a training model on a CPU can be seen below (to train on GPU, use
The fully trained AlexNet model on 2 classes can reach the accuracy of 95%. The learning curve and losses of training and validation set can be seen in the following figure. As it can be seen, the training loss on the validation set stays flat after 20 epochs and model learning cannot be improved.
Another way of assessing the model performance is using a method called confusion matrix. A confusion matrix is the table lay-out, comprised of data categories, and the resulting prediction using the trained classifier. An example of a confusion matrix can be seen here. This confusion matrix is obtained by running the trained neural network classifier on 100 validation images (50 images per category). As it can be seen, the model has misclassified only 1 image on the bikes category (denoted by 0). However, the model misclassified 3 images on the ships category (denoted by 1).
The 4 misclassified images are illustrated below. It seems the model cannot fully recognize objects in the image when:
- Object is cropped in the image (partially visible object)
- Object is in the background or covered by the surroundings