Source: Deep Learning on Medium
ZF Net 
The winner of the ILSVRC 2013 was a network called ZF Net, which was built by Matthew Zeiler and Rob Fergus. The model was trained on the same ImageNet 2012 dataset that AlexNet was trained on.
Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256*256 region, subtracting the per-pixel mean (across all images), and then using 10 different sub-crops of size 224*224.
The model achieved a test error rate of 14.8 %. The model generalizes well to other datasets; when the softmax classifier is retrained, it convincingly beats the current results on Caltech-101 and Caltech-256 datasets. The architecture is essentially a fine-tuned version of the AlexNet architecture.
Here the authors are addressing two main issues:
· Why do large convolutional network models perform so well?
· How can these networks be improved?
A novel visualization technique that gives insight into the function of the intermediate feature layers and the operation of the classifier is introduced. The technique reveals the input stimuli that excite individual feature maps at any layer in the model. It also allows us to observe the evolution features during training and diagnose potential problems with the model.
The visualization technique uses a multi-layered deconvolutional network (deconvnet) to project the feature activations back to input pixel space. An input image is fed into the CNN and activations are computed at each level (forward pass).
Suppose we want to examine the activations of certain features in the 5th convolution layer? To do this, we can store the activations of this one feature map, but set all of the other activations in the layer to zero. We then pass the feature map as input into the deconvnet, which has the same filters as the original CNN. This input goes through a series of unpool (reverse maxpooling), rectify, and filter operations for each preceding layer until input space is reached.
To gain a deeper understanding of deconvnet, I would recommend this presentation by Zeiler. Through the visualization, we’re able to find a model architecture that outperforms AlexNet.
An ablation study is also performed to discover the performance contribution from different model layers. This is done by occluding portions of the input image, revealing which parts of the image are important for classification.
This is an 8-layer convnet model that takes 224 by 224 size images (with 3 color panes) as input. It’s convolved with 96 different 1st layer filters (red), each of size 7 by 7, using a stride of 2 for both x and y. The resulting feature maps are then passed through a rectified linear function, max pooled within 3*3 regions using stride 2 and contrast normalized across the feature maps to give 96 different 55 by 55 element feature maps.
The same operation is repeated in the preceding layers 2,3,4, and 5. The last 2 layers are fully connected, taking a feature from top convolutional layers as input in vector form (6.6.256=9216 dimensions). The final layer is a C-way softmax function, C (number of classes).
ZF Net provided a better understanding of the inner working of CNNs and illustrated more ways to improve performance. The visualization approach also has provided insight for improvements to network architectures more broadly.