ML Paper Challenge Day 8— Going deeper with convolutions

Original article can be found here (source): Deep Learning on Medium

ML Paper Challenge Day 8— Going deeper with convolutions

Day 8: 2020.04.19
Paper: Going deeper with convolutions
Category: Model/CNN/Deep Learning/Image Recognition

This paper introduces a new concept called “Inception”, which is able to improve utilisation of computation resources inside the network. This allows increasing the depth and width while keeping the computational budget constant.

  • based on the Hebbian principle & intuition of multi-scale processing
  • useful in the context of localisation and object detection

Related Work:

  • Series of fixed Gabor filters of different sizes: handle multiple scales
    -> In “Inception”, all filters are learned
  • Network-in-Network: increase the representational power of neural networks
    additional 1 × 1 convolutional layers are added to the network
    -> In “Inception”, used mainly as dimension reduction modules to remove computational bottlenecks
  • R-CNN: decomposes the overall detection problem into two subproblems: utilising low-level cues such as colour and texture in order to generate object location proposals in a category-agnostic fashion and using CNN classifiers to identify object categories at those locations.


  • introduce sparsity and replace the fully connected layers by the sparse ones, even inside the convolutions
  • “if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer after layer by analysing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs.” -> Hebbian principle — neurons that fire together, wire together
  • However, today’s computing infrastructures are very inefficient for numerical calculation on non-uniform sparse data structures
  • clustering sparse matrices into relatively dense sub-matrices tends to give competitive performance for sparse matrix multiplication.



  • consider how an optimal local sparse structure of a convolutional vision network can be approximated and covered by readily available dense components
  • find the optimal local construction and to repeat it spatially
  • a layer-by layer construction where one should analyse the correlation statistics of the last layer and cluster them into groups of units with high correlation.
  • Each unit from an earlier layer corresponds to some region of the input image and these units are grouped into filter banks
  • In lower layers (the ones close to the input):
    correlated units would concentrate in local regions.
    -> end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer
  • In upper layers:
    there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions
    -> to avoid patch-alignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1×1, 3×3 and 5×5;
  • As features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease.
    ->the ratio of 3×3 and 5×5 convolutions should increase as we move to higher layers.


  • 5×5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters.
  • Solution: Reduce dimension
    1×1 convolutions are used to compute reductions before the expensive 3×3 and 5×5 convolutions.