CNN week 1

Image classification vs Object detection

  • Applications: e.g Neural Style Transfer

Edge detection,

  • Vertical edge detection, 6×6 grayscale image, 3×3 filter/kernel (sobel, robert, prewitt filter) -> 4×4 image. In images with clear edges, visualize the image and the Prewitt filter.
  • Vertical vs horizontal
  • prewitt filter → sobel filter: more weighted for the center, more robust
  • What about auto learn the filter ? hard-code filter -> data-learned filter, robust with the rotation. data-dependent filter.


  • The output filtered is depended on how “fit” the filter [f x f] to the image [n x n]. Then, the output is [(n-f+1), (n-f+1)]. “f” usually odd.
  • The “edge” pixels are used less times in filtering step, rather than more center pixels. Loss the informative and size of the ouput dismiss very fast.
  • Solution: padding edge “p” pixels to expand the size of the input image -> (n+2p-f+1) x (n+2p-f+1)
  • “Valid” convolution: no padding, “Same” convolutions: padding, #input == #output => p = (f-1)/2

Strided convolution:

  • Sweeping the filter not in 1 by 1 pixel, but jump 2 two pixels step or more => smaller output image.
  • Jumping “s” pixels => output[( (n+2p-f)/s+1) X …]
  • Typical “covolution” operator in math/signal processing textbooks: flip and multiply. The “covolution” of deep learning is “cross-correlation”.

Convolution over volume:

  • RGB images (3 channels= Nc), including 3 layers of matrices => 3 covolution matrices to do the filterting for each channel (according to version).
  • The mechanism: in each area, doing the filtering for 3 channel by 3 layers respectively. Add all of them together to get the final number ( 1 number).
  • Multiple filters: use different filters and stackup the results. Each filter for different purpose.

One layer of the covolutional neural network

  • Input -> Filtering -> output_tmp -> add bias -> relu -> final ouput
  • Parameters are the number of elements in these filters cubic. NOT the number of weights in the full connected neural network. Less parameters + less overfitting + easier to learn + more robust.
  • Partice to count how many parameter in the network

Simple convolutional network

  • One convolution layer: Images input (3 channels) -> filtering -> output (affected by padding, stride, size of the filter) x (number of filters).
  • Input -> many convolution layers -> flattening -> full connected layer -> outputs.
  • Type of layer in the convolutional network: convolution (conv) + pooling (pool) + fully connected (fc)


  • Max operator: break image -> region -> pick the max value of each region.
  • Hyper parameter: stride (s) and filter size (f), do not need to be learned.
  • Reserve the key feature by max operator, if the feature is dismiss in the output, it is not “key” feature.
  • Hands-on example:
  • The average operator can be used in the pooling step.

CNN example:

  • Andrew Ng convention: Convolution+pool = one layer.
  • LeNet-5: Input -> layer 1 -> layer 2 -> fully connected layer 3 -> fully connected layer 4 -> softmax -> output
  • Activation size decrease fastly, but the number of parameters increase (still a lot smaller than the full connected weights).

Why CNN ?

  • Parameters sharing and Sparsity of the connection
  • Parameters sharing in the cubic of features, it learns the data-dependent filter based on areas/parts of input images. Useful in one part is also useful in other parts. Less parameters
  • Sparsity of the connection: each output element is depended only some number of input : as this element is the result number of the convolution operator of the filter on a part of the image. Less overfitting, less translation invariance.
  • How to train ? need a “cost” function, gradient descent

Source: Deep Learning on Medium