Image classification vs Object detection
- Applications: e.g Neural Style Transfer
- Vertical edge detection, 6×6 grayscale image, 3×3 filter/kernel (sobel, robert, prewitt filter) -> 4×4 image. In images with clear edges, visualize the image and the Prewitt filter.
- Vertical vs horizontal
- prewitt filter → sobel filter: more weighted for the center, more robust
- What about auto learn the filter ? hard-code filter -> data-learned filter, robust with the rotation. data-dependent filter.
- The output filtered is depended on how “fit” the filter [f x f] to the image [n x n]. Then, the output is [(n-f+1), (n-f+1)]. “f” usually odd.
- The “edge” pixels are used less times in filtering step, rather than more center pixels. Loss the informative and size of the ouput dismiss very fast.
- Solution: padding edge “p” pixels to expand the size of the input image -> (n+2p-f+1) x (n+2p-f+1)
- “Valid” convolution: no padding, “Same” convolutions: padding, #input == #output => p = (f-1)/2
- Sweeping the filter not in 1 by 1 pixel, but jump 2 two pixels step or more => smaller output image.
- Jumping “s” pixels => output[( (n+2p-f)/s+1) X …]
- Typical “covolution” operator in math/signal processing textbooks: flip and multiply. The “covolution” of deep learning is “cross-correlation”.
Convolution over volume:
- RGB images (3 channels= Nc), including 3 layers of matrices => 3 covolution matrices to do the filterting for each channel (according to version).
- The mechanism: in each area, doing the filtering for 3 channel by 3 layers respectively. Add all of them together to get the final number ( 1 number).
- Multiple filters: use different filters and stackup the results. Each filter for different purpose.
One layer of the covolutional neural network
- Input -> Filtering -> output_tmp -> add bias -> relu -> final ouput
- Parameters are the number of elements in these filters cubic. NOT the number of weights in the full connected neural network. Less parameters + less overfitting + easier to learn + more robust.
- Partice to count how many parameter in the network
Simple convolutional network
- One convolution layer: Images input (3 channels) -> filtering -> output (affected by padding, stride, size of the filter) x (number of filters).
- Input -> many convolution layers -> flattening -> full connected layer -> outputs.
- Type of layer in the convolutional network: convolution (conv) + pooling (pool) + fully connected (fc)
- Max operator: break image -> region -> pick the max value of each region.
- Hyper parameter: stride (s) and filter size (f), do not need to be learned.
- Reserve the key feature by max operator, if the feature is dismiss in the output, it is not “key” feature.
- Hands-on example:
- The average operator can be used in the pooling step.
- Andrew Ng convention: Convolution+pool = one layer.
- LeNet-5: Input -> layer 1 -> layer 2 -> fully connected layer 3 -> fully connected layer 4 -> softmax -> output
- Activation size decrease fastly, but the number of parameters increase (still a lot smaller than the full connected weights).
Why CNN ?
- Parameters sharing and Sparsity of the connection
- Parameters sharing in the cubic of features, it learns the data-dependent filter based on areas/parts of input images. Useful in one part is also useful in other parts. Less parameters
- Sparsity of the connection: each output element is depended only some number of input : as this element is the result number of the convolution operator of the filter on a part of the image. Less overfitting, less translation invariance.
- How to train ? need a “cost” function, gradient descent
Source: Deep Learning on Medium