Deep learning techniques in computer vision has surpassed human level accuracy i.e now a computer can more accurately tell if a given picture is a cat or dog.In this post we will get a feel for the mechanics of the process that actually make it possible.
Let’s represent an image by a point ‘x’ (a bunch of pixel values) and figure how out set of rules for any image classification task — here recognizing faces.Our visual system is incredibly complex and each scene goes through complex chain of image processing tasks.Let’s assume(for the sake of simplicity) that we could weigh set of binary decisions like is there an eye in the top left of the image or is there a mouth at the bottom etc to decide a face.Though seemingly grossly incorrect we would see how this approach would finally outperform (with some tricks and hacks) humans.
Almost any sort of intelligent task can be seen as passing a set of inputs through a function or a mapping (it can be dynamic) that spits out the actions or the results.For image classification we got to find a mapping from ‘x’ to ‘y’ to indicate whether it’s an image of face.
What if we would come up with a series of transformations(mappings) from input space to neatly split all the images into two classes face and non-face.For instance i might come up with an arbitrary rule like if the sum of all the pixel values of my image are greater than 1000 (consider all images are of equal size) than they are faces.Ok,that’s too naive to be useful but what if we would let our algorithm figure out the right set of rules by only providing the error on each prediction.This is the fundamental idea behind deep learning.To make our function space rich enough we include non-linearities like relu,sigmoid,tanh etc.
Note:Multiple arrows from each neuron indicates a single activation duplicated to form inputs for next layer.
Consider simple mapping y = mx + c that takes an input x and maps to y with m and c as parameters.
In case of our image data we will have x as a vector of image size as input with m and c as corresponding vectors of variables whose values we got to tweak.Our aim would be to reduce error.Our mention of non-linearities simply means passing output y through a non-linear function like sigmoid.
Theoretically the form of non-linearity doesn’t matter(we will discuss them in the future posts),so for now take sigmoid as a convenient substitute.So each circle in the above figure performs a liner operation followed by a squashing function like sigmoid.Virtually all the neurons are same except each one taking the output’s of previous layers as inputs.Output will be 1 or 0 indicating face or non-face(here as the last neuron also has sigmoid activation we will get values between 0 and 1 indicating the probability of face).But,how will we figure out the right weights.That’s the question of next post(Backpropagation).
NOTE : It takes some time and practise to actually appreciate how nothing but the above structure can perform arbitrary complex tasks from image recognition to language models.We would be building some along the way..
Source: Deep Learning on Medium