Source: Deep Learning on Medium

Introduction to Deep Learning
What is Neural Network ?
Basic intuition of neural network using house price prediction as toy example.

Classic example of a supervised learning using linear regression
Problem statement : Predict price of a house by looking at data describing the house.

Technically, if X is the input data we have about the house and Y is price of that house, we want to find a mapping between X to Y.

Function approximation
In order to look at how neural network solve this problem we need to start from the most simplest neural network, a Neuron. Neurons are fundamental building of any neural network and they map their input to output via some function mapping.

Simplest Neural Network
A typical mapping function for a neuron looks like taking summation of all the inputs multiplied by there weights and applying a activation (non-linear) function to that value to give out final value as output.

Intuition
The core idea of how neural networks approach this problem can only be studied once we extend our input data from single feature to multiple features.

Neural networks will read input features and further map them to stronger and more relevant features w.r.t our target variable.

Important point to note here is that the features such as family size, walkability and school quality are derived or learned features neural network they are not present as input.

Actual Construction of Neural Network
Supervised Learning
List of popular examples
House price prediction in Real Estate
Predict Ad click event in Online Advertising
Object classification for image tagging
Car detection and localization for autonomous vehicle
Audio to text for voice based virtual assistance system
Neural Network examples
Standard neural network
Convolutional neural network
Recurrent neural network
Structured data and Unstructured data
Structured data is usually database of some application
Unstructured data refers to raw data such as audio or pictures
Humans are good at interpreting unstructured data
Driving factors behind rise of deep learning
Data (Collecting more labeled data)
Computation (GPU accelerator)
Algorithms (changes such as replacing Sigmoid to Relu activation function)
Iterative cycle
Overview of rest of the material
Goal is to learn how to build deep neural network and get them to work.
Basics of neural network programming
Build deep neural network for image classification
Logistic Regression as a Neural Network
Logistic regression is a learning algorithm used to build a ML model for binary classification.
We will study a toy example of cat v/s non-cat classifier
Mathematical representation of problem statement
We often look at batch of training examples together, hence the input vectors are stacked together to form matrix and class labels are stacked together to form vectors
So far we have looked at inputs and their class label data. Now it’s time to focus on predictions.
In logistic regression we expect the model output to be a probability score estimating the label to be one of the pre-defined class given the input.
We can see how we bridge a non-numerical quantity class label to a numerical value the model outputs. For more explanation refer logistic regression article
Linear regression model doesn’t fit the requirement as it outputs a value that can range from -inf to +inf. We need a probability score which ranges from 0 to 1.
This is achieved applying sigmoid function after the summation of linear combination of weighted inputs.
Logistic Regression Cost Function

All we want from above model is to give out predictions similar to (as close as possible) to ground truth labels (class labels).
Loss / error function : We define a function which takes in prediction and ground truth as input and outputs a (loss) value that tells us how good or bad are our predictions.
There are many options for such loss functions, but not all of them are desirable. For example (l2) square error loss function maps our prediction to a loss value but when you start learning parameters using gradient descent the optimization problem you get to solve is non-convex with multiple local optima.
Hence we use different loss function which does the same job as squared error loss function but it will give us optimization problem that is convex
Cost Function : Loss function is defined to measure a single training example, Cost function is defined to measure how well are the predictions over entire training set.
Gradient Descent

The optimization problem at hand is to find parameters (w and b) such that cost function is minimised
lower dimension visualization of cost function which is a convex function
The optimization is a iterative process, were we start at some random point and decent to new point which brings us closer to global optima point.
Weight Update : We use concept of partial derivative from multivariate calculus in order to understand how to find the right direction / step for a particular parameter such that, if we update that parameter in that direction then the loss value decreases.
Derivatives and Computational Graph :

Derivative of loss function with respect to model parameter gives us the direction in which we should make a small change such that the loss value decreases.
In order to calculate partial derivatives we leverage concept in calculus know as ‘chain rule’
In order to simply or visualize chain rule we use concept of computation graph were we interpret our model as chain of operations such that it forms a graph connecting inputs from left to output to the right.
We can traverse through the graph connect input to final output and also we can start at output and backtrack on how are intermediate steps computed or linked to it’s input
Now the forward and backward propagation through this graph seems more intuitive
Gradient Descent for Logistic Regression

Computational graph of the model
Vectorization and Broadcasting

If we look at naive implementation of logistic regression we can see that there are two for loops present
One for loop iterates over all the training examples, another for loop iterates over all the features (elements) in input vector.
We use numpy library which leverages SIMD to optimize vector and matrix multiplications
We replace feature for loop by dot product between input and weight as vector or else we can also stack all input vectors as matrix and perform matrix multiplication to avoid for loop for training examples