Deep Learning Specialization Part-one

Source: Deep Learning on Medium

Introduction to Deep Learning

What is Neural Network ?

Basic intuition of neural network using house price prediction as toy example.

Classic example of a supervised learning using linear regression

Problem statement : Predict price of a house by looking at data describing the house.

Technically, if X is the input data we have about the house and Y is price of that house, we want to find a mapping between X to Y.

Function approximation

In order to look at how neural network solve this problem we need to start from the most simplest neural network, a Neuron. Neurons are fundamental building of any neural network and they map their input to output via some function mapping.

Simplest Neural Network

A typical mapping function for a neuron looks like taking summation of all the inputs multiplied by there weights and applying a activation (non-linear) function to that value to give out final value as output.


The core idea of how neural networks approach this problem can only be studied once we extend our input data from single feature to multiple features.

Neural networks will read input features and further map them to stronger and more relevant features w.r.t our target variable.

Important point to note here is that the features such as family size, walkability and school quality are derived or learned features neural network they are not present as input.

Actual Construction of Neural Network

Supervised Learning

List of popular examples

  1. House price prediction in Real Estate
  2. Predict Ad click event in Online Advertising
  3. Object classification for image tagging
  4. Car detection and localization for autonomous vehicle
  5. Audio to text for voice based virtual assistance system

Neural Network examples

  1. Standard neural network
  2. Convolutional neural network
  3. Recurrent neural network

Structured data and Unstructured data

  • Structured data is usually database of some application
  • Unstructured data refers to raw data such as audio or pictures
  • Humans are good at interpreting unstructured data

Driving factors behind rise of deep learning

  • Data (Collecting more labeled data)
  • Computation (GPU accelerator)
  • Algorithms (changes such as replacing Sigmoid to Relu activation function)

Iterative cycle

Overview of rest of the material

  • Goal is to learn how to build deep neural network and get them to work.
  • Basics of neural network programming
  • Build deep neural network for image classification

Logistic Regression as a Neural Network

  • Logistic regression is a learning algorithm used to build a ML model for binary classification.
  • We will study a toy example of cat v/s non-cat classifier
  • Mathematical representation of problem statement
  • We often look at batch of training examples together, hence the input vectors are stacked together to form matrix and class labels are stacked together to form vectors
  • So far we have looked at inputs and their class label data. Now it’s time to focus on predictions.
  • In logistic regression we expect the model output to be a probability score estimating the label to be one of the pre-defined class given the input.
  • We can see how we bridge a non-numerical quantity class label to a numerical value the model outputs. For more explanation refer logistic regression article
  • Linear regression model doesn’t fit the requirement as it outputs a value that can range from -inf to +inf. We need a probability score which ranges from 0 to 1.
  • This is achieved applying sigmoid function after the summation of linear combination of weighted inputs.

Logistic Regression Cost Function

  • All we want from above model is to give out predictions similar to (as close as possible) to ground truth labels (class labels).
  • Loss / error function : We define a function which takes in prediction and ground truth as input and outputs a (loss) value that tells us how good or bad are our predictions.
  • There are many options for such loss functions, but not all of them are desirable. For example (l2) square error loss function maps our prediction to a loss value but when you start learning parameters using gradient descent the optimization problem you get to solve is non-convex with multiple local optima.
  • Hence we use different loss function which does the same job as squared error loss function but it will give us optimization problem that is convex
  • Cost Function : Loss function is defined to measure a single training example, Cost function is defined to measure how well are the predictions over entire training set.

Gradient Descent

  • The optimization problem at hand is to find parameters (w and b) such that cost function is minimised
  • lower dimension visualization of cost function which is a convex function
  • The optimization is a iterative process, were we start at some random point and decent to new point which brings us closer to global optima point.
  • Weight Update : We use concept of partial derivative from multivariate calculus in order to understand how to find the right direction / step for a particular parameter such that, if we update that parameter in that direction then the loss value decreases.

Derivatives and Computational Graph :

  • Derivative of loss function with respect to model parameter gives us the direction in which we should make a small change such that the loss value decreases.
  • In order to calculate partial derivatives we leverage concept in calculus know as ‘chain rule’
  • In order to simply or visualize chain rule we use concept of computation graph were we interpret our model as chain of operations such that it forms a graph connecting inputs from left to output to the right.
  • We can traverse through the graph connect input to final output and also we can start at output and backtrack on how are intermediate steps computed or linked to it’s input
  • Now the forward and backward propagation through this graph seems more intuitive

Gradient Descent for Logistic Regression

  • Computational graph of the model

Vectorization and Broadcasting

  • If we look at naive implementation of logistic regression we can see that there are two for loops present
  • One for loop iterates over all the training examples, another for loop iterates over all the features (elements) in input vector.
  • We use numpy library which leverages SIMD to optimize vector and matrix multiplications
  • We replace feature for loop by dot product between input and weight as vector or else we can also stack all input vectors as matrix and perform matrix multiplication to avoid for loop for training examples