Deep learning with Python

This post is for beginners in deep learning. I have written my experience about the book ‘Deep learning with Python’ and what I learnt from it.

This book is written by François Chollet, also the author of keras.


A book for anyone who wants to start the career in deep learning or even have some interest in deep learning. It covers all problem classes and solutions in the field of deep learning. After reading this book, you will be equipped with skills to classify an image, predict the weather, generate text etc.

I started this book one week ago. I had done some machine learning courses but deep learning was completely new to me. I wanted to understand the neural network for a college project. So, I searched for the neural network and came across many courses and study material. But finally, I found this perfect book for starting my learning of deep learning.

The author is good with his words and way of writing. He explains complex problems like a piece of cake. I found this book in the evening and read it the whole night. It is interesting to know the application first and implementing them after. It makes you confident and interested in further reading. I was interested in the field; but after reading this book, I am passionate about becoming a Data Scientist. I want to explore this field. I want to learn more. I want to contribute to this field.

Thanks to Francois Chollet for writing this book.

What is inside this book?

The book is divided into two parts:

  1. Fundamentals of Deep Learning
  2. Deep learning in Practice

First part is for building a solid foundation of deep learning and answers some important questions like ‘what is deep learning?’, ‘How is it different from A.I. and machine learning?’, ‘What can be done or can’t be done with deep learning?’ etc.

Chapter 1) What is deep learning?

“Deep learning, machine learning, artificial intelligence…”, Suddenly everyone is talking about them. Most of the people have no idea about the difference between these terms. It is a hot topic of discussion everywhere. Do you have a clear understanding of these terms?

If the answer to the previous question is no. Then, this chapter is for you. This chapter explains these terms, history related to them etc.

Artificial Intelligence:

Can we make a machine to think? This idea born in the 1950s. It was a time of symbolic AI. The approach was to give machines some data and a set of rules. The output was the answer. Until the 1980s, everyone was trying to build the best set of rules for better results. In the 1980s, people shifted their focus to another approach. It was starting of Machine learning era.

Machine Learning:

Give the machine sufficient data and answers, and let the machine makes the rules. With this approach, a new era of machine learning started in the 1980s. Machine learning is a subfield of Artificial Intelligence. And, Deep learning is a subfield of machine learning.

Deep Learning:

Deep learning is an old subfield of machine learning. But, everyone is talking about it now because of these reasons:

  1. Hardware
  2. Dataset
  3. Algorithms

Recent years saw an immense growth in data available, advancement in hardware performance and advancement in algorithms.

When other machine learning approaches achieve saturation on big data, deep learning provides better results with a large amount of data generating every moment.

A lot of feature engineering is required in other machine learning approaches. These approaches can learn only 1–2 layers of representation. So, these approaches are also known as shallow learning.

Deep learning requires almost no feature engineering and it can learn multiple layers of representation. Neural Networks are used to learn these representations.

Deep learning shows better results in:

  1. Image Classification
  2. Speech Recognition
  3. Machine Translation
  4. Natural language processing, etc.

There is a lot more detail about traditional machine learning methods and deep learning future given in this chapter. I am not going to discuss those here to keep the post short.

Chapter 2) Before we begin: the mathematical building blocks of neural networks.

This chapter deals with the mathematical part of neural networks. Are you a curious person? Do you want to know how a machine learns?

If yes, this chapter is for you.

Have you heard terms tensors, differentiation, gradient descent etc?

Don’t worry if the answer is no. This chapter explains these and few more terms from basic to advance level. You will also get a first look at a neural network.

Keras library and MNIST dataset are used to build and train the first neural network in this book.

Q. What is a neural network?

Ans. A neural network is a mathematical model. Term ‘neural’ comes from the neurobiology. Some people say that a neural network is similar to our brain. But, this book discards this analogy.

Q. What are the building blocks of a neural network?

Ans. The core building block of any neural network is the layer. The layer can be considered as a filter which abstract representational features from the data. The layers consist of many subunits, known as neurons.

It may seem difficult at first but I promise after understanding the basic mathematics behind it, the neural network will be super easy for you and your first choice while dealing with any machine learning problem.

We need to choose three more things before training our model. These are:

  1. A loss function
  2. An optimizer
  3. Metrics to monitor training and testing.

Let me help you to have an intuition about neural networks.

Imagine a ball. Consider this ball as a neuron. Let’s say we have 16 neurons. Arrange them in 4 columns like this.

0 0 0

0 0

0 0 0 0

0 0

0 0 0

0 0

These columns are known as layers. Each layer can have any number of neurons. The first layer is known as the input layer and the last layer is known as the output layer. Layers between the input layer and the output layer are known as hidden layers. A neural network can have any number of layers and any layer can have any number of neurons.

It is the basic architecture of neural networks. Next, imagine a thread between any two neurons of two consecutive layers. This connecting thread between any two neurons of any two consecutive layer is known as weight. Each and every neuron in a layer is connected to each and every neuron of the previous layer and the next layer. These weights are number. For example, neuron 2 of layer 2 is connected to neuron 5 of layer 3 by a weight of 0.4

Imagine a situation, you have given a competitive exam comprising 3 section (Physics, Chemistry and maths) for a job. You get 30 marks in physics, 70 marks in chemistry and 50 marks in maths. You will pass if the weighted sum of all three is more than 50.

Now, if all three subject have equal weightage then the weighted sum will be (0.33)*30 + (0.33)*70 + (0.33)*50 = 49.5

You don’t pass the exam.

But, what if no one passes the exam. The examination authority has to change the passing criteria. Now, they have given 50% weight to chemistry, 30% weight to maths and 20% weight to physics.

Now, your weighted sum is (0.2)*30 + (0.5)*70 + (0.3)*50 = 56.

Congrats, you pass the exam. The examination authority is an optimizer in a neural network. It changes weights and barrier values to allow or stop any feature to pass through a neuron. Each neuron has a bias value which decides what value can pass through it.

Even if you pass the exam and get the job. But, the company finds out that you don’t have enough skills required for the job. It will be a loss for the company. Same here with a ‘loss function’ of a neural network.

The loss function is the deviation of a predicted value from the actual value required. Work of optimizer is to minimise the loss function by changing weight and bias values.

One more function known as ‘Activation’ is to be set for every layer. Activation is a just mathematical function which transforms the output of a layer according to set function. Example: ‘Relu’ activation will transform all negative output to zero.

If you are interested in the mathematics behind these functions then you can go through this chapter. Otherwise, We have a powerful library Keras which takes care of all background coding and we have to choose only a few hyper-parameter according to our problem.

For the application you have to follow these few simple steps:

  1. Build model architecture by selecting the number of layers, neurons and activation.
  2. Compile the model by selecting loss function, optimizer and metrics.
  3. Train the model on your data.
  4. Predict on test data.

It is a simple neural network. But, that’s all we need to build a baseline model with Keras.

Go through this chapter if you want the deep and clear understanding of neural networks.

Chapter 3) Getting started with neural networks

This chapter introduces Keras and components of neural networks.

What problems can be solved using neural networks?

Classes of problems:

  1. Classification
  2. Regression

Classification problems can be further divided into 3 types:

  1. Binary Classification
  2. Multiclass Classification
  3. Multilabel Classification

Binary classification when we need to classify between two options only. Example: Dog vs Cat.

Multiclass Classification when we need to choose one option from many options available. Example: Digit (0–9)

Multilabel Classification when we need to choose multiple options. Example: River, farm, mountain etc in a satellite image.

Regression is to predict a floating number. Example: the price of a house.

You can have a look at Keras documentation which is also covered in this chapter.

After the introduction to Keras and neural networks. A binary classification problem is solved in this chapter.

Dataset — IMDB review dataset

Problem — To classify review as positive or negative.

After that, a multi-class classification problem is solved.

Dataset — Reuters Dataset

Problem — Classify Reuters newswires into 46 mutually exclusive topics.

After that, a regression problem is solved.

Dataset — Boston housing

Problem — Boston Housing Price prediction.

You can see the specific example according to your need.

Chapter 4) Fundamentals of machine learning

Machine learning can be divided into 4 branches:

  1. Supervised Learning
  2. Unsupervised Learning
  3. Self-supervised Learning
  4. Reinforcement Learning

Supervised Learning

Labels have to be selected from known targets. Example: handwritten digit classification in which labels can only be any digit between 0–9. Known targets are generally annotated by humans. Supervised learning is learning a map between input data and known targets.

Unsupervised Learning

No targets are provided to the model to learn. Instead, the model gives us interesting data transformation which can be used for data visualisation, clustering, dimensionality reduction etc.

Self-supervised Learning

It is supervised learning but the labels are not annotated by humans. Labels are generated from input data using a heuristic algorithm.

Reinforcement Learning

It is still in the research phase. Models try to maximise some reward while learning from the environment. It has been used to play Go.

Training, validation and test sets

Data has to be divided into 3 sets before fitting in the model; training, validation, and test sets.

Q. Why validation set? Why can’t we use only the test set for checking model performance?

Ans. There are two types of variables while building and training any model; parameters and hyperparameters.

Model learn parameters like weight and bias from training data. But, hyper-parameters are to be set by the humans. To set hyper-parameter, we use a validation set. If we used the test set to select hyper-parameters, then we would leak some information about the test data into the model. So, It is better to use three sets:

Training set: For training parameters

Validation set: For the selection of hyper-parameters

Test set: For evaluating the final model performance.

Data preprocessing

It would be a little difficult for a neural network to map relationships between raw data and labels. It is better to do some feature engineering before fitting data into the model. We can do vectorization, normalisation, handling missing value, and feature extraction.


Neural networks only accept data and targets in the form of a tensor with floating point value. Image, text or anything, we need to convert it into vectors.


If values of all features are not in the same range, then the model can become bias about some features. It is better to normalize all values between 0 to 1.

Handling missing value

We can replace the missing value with 0, given 0 is not used for anything else.

Feature engineering

Some simple transformations on input data that can make the mapping to the target values simpler. Do these transformations before fitting data into the model. Many times model can’t find these simple transformations if we gave the raw data.

Overfitting and underfitting

Overfitting is the main issue with any model which results in poor performance on the test data set. Overfitting can be solved using some techniques like:

  1. Reducing the network size
  2. Adding weight regularisation
  3. Adding dropout

The universal Workflow of Machine learning

  1. Defining a problem and assembling a dataset
  2. Choosing a major of success
  3. Deciding on an evaluation protocol
  4. Preparing the data
  5. Developing a model that works better than baseline
  6. Scaling up: Developing a model that overfits
  7. Regularizing the model and tuning hyperparameters

Choosing the last layer activation and loss function for a model

Problem Type- Last layer activation- loss function


  1. Binary — sigmoid — binary_crossentropy
  2. Multiclass, single label — softmax — categorical_crossentropy
  3. Multiclass, multi label — sigmoid — binary_crossentropy

Regression — None or sigmoid — mse

With this, we come to an end of part 1 of the book.

Chapter 5) Deep learning for computer vision

This chapter deals with computer vision problems like image classification.

Instead of fully connected dense layers model, a convolutional neural network (CNN) is used for computer vision problems.


From the application point of view, you don’t need to go deeper into understanding how convent works. But, if you are interested you can watch this video:

Using a pretrained convent

Instead of using a CNN made from scratch by us, we can use a pretrained CNN like ImageNet. ImageNet has trained over 1.4 million labelled images and 1000 different classes. We use Feature extraction to extract already learned feature of ImageNet.

Step to make the best model for computer vision problems:

  1. Extract features of a pretrained CNN.
  2. Unfreeze some output side layers of the CNN and train it.
  3. Use data augmentation to prevent overfitting.
  4. Fine tune the hyperparameters.

This chapter also visualises what the CNN learn We can visualise intermediate activation. It is helpful to understand what’s going on and for fine-tuning.

Chapter 6) Deep learning for text and sequences

Recurrent neural network (RNN) and 1-D convnet are useful when dealing with text or sequence data.

Working with text data

As we discussed before, neural networks take only the tensor as input. Text data also need to be converted into vectors before providing it to the neural network. Vectorization of text can be done in multiple ways:

Segment text into words, and transform each word into a vector.
Segment text into characters, and transform each character into a vector.
Extract n-gram of words or characters, and transform each n-gram into a vector.
These different segments which can be words, characters, n-grams are called tokens. Breaking text into tokens is called tokenisation. The two major ways to connect a token with a vector are one-hot encoding and token embedding.

One-hot encoding

We can understand it with an example. We have a text containing 1000 different words. We can number them from 1 to 1000. Now, each word has a vector of 1000*1. It will take value 1 at only one position and zero at all other positions. Hence, all words have a unique vector.

Word embedding

Word embeddings are lower dimension dense vectors whose values are learned from data. It is similar to the weight matrix of any neural network. We can use a separate neural network to learn word embedding before feeding data into RNN or we can use an embedding layer before RNN.

Recurrent neural network and long-short-term memory(LSTM)

As I said for CNN, you don’t need to go deep into the methodology for applying RNN. But, If you are interested you can watch this video:

Further, in this chapter, a temperature forecasting problem has been solved.

Dataset — jena_climate

You can go through this chapter if you are working on time series data or text sequence.

Chapter 7) Advance deep learning best practices

This chapter is for mainly research purpose. Advance network architectures are explained in this chapter. Introduction to TensorBoard is also given for analysing models and fine-tuning.

We can build multi-input and multi-output models using keras API.

This chapter has gone deep into deep learning. If you understood all the chapter before this, then you can try this chapter. I didn’t understand this chapter completely but it gave me exposure to the advance techniques which can be learned.

Chapter 8) Generative Deep learning

We saw some classical problems until now like classification and regression. But, deep learning can also be used to generate artworks. It can be used to write a script, create a painting, generate a song etc.

Text generation with LSTM

Did you read the news that a neural network wrote a script for the next season of Game of throne?

Yes, some researchers did it. They used previous novels as data to generate next in series.

Personally, I find this application of deep learning most interesting. This section used LSTM to create a text generation model. It is interesting to read and apply it.


Google released it in 2015. It is an artistic image modification technique which uses CNN learned representations.

Google it. You will see some really cool images. Example:

Neural Style transfer

How about painting something like Van Gogh? You can do this using neural style transfer. It basically extracts content from one image and style from another image to give a combined image. If you have used PRISMA then you can relate it.

Again, google it for some cool examples. One of them is:

After explaining the coding part for these two cases, this chapter also explains image generation using vibrational autoencoders. It can be used for image editing and generation. If you are interested in image generation or image editing, then you can go through this section.

Generative adversarial networks

It is an advance network to generate a fairly realistic image. It requires heavy computational power.

The concept is to use two neural network work together. Generator network produces an image and adversary network give feedback about the authenticity of the image. Both networks try to overcome each other which results in a fairly realistic image.

Chapter 9) Conclusions

Give a brief description of everything covered in the book. Also, discuss the future, limitations, and risks of deep learning.

Further reading

arXiv Sanity Preserver ( )
Keras online documentation ( )

Keras source code (

Source: Deep Learning on Medium