Book review Part 1 — Deep Learning with Python

Original article was published on Artificial Intelligence on Medium

Book review Part 1 — Deep Learning with Python

Book by François Chollet

This series aim to keep the record of those books I read, will highlight the important and interesting points when I read along. I hope you will find something useful or interesting.

  1. In classical programming, the paradigm of symbolic AI, humans input rules (a program) and data to be processed according to these rules, and out come answers (see figure 1.2). With machine learning, humans input data as well as the answers expected from the data, and out come the rules. These rules can then be applied to new data to produce original answers.

2. Learning, in the context of machine learning, describes an automatic search process for better representations.

So that’s what machine learning is, technically: searching for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal.

You can think of a deep network as a multistage information-distillation operation, where information goes through successive filters and comes out increasingly purified

So that’s what deep learning is, technically: a multistage way to learn data representations.

3. What makes deep learning different:

a. it offered better performance on many problems.

b. it completely automates what used to be the most crucial step in a machine learning workflow: feature engineering.

4. These are the two essential characteristics of how deep learning learns from data: the incremental, layer-by-layer way in which increasingly complex representations are developed, and the fact that these intermediate incremental representations are learned jointly, each layer being updated to follow both the representational needs of the layer above and the needs of the layer below.

5. import keras
from keras import models
from keras import layers
from keras.utils import to_categorical

from keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_labels[:20]

network = models.Sequential()
network.add(layers.Dense(512, activation = ‘relu’, input_shape = (28*28,)))
network.add(layers.Dense(10, activation = ‘softmax’))

network.compile(optimizer = ‘rmsprop’,
loss = ‘categorical_crossentropy’,
metrics = [‘accuracy’])

train_images = train_images.reshape((60000, 28*28))
train_images = train_images.astype(‘float32’) / 255

test_images = test_images.reshape((10000, 28*28))
test_images = test_images.astype(‘float32’) / 255

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

network.fit(train_images, train_labels, epochs=5, batch_size=128)

I learned a lot by simply using shft + Tab, then click the top-right ^, you will see the docstring, which is very infomative.

for example, for

network.add(layers.Dense(512, activation = ‘relu’, input_shape = (28*28,)))

the docstring gives:

activation: Activation function to use
(see [activations](../activations.md)).
If you don't specify anything, no activation is applied
(ie. "linear" activation: `a(x) = x`).

and if you go to: https://keras.io/activations/, you will learn every activation function in one page. Even better that you can:

from keras import backend as K model.add(Dense(64, activation=K.tanh))

which is big advantage as you can take full adavantage of tensorflow as well.

6. Definitons

A tensor is a container for data.

Uncrumpling paper balls is what machine learning is about: finding neat representa-
tions for complex, highly folded data manifolds.

Smooth function is its curve doesn’t have any abrupt angles.

Continuous function means a small change in x can only result in a small change in y.

Differentiable means “can be derived”: for example, smooth, continuous functions can be derived.

stochastic is a scientific synonym of random.

Weight tensors, which are attributes of the layers, are where the knowledge of the network persists.

7. Questions:

P48 when talking about the weight update, it syas “with a function f(W) of a tensor, you can reduce f(W) by moving W in the opposite direction from the gradient: for example, W1 = W0 — step * gradient(f)(W0) (where step is a small scaling factor).” here, opposite direction is bit confusing.

P54 for the simple NN in 5, it claims that “After these 5 epochs, the network will have performed 2,345 gradient updates (469 per epoch)”. I think the number of weight should be 28*28*512 + 512*10. I am not sure where is this 469 coming from.

8. p73 crossentropy is usually the best choice when you’re dealing
with models that output probabilities. It measures the distance between probability distributions or, in this case, between the ground-truth distribution and your predictions.

p79 In a stack of Dense layers like that you’ve been using, each layer can only access information present in the output of the previous layer. If one layer drops some information,relevant to the classification problem, this information can never be recovered by later layers: each layer can potentially become an information bottleneck.

p86 You should never use in your workflow any quantity computed on the
test data, even for something as simple as data normalization.

In general, the less training data you have, the worse overfit-
ting will be, and using a small network is one way to mitigate overfitting.

p87 Because you have so few data points, the validation set would end up being very small (for instance, about 100 examples). As a consequence, the validation scores might change a lot depending on which data points you chose to use for validation and which you chose for training: the validation
scores might have a high variance with regard to the validation split. This would prevent you from reliably evaluating your model.

p97You may ask, why not have two sets: a training set and a test set? You’d train on the training data and evaluate on the test data. Much simpler!
The reason is that developing a model always involves tuning its configuration: for example, choosing the number of layers or the size of the layers (called the hyperparameters of the model, to distinguish them from the parameters, which are the network’s weights). You do this tuning by using as a feedback signal the performance of the model on the validation data. In essence, this tuning is a form of learning: a search for a good configuration in some parameter space. As a result, tuning the configuration of the model based on its performance on the validation set can quickly result in overfitting to the validation set, even though your model is never directly trained on it.

p100you usually should randomly shuffle your data before splitting it into training and test sets.

p101To make learning easier for your network, your data should have the following characteristics:
1. Take small values — Typically, most values should be in the 0–1 range.
2. Be homogenous — That is, all features should take values in roughly the same range.

That’s the essence of feature engineering: making a problem easier by expressing it in a simpler way. It usually requires understanding the problem in depth.

  1. Good features still allow you to solve problems more elegantly while using fewer resources.
  2. Good features let you solve a problem with far less data.if you have only a few samples, then the information value in their features becomes critical.

p104 The fundamental issue in machine learning is the tension between optimization and generalization. Optimization refers to the process of adjusting a model to get the best performance possible on the training data (the learning in machine learning), whereas generalization refers to how well the trained model performs on data it has never seen before. The goal of the game is to get good generalization, of course, but you don’t control generalization; you can only adjust the model based on its training data.

The number of learnable parameters in a model is often referred to as the model’s capacity. Intuitively, a model with more parameters has more memorization capacity and therefore can easily learn a perfect dictionary-like mapping between training samples and their targets — a mapping without any generalization power.

p105Always keep this in mind: deep learning models tend to be good at fitting to the training data, but the real challenge is generalization, not fitting.