Best Practices: Advanced Deep Learning with Keras

Source: Deep Learning on Medium

Best Practices: Advanced Deep Learning with Keras

Credit: Somendra P

This blog will help us to explore various tools that will bring us closer to the develop state-of-art on difficult problems. Using Keras functional API, we can build a graph-like-model, share the layers across different inputs. Keras callbacks and Tensorboard visualization tools allow us to monitor the model during training.

Keras Functional API: Beyond the sequential model

The sequential model is very common when we talk about training the neural network. It has exactly one input and one output and consists of a linear stack of layers.

Sequential Model

Some scenario requires multimodal input where we merge the data from different sources and processing it in the neural network.

Additionally, there is much development in the neural network architecture require nonlinear topology. There are three types of use cases- multi-input models, multi-output models and graph-like models.


In function API, we directly deal with tensors and use layers as function which take tensor and return tensor.

The only part of the code which seems to be surprising is the use of “Model” object here. Here, Model is instantiated by an input tensor and an output tensor. Behind the picture, Keras retrieves all the layers involve in going from input tensor to output tensor, bringing them together into a graph-like data structure.

Multi-input Models:

The functional API can be used to build models that have multiple inputs. Such kind of models always have some point where we can combine their different inputs: by adding them or concatenating them.

This is usually done via Keras merge operation such as keras.layers.add, keras.layers.concatenate.

Multi-output models:

In the same way, functional API can be used to build the models having multiple outputs.

Importantly, training such a model requires the ability to specify different loss function for different heads of the network. For one instance it might be regression and for another, it could be binary classification required different training procedures. But because gradient descent requires us to minimize a scalar, we must combine these losses into a single value in order to train the model. The simplest way to combine different losses is to sum them all. In Keras, we can use either a list or a dictionary of losses in compile to different objects for different output: the resulting loss values are summed into a global loss, which is minimized during training.

Note that very imbalanced loss contribution will cause the model representations to be optimized preferentially for the task with the largest individual loss, at the expense of another task. To remedy this, we can assign the different levels of importance to the loss value in their contribution to the final loss.

Directed Acyclic Graph of Layers:

In functional API, we can also implement network with complex internal topology. Neural Networks in Keras can be arbitrary directed acyclic graphs of layers. The qualifiers acyclic is important: these graphs can’t have cycles. It is impossible for a tensor x t to become the input of another layer that generated x. The only processing loops that rea allowed (i.e recurrent connections) are those internal to recurrent layers.

Several common neural-network components are implemented as graphs:

1. Inception Modules

2. Residual connections.

Inception Modules:

Inception model was developed by Christian Szegedy in 2013–14. It contains small independents modules split into parallel branches. The most basic form of an Inception module had three or four branches starting with 1×1 convolution, followed by 3×3 convolution and ending with the concatenating of the resulting features.

This setup helps the network separately learn spatial features and channel-wise features, which is more efficient than learning them jointly.

Residual Connection:

This is developed by Microsoft in 2015. It handles two major problems in deep learning model: vanishing gradient and representational bottleneck. In general, adding residual connection to any models that has more than 10 layers is beneficial.

A residual connection is making an output of the previous layer effectively as an input to later layer, effectively creating a shortcut in q sequential network. Instead of concatenating to the later activation, the earlier is summed up with later activation, which assumes that both are of the same size. If they are in different shapes, we can use the linear transformation to shape it up in the target shape.