Building a deep learning model to judge if you are at risk.

Original article was published on Deep Learning on Medium

Building a deep learning model to judge if you are at risk.

Predict vehicle collisions moments before it happens using CNN+LSTMs and Carla!


The project combines CNNs and LSTMs to predict whether a vehicle is in a collision course using a series of images moments before it happens. CNNs are good for image understanding but without sequence relation between images, we miss out on a lot of temporal information to predict how a series of events can cause an incident.


The post assumes you have a basic understanding of CNNs and LSTMs.You don’t have to read the entire thing, this story is more about explaining the challenges faced, various experiments and optimizations done in building the project. So, you can be selective about what you want to read. Treat it like a starting point or a guide for solving the problem you are facing. Not all knowledge is useful.

Show me the code gang go here.


A solid simulation environment is needed to collect data. Carla is a driving simulation software which can provide environment level control. Since we need to replay the memories from the accident point, Carla has options to raise red flags whenever a violation is made or an accident is caused. It also lets us use different kinds of agents from naive to expert ones. This makes Carla a good choice for collecting data. In addition, we also get information about the vehicle, climate, street objects, traffic levels, speed and more which can be vital in complex systems.

The project is entirely built on Python and TensorFlow Keras.

The project can be divided into three stages: Collecting Data, Creating the appropriate Network and Training.

1 Data Collection

1.1 Carla Challenges

Carla is a graphic intensive software and surprisingly, Carla did not have a headless mode(in windows) because of which the running system gets really slow after a certain point of time and crashed often.

  • Carla’s graphic setting was changed to the lowest possible resolution by changing the configuration files they have.
  • The number of frames was reduced to 12 fps decrease the load on the machine and that helped the program to collect the entirety of the data without crashes.

1.2 Custom Scripts

Carla has Python APIs which helps you create custom agents for you to drive around. They also have expert agents to drive around which runs through the map perfectly.

  • Firstly, we have a naive agent which drives the car around the city and takes photos every 4 frames. We use a naive agent so that we can capture more accidents and violations.
  • Once an accident occurs or a violation is caused, CARLA raises red flags and from that time step when the accident occurs, we can take the past 15 images of the episode. So, if we look at the series of images, each image has put us closer to the accident as time moves forward.
  • Collecting the data for uniform driving is easy, we can just use the autopilot API provided by Carla to drive around and take pictures using the same capture rate for consistency.
  • When a collision is made, the program only takes the last 15-time steps and it automatically deletes all previous images of the episode to avoid data overflow.

1.3 Handling the data:

The collected data had a unique structure and it was hard to handle. The total amount of data collected was around 40GB and it was hard to move around or load the data into memory, not only because of the size of the data but of the number of files (210,000). Also, each sample had 15-time step images before the incident and each sample had an image resolution of 420 X 280 pixels each.

  • Numpy binary format is used with int as the datatype for efficient storage.
  • Reduced the required number of time steps from 15 to 8, reduced the resolution to 210 X 140.
  • The images are stored batches of 8 episodes per file to reduce the number of files and fasten reading time.
  • This combination of these measures brought down the overall data size to 8 GB and made the logistics faster.

A total of 7000 (episodes) x 2 (classes) = 14000 samples were collected with each sample containing about 8 images for training the network. Data augmentations are not applied because of the nature of the environment. These samples are collected around various towns and different environmental conditions available in Carla to make it robust across different conditions.

2 Model Architecture

It is well known that to tackle problems that are in sequence or in the form of a time series, RNNs work the best and RNN units are replaced by LSTMs for obvious reasons. Now encode the given information in a time series format but this works well only for naive data like number series. We need a complex embedding method to understand complex data forms.

  • For images, we can use Convolutional Neural Networks. CNNs are proven to be the best method to extract spatial information. We are going to use some of the standard CNN architectures to extract the features from the image. So, for each image in the series, we will get a feature vector of a fixed size which can be passed into their respective LSTM time step cells.
  • But we don’t need so many CNNs for learning the images. 3D CNNs are good for this kind of data it is better suited for exploring and learning spatial correlation than temporal correlation.
  • So, we can use the Time Distributed wrapper and wrap our CNN layers to spread it across time steps. This is much better compared to 3D CNNs for extracting the temporal correlation because it just uses a single CNN to learn the features from images across time steps.
  • Once these features are extracted in the form of embedding vectors, all these features are passed into their respective LSTM cells in their timestep.
  • Now, these embeddings are taken into the encoder and then passed to the fully connected layers which in turn learns the classification task.
  • The figure shows the initially proposed network and for the convolutional block, a full VGG network is used to get the image embeddings. CNN network is distributed in time to perform the convolution for each time step. Though the network seemed logically correct initially, it had gone through many corrections to achieve the results shown.

2.1 Training Issues

  • The training caused memory exhaust errors even though the network is relatively not small because the network is distributed in time and that caused problems in runtime exhausting.
  • Sometimes, the training will run for a certain amount of time and it will stop after it reaches an upper limit, this occurs due to the data size and time distribution operation happening in each layer of CNN.
  • So, the batch size is reduced and some of the network parameters like filter size and embedding sizes have been carefully adjusted after a lot of experimentation.

2.2 Overfitting

  • Overfitting has always been the Achilles heels of deep learning. initially, I was not surprised by the fact it overfitted the data. The network immediately overfitted the data given and the accuracy was 100 per cent even in the epoch1. Obviously, the results in the test data were really bad
  • So, I went on and did the standard overfitting procedure like checking the data, biases, adding batch normalization etc.
  • But the problem as it turned out, was very unique to this kind of architecture of the network. The network overfitted because, for each global network step update, the CNN part of the network is updated 8 times(number of time steps).
  • This means the CNN part of the network is learning too much information about the images while the LSTM layers under it were not able to keep up with it.
  • The solution to this problem was to reduce the size of the VGG network used. The CNN network is almost reduced to half of the original network and the filters were carefully chosen after repeated experimentation with various filter numbers and combinations.
  • Also, by adding two fully connected layers and dropouts instead of the proposed one layer gave significant performance improvements with an accuracy of 65–71%.

2.3 Learning Complex Time Functions

  • Increasing the number of LSTM units in the LSTM layer improved the performance for a bit and after that increasing the number of units does not make any difference.
  • But adding an LSTM layer to the net-work helped improve the performance of the network and this actually enables the network to learn a more complex time function and the network improved to an accuracy of 86%.

2.4 A Better CNN

  • The VGG layers worked well for the network so far but it is possible to improve the performance of the network by adding smarter layers instead of the naive VGG layers.
  • To implement this idea, a couple of Inception modules with carefully chosen filter numbers were employed in place of the VGG layers.
  • It is important to note that ResNet modules have not been chosen because the issue is not the forgetful network but we just needed a better feature extractor.
  • Also, a modification in the fully connected layers was required to prevent overfitting and this worked amazingly well leaving the network with the final training accuracy of 93%.

3 Training

The training process is similar to the classification tasks and we can use softmax cross-entropy to find the loss and minimize it using Adam optimizer.

  • The dimension of the input is (batch size, time, channel, width, height) compared to the traditional (batch size, channel, width, height). LSTMs, in general, takes more time than CNNs to train since we are training both, we can expect the training to be quite time-consuming for a classification task.
  • It is a decently large network with around 14 million parameters and it required a good GPU for training or it ran into memory exhaust errors. Luckily, I used a GTX 1080 Ti machine in which the training was done within 3 hours. The training and testing metrics are presented in the figures: