Traffic Sign Detection and Classification through CNN

Original article was published on Deep Learning on Medium

Traffic Sign Detection and Classification through CNN

Autonomous cars must make real-time decisions about perception of surroundings. CNN classifier accuracy must be close to 100%. One wrong classification can cause loss to life and property. I recently built a CNN from scratch to detect and classify traffic signs using open source data.

Problem Statement: Detect and classify traffic signs

Methodology: Build, visualize and refine CNNs


  • Required: Tensorflow 2.x, Keras, Python, Jupyter,
  • Recommended: Nvidia GPU, Kaggle (for downloading dataset)
  • Python Libs: sklearn, pandas, math, numpy, cv2, itertools, seaborn

Major Steps:

  • Understand problem space, autonomous driving use-cases, research papers, available datasets, technologies available
  • Analyze, prepare, visualize data set and design software
  • Build CNN model, evaluate, test, predict, visualize results
  • Evaluate next steps with dataset, model, experiments, visualization, etc


GTSRB — German Traffic Sign Recognition Benchmark version at Kaggle.


· Consistently achieving over 99% validation accuracy and around 97–98% test accuracy.

· Identified specific cases for further improvements (darkness, reflections, partially hidden, etc signs and need for more labeled samples or augmentation for under-represented classes).


· ImageDataGenerator, flow_from_dataframe, etc, to pre-process, standardize, normalize, manipulate, augment, etc.

· Iterator on demand returns 1 batch size of images & labels without filling RAM/disk.


· Relatively new techniques, internal nuances in implementation, insufficient documentation or examples

· ImageDataGenerator fails to generate a finite set of tests for model.evaluate in Tensorflow 2.2

Setup Instructions

Only tested this on laptop with Nvidia GPU and running Tensorflow 2.1.0 in Ubuntu Linux.

Theoretically, this should work on Windows or cloud offerings such as AWS where Jupyter notebooks can run, or in Google Colab, but I have not tested on those environments at the time of this writing.

Laptop Setup Instructions:

Follow detailed instructions to set up Linux (Ubuntu 18.04 at the start, and still working on upgraded Ubuntu 20.04), Nvidia driver (version 440 available from Linux additional driver options), Cuda, CuDNN, tensorflow.

Install python version 3. A simpler method is to use Anaconda and follow detailed instructions.

Note that on Linux, you may need to set up the environment and notebook separately with the following instructions:

conda create -n DL2020 python=3.7

conda activate DL2020

conda env list

pip install tensorflow

pip install jupyter

jupyter notebook


>>> import tensorflow as tf


Several datasets are available in different formats of the GTSRB — German Traffic Sign Recognition Benchmark from the Benchmark site. I chose the source from Kaggle site following due to layout of the dataset available as CSV which can be pulled into pandas DataFrame to analyze while loading none or few images into memory.

After testing that each of the above works, proceed to download the dataset from Kaggle. Extract files into a folder where you will be creating your ipynb file using Jupyter. Need to create this folder structure:

Change all file paths in the dataframes to all lower case for cleaner coding, and all ClassId to string because the CNNs using categorical labels require strings or tuples. However to then sort the labels, pad all 1-digit ClassIDs with a leading 0.

From the meta, train, test csv file, locate 43 classes of various shapes, colors, sizes. While the given image dataset can be analyzed in greyscale, chose RGB to create extensible approach to other countries where colors can be used to distinguish signs.

Visualize classes of symbols in this dataset from the metadata

Visualize images


  • Actual images can be dull, grainy, dark, etc
  • Actual signs can be reflecting, damaged, vandalized, covered in snow, bird droppings, etc
  • Application in autonomous driving requires near perfect detection & classification

The images range from bright to dark, grainy to clear, and so on. However, I’ll not further preprocess the images for now because I don’t want to learn for example only how the traffic signs would appear on a sunny day or in the evening or close up or far away. Let’s revisit preprocessing if I need to later. Also leaving all images in color rather than grayscale to keep the model robust enough to later on add classes that may depend on the color of the traffic sign.

Image Classes Vary on Width & Height:


  • Image width & height vary, and cluster differently per class
  • Chose 50, 50 for the study due to general visual median around that point, although clustering of classes vary from 40s to 70s

Visual inspection of the width and height seems to indicate general clustering around width and height of about 50, while specific classes, e.g. 14 (Stop) on the high end and 17 (No Entry) on the low end seem to be on average at the higher and lower ends. I will note these, but for now keep image sizes to have width and height of 50.

Image Distribution Across Classes


  • Distribution varies across classes, creating difficulty in interpreting accuracy of predictions
  • Must also consider Precision, Recall and F1 scores

The data set isn’t balanced. This could create difficulties in learning and in interpreting accuracy, skewing against classes with small training sample sizes.

This does not necessarily imply that there is a problem. Hence, I proceed and see the results with the data as is, and then consider data augmentation or other approaches to attend to the problem.


The baseline model architecture is inspired by CNN models from our course and ones I tried to prepare for homework assignments. The CNN model consists of 2 logical parts –

  • Feature extraction, followed by the
  • Classification

The input into the model are images of width and height 50, and in color which is represented by 3 colors, RGB. This implies that the dimension of the input is 50 x 50 x 3.

For feature extraction from these images, I use convolutional layers with small 3×3 filters, which help summarize the presence of features in an input image. Each uses relu (rectified linear activation unit) activation. The layer uses some padding to ensure that the output of the layer matches in shape to the input.

I also use BatchNormalization to help each Conv2D layer learn more independently from other layers. It works by reducing the degree to which the hidden unit values change. This layer also helps train faster, converge faster, and makes more activation functions viable.

Batch normalization uses Weights as usual but does not add a bias term and makes the bias term unnecessary. Hence in Keras, I need to set use_bias=False. And place the normalization before the activation function.

These are followed by max pooling layers that extract the most activated presence of a feature based on filters applied in the preceding convolutional layers.

Together the convolutional layer and the max pooling layer form a logical block which detect features. These blocks are stacked with the number of filters expanding, from 32 to 64 to 128 in my CNN.

The output of the feature extraction part of the model becomes the input into the classification part of the model. For this to work, I must reduce dimensions to a flat structure, and the Flatten layer helps do that. Next to do the actual classification, I use a fully connected layer called Dense layer first to interpret the flattened input and another Dense layer to predict one of the 43 classes of traffic signs.

Dropout is a simple technique which randomly drops nodes from the network, resulting in a regularizing effect to reduce overfitting because it forces the remaining nodes to fill the missing information. The Dropout layer is an easier implementation of this. I use it to reduce overfitting and also to support classification when the number of classes drops down to 43.

I referred to the paper Traffic Sign Classification with Deep Convolutional Neural Networks written by Jacopo Credi which compares various regularization techniques for Ill-known complex CNNs.

Settings and Hyperparameters

I chose Adam optimizer based on the results from experiments.

Batch size of 64 is a good balance of performance vs time.

Learning rate default of 0.001 gives the best outcomes.

Weight decay rates of 0.0001 and 0.001 both appear roughly equivalent from an accuracy perspective, but difference from the default of 0 is minor.


While the accuracy is not as high in the 99% with the test set, it is still decent enough at about 2% below.

However, what does accuracy mean here?

Interpreting Accuracy is not as straightforward, as it depends on the dataset. Hence taking another path to evaluate Precision and Recall too. From

Understanding Model Accuracy

True Positives (TP) — The correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes.

True Negatives (TN) — These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no.

False positives and false negatives — These values occur when your actual class contradicts with the predicted class.

False Positives (FP) — When actual class is no and predicted class is yes.

False Negatives (FN) — When actual class is yes but predicted class in no.

I can calculate Accuracy, Precision, Recall and F1 score using the above –

Accuracy — The ratio of correctly predicted observation to the total observations. Accuracy is a great measure but only for symmetric datasets where values of FP and FN are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model.

Accuracy = (TP+TN)/(TP+FP+FN+TN)

Precision — The ratio of correctly predicted positive observations to the total predicted positive observations. High precision relates to the low FP rate.

Precision = TP/(TP+FP)

Recall (Sensitivity) — The ratio of correctly predicted positive observations to the all observations in actual class.

Recall = TP/(TP+FN)

F1 score — The Weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of FP and FN are very different, it’s better to look at both Precision and Recall.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Wrong Predictions

Most wrong predictions fall into 2 categories

  • Low sample sizes for class in training
  • Dark or blurry images

Potential solutions

  • Addition of more samples to classes with small sizes of data, and/or Data Augmentation
  • Grayscale images to better learn signs in various lighting conditions

The images that are predicted wrong belong to few classes. The concentration of wrong predictions in a certain class indicates both that other classes contained more similar images, and that the labeled data of these classes are limited enough to not help the model distinguish between the classes.

Note: Due to space considerations in this report, use the class labels at the bottom to read class names of all 4 distributions below.

Confusion Matrix

Several of Pedestrian signs in test were interpreted as other classes. It is valuable to ask data collection teams to source not only more images of the class under question, but also more variations of the class it is misinterpreted to be.

Visualization to Understand & Improve

I visualize the convnet filters, activations, heatmaps, and superimposed on images.

  • Filters and feature maps help see what the convnets see
  • Heatmaps and superimposed images help understand what the CNN thinks makes a sign that sign, which also help debug misinterpretations in cases of wrong predictions

Next Steps

There are several next steps for future development and further improvement I would love to take.

  • Data Augmentation to add samples and improve model robustness
  • Transfer learning with addition of grayscale to add nighttime and variable lighting conditions while retaining RGB learnings
  • Experiment with other traffic sign datasets, from Belgium, US, etc
  • Multilabel traffic sign classifications which are more realistic
  • For real life autonomous driving use-cases, learn to interpret signs partly hidden, damaged, vandalized, etc perhaps through autoencoders.