Semantic Segmentation to Solve The Lyft Perception Challenge

Perception — the ability to perceive and make sense of the surrounding environment — is central to self-driving technology. Autonomous vehicles perceive their surroundings using sensors like cameras, lidar and radar. The Lyft Perception Challenge is all about understanding where the drivable surface of the road and other vehicles are in each frame of a video recorded using a forward facing camera on a self-driving car.

The camera data for this challenge actually comes from a simulator called CARLA. In fact, much of the software development for self-driving cars now takes place in simulation, where rapid prototyping and iteration are much more efficient than they would be using real hardware in the real world. The camera data from the CARLA simulator looks like this:

Example of a simulated camera image (left) and pixel-wise labels of cars, road, lane markings, pedestrians, etc. (right)

On the left is the simulated camera image and on the right is the corresponding label image, where the color of each pixel identifies what class of object resides in that pixel. The overall goal of this challenge is to train an algorithm using images and labels like these to be able to predict which pixels in some new and previously unseen image correspond to road and other vehicles.

Tools for Solving The Challenge

In Term 1 of the Self-Driving Car Nanodegree Program, students learn a variety of techniques for understanding what’s in an image. They use traditional computer vision techniques like edge detection and color thresholding to identify lane markings and they use deep learning techniques to do object classification and predict steering angles from camera data.

In the final project for vehicle detection, students use a combination of computer vision and machine learning techniques. Most choose to follow along with the lessons in the classroom and use a technique known as histogram of oriented gradients and a support vector machine classifier to identify vehicles, but some elect to incorporate deep learning techniques to detect vehicles instead.

Students who completed the deep learning specialization in Term 3 of the program got a chance to try their hand at using deep learning for semantic segmentation, which is a method for training a neural network to predict what class of object exists in each pixel of an image.

Example result using semantic segmentation to identify the road surface.

For this challenge, the goal is to infer where the road and other vehicles are on a pixel-wise basis so semantic segmentation is the most natural approach to solving the problem. If you are a Udacity Self-Driving Car student who hasn’t yet encountered semantic segmentation, don’t worry! With the skills you have learned in Term 1, along with some grit and determination, you too can succeed in this challenge!

My Approach

The Lyft Perception Challenge is a competition where the reward is a job interview with Lyft. A key component of being a great engineer is communicating your results to your team and other stakeholders. While winners of the challenge will be selected by ranking on a leaderboard, Lyft also wants to see how you communicate your results, so a required component of the challenge is to write up your results. Here I’ll give a brief overview of how I approached this challenge as an example.

Preprocessing Label Images

The first step for anyone working on this challenge should be to do a bit of preprocessing on the label images. Preprocessing is necessary, first, because the hood of the car itself is visible in all images and labeled as “vehicle”. We don’t want to train our network to simply find the hood of our own car, so these pixels must be set to 0 (or some other non-vehicle label) in the label images. Second, the lane markings have a different value in the label image than the road, and we want to classify them as part of the road surface, so I’ll set the labels for lane markings to be the same as those for road surface. In Python code, defining a function to preprocess labels looks something like this:

def preprocess_labels(label_image):
    # Identify lane marking pixels (label is 6)
lane_marking_pixels = (label_image[:,:,0] == 6).nonzero()
    # Set lane marking pixels to road (label is 7)
labels_new[lane_marking_pixels] = 7

# Identify all vehicle pixels
vehicle_pixels = (label_image[:,:,0] == 10).nonzero()
    # Isolate vehicle pixels associated with the hood (y-position >   
hood_indices = (vehicle_pixels[0] >= 496).nonzero()[0]
    hood_pixels = (vehicle_pixels[0][hood_indices], \
    # Set hood pixel labels to 0
labels_new[hood_pixels] = 0
    # Return the preprocessed label image 
return labels_new

The result of applying this preprocessing looks like this:

Raw Image (left), Raw Labels (middle), and Preprocessed Labels (right)

After preprocessing all the labels, I’m ready to train my network!

Training The Network

In terms of network architecture I decided to go with a customized FCN-Alexnet from in order to best label pixels into classes. More information on this model can be found here. I ran the model for 10 epochs with a stochastic gradient descent solver using a base learning rate of 0.0001.

Visualization of Fully Convolutional Network for Semantic Segmentation

Performance Evaluation

I ran inference on a validation set using the trained network and achieved an F2 score for vehicles of 0.6685 and an F0.5 score on road surface of 0.9574 (more on F score). My network processes images at 6.06 frames per second (FPS) and the results look like this!

Sunny CARLA Validation Set

Room For Improvement

In order to improve the performance of my network I would try to gather more data in various conditions in the simulator. I then would perform a series of data augmentations in order to diversify the data I had collected. I could also turn to a different architecture like FCN-8 which is designed for more fine grain predictions. Another idea to focus on a speed increase would be to include temporal data possibly in the format of optical flow to cut down the frames needed to infer on while retaining a higher accuracy. Here is an interesting paper looking into using temporal information!

Application to The Real World

The whole point of training an algorithm like this in a realistic looking simulated world is to come up with something that actually works in the real world! I don’t think my implementation is quite ready to be released into the wild yet, but I couldn’t resist running on some real world data to see what kind or results I would get!

In simulation my implementation does an okay job figuring out where the road and the vehicles are but how will this translate over to real world data? Let’s take a look at how it performs on the classic lane-finding video from Self-Driving Car Term 1.

Lane-finding video from Self-Driving Car Term 1

We can see that overall it doesn’t deliver great results, but it’s encouraging to see that the network does seem to be able to identify road and vehicles to some degree in many frames A theory I have here is that the scene is not very similar to the simulator we used to train our data. Perhaps, if found real world data closer to what the simulator looked like then we would be able to observe better results.

Data collected from Udacity Carla Car in the streets of California

As you can see in this case, the outcome looks much better in scenes that are close to the setup of the simulator. A close to narrow two lane road with artifacts on both sides of it. This seems to be a promising result. It means that as long as we can build out simulators that look approximately like the real world scenarios we expect to be operating in, that we will be on our way towards a robust solution to transfer from simulation to the real world.

Here is some code you can use to observe the results of your algorithm on the provided visualization data:

from moviepy.editor import VideoFileClip, ImageSequenceClip
import numpy as np
import scipy, argparse, sys, cv2, os
file = sys.argv[-1]
if file == ‘’:
print (“Error loading video”)
def your_pipeline(rgb_frame):
## Your algorithm here to take rgb_frame and produce binary     
array outputs!

out = your_function(rgb_frame)
    # Grab cars
car_binary_result = np.where(out==10,1,0).astype(‘uint8’)
car_binary_result[496:,:] = 0
car_binary_result = car_binary_result * 255
    # Grab road
road_lines = np.where((out==6),1,0).astype(‘uint8’)
roads = np.where((out==7),1,0).astype(‘uint8’)
    road_binary_result = (road_lines | roads) * 255
    overlay = np.zeros_like(rgb_frame)
    overlay[:,:,0] = car_binary_result
overlay[:,:,1] = road_binary_result
    final_frame = cv2.addWeighted(rgb_frame, 1, overlay, 0.3, 0,   
    return final_frame
# Define pathname to save the output video
output = ‘segmentation_output_test.mp4’
clip1 = VideoFileClip(file)
clip = clip1.fl_image(your_pipeline)
clip.write_videofile(output, audio=False)

Source: Deep Learning on Medium