SLAM in the era of deep learning

Source: Deep Learning on Medium

Deepifying SLAM

SLAM in the era of deep learning

1. What is SLAM and why we need it?

Photo by John Baker on Unsplash

This article is part I of a series that explores the relationship between deep learning and SLAM. We first look at what SLAM is and how it works. This will later on allow us to have deeper insight into which parts of the system can be replaced by a learned counterpart and why. Let’s begin at the beginning, if all goes well, we won’t need a sign to tell us “You are here” as SLAM would solve this for us.


Simultaneous Localization and Mapping or SLAM, for short, is a relatively well studied problem is robotics with a two-fold aim:

  • Mapping: building a representation of the environment which for the moment we will call a “map” and
  • Localization: finding where the robot is with respect to the map.

When is SLAM needed?

In GPS-denied environments such as indoors, underground, or underwater, a mobile agent has to rely solely on its on-board sensors to construct a representation of the environment in order to localize itself. This is the scenario in which SLAM is needed. Even in situations where GPS can provide coarse localization, SLAM can be used to provide a fine grained estimate of the vehicle location.

Why simultaneous?

“Which came first, the chicken or the egg?” is an age old question which also answers the simultaneous bit of SLAM. If we already have a map, it is relatively easy to localize the robot with respect to it since we know what to localize against. Similarly, if for all times we know where the robot has been (position as well as orientation), it is easy to construct a map by superimposing sensor measurements in a common frame of reference. Starting with neither a map nor a location (it is customary to consider the first measurement location as the origin of the map), both localization and mapping need to be done simultaneously to get an updated location of the robot as well as an updated estimate of the map.

What’s in a map?

It depends. On what you ask? On the application. One of the biggest applications of SLAM is localization. We build maps to find out where we are with respect to them, either now or at a later time.

Think of millions of cars on the road in a big city, everyone helps everyone else to localize by contributing to a shared map. This is needed because a map is not static, instead, it is a living breathing organism that changes with time all the time. For localization at large scale, the map can be represented as a set of sparse points corresponding to uniquely identifiable regions in the sensor measurement. Sparse representations are efficient to create, update and share.

Mapping would be the other obvious application. Imagine a mobile agent (you holding a camera) moving inside a property that you want to rent out. You use SLAM magic and out pops a detailed 3D model of the building. The map in this case needs to be a dense surface model of the rental property.

For Virtual Reality/Augmented reality applications, SLAM serves as the localization back bone. As the end goal is localization, sparse representations can be used.

Depending on the application, the map can consists of various different things, ranging from sparse points to dense representation of the world. Later on we will see how deep learning enables more sophisticated yet sparse representations for a SLAM map.

Odometery, SfM, SLAM

Before proceeding to dissect a modern SLAM system, it is necessary to clear some confusions about terminology.

  • Odometry in its purest form provides the estimate of motion of a mobile agent by comparing two consecutive sensor observations, which was the case for laser-based odometry. The work visual odometry by Nister et. al. extends this to tracking over a number of image frames, however, the focus is still on the motion instead of the environment representation.
  • Structure from Motion (SfM) deals with an unordered set of images to recover a model of the environment as well as camera location. A good example of SfM is “Building Rome in a day” by Agarwal et. al.
  • SLAM exploits the sequential nature of observation in a robotics setup. It assumes that instead of a unordered set of images, the observations comes from a temporal sequence (aka video stream).

Sensors for SLAM

Senors can be divided into two categories based on whether they measure the outside world or measure themselves, that is measure the internal state of the system.

  • Proprioceptive (from Latin proprius meaning ‘own’ + receptive): IMUs, Gyroscopes, compasses. These sensor do not measure any aspect of the environment and therefore are only useful in recovery an estimate of the trajectory of the robot.
  • Exteroceptive: Cameras (Mono, Stereo, More-o), Lasers, LIDARs, RGB-D Sensors, Wifi receivers, Light intensity, etc. Anything that can measure some aspect of the outside world that changes with the position/orientation of the robot can theoretically be used as a sensor for SLAM.

While many different SLAM solution have been proposed used a combination of proprio- and exteroceptive sensors, going forward, we are going to consider the case of Monocular SLAM, that is, SLAM using just a single camera. This is challenging and therefore interesting as using only a single camera introduces problems that are not present in a multi-camera/laser based solution. This will also give an idea of where deep learning can play a role in SLAM.

Problems with monocular SLAM

  • Scale Ambiguity
  • Scale drift
  • Pure rotation: Monocular SLAM dies under pure rotation, it’s that bad. With a single camera, the baseline (translation between two camera positions) is used to estimate the depth of the scene being observed, a process called triangulation. If a camera only rotates, the baseline is zero and no new points can be triangulated. What makes things worse is that the apparent motion on the image plane is greater under rotation than under translation. Effectively, the point for which we knew the depth whizz out of the field of view and no new points can be estimated since there is no baseline. The result, tracking failure!

Modules of a modern SLAM system

I say modern to distinguish it from filtering based methods which are no longer en vogue. In this series, we are mainly going to focus on the graph-based formulation of SLAM [2].

Let us pause here and consider what we want to achieve with a SLAM system. We want to convert raw sensor measurements into a coherent map and in the process recover the location of the robot at every time instance where a sensor measurement was obtained. The coherent map is expected to be more accurate than the individual sensor measurements. The sensor measurement are the inputs and robot poses and a map are the output of the SLAM pipeline. In the Graph SLAM formulation, the vertices in the graph are entities that we want to estimate (outputs) : robot positions, location of points in the world, etc. Edges represent constraints between these entities which are derived from the raw sensor measurements (inputs).

From an implementation perspective, we have a notion of things that can become vertices: the location (pose) of the robot when it made a measurement of the world with it sensors, so that goes into the graph as a vertex. How do we formulate the edges in the graph? This is where we encounter the first module of a modern SLAM system: the front-end.

The front-end is responsible for converting raw sensor measurement into vertices and edges that will go in the graph. It deals with tasks such as feature extraction from images, 3D point initialization, and data association (feature matching), among other things. The front end serves as a way of abstracting away the actual sensor by converting it into relative constraints (edges) between entities (vertices) that we want to estimate. Once these constraints have been formed, the backend is responsible for optimizing the graph to find the best solution.

Is SLAM solved?

There has been a lot of debate about what this question means. The short answer is “Yes and No”. This is precisely the question that lead to the creation of our paper [1] which provides a detailed discussion on the Past, Present, and Future of SLAM.

Next time

We look at the architecture of a modern SLAM system.


  1. Cadena C, Carlone L, Carrillo H, Latif Y, Scaramuzza D, Neira J, Reid I, Leonard JJ. “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age” IEEE Transactions on robotics. 2016 Dec ; 32(6):1309–32.
  2. Grisetti G, Kummerle R, Stachniss C, Burgard W. “A tutorial on graph-based SLAM”. IEEE Intelligent Transportation Systems Magazine. 2010;2(4):31–43.