Computer Vision at Tesla

Original article was published by Jeremy Cohen on Artificial Intelligence on Medium

3. The Neural Networks

Between the vehicles, the lane lines, the road curbs, the crosswalks, and all the other specific environmental variables, Tesla has a lot of work to do. In fact, they must run at least 50 neural networks simultaneously to make it work. That’s just not possible on standard computers.

Tesla uses a specific architecture called HydraNets, where the backbone is shared.

Similar to transfer learning, where you have a common block and train specific blocks for specific related tasks, HydraNets have backbones trained on all objects, and heads trained on specific tasks. This improves the inference speed as well as the training speed.

Tesla’s neural network

The neural networks are trained using PyTorch, a deep learning framework you might be familiar with.

  • Each image, of dimension (1280, 960, 3), is passed through this specific neural network.
  • The backbone is a modified ResNet 50 — The specific modification is the use of “Dilated Convolutions”.
  • The heads are based on semantic segmentation — FPN/DeepLab/UNet architectures.

I teach both these concepts in my course IMAGE SEGMENTATION: Advanced Techniques for aspiring Computer Vision experts. I designed this course for everyone who knows how backpropagation works—that’s the only requirement, along with beginner-level Python. Segmentation is crucial for Tesla, as almost all of their tasks use it.

Something else Tesla uses is Bird’s Eye View

Sometimes the results of a neural network must be interpreted in 3D. The Bird’s Eye View can help estimate distances and provide a much better and more real understanding of the world.

Smart summon in Tesla using Bird Eye View

Some tasks run on multiple cameras. For example, Depth estimation is something we generally do on stereo cameras. Having 2 cameras helps estimate distances better. Tesla is doing this using neural networks with a regression on the depth.

Depth estimation from 2 cameras

Using this stereo vision and sensor fusion, Tesla doesn’t need LiDAR. They can do distance estimation based on these 2 cameras alone. The only trick is that the cameras don’t use the same lenses: on the right, further distances appear much closer.

Tesla also has recurrent tasks such as road layout estimation. The idea is similar: multiple neural networks run separately, and another neural network is making the connection.

Optionally, this neural network can be recurrent so that it involves time.

👉 Tesla’s main problem is that it uses 8 cameras, 16 time steps (recurrent architecture), and a batch size of 32.

It means that for every forward pass, 4096 images are processed. I don’t know about you, but my MacBook Pro could never support this. In fact, a GPU couldn’t do it—not even 2 GPUs!

To solve this problem, Tesla is betting big on the HydraNet architecture. Every camera is processed through a single neural network. Then everything is combined into the middle neural network. The amazing thing is that every task requires only a few parts of this gigantic network.

For example, object detection can require only the front camera, the front backbone, and a second camera. Not everything is processed identically.

The 8 main neural networks used by Tesla

4. The Training

Network training is done using PyTorch. Multiple tasks are needed, and it can take a lot of time to train on all 48 neural network heads. In fact, training would require 70,000 hours of GPU to be complete. That’s almost 8 years.

Tesla is changing the training mode from a “round robin” to a “pool of workers”. Here’s the idea: on the left — the long, impossible option. In the middle and on the right, the alternatives they use.

I don’t have a lot of details to share on that part, but essentially, these pool of workers approaches parallel the tasks to make the network faster.

5. Full Stack Review

I hope you now have a clear idea of how it works there. It’s not impossible to understand, but it’s definitely different that what we might be used to.
Why? Because it involves very complicated real-world problems.

In a perfect world, you wouldn’t need the HydraNet architecture—you’d just use one neural network per image and per task… but that, today, is impossible to do.

In addition to that, Tesla must improve its software continuously.
They must collect and leverage users’ data. After all, they have thousands of vehicles driving out there, it would be stupid not to use their data to improve their models. Every data is collected, labeled, and used for training; similar to a process called active learning (find more about this here).

Here’s the complete loop.

Tesla’s Full Stack

Let’s define the stack from the bottom to the top.

  • Data — Tesla collects data from the vehicles and a team labels it.
  • GPU Cluster — Tesla uses multiple GPUs (called a cluster) to train their neural networks and run them.
  • DOJO — Tesla uses something they call dojo to train only a part of the whole architecture for a specific task. It’s very similar to what they do in inferences.
  • PyTorch Distributed Training — Tesla uses PyTorch for training.
  • Evaluation — Tesla evaluates network training using loss functions.
  • Cloud Inference— Cloud processing allows Tesla to improve its fleet of vehicles at the same time.
  • Inference @FSD — Tesla built its own computer that has its own Neural Processing Unit (NPU) and GPUs for inference.
  • Shadow Mode — Tesla collects results and data from the vehicles and compares them with the predictions to help improve annotations: it’s a closed-loop!

Here is the video describing everything I just wrote and gathering the images I showed you.