Video Semantic Segmentation -A Novel Approach from NVIDIA

Original article was published on Artificial Intelligence on Medium

4. Experiments and Results

The experiments are conducted on the state-of-the-art datasets — Cityscapes, CamVid, and KITTI. With the new approach, significant improvements have been witnessed in the mask IoUs of all three datasets.

4.1 Cityscapes

  • The cityscapes dataset contains 5000 images of high-quality pixel annotations. The standard training, validation, and test splits are 2975, 500, and 1525 images, respectively.

4.1.1 Mapillary pre-training and class-uniform sampling

  • The model is pre-trained on the Mapillary Vistas dataset instead of using the ImageNet pre-trained weights. This dataset was chosen because it contains data similar to Cityscapes but has a higher number of training images (18K) and classes (65).
  • Class uniform sampling is a strategy employed to make sure that all classes are uniformly chosen during training.
  • Mapillary pre-training improves mIoU by 1.72% over baseline (76.60% → 78.32%). Class uniform sampling brings an additional 1.14% mIoU improvement, hence making it (78.32% → 79.46%).
Figure 4: Effectiveness of Mapillary pre-training and class uniform sampling from [1]

4.1.2 Label Propagation versus Joint Propagation

  • The results demonstrate that joint propagation works better than label propagation. Joint propagation improves mIoU by 0.8% over the baseline (79.46% → 80.26%) as opposed to label propagation which improves by 0.33% (79.46% → 79.79%).
Figure 5: Comparison between (1) Label Propagation (LP) and Joint Propagation (JP); (2) Video Prediction (VPred) and Video Reconstruction (VRec) (from [1]). The columns indicate different propagation lengths, including a forward (+) and backward (-) propagation.

4.1.3 Video Prediction versus Video Reconstruction

  • Results indicate video reconstruction works better than video prediction as was expected.
  • Best results are obtained for propagation length = 1.
  • Using video reconstruction with joint propagation achieves an improvement of 1.08% mIoU over the baseline (79.46% → 80.54%) as per (Table 2 from the original paper [1]).

4.1.4 Effectiveness of Boundary Label Relaxation

  • Longer propagated samples are utilized for training a better model using boundary label relaxation. For video reconstruction, the best performance is achieved at propagation length = 3 with an increase in mIoU of 0.81% (80.54% → 81.35%).
Figure 6: Boundary label relaxation leads to higher mIoU at all propagation lengths. The Black dashed line represents the baseline (79.46%). (from [1])

4.1.5 Learned Motion Vectors versus Optical Flow

Figure 7: (a) Learned motion vectors from video reconstruction are better than optical flow in terms of occlusion handling. (b) Learned motion vectors are better at all propagation lengths in terms of mIoU.
  • The dragging car and the doubling rider in the above figure (a) denotes the effects of occlusion in using optical flow. In the above-given figure (b), learned motion vectors perform significantly better than FlowNet2 (optical flow approach) at all propagation lengths.
Figure 8: Visual Comparisons on Cityscapes from [1]. The proposed technique leads to better segmentation results as compared to baseline, especially for thin and rare classes, like street light and bicycle (row 1), signs (row 2), person and poles (row 3).

4.2 CamVid

  • CamVid comprises of 701 densely annotated images with size 720 × 960 taken from five video sequences. The dataset split is 367 training, 101 validation, and 233 test images, respectively. mIoU is increased by 1.9% from the baseline (79.8% → 81.7%).
Figure 9: Results of the CamVid test set from [1]. Pre-train indicates the source dataset on which the model is trained.


  • The data format is similar to the Cityscapes dataset, but the resolution is 375 × 1242. Dataset comprises of 200 training and 200 test images. 10-split cross-validation fine-tuning is done on the 200 training images since the dataset is very small.
Figure 10: Results on KITTI test set from [1]
Figure 11: Visual comparison between results from [1] on KITTI and with winning entry of ROB challenge 2018 on KITTI. Boxes indicate better predictions on semantic objects like a whole (bus), thin objects (poles and person), and can distinguish confusing classes (sidewalk and road, building, and sky).

For further implementation details and a detailed explanation of results refer to Section 4: Experiments of [1] and Appendices of [1].