Original article was published on Artificial Intelligence on Medium
4. Experiments and Results
The experiments are conducted on the state-of-the-art datasets — Cityscapes, CamVid, and KITTI. With the new approach, significant improvements have been witnessed in the mask IoUs of all three datasets.
- The cityscapes dataset contains 5000 images of high-quality pixel annotations. The standard training, validation, and test splits are 2975, 500, and 1525 images, respectively.
4.1.1 Mapillary pre-training and class-uniform sampling
- The model is pre-trained on the Mapillary Vistas dataset instead of using the ImageNet pre-trained weights. This dataset was chosen because it contains data similar to Cityscapes but has a higher number of training images (18K) and classes (65).
- Class uniform sampling is a strategy employed to make sure that all classes are uniformly chosen during training.
- Mapillary pre-training improves mIoU by 1.72% over baseline (76.60% → 78.32%). Class uniform sampling brings an additional 1.14% mIoU improvement, hence making it (78.32% → 79.46%).
4.1.2 Label Propagation versus Joint Propagation
- The results demonstrate that joint propagation works better than label propagation. Joint propagation improves mIoU by 0.8% over the baseline (79.46% → 80.26%) as opposed to label propagation which improves by 0.33% (79.46% → 79.79%).
4.1.3 Video Prediction versus Video Reconstruction
- Results indicate video reconstruction works better than video prediction as was expected.
- Best results are obtained for propagation length = 1.
- Using video reconstruction with joint propagation achieves an improvement of 1.08% mIoU over the baseline (79.46% → 80.54%) as per (Table 2 from the original paper ).
4.1.4 Effectiveness of Boundary Label Relaxation
- Longer propagated samples are utilized for training a better model using boundary label relaxation. For video reconstruction, the best performance is achieved at propagation length = 3 with an increase in mIoU of 0.81% (80.54% → 81.35%).
4.1.5 Learned Motion Vectors versus Optical Flow
- The dragging car and the doubling rider in the above figure (a) denotes the effects of occlusion in using optical flow. In the above-given figure (b), learned motion vectors perform significantly better than FlowNet2 (optical flow approach) at all propagation lengths.
- CamVid comprises of 701 densely annotated images with size 720 × 960 taken from five video sequences. The dataset split is 367 training, 101 validation, and 233 test images, respectively. mIoU is increased by 1.9% from the baseline (79.8% → 81.7%).
- The data format is similar to the Cityscapes dataset, but the resolution is 375 × 1242. Dataset comprises of 200 training and 200 test images. 10-split cross-validation ﬁne-tuning is done on the 200 training images since the dataset is very small.
For further implementation details and a detailed explanation of results refer to Section 4: Experiments of  and Appendices of .