Rethinking ImageNet Pre-training

Source: Deep Learning on Medium

A recently published paper Rethinking ImageNet Pre-training by Kaiming He, Ross Girshick, and Piotr dollár shows some interesting research results by comparing performances of standard models trained from random initialization and ImageNet pre-trained models on object detection and instance segmentation. This blog serves merely as a concise summary of the paper and highlights some insightful results that will provide discussions for readers to rethink the ImageNet-like pre-training paradigm in computer vision. People who are interested in the details on the experiments and research should read the original paper.

This paper reports that competitive object detection and instance segmentation accuracy is achievable when training on COCO from random initialization (‘from scratch’), without any pre-training. These results can be achieved by using baseline systems and their hyper-parameters that were optimized for fine-tuning pre-trained models. There is no fundamental obstacle preventing us from training from scratch if: (i) normalization techniques appropriately for optimization are used, and (ii) the models are trained sufficiently long to compensate for the lack of pre-training (Figure below).

The model trained from random initialization needs more iterations to converge but converges to a solution that is no worse than the fine-tuning counterpart.

Main observations

  • Training from scratch on target tasks is possible without architectural changes.
  • Training from scratch requires more iterations to sufficiently converge
  • Training from scratch can be no worse than its ImageNet pre-training counterparts under many circumstances, down to as few as 10k COCO images.
  • ImageNet pre-training speeds up convergence on the target task.
  • ImageNet pre-training does not necessarily help reduce overfitting unless we enter a very small data regime.
  • ImageNet pre-training helps less if the target task is more sensitive to localization than classification.


  • Is ImageNet pre-training necessary? No. If enough target data and computation are available. The experiments show that ImageNet can help speed up convergence, but does not necessarily improve accuracy unless the target dataset is too small (e.g., <10k COCO images). It can be sufficient to directly train on the target data if its dataset scale is large enough. Looking forward, this suggests that collecting annotations of target data (instead of pretraining data) can be more useful for improving the target task performance
  • Is ImageNet helpful? Yes. ImageNet pre-training has been a critical auxiliary task for the computer vision community to progress. It enabled people to see significant improvements before larger-scale data was available (e.g., in VOC for a long while). It also largely helped to circumvent optimization problems in the target data (e.g., under the lack of normalization/initialization methods). Moreover, ImageNet pre-training reduces research cycles, leading to easier access to encouraging results — pre-trained models are widely and freely available today, pre-training cost does not need to be paid repeatedly, and fine-tuning from pretrained weights converges faster than from scratch. We believe that these advantages will still make ImageNet undoubtedly helpful for computer vision research.
  • Do we need big data? Yes. But a generic largescale, classification-level pre-training set is not ideal if take into account the extra effort of collecting and cleaning data — the cost of collecting ImageNet has been largely ignored, but the ‘pre-training’ step in the ‘pre-training + fine-tuning’ paradigm is in fact not free when we scale out this paradigm. If the gain of large-scale classification-level pre-training becomes exponentially diminishing, it would be more effective to collect data in the target domain.
  • Shall we pursuit universal representations? Yes. The results do not mean deviating from this goal. Actually, the study suggests that the community should be more careful when evaluating pre-trained features, as now we learn that even random initialization could produce excellent results.