YOLOv4- Speed & Accuracy

Original article was published on Deep Learning on Medium

YOLO (You only look once) but more sharper !!!!


Last few years object detection has starts maturing in ever since R-CNN was released, the competition remains cut-throat. YOLOv4 has again claim to have state-of-the-art(SOTA) accuracy while maintains a high processing frame rate. It achieves an accuracy of 43.5% AP (65.7% AP₅₀) for the MS COCO with an approximately 65 FPS inference speed on Tesla V100 as per the graph below. In object detection, higher accuracy & precision is few of many things we definitely want . We want the model to run smoothly in the edge devices like Rasberry Pi, Jetson Nano, Intel boards. How to process streaming real time video with these low power and low cost hardware becomes important and challenging pushing the need to get for robotics ,business and much more.(Code is Shared in the end with Video walk-through)

YOLOv4 is twice as fast as EfficientDet with comparable performance.

The YOLO v4 release lists three authors: Alexey Bochkovskiy, the Russian developer who built the YOLO Windows version, Chien-Yao Wang, and Hong-Yuan Mark Liao.(Unfortunately, the Creator of YOLO Joseph Redmon announced he was not pursuing computer vision due negative impact of his work )


As per the authors

Compared with the previous YOLOv3, YOLOv4 has the following advantages:

It is an efficient and powerful object detection model that enables anyone with a 1080 Ti or 2080 Ti GPU to train a super fast and accurate object detector.

The influence of state-of-the-art “Bag-of-Freebies” and “Bag-of-Specials” object detection methods during detector training has been verified.

The modified state-of-the-art methods, including CBN (Cross-iteration batch normalization), PAN (Path aggregation network), etc., are now more efficient and suitable for single GPU training.

Plug able Architecture

Bag of freebies (Bof) & Bag of specials (BoS)

Improvements can be made in the training process (like data augmentation, class imbalance, cost function, soft labeling etc…) to advance accuracy. These improvements have no impact on inference speed and called “bag of freebies”. Then, there are “bag of specials” which impacts the inference time slightly with a good return in performance. These improvements include the increase of the receptive field, the use of attention, feature integration like skip-connections & FPN, and post-processing like non-maximum suppression. In this article, we will discuss how the feature extractor and the neck are designed as well as all these Bof and BoS goodies.

Methodology for meeting speed in Neural Network in Production & Optimization for Parallel Computing:

  • For GPU small number of groups (1–8) in convolutional layers: CSPResNeXt50 / CSPDarknet53
  • For VPU — grouped-convolution, but refrain
    from using Squeeze-and-excitement (SE) blocks
    – specifically this includes the following models:
    EfficientNet-lite / MixNet / GhostNet / MobileNetV3

Selection of BoF and BoS on a General Sense

For improving the any object detection training, a typical CNN usually uses the following:

  • Activations: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish, or Mish
  • Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU
  • Data augmentation: CutOut, MixUp, CutMix
  • Regularization method: DropOut, DropPath , Spatial DropOut , or DropBlock
  • Normalization of the network activations by their mean and variance: Batch Normalization (BN) , Cross-GPU Batch Normalization (CGBN or SyncBN) , Filter Response Normalization (FRN) , or Cross-Iteration Batch Normalization (CBN)
  • Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections, or Cross stage partial connections (CSP)
Mosaic Data Augmentation

Details of YOLOv4

  • Backbone: CSPDarknet53(CSP + Darknet53)
spatial pyramid pooling layer

YOLO v4 uses:

  • Bag of Freebies (BoF) for backbone: CutMix and Mosaic data augmentation, DropBlock regularization, Class label smoothing
  • Bag of Specials (BoS) for backbone: Mish activation,
    Cross-stage partial connections (CSP), Multi input
    weighted residual connections (MiWRC)
  • Bag of Freebies (BoF) for detector: CIoU-loss, CmBN, DropBlock regularization, Mosaic data augmentation, Self-Adversarial Training, Eliminate grid sensitivity, Using multiple anchors for a single ground
    truth, Cosine annealing scheduler , Optimal hyper-parameters,
    Random training shapes
  • Bag of Specials (BoS) for detector: Mish activation,SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS

!!! Curious to deep dive in each of above hyper parameters, please read through https://medium.com/@jonathan_hui/yolov4-c9901eaa8e61 (you will love it.)

Comparison of YOLOv4 on Different NVIDIA GPU Architectures (Maxwell,Pascal,Volta)

Final Thoughts from Authors:

A state-of-the-art detector which is faster (FPS) and more accurate (MS COCO AP50…95 and AP50) than all available alternative detectors. The detector described can be trained and used on a conventional GPU with 8–16 GB-VRAM this makes its broad use possible

Code and Walk through:(Fork the code)

https://www.youtube.com/watch?v=mKAEGSxwOAY (Credits to him for Code and Video)

Keep Learning !!!


  1. YOLOv4 paper- https://arxiv.org/abs/2004.10934
  2. https://medium.com/@jonathan_hui/yolov4-c9901eaa8e61
  3. Official and Managed Code: https://github.com/AlexeyAB/darknet