Instance-level Recognition

Original article was published by Kb Pachauri on Deep Learning on Medium

ResNet variants (152 & 101) both rely on identifying shortcut connections that skip one or more layers to tackle the vanishing gradient issue. SeResNeXt is a variant of ResNeXt which is an inception net with a shortcut connection and Se refers to the Squeeze and Excitation module added to ResNeXt. Se network improves channel interdependency by adaptively adjusting the weights of feature maps. EfficientNet is a state of the art network for image classification which relies on Auto Machine Learning to find out the best base network and efficient compound scaling to achieve improved results depending on available compute resources.

Generalized Mean Pooling (GeM) in the neck network computes the generalized mean of each channel in a tensor. If p𝑘 → ∞, GeM behaves as max-pooling and p𝑘 → 1, it behaves as average pooling. With an increase in p𝑘, the contrast of the pooled feature map increases and focuses on salient features of the image.

Generalized Mean Pooling Equation. Image from

PReLU: PReLU is a generalization of leaky ReLU, to solve the dying neuron problem which happens when data is not normalized or network weights are not properly initialized.

ReLU (Left) and PReLU (Right). For PReLU, the coefficient of the negative part is adaptively learned. Image from

Arc margin improves on softmax loss to enforce higher similarity for intra-class variation and diversity for inter-class by distributing learned embedding on hypersphere with a radius of s. Below is the pseudo-code of ArcFace loss on MxNet.

Image from


  • Models are trained at different image scales [448×448, 568×568, 600×600, 512×512] using albumentations.
  • Each model is trained for 10 epochs with cosine annealing scheduler.
  • The test set of the 2019 competition which was released with labels is used as validation.


As a post-processing step re-ranking was done to penalize non-landmarks images to improve the GAP metric.

  • Test: Leaderboard test set.
  • Train: Candidate images to determine labels and confidence.
  • Non-Landmark: Images with no landmark from the GLDv2 test set.

Re-ranking steps:

  1. Calculate the cosine similarity between the test and train image (A)
  2. Calculate the average (top-5 or top-10) cosine similarity between the train and non-landmark images. (B)
  3. Calculate Ai,j — Bj
  4. Sum the confidence of the same label and pick the highest.
Post-Processing Re-ranking. Image from

The most important point of the above solution is the use of the 2019 competition test set as validation for re-ranking as post-processing after inferencing and led to a top leaderboard score of 0.6598 which is ~1.75x better than the 2019 result.

CVPR 2020 AliProducts Challenge


Backbone networks (EfficientNet-B3, EfficientNet-B4, ResNet50, SeResNext50, SeResNext101) are fine-tune with Destruction and Construction Learning (DCL) and Look into Object (LIO) method. The model averaging is used to ensemble all the fine-tune models, achieving a top-1 error rate of 6.27%.

DCL as shown in the below image enhances fine-grained recognition by learning local discriminative regions and features by shuffling local regions. To prevent the network to learn the noisy patterns, an adversarial counterpart is proposed to reject Region Confusion Mechanism (RCM)-induced patterns that are not relevant. For more details kindly check the paper.

DCL Network, Image from

LIO as shown in the below image, models structure using self-supervised learning. Object extent learning helps the backbone network to distinguish between foreground and background. Spatial context learning using a self-supervision strengthen structural information for the backbone network. For more details kindly check the paper.

Look-into-object (LIO) framework. Image from


All images are resized to 256×256 then random crop to 224×224 for training and center crop 224×224 for the test. Train data is augmented using


All models are trained with an SGD optimizer with manual learning rate decay.

  1. All backbone networks with basic training achieve a top-1 error rate of 20–25%.
  2. All backbone networks are fine-tuned with balanced training, achieving a top-1 error rate of 9–12%. The balanced training set includes all the images from validation as well if the number of images in a category is less than 30.
  3. All backbone are further fine-tuned using DCL on higher resolution image (448×448), reducing error rate by 1–2% further.
  4. All networks are further fine-tuned using accuracy loss as shown in the below image, which is optimized for the top-1 error rate, reducing error rate by ~0.2–0.5%.
Accuracy Loss, Image From
def acc_loss(y_true, y_pred):
tp = (y_pred, y_true).sum(1)
fp = ((1-y_true)*y_pred).sum(1)
acc = tp/(tp+fp)
return 1 - acc.mean()

Below 11 models are used to calculate final probabilities

  • Balanced fine-tuned resnet50, seresnext50, seresnext101, efficientnet-b3, efficientnet-b4
  • DCL fine-tuned resnet50, seresnext50
  • Accuray loss fine-tunes resnet50, seresnext50, efficientnet-b3
  • LIO fine-tuned resnet50


Instance-level recognition will unravel the true potential of deep learning technologies for semantic image classification/retrieval for eCommerce, travel, media & entertainment, agriculture, etc. Some of the major building block for an efficient instance-level solution is

  • Backbone Network Selection (Residual, Squeeze & Excitation, EfficientNet)
  • Data Augmentation (Albumentation, AutoAugment, Cutout, etc).
  • Loss function (ArcFace, AccuracyLoss).
  • Multi-scale processing.
  • Fine-tuning and post-processing.

Thanks for reading the article, I hope you found this to be helpful. If you did, please share it on your favorite social media so other folks can find it, too. Also, please let me know in the comment section if something is not clear or incorrect.