Neural Networks Intuitions: 6. EAST

Source: Deep Learning on Medium

EAST — An Efficient and Accurate Scene Text Detector:

a. Architecture: Every single shot object detector has 3 major stages involved:

  1. Feature extraction stage.
  2. Feature fusion stage.
  3. Prediction network.

All variants of single shot detector differ in one of the above three stages. EAST as well follows the above same paradigm.

b. Input-Output:

  1. The network takes in an input image and passed through some set of conv layers(feature extractor stem) to get four levels of feature maps — f1, f2, f3, f4.
  2. The feature maps are then unpooled(x2), concatenated(along channel dimension) and then passed through 1×1 followed by 3×3 convs. The reason for merging features from different spatial resolution is to predict smaller word regions.
Feature merging

3. The final feature volume will be then used to make score and box predictions — 1×1 filter of depth=1 used to generate score map, another 1×1 filter of depth=5 is used to generate RBOX(rotated boxes) — four box offsets and rotation angle, and another 1×1 filter of depth=8 to generate QUAD(quadrangle with 8 offsets).

c. Loss function:

Before jumping into loss functions, let us try to interpret the network output first.

1. The class head output can be interpreted similar to the traditional detector’s class output — except here there is only anchor box per feature map cell, hence the output will be of shape HxWx1 where 1 indicates the number of anchor boxes.

2. But in case of the box head, the output(of shape HxWx4) should be interpreted at a “pixel” level and there is no concept of anchor box. Every pixel has 4 numbers associated with it — distance to the nearest [minx, miny, maxx, maxy] box. Important thing to note here is that the final word level output is later derived from this per-pixel level output.

c1. Classification Loss

We all are well aware of the class imbalance problem present in object detection datasets. The number of samples for background class is generally very high in number and now that we are treating every 1×1 box(basically every pixel) as output, the number of background samples becomes huge.

Inorder to tackle this problem of class imbalance, EAST uses a modified version of cross entropy called Balanced/Weighted Cross Entropy.

Balanced Cross Entropy(BCE)

In BCE, the fraction of highly-represented class is generally multiplied with the under-represented class’s loss term(and similarly for the highly-represented class’s loss term) inorder to the control the contribution of high and under-represented classes. Note: background class:=highly-represented and foreground class:=under-represented.

Check this blog for detailed explanation of BCE (Neural Networks Intuitions: 1.Balanced Cross Entropy).

c2. IOU Loss: IOU loss used here is different from the traditional bounding-box loss.

IOU Loss

Every pixel will have 4 numbers associated with it — distance to the nearest top, left, bottom, right box boundaries, from which IOU is computed and then negative log likelihood is used as the loss which penalizes when IOU is less than 1.

It is pretty evident that the width and height of gt/pred box can be computed by simply summing their x and y-offsets — from which the gt and pred box area is obtained. To find width and height of the intersected rectangle,

Now that we have area of gt box, area of predicted box and area of intersected box, we can compute IOU!