Easiest RPN explained, the core of Faster R-CNN.

Source: Deep Learning on Medium

Easiest RPN explained, the core of Faster R-CNN.

When I was studying Faster R-CNN and RPN, I didn’t found an easy-to-understand article, so I decided to write it myself.

The Japanese is here.

What is Faster R-CNN ?

Faster R-CNN is an object detection algorithm invented by Microsoft in 2015.It is the first succeed in the end-to-end implementation by Deep Learning.

Original: https://arxiv.org/pdf/1506.01497.pdf

The outline is as follows.
① Classify that the bounding boxes are object or background.
② Classify what is in and determine the size of bounding box.
The breakthrough of Faster R-CNN is that it uses a CNN structure called Region Proposal network (RPN) in ①. The point of update is that it is implemented by Deep Learning instead of Selective Search, which is an image processing method.

cited from original paper

Like above, you can learn in a series of flows.

Well… such the RPN, but the following the questions remained even after looking at the some explanation pages.

・What is “sliding window”?
・How and when do “Anchor boxes” is used?
・I can’t imagine the shape of output layer of the RPN.

In this article, I’ll explain these without using mathematical formulas.

① About feature maps.

The feature maps are very simple: “The output layer when an input is passed the the final convolution layer using a trained model such as VGG16”. For example, the structure of VGG16 is as below.


Feature maps use layers up to 14*14*512. Layers after 7*7*512 are not used. If you choice VGG16 as a trained model, it has 4 pooling layers, the size of one feature map is 1/16 of the original image, and there are 512 of them.

For example, if the original image is 300*400*3, the feature maps are 18*25*512.

From now on, we use this image as the original. The ground truth is a green rectangle. Assuming that the width is 300, and the height is 400. As for the portrait right, I got permission directly from her.

② Brief of Anchor and Anchor boxes

After creating feature maps, set Anchor. Anchor is each point of the feature maps. In addition, we’ll make nine Anchor boxes for each Anchor. The information of Anchor boxes is the RPN output.

About Anchor

In this example, 18*25=450 points are all Anchors. These 450 points are the center of the bounding box. Anchor appear at the rate of 1 in 16 pixels in the original image because of VGG16, so you can see that they appear as below.

Set Anchors in feature maps and apply to original image.

The center of the bounding box has decided by Anchors, but the width and height has not been decided yet. The role of the Anchor box is to decide these.

About Anchor box

You determine
・standard length
・aspect ratio
from each Anchor, you’ll create multiple anchor boxes. For example, you determine

・standard length →64,128,256(be careful not to exceed the length of the image.)
・aspect ratio →1:1, 1:2, 2:1

and, Anchor boxes for Anchor with (x,y)=(11,12) are created as below.

9(=3*3) Anchor boxes are created for each Anchor.
There is one caution, when making Anchor boxes, the area of Anchor boxes have to be the same for each standard length.
In other words, when standard length is 64, Anchor boxes are as below.
1:1→64×64 (=4096)
1:2→45×91 (≒4096)
2:1→91×45 (≒4096)
In addition, Anchor boxes that protrude from the image are ignored, so it is acutually as below.

A total of 18*25*9=4050 Anchor boxes are created. By proposing various types of boxes in here, we can generate candidates boxes that are similar regardless of the shape of ground truth.

All you have to do is compare each Anchor box with the ground truth, and use the result as the RPN output!

③ About a structure of RPN output

The biggest reason why I wanted to write this article is that there was no article written about detail of RPN output.

As written in original paper, RPN learns two things.

・Whether a content in Anchor box is a background or an object.(cls layer)
・If it is an object, how much is the error with ground truth.(reg layer)

cited from original paper

k is the number of Anchor boxes. In this case, it’s 9.

About a background or an object, calculate the IOU of the ground truth and Anchor box, and label it as “background” if IOU<0.3 and “object” if IOU>0.7.Therefore, 9*2=18 classes is created.
( If you want to know about IOU, look this)
The part where 0.3<IOU<0.7 isn’t used for learning.

About an error with ground truth, it’s needed to calculate 4 items(the error of “x coordinate”, “y coordinate”, “width”, “height”), so 9*4=36 classes are created.

…now, can you imagine the shape of RPN output??

I couldn’t imagine it at all.

・About a background or an object

The quickest way is to see the conclusion. The conclusion is this.

This is arranged in the order of

background label for Anchor box 1|object label for Anchor box 1|background label for Anchor box 2| object label for Anchor box 2|
…| object label for Anchor box 9

for each Anchor. Anchor box 1 has a standard length=64 and aspect ratio=1:1, similarly Anchor box 2 is 64 and 1:2, Anchor box 3 is 64 and 2:1, Anchor box 4 is 128 and 1:1, and so on.

Returning to the image, the bounding box with 256 and 1:2(Anchor box 8) achieves IOU>0.7 with ground truth.

Therefore, RPN output of (x,y,z)= (11,12,16) is “1”.

Conversely, the anchor boxes of “64 and 1:1”, “64 and 1:2”, “64 and 2:1” are all IOU<0.3. Therefore, RPN output of (x,y,z)= (11,12,1), (11,12,3), (11,12,5) are 1.

In this way, all 4050 Anchor boxes are examined, and the background with IOU<0.3 and the object with IOU>0.7 with 1 is the background or an object output.

・About the error of ground truth

Let me get to the conclusion.

This is just changing 2 to 4 !

This is arranged in the order of

An error of x coordinate for Anchor box 1|An error of y coordinate for Anchor box 1|An error of width for Anchor box 1|An error of height for Anchor box 1|An error of x coordinate for Anchor box 2|…|An error of height for Anchor box 9

Only Anchor boxes with IOU>0.7 are checked for errors. Therefore, most labels are 0. In this example, Anchor box 8 (with (x,y)=(11,12))is IOU>0.7, so (x,y,z)=(11,12,29), (11,12,30), (11,12,31), (11,12,32) have the corresponding values.

In this way, RPN solves “background or object” as a binary classification and “error of ground truth” as a regression at the same time.

This is the RPN structure.

Extra:About sliding window

It’s very simple. It’s only a convolutional layer (filter size = 3*3). I don’t know why it was written as “sliding window” instead of “convolutional layer”.

There is no need to think complicatedly.


RPN is written only in convolutional layer and pooling layer. In general classification problems, the full connected layer comes to the end, so the input image must be a fixed size, bht RPN can accept images of any size. Therefore, is a very flexible model.

If I say selfishness,
I wanted Microsoft to write a little more easily about RPN : -P


  1. https://arxiv.org/pdf/1506.01497.pdf

2. https://towardsdatascience.com/faster-r-cnn-object-detection-implemented-by-keras-for-custom-data-from-googles-open-images-125f62b9141a