Mask R-CNN what and how does it work? Attempt 1

Source: Deep Learning on Medium

Mask R-CNN what and how does it work? Attempt 1

Instance → hard since we have to count the number of instances in the image → very hard and they can overlap.

Fast RCNN → upgrades → Mask RCNN.

Basically → the image is encoded → this is feature encoding. (another network that was trained on object classification).

FPN → we are going to extract feature in different scales → making things much more powerful. (FPN → can choose multiple feature images).

Then creates a regional bounding box → super cool. (if there is an overlap → it might be hard to decide which anchor boxes to use).

To fit the object better there are multiple suppression happening. (this is good).

Now for each bounding box → for EACH bounding box, we are going to classify → this is why it is so powerful and dynamic.

Finally, → we are going to generate a segmentation mask. (the ground truth mask is scaled down to 256*256) → for easier training and faster.

The author used different kinds of annotation tools.

Feature extraction network is the backbone → without this it is nothing.

The only difference between Faster RCNN → is the mask network → to generate a segmentation mask. (and there is some other refinement stage as well).

Non maxima suppresion → since occlulsion is a huge problem.

Separate different objects and for each object → give some identity. (there are 7 objects in this image).

Quite a small feature map → but we are going to build a pyramid.

The start of anchors → are predefined locations before training → this is a much easier approach.

Computer vision is a cool field → we are asking the network to draw a bounding box in the image → this is very important and has a lot of applications.

The softmax → is going to propose multiple regions of the image → that will contain some kind of object.

Some of the networks use → FCN → to get the segmentation map or more.

Basically, if the backbone of feature extraction is done well → consider the problem is solved.

They just combined two different states of the art models. (coloring is pretty easy → just showing with different color maps).