Source: Deep Learning on Medium

Robotic grasp detection problem is to find a way to safely pick up and hold an object. This paper proposes an accurate and real-time approach to robotic grasp detection based on convolutional neural network. In this work, they don’t use sliding windows and region proposal networks (RPNs) for regression and their network performs single-stage regression. This network does the classification and regression steps at one step. So it can identify the object and graspable rectangle in a single step.

There are different problem description for this problem. Instead of finding the full 3D grasp location and orientation, they assume that a good 2D grasp can be projected back to 3D and executed by a robot viewing the scene.

This paper uses a five-dimensional representation for robotic grasps. This representation gives the location and orientation of a parallel plate gripper before it closes on an object. Ground truth grasps are rectangles with a position, size, and orientation: g = {x; y; θ; h; w} where (x, y) is the center of the rectangle, θ is the orientation of the rectangle relative to the horizontal axis, h is the height, and w is the width. The following picture shows it:

Using five-dimensional representation makes the problem of grasp detection analogous to object detection in computer vision. The only difference is the added term for gripper orientation.

They used AlexNet with five CNN layers with normalization and max-pooling layers followed by three FC layers. The output layer has 6 neurons for grasp. Four for location, height, and width and two for sine and cosine of twice the orientation angle.

They used the RGB-D image as input to the network. To use the AlexNet network they simply replaced blue channel with depth information. It is possible to change the architecture and train it from scratch. But because they wanted to use transfer learning technique and the pre-trained model on ImageNet data set, they did this trick. They also normalized the depth information to fall between 0 and 255 and replaced missed pixels with 0.

To do the classification and regression tasks simultaneously, they modified the architecture by adding some extra neurons to the output layer that correspond to object categories.

Note: They assume that every image contains a single object.

They also proposed another model which is a generalization of the first model and call it MultiGrasp. MultiGrasp divides the image into an NxN grid and assumes that there is at most one grasp per grid cell. It predicts one grasp per cell and also the likelihood that the predicted grasp would be feasible on the object. For a cell to predict a grasp the center of that grasp must fall within the cell.

The output of this model is an NxNx7 prediction. The first channel is a heatmap of how likely a region is to contain a correct grasp. The other six channels contain the predicted grasp coordinates for that region.

The first model can be seen as a specific case of the second model with a grid size of 1×1 where the probability of the grasp existing in the single cell is implicitly one.

For training the network they used Cornell data set.

For evaluating the grasp detection performance, there are some metrics:

Point metric: This metric looks at the distance from the center of the predicted grasp to the center of each of the ground truth grasps. If any of these distances is less than some threshold, the grasp is considered a success. There are a number of issues with this metric, most notably that it does not consider grasp angle or size.

Rectangle metric: It considers full grasp rectangles. It considers a grasp to be correct if both 1) The grasp angle is within 30◦ of the ground truth grasp and 2) The Jaccard index of the predicted grasp and the ground truth is greater than 25 percent. The Jaccard index is given by:

They use five-fold cross-validation for experimental results. They do two different splits of the data:

1) Image-wise splitting splits images randomly.

2) Object-wise splitting splits object instances randomly, putting all images of the same object into the same cross-validation split.

This paper uses the rectangle metric for their performance evaluation.

The inference time is 14 fps on NVIDIA Tesla K20 GPU.

The results of their model are as follows. You see some proposed networks in tables:

- Direct regression: The first model without classification head. It has 6 neurons as output for rectangle regression.
- Regression + Classification: It is regression network with added classification head.
- MultigGasp detection: It is the proposed second network.