Review Paper: Robotic Grasp Detection using Deep Convolutional Neural Networks

Source: Deep Learning on Medium

Go to the profile of Isaac Kargar

In this post, we want to review this paper.

In the previous review, we saw that the authors used AlexNet architecture and replaced the blue channel with depth information and normalized the depth information. Then they fed the image into the network.

In this paper, the authors propose a network with two 50-layer deep convolutional residual neural networks (ResNet-50) running in parallel to extract features from RGB-D images. One network extracts features from RGB image and the other network extracts features from depth channel. The extracted features from both branches are then merged, and fed into another convolutional network to predict the grasp configuration. They propose another version of this model with only RGB component. — Uni-modal → the model with RGB image as input.

– Multi-modal → the model with RGB-D image as input.

They have another mode which uses SVM after feature extractors and calls it baseline.

They train and evaluate the trained network on Cornell Grasp Data set. You can see sample images from this data set in the picture below:

There are different problem description for this problem. Instead of finding the full 3D grasp location and orientation, they assume that a good 2D grasp can be projected back to 3D and executed by a robot viewing the scene.

This paper uses a five-dimensional representation for robotic grasps. This representation gives the location and orientation of a parallel plate gripper before it closes on an object. Ground truth grasps are rectangles with a position, size, and orientation: g = {x; y; θ; h; w} where (x, y) is the center of the rectangle, θ is the orientation of the rectangle relative to the horizontal axis, h is the height, and w is the width.

Using five-dimensional representation makes the problem of grasp detection analogous to object detection in computer vision. The only difference is the added term for gripper orientation.

In this work, the authors assume that the input image contains only one graspable object and a single grasp has to be predicted for the object. The main advantage of this assumption is that the whole image can be seen at one stage.

Uni-modal Grasp Predictor

In this version, they use RGB or RGD image as input. In this way, they can use ResNet-50 model that is pre-trained on ImageNet data set to extract features from the image. They replaced the last fully connected layer of ResNet-50 by two fully connected layers and the last layer is responsible for grasp configuration prediction. The uni-modal structure is as follows:

Multi-Modal Grasp Predictor

This version uses RGB-D information as input. They convert the RGB-D data into two images. The first one is a simple RGB image and the other one is a depth image converted into a 3-channel image. This conversion is done similar to a gray to RGB conversion. These two images are then given as input to two independent pre-trained ResNet-50 models as our feature extractors. They used features from the second last layer of both ResNet-50 networks. Then these features are normalized using L2-normalization and then are concatenated and sent to the average pooling layer. After pooling layer they used some fully connected layers. The network architecture can be seen in the following picture:

Evaluation Metrics

For evaluating the grasp detection performance, there are some metrics:

Point metric: This metric looks at the distance from the center of the predicted grasp to the center of each of the ground truth grasps. If any of these distances is less than some threshold, the grasp is considered a success. There are a number of issues with this metric, most notably that it does not consider grasp angle or size.

Rectangle metric: It considers full grasp rectangles. It considers a grasp to be correct if both 1) The grasp angle is within 30◦ of the ground truth grasp and 2) The Jaccard index of the predicted grasp and the ground truth is greater than 25 percent. The Jaccard index is given by:

They use five-fold cross-validation for experimental results. They do two different splits of the data:

1) Image-wise splitting splits images randomly.

2) Object-wise splitting splits object instances randomly, putting all images of the same object into the same cross-validation split.

This paper uses the rectangle metric for their performance evaluation.


The authors proposed various versions for their work as you can see in the following tables.

The accuracy of different models can be seen in the table below:

The grasp prediction speed on a PC with NVIDIA GeForce GTX 645 GPU and Intel® Core(TM) i7–4770 CPU @ 3.40GHz is as follows:

You can see some examples of predicted grasps in the figure below: