In this blog, I will try to explain the interesting work in the paper by Jiajun Wu, etc — MarrNet: 3D Shape Reconstruction via 2.5 D Sketches together with the current published code which doesn’t include the important codes of the “pre-trained” model.
Basically, the goal of the trained model is to reconstruct a voxelized representation given a single image as input. And the main method, as implied by the title of the paper, is to first project the single image to three 2.5D sketches — depth, normal and silhouette. Then the model uses these three sketches to reconstruct the 3D voxelized representation corresponding to the input image. To make it supervised, they designed a novel loss — “reprojection loss”, to help guide the 3D reconstruction.
As far as I’m concerned, the main novelty of this paper lies in the loss design, which helps the long-existed 2.5D method become plausible. While the whole architecture may look fancy, there are some underlying parts of the model which are not clearly shown in the paper, i.e. the pre-trained model.
Here is the architecture of MarrNet. With some previous experience, we can see the the (a) part, i.e. 2.5D sketch estimation, is relatively easy. However, to achieve the (b) part, the information from 2.5D sketches seem to be not enough. Actually, as this paper pointed out, the task to reconstruct a 3D representation from a single image is ill-posed and prior must be incorporated. And in MarrNet, too, they actually incorporated pretty strong prior to help the 3D shape estimation, via pre-training.
Here is where the paper is ambiguous. While the paper claimed that “The 3D shape estimation module takes in the masked ground truth depth and normal images as input, and predicts 3D voxels of size 128*128*128 with a binary cross entropy loss”, and it’s trained in an encoder-decoder style architecture, I do have some doubts in it.
Is it really possible to train a model that can take 2.5D as input and reconstruct a 3D representation that well? Of course, you would argue that the model has a lot of pre-training on synthetic 2.5D data and 3D data. I agree. But the problem is till now the authors haven’t publish the code they used to pre-train, therefore I have another guess on how they actually pre-trained the model or can pretrain — using TL network.
The work of TL network was mentioned briefly in the paper, but the authors just said that the encoder part of the 3D shape estimation was inspired by TL network. But if you really take a look at TL network, you may, like me, wonder what the inspiration actually is.
As shown above, TL network is a stupidly simple yet really powerful method to use pre-training to reconstruct 3D representation with image as input. The pre-training has three stages: 1) voxel to voxel, learning the intermediate vector representation; 2) images to vector representation, which is modeled as a regression problem; 3) fine-tune.
In the test stage, the only remaining part is the “image -vector -voxel” part, which is the 3D shape estimation architecture in MarrNet.
So the second possibility of pre-training, may be almost the same as TL network, except that the image input now is replaced by 2.5D sketches. And of course this would lead to really satisfying results.
The only problem of the second possibility, however, is that if the authors of MarrNet really used this method, I think they should have mentioned more in the original paper. Since they didn’t, I guess they might have used the first method, which really amazed me.
Now the only thing to do is just to wait for the remaining code…