Source: Deep Learning on Medium
Unsupervised Learning of Object Landmarks through Conditional Image Generation
We are going to learn key points → from generating the image very interesting. (the condition is on geometry differences → this is why the model is different from other generative models).
The image pair have different viewpoints → the goal of the model is to reproduce the original image. (and actually understand the key point differences) → Good idea.
Very interesting architecture of → how the source and target are used. (this method is simple as well as very easy to use → no complicated loss function → this is a huge advantage).
There is a lot of related work on this field → but they are not generative → is this the most optimal way? → unsupervised learning → autoencoders, as well as other cases, are used well.
A good case is when there is a factoring process where different elements are decomposed. (some of the methods use LSTM → would this be a good idea?).
Their method → uses GAN to learn the structures. (different variations are enforced). (the author’s method can learn from raw videos directly).
So this is a reconstruction loss → but with different combinations. (and there is some sort of bottleneck process where → the model is forced to learn some of the stuff) → this is needed since we want the model to learn specific stuff.
Hence that acts as a regularization.
The above is proof of the fact that the model learned something more useful. (super interesting). (we can use separable implementation).
What kind of loss function is used? → a simple method is used → such as perceptual loss → super interesting. (a deep learning model is used to train the network).
They can train a model with very little data → super good thing. (so reconstruction is needed → making the loss function more complicated and complex) → this does seem to be very important.
But we would need high computational power. (this model can also learn human pose and more). (BBC pose dataset → the key points are automatically learned and tracked).
Adam and weight decay were used. (they really did a lot of experiments → really great job).
The author’s method of self-supervised training gave the best performance. (this is very impressive). (there was a linear regressor on top of the unsupervised training model).
When the bottleneck is replaced → the performance is degraded → hence we know that the bottleneck is a fundamental element. (BBC human pose was good → but there was some limitations since no specific supervised signals were in the training process).
Wow, they can even translate the style. (very impressive).
Is this the best approach…?
In conclusion, the authors developed a very smart way to image generation for object landmark detections.