Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction

Source: Deep Learning on Medium

Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction

Proud Korean researchers → what they are doing is video keypoints → but there is no ground truth data → how can this be done? (predicting future frames is hard to do → some methods are just comparing the ground truth data to created one by using L1 loss)

There are multiple methods to do this → there is no one right way → and some methods have their own problem. (most recently we are using GANs to create more realistic data)

As methods increase → the generated data becomes sharper as well as better.

The inference shows a cropped portion of the image → this is great → some kind of unsupervised method is very realistic. (generating videos are much harder task → compared to just image generation since the temporal dimension is there).

There is a lot of research background → a lot of researchers are tackling this problem → such an interesting area of research.

A lot of networks are used to not only generate the next frames but also → the key points → very interesting.

They are able to generate a video from the image → this is good. (additionally, their method is robust as well as able to predict keypoints) The training is done → to learn the keypoint detector and learning motion generator. These are the two steps for training. (image translation is done → and the model is learning the differences of each frame to get the key points).

Wow, such an interesting training process → quite a lot of models are involved in creating this loss function. (But it seems like frame differences → are used to learn key points, this is one of the standard methods for keypoint detection, use the known image translations).

The loss function was VGG19 → perceptual loss → this is already a good loss function to optimize.

This is a very hard model to optimize overall → so many models are used as loss functions as well as the generating model itself. (a lot of computation is needed).

So the only subset of the data → was used → it seems like there could be a dataset bias. (there was data augmentation as well → this is interesting).

Cartoon animal characters are used as well.

And their method gives the best performance → and when we visualize the frames we can actually see that. Their models are so much clearer and cleaner).

This is a great result → it seems like object keypoints really do help for image generation. (their method is able to generate a whole video from just one image → this was very impressive)

The image translator → itself can do a lot of things → and they are used to separate the foreground to the background.

And they really got better results between a lot of different datasets. (they used different numerical measurements as well).

When they compared with AMT → and their results were ranked high. (and they outperformed → other methods even without a ground truth data!)

Quite an amazing results.

But there are limitations → when there are multiple objects.

They were able to generate realistic videos, just from one image.