The #paperoftheweek is “Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning”
One of the common issues when working on video data for tasks like object recognition, tracking, or segmentation is the high cost of labeling. Another issue can result from the fact that the common feed-forward architectures are processing each frame on its own and thus discard the rich temporal structure that could actually be available from the input data. This may result, for instance, in temporal irregularities in the results like inconsistent classifications from frame to frame. Lotter et. al. present an approach to training deep neural networks on video data that potentially addresses these two issues.
The proposed architecture (“PredNet”) is trained on unlabeled video data. For every time step, its objective is to predict the next video frame. The residuals of this prediction are fed to the next higher layer which again tries to predict the incoming input (and so on to the top). The consequence of this requirement to predict the next frame is that the net needs (and is able) to learn a lot about the regularities and dynamics of the visual world in a completely unsupervised fashion. In the following images, you can see for instance, how the network is able to predict the perspectival movement of the street, fill in trees at the border of the image, or predict the movement of cars moving in different directions. The internal representations thus seem to be a good starting point for transfer learning.
The authors demonstrate the usefulness of the pre-learned representations on different supervised tasks:
- For a network that is trained on a video of rotating 3D heads (with known ground truth for angles and velocities), the internal representations are shown to encode for the angles and velocities of the observed heads.
- The internal representations of the same net show very competitive performance (compared to other autoencoder-style architectures) when they are used to train a classifier for faces that have not been part of the training set.
- Last but not least, a network that is trained on car data is used for the task of steering angle estimation and is able to outperform a CNN that was specifically trained end-to-end for the task with a magnitude more of training data (for the supervised learning step).
That is, the presented architecture appears to be a promising approach to reduce labeling costs for videos through unsupervised pre-training, in part through making use of the rich temporal structure available in videos.
“While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning — leveraging unlabeled examples to learn about the structure of a domain — remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network (“PredNet”) architecture that is inspired by the concept of “predictive coding” from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure”
Authors: William Lotter; Gabriel Kreiman and David Cox — Harvard University.
You can read the full article here.
About the author:
Björn Weghenkel, Software Development Intern at Brighter AI.
About Brighter AI:
Brighter AI has developed an innovative privacy solution for visual data: Deep Natural Anonymization. The solution replaces personally identifiable information such as faces and licenses plates with artificial objects, thereby enabling all AI and analytics use cases, e.g. self-driving cars and smart retail. Check out our open positions!