ICLR 2021 Submission: Deeper VAEs Excel on Natural Image Benchmarks

Original article was published by Synced on Artificial Intelligence on Medium


ICLR 2021 Submission: Deeper VAEs Excel on Natural Image Benchmarks

One of the most popular approaches to unsupervised learning of complicated distributions, Variational Autoencoders (VAEs) consist of an encoder and a decoder built on top of standard function approximators (neural networks). VAEs have shown promise in generating many kinds of complicated data including faces, handwritten digits, physical models of scenes, etc.

Testing the premise that sufficiently deep VAEs could implement autoregressive models and other more efficient generative models, new research proposes a hierarchical VAE that outperforms PixelCNN in log-likelihood on all natural image benchmarks. The paper is currently under double-blind review for the International Conference on Learning Representations (ICLR) 2021, and so the author and institution identities remain masked.

Starting with the PixelCNN in 2016, autoregressive generative models have traditionally achieved the highest log-likelihoods across modalities, despite counterintuitive modelling assumptions. The paper explores whether sufficiently improved VAEs can outperform autoregressive models, a question the researchers believe has significant practical stakes.

The paper first presents theoretical justifications for why greater depth could improve VAE performance, then introduces an architecture capable of scaling past 70 layers (compared to 30 layers or fewer in previous work).

The researchers trained the very deep VAEs on natural image datasets CIFAR-10, ImageNet-32, and ImageNet-64, and tested whether greater statistical depth — independent of other factors — could produce improved performance. Using more stochastic layers but fewer parameters than previous work, the VAEs outperformed GatedPixelCNN/PixelCNN++ models on all tasks.

The researchers demonstrated that their model uses fewer parameters than PixelCNN while generating samples thousands of times more quickly. The proposed model can also easily scale to larger images, and the researchers suggest such strengths may emerge from its learning an efficient hierarchical representation of images.

This paper reflects the machine learning community’s discovery that scaling up VAEs works surprisingly well for image modelling — compared to more involved generative models that require autoregressive sampling. A Nvidia paper published this July, for example, shows that deep hierarchical VAEs with carefully designed network architecture can generate high-quality images and achieve SOTA likelihood. The source code has been released to support research on VAE architectures and techniques, which the team hopes will encourage efforts “in further improving VAEs and latent variable models.”

The paper Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images is on OpenReview.