QMNIST: MNIST Dataset expanded with 50k new images

Source: Deep Learning on Medium

Go to the profile of Vlad Tămaș

Facebook AI Research and researchers from New York University have expanded the wide known MNIST dataset.

MNIST is one of the most well-known and most used dataset for building computer vision systems. A lot of research was done in the last years using MNIST and the dataset has become a baseline for many computer vision problems.

The reason behind this expansion is that the official MNIST dataset contains only 10K randomly sampled images and is often considered too small to provide meaningful confidence intervals.

Through an iterative process, researchers tried to generate an additional 50k images of MNIST-like data. They started with a reconstruction process given in the paper and used the Hungarian algorithm to find the best matches between the original MNIST samples and their reconstructed samples.

After many iterations of improvements in the reconstruction algorithm trying to extract the best matches between the generated and the original samples, researchers improved the samples and generated a dataset of an additional 50k digit images.

The new QMNIST dataset will allow examining the existing methods and investigating their generalization capabilities since many of them might have been overfitting on the small MNIST official testing set.

The dataset, as well as a detailed explanation of the reconstruction process, can be found on Github. The pre-print paper is available on arxiv.