Scanned Document Classification using Computer Vision

Source: Deep Learning on Medium

Scanned Document Classification using Computer Vision

A deep learning approach to address the scanned document classification problem

In the era of digital economy, sectors like Banking, Insurance, Governance, Medical and Legal sectors still deal with various handwritten notes and scanned documents. In later parts of the business life cycle, it becomes a very tedious job to maintain and classify these documents manually. A simple and meaningful automated binning of these unclassified documents would make it a lot easier to maintain and leverage the information and reduce the manual effort significantly.

Scanned Documents

The goal of this case study is to develop a deep learning based solution which can automatically classify the documents.

Data: For this case study, we will use the RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) data set which consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels. The size of this data set is more than 200 GB.

Business-ML problem mapping: We can map the business problem as a multi-class classification problem. There are 16 classes in the current data set. We need to predict the class of the documents based on only the pixel values of the scanned document which makes the problem hard. But wait, why can’t we use OCR to extract text and apply NLP techniques? Yes, we were also excited about that idea, but low quality scans resulted in a poor quality of text extraction. In the practical business scenarios also we do not have control over the quality of scans, so models rely on OCR may suffer from poor generalization even after proper preprocessing.

KPI and Business Constraints: The data set is fairly balanced. So we chose accuracy as the primary metric and Micro average F1 score as a secondary metric to penalize wrongly classified data points. We have also used confusion metric to validate the performance of the model. There is a moderate latency requirement and no specific requirements for interpretability.

Can we get anything from the pixel intensity and size of the documents?

Let’s try to visualize the mean pixel intensity and size of the documents using a box plot

From the box plot, we can observe that size of some type of scanned documents is quite different than others, but there are also overlaps. For example, the file size of class 13 and class 9 are much different but the size of class 9 overlaps with class 4 and class 6,7.

We can observe that for 75% of the cases, the mean pixel intensity of class 4 lies between 160–230 pixels. But for approx 50% cases, it also overlaps with the mean pixel value of class 6. For other classes, mean pixel value overlaps.

Analytical Approach

To solve the problem at hand we have trained Convolution Neural networks (CNN) on the augmented data. We have tried to train the model with and without data augmentation, the results are comparable.

High Level Analytical Work Flow Diagram

Great! But how to decide the network architecture? How did you train the network as the data can not fit in the memory at once?

Training a neural network from scratch takes significant time and computational resources to converge, to avoid this we took the help of transfer learning. We started with the weights of pretrained networks trained on ImageNet data set and retrained on our data set. The current SOTA model for this genre of problem uses inter and intra domain transfer learning where an image is divided into four parts header, footer, left body and right body. A pretrained VGG16 model is first used to train over the whole images(inter domain) then this model is used to train the part of images(Intra domain).

In this experiment, we took a slightly different approach. Instead of intra domain transfer learning using VGG16, we trained two parallel models VGG16 and InceptionResNetV2 and used a stack of these as our final model. Our assumption was that because of the different architectures of these two models they will learn the different aspect of images and stacking them will result in good generalization. But how we choose these models? This basically comes from the cross validation results. We tried various network architectures like VGG16, VGG19, DenseNet, ResNet, InceptionNet and the best two were selected.

We have used the ImageDataGenerator class of keras to preprocess and load the training data on the go instead of loading the whole data once in the memory.

Final Training Phase of VGG16 network

OK. But how to deal with the hyperparameters?

For any CNN the hyper parameters are: learning rate, pooling size, network size, batch size, choice of optimizer, regularization, input size etc.

Learning rate plays a significant role over the convergence of neural network. The loss functions used in deep learning problems are non convex, which means finding the global minimum is not an easy task in the presence of several local minima and saddle points. If the learning rate is too low it will converge slowly and if the learning rate is too high it will start oscillating. For this case study, we have used a technique called ‘Cyclic Learning Rate’, which aims to train neural network such a way that the learning rate changes in a cyclic way for each training batch.

But why does it work? In CLR, we vary the learning rate within a threshold. The periodic higher learning rate helps to overcome if it stuck in the saddle point or local minima.

For other hyperparameters, we have developed custom utility functions to check which configuration works better. Suppose after 10 epoch we got an accuracy of 47%. We will use this model as a testing baseline at that point and using the utility functions we will check which configuration set(i.e. batch_size/optimizer/learning_rate) will result in better accuracy in future epochs.


We have achieved an accuracy of 90.7% using the VGG16 model and 88% using the InceptionResNetV2. The proportional stacked model of the above two models obtained a training accuracy of 97% and test accuracy of 91.45%.

you can find the full implementation here.


  1. A. W. Harley, A. Ufkes, K. G. Derpanis, “Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval,” in ICDAR, 2015.