How I Built A Document Classification System using Deep Convolutional Neural Networks !

Source: Deep Learning on Medium

Creating Structured Data

The label files list the images and their categories in the following format:

path/to/the/image.tif category

We create a script such that our we get two separate pickel files for labels as well path to images

import os
import joblib
for i in range(3):
with open(directories[i],'r') as f:
for line in f:
path=[]for x in label:category=x.split(" ")[1][-2::-1][::-1]
y=x.split(" ")[0]
for i in range(3):

Once we have out labels and paths separated into files we can now move forward and filter our data into three parts-Train,Test and Cv.

Lets check if there is any variation in image sizes using some samples of image

train_width=[]test_width=[]cv_width=[]count=0for image in train[:100]:im = cv2.imread(image,cv2.IMREAD_GRAYSCALE)train_width.append(im.shape[1])for image in test[:100]:im = cv2.imread(image,cv2.IMREAD_GRAYSCALE)test_width.append(im.shape[1])count+=1for image in cv[:100]:im = cv2.imread(image,cv2.IMREAD_GRAYSCALE)cv_width.append(im.shape[1])

Lets visualize the image width distribution.

import seaborn as sb
from matplotlib import pyplot as plt
width=(train_width,test_width,cv_width)for x in width:
sb.distplot(x,kde = True)

We see that there is a lot of variance in the width of the images for all the three Test,Train and CV data images.

So resizing images will be an important part of this project.

Checking for class imbalances

This a fairly balanced data set with all the classes having equal representation.

Building Generators and File structures

We are going to have 3 folders, namely train,test and cv.

Each folder will have a 16 files representing each class starting from 0 to 15

We move all the data with respect to there labels and into the folders.

Now we are moving only 10k data per Class for training into these folders due to computational constraints.

As the data is huge we would need a Keras generator to load the data batches from the disk to the model for efficient I/O.So we initialize Keras data generators by also including data normalization aspect such that our data is in range of 0 to 1,dividing by 255 .

The Architecture

Oh ! well okay this thing that i dumped above seems super complicated right ?

So lets break it down piece by piece !


1. model_final_res

We are using a Inceptionresnet-V2 initialized with weights trained on imgenet.We will resize our image into size of 128×128 and train this model.We unfreeze all the layers and train.

You might have heard of Resnets and Inception-net but what is a inception_resnet_v2?

As the name suggests it basically a Inception-net combined with residual modules concept from the Resnet.

Residual connections allow shortcuts in the model and have allowed researchers to successfully train even deeper neural networks, which have lead to even better performance. This has also enabled significant simplification of the Inception blocks.

We use the Keras application API to import the inception_resnet_v2 model with imagenet weights and also set it with variable input shape as using (None,None,3).We will be doing something interesting later on this blog.

We remove the top of the inception_resnet_v2 model and attach 2 dense layers with Dropout and finally attach a Softmax layer of size 16.

We will be using the old classic SGD with lr=0.1,momentum=0.0,nesterov=False

We will penalize our learning rate,if the validation accuracy does not improve after 2 epoch by 10% and finally train it for 30 epochs with batch_size=128.

After training for 30 epochs we get a decent Accuracy 85% on our test data.

2. model_final_res_2

After we have trained our based model with image size 128×128,we will transfer these weights to another inception_resnet_v2 model where we will train on the images of size 256×256.

This is a little trick I learned from course where Jeremy Howard trains on smaller images and then uses those weights to initialize same architecture to train larger images.

The reason why I think this techniques works is because when a smaller sized image is provided to our architecture,it tries to learn all the minute features which are constrained by the smaller image size.However once it learns these features under constraints our model will eventually work well with larger sized images .This is just like Data augmentation which forces your model to learn better features,kind of like regularization.

So everything remains the same except we change our image size to 256X256 and train on the same conditions as above for 20 epochs.

Wallah ! we get an accuracy of 89.37% on our test data

model_final_res_2 will now becomes our base model.


Transfer Learning involves the transfer of experience obtained by a machine learning model in one domain into another related domain [25]. While document classification and object classification apparently seem like divergent domains, architectures trained on the 1000 class ImageNet dataset have proven to function as generalized feature extractors.

In this work, a inception_resnet_v2 model, trained on ImageNet, is used as initial weights of our holistic model thus constituting an initial level-1 (L1) transfer of weights. The L1 transfer, of course, originates from a different domain and is the regular inter-domain form of transfer learning. However, the holistic model trained on whole images of the RVL-CDIP dataset can be thought of as a generalized document feature extractor. The training sets for the region based models, although containing images of document regions and at a different scale, are still essentially images of documents. Thus, this concept is utilized by setting up another level (L2) of transfer learning in which the region based models are initialized with weights from the holistic model instead of the original inception_resnet_v2 model.

Using the concept of Intra-Domain Transfer learning we will now train Region Specific model by cropping our image in different ways.

Region Specific Models


We take the first 256 pixels from top to bottom and crop the remaining.

We do this by building a cropping function and passing that function to our generator.

We now load the weights from the model_final_res_2 into a new inception_resenet_v2 model and train it for 5 epochs.

Using only the top region of the images we get 85.6%


We take the last 256 pixels from bottom to top and crop the remaining.

Using only the bottom region of the images we get 82.09%


We take the left most 256 pixels from left to right and crop the remaining.

Using only the left region of the images we get 86.77%


We take the left most 256 pixels from left to right and crop the remaining.

Using only the left region of the images we get 86.77%


We take the right most 256 pixels from right to left and crop the remaining.

Using only the region of the images we get 84.36%

We have used Adam optimizer with 0.0001 Learning rate to train all the region specific models.

We all the models trained by now.

Stacked Generalization

Lets use the power of Stacking to get SOTA results.

We take all our 5 models

  1. Holistic model:model_final_res_2
  2. Top crop model:mode_final_top
  3. Bottom crop model:mode_final_bottom
  4. Left crop model:mode_final_left
  5. Right crop model:mode_final_right

Concatenate the softmax output of every model together for all the validation data and test data.

We will be training a simple 2-layered MLP Meta Classifier with the softmax output generated from the validation data and then we will test it on the test data.

After training for 20 epochs we get an accuracy of 90.5 % on both test and train data.


from matplotlib import cmcmap = cm.get_cmap('tab20')def height_crop(path,type): image=cv2.imread(path) image = cv2.resize(image, (512,256)) if type=='bottom': image =image[-256:,:,:] else: image =image[:256,:,:] return imagedef width_crop(path,type): image=cv2.imread(path)
image = cv2.resize(image, (256,512))
if type=='right':
image =image[:,:256,:]
image =image[:,-256:,:] return imagedef full_image(path): image=cv2.imread(path)
image = cv2.resize(image, (256,256))
return image
def preprocess(im): im = im/255
im = np.expand_dims(im, axis=0)
return im
def predictions(image): top_pred=top.predict(image[0])
return predictiondef plot_bar_x(axes,prediction,doc_type): sort_index=np.argsort(prediction)[::-1]

Our output for a random image is spot on !!


We achieved a accuracy of 90.5% with 1/3rd data using Intra-domain transfer learning followed by Stacked Generalization.

Using more data could produce more results and I am working on that.

Ill keep you posted once we beat the 92.4 benchmark soon !!

You can get the entire code at :

Caution : The notebook show the final results to be 93% which is an error of not resetting the Generators.After fixing we found the accuracy to be 90.5%.