For a specific task i had to solve i recently came across some interesting paper:
“Table detection is a crucial step in many document analysis applications as tables are used for presenting essential information to the reader in a structured manner. It is a hard problem due to varying layouts and encodings of the tables. Researchers have proposed numerous techniques for table detection based on layout analysis of documents. Most of these techniques fail to generalize because they rely on hand engineered features which are not robust to layout variations. In this paper, we have presented a deep learning based method for table detection. In the proposed method, document images are first pre-processed. These images are then fed to a Region Proposal Network followed by a fully connected neural network for table detection. The proposed method works with high precision on document images with varying layouts that include documents, research papers, and magazines. We have done our evaluations on publicly available UNLV dataset where it beats Tesseract’s state of the art table detection system by a significant margin.”
I decided to give it a try.
So — what do we need to implement this?
Before we go on make sure you have everything installed to do be able to follow the steps described here.
The following will be required to follow the instructions:
- Python 3 (i use Anaconda)
- Luminoth (which will also install Tensorflow)
First, we need the data. Going through the paper I found some links that point to a website with XML files containing the ground truth ground truth for the UNLV dataset — but to keep things simple i will provide some already prepared dataset based on that 2 sources to start with.
You can download the dataset here — please extract in to a directory “data”.
In the “data/images/” folder we have 403 image files from different types of documents like this one:
In addition to the images there are also 2 csv files with the ground truth data for this dataset. Each file has lines for each table found in each file, in the following format:
<filename>, <xmin>, <ymin>, <xmax>, <ymax>, <class> (in our case “class” will always be “table”)
The first lines of the train.csv file look like this:
Preprocessing of images
The first part of the process is the preprocessing of the images. As the text elements in documents are very small and the used network is normally used for detecting real world objects in images we need to process the images to make the contents better understandable for the object detection network.
We will do this with in the following steps:
- open csv file
- read in all image file names in that file
for each image:
- preprocess image
- save image to data/train (for files from train.csv) or to data/val (for files from val.csv)
Let’s do this!
After you have done this there should be 2 additional directories in you “data” folder — ”train” and “val”. These hold the preprocessed image files we later use for training and validating the results.
The images that folders now should look like this:
But before we start training the network there is one additional step that has to be done.
Creating TFRecords for training the network
Now that we have the preprocessed files the next step is to create the files needed as inpurt for the training. Here we will use Luminoth framework for the fist time.
As Luminoth is based on based on Tensorflow we need to create TFRecords which will be used as input for the training process. Luckily, Luminoth has some converters which you can use to transform your dataset accordingly.
To do this we will use the command line tool “lumi” which comes with Luminoth. In the directory where you placed the “data” folder open a terminal or command line and type:
This will create a folder called “tfdata” with the TFRecords needed for the training of the network.
Training the network
To start the training of the network with luminoth we need to configure the training process.
This is done by writing a configuration file — there is a sample file available in the Luminoth Git repo, which i used to create a simple configuration (config.yml) for our task at hand:
Save this file to your working directory and we can start the training. Again we will use the tool “lumi” from Limunoth for this — so go to the terminal or command line (you should be in the folder with the data):
This will start the training process and you should see output like this:
It can take quite a while to train the network — if the loss is getting close to 1.0 you can stop the training with <ctrl + c>.
Ok, now we have a trained network — what next?
Using the trained network to make predictions
To use the trained network to make prediction we first need to create a checkpoint.
In the terminal or commandline window type the following:
You see something similiar to this:
The last line with “Checkpoint c2155084dca6 created successfully.” holds the important information: the id of the created checkpoint (in this case c2155084dca6).
This is the identifier you need for the prediction for new images and if you want to load the model to the lumi webserver.
First we will use the command line tool to make a prediction (make sure to use the id of your checkpoint instead of c2155084dca6):
You should see something like the following:
The interesting part for us is the part with “bbox” — the numbers show the coordinates of the table area whith x0 = 160, y0 = 657 (upper left corner of the area) and x1 = 2346, x2 = 2211 (lower right corner of the area). This information can be used to mark the area in the original, unprocessed image and looks like this:
So the network seems to have a good idea where the table can be found on that page.
You can try that on you own — take the predict command above together with the id of you trained checkpoint and use an image tool of you choice to draw an area with the given coordinates on the image. You will see that it will fit the area nicely around the table on the pages.
If you want a quick view on the prediction you can also use the small web application that comes with Luminoth, but this can only be used with the preprocessed files.
To use it start the web server with the command (again — make sure to use the id of your checkpoint instead of c2155084dca6):
This will start the webserver that comes with Luminoth where you can upload the preprocessed images to see the predictions which looks like this:
Although this has to be done using the preprocessed images you still get an idea of how good or bad the network detects the table areas.
In this article i gave a brief overview on how to implement the concept described in the research paper.
But detecting the table area alone is not of practical use — you still need some tools or libraries that can use this area definitions as input to acutally get the content of the tables.
For those of you who want to go further on this i can recommend to take a look at tabula-py, a Python library that can take the area definitions as input to improve accuracy for extracting the data of the tables.
Hmm…. maybe a good topic for a second article? ;-)
Source: Deep Learning on Medium