How I broke the captcha barrier for a Legal Tech company — Part 1

Source: Deep Learning on Medium

How I broke the captcha barrier for a Legal Tech company — Part 1

Photo by Clarisse Meyer on Unsplash

At Riverus, we strive to make legal information accessible to everyone. Once in a while, if not more so than often, a few high court forums restrict access to publicly important information behind a wall of captcha. With the vast amount of data that is generated by these forums every week, it becomes mandatory to get to that information fast and with less human effort.

Textual alpha-numeric captchas are for the most part a solved problem in terms of deep learning in computer vision. Building on top of that, I’ve managed to create a system which lets me solve a new captcha pattern within 1–2 hours of trial and error and tweaking 12–15 lines of code.

Going ahead I’d want to create a framework tool GUI, so that it creates a tight feedback loop and allows a person to modify parameters easily.

Main idea

  1. Since the digits in the captcha are almost always of the same fixed width and height and spaced out quite a bit, find the contours to identify individual digits. (Part 1 — You are here)
  2. Build a digit recognizer model based on the dataset created from the captcha images such as above. (Part 2)
  3. During inference stage, follow the same prerprocessing steps used to bifurcate the captcha into individual parts and run the model on these sub-images to obtain the final result. Encapsulate it in a Flask framework to serve it as an API. (Part 3)

I’ll demo one of the easy ones but the technique and strategy to solve remains the same for any kind of fixed-length alpha-numeric text captchas.

Sample Captcha used by Telangana forum

Bifurcate image into parts

Preprocess image

  • Convert into grayscale (Nothing changes because it’s already in grayscale)
Grayscale image
  • Threshold to increase the black and white contrast (Blur the image before this step if background shapes and lines of different colors aren’t eliminated by thresholding itself) [Reference]
Thresholded image
  • Erode accordingly to shrink up white blobs of digits or dilate to puff up the digits. [Reference]

Note: Here erosion is showcased only as additional info and to display the various capabilites.

Slice up the image

To achieve bifurcation –

  • Use findContours() to obtain a list of continuous individual shapes. So ideally (non ideal cases are taken care in point 3) there should be a list of 6 contours in this case.
Contours identified by findContours()
Bouding boxes
  • Use the x, y, w, h dimensions to isolate each digit which is our region of interest. I have added some black padding on all sides to give the digit some breathing space.
Bifurcated digits

Sometimes the digits are too close to each other and get merged into each other and form one white blob which gets recognized as one contour. To counter this, a condition is put on the width of a contour. Let’s say width of a digit is w . And if the width of the contour is between 2w and 3w then most probably they are 2 digits mashed together and it needs to be halved. Similarly for 3w and 4w and so on.

Creation of dataset

Looking at the above digits, you’d think that tesseract-ocr should do the trick. But unfortunately that’s not the case which is why building a specific model is required.

Now that we are able to obtain individual digits, let’s create our dataset for training a digit recognizer model.

python annotate.py -i CaptchaImages -a data

-i: Folder consisting of all input images

-a: Folder in which the annotations, dataset of individual images will be stored

Process

Running the above command will give you a window with Region Of Interest which the user will have to annotate correctly. Correct annotation will save that image in a folder with appropriate label. (0–9)

Part 2 of this series will talk about the various approaches I used to build a model for this particular version. Follow for more updates.