Source: Deep Learning on Medium
How I broke the captcha barrier for a Legal Tech company — Part 1
At Riverus, we strive to make legal information accessible to everyone. Once in a while, if not more so than often, a few high court forums restrict access to publicly important information behind a wall of captcha. With the vast amount of data that is generated by these forums every week, it becomes mandatory to get to that information fast and with less human effort.
Textual alpha-numeric captchas are for the most part a solved problem in terms of deep learning in computer vision. Building on top of that, I’ve managed to create a system which lets me solve a new captcha pattern within 1–2 hours of trial and error and tweaking 12–15 lines of code.
Going ahead I’d want to create a framework tool GUI, so that it creates a tight feedback loop and allows a person to modify parameters easily.
- Since the digits in the captcha are almost always of the same fixed width and height and spaced out quite a bit, find the contours to identify individual digits. (Part 1 — You are here)
- Build a digit recognizer model based on the dataset created from the captcha images such as above. (Part 2)
- During inference stage, follow the same prerprocessing steps used to bifurcate the captcha into individual parts and run the model on these sub-images to obtain the final result. Encapsulate it in a Flask framework to serve it as an API. (Part 3)
I’ll demo one of the easy ones but the technique and strategy to solve remains the same for any kind of fixed-length alpha-numeric text captchas.
Bifurcate image into parts
- Convert into grayscale (Nothing changes because it’s already in grayscale)
- Threshold to increase the black and white contrast (Blur the image before this step if background shapes and lines of different colors aren’t eliminated by thresholding itself) [Reference]
- Erode accordingly to shrink up white blobs of digits or dilate to puff up the digits. [Reference]
Note: Here erosion is showcased only as additional info and to display the various capabilites.
Slice up the image
To achieve bifurcation –
findContours()to obtain a list of continuous individual shapes. So ideally (non ideal cases are taken care in point 3) there should be a list of 6 contours in this case.
- Use the
x, y, w, hdimensions to isolate each digit which is our region of interest. I have added some black padding on all sides to give the digit some breathing space.
Sometimes the digits are too close to each other and get merged into each other and form one white blob which gets recognized as one contour. To counter this, a condition is put on the width of a contour. Let’s say width of a digit is
w. And if the width of the contour is between
3wthen most probably they are 2 digits mashed together and it needs to be halved. Similarly for
4wand so on.
Creation of dataset
Looking at the above digits, you’d think that tesseract-ocr should do the trick. But unfortunately that’s not the case which is why building a specific model is required.
Now that we are able to obtain individual digits, let’s create our dataset for training a digit recognizer model.
python annotate.py -i CaptchaImages -a data
-i: Folder consisting of all input images
-a: Folder in which the annotations, dataset of individual images will be stored
Running the above command will give you a window with Region Of Interest which the user will have to annotate correctly. Correct annotation will save that image in a folder with appropriate label. (0–9)
Part 2 of this series will talk about the various approaches I used to build a model for this particular version. Follow for more updates.