Deep learning neural networks have made great success in pattern recognition. At the same time the text captcha is still used in some well-known free email services. It is interesting to know whether neural networks will be able to cope with the task of the text CAPTCHA recognition? If the answer is yes, the question is how?
What is the text CAPTCHA?
CAPTCHA is a test used to determine whether or not the user is human. So, it is a task that is easily solved by any person, while for a computer this task must be complicated. One of its common types is the text with coalesced symbols (see an example below). The text is additionally subjected to optical distortions.
CAPTCHA is typically used on the login page to protect against bots sending spam.
Fully convolutional neural network
When the symbols are coalesced, they are usually very difficult to separate by heuristic algorithms. Therefore, it is necessary to search for each symbol in each place of the text picture. Fully convolutional neural network will cope with this task. Fully convolutional network is a convolutional network without fully connected layer. The input of such network is fed with an image, and at the output it also produces an image or several images (center maps).
The number of center maps is equal to the length of the alphabet of characters used in a CAPTCHA. Centers of characters are marked on the center maps. Scale transformation that occurs in the network because of the pooling layers is counted. The example of character map for “D” character is demonstrated below.
In this case convolutional layers with padding are used in such a way so that the image size at the output of the convolutional layer would be equal to the image size at the input. The spot profile on the character map is defined in a 2-dimensional Gaussian function with widths of 1.3 and 2.6 pixels.
At first, fully convolutional network was tested on “R” character:
For the test a small network with 2 pooling layers trained on CPU was used. After making sure that this idea works in some way a second-hand Nvidia graphics card GTX 760, 2GB was bought. This provided the opportunity to train larger networks for all characters of the alphabet, and also (approximately 10 times) accelerated training. For training the network, the Theano library that is no longer supported now was used.
Training using a generator
It seemed to be long and difficult to label a large data set so it was decided to generate CAPTCHAs with special script. In this case, the center maps are generated automatically. I selected the font used for Holtmail CAPTCHA. Generated CAPTCHAs were similar in style to real ones:
The final accuracy after training on generated CAPTCHAs was approximately 2 times lower than after training on real CAPTCHAs. Probably, such features as intersection of characters, scale, their line-weight, distortion parameters, etc. are important and the generator failed to reproduce these features. The network trained on generated CAPTCHAs and tested on real CAPTCHAs had about 10% of accuracy. Accuracy is a percentage of CAPTCHAs recognized properly. CAPTCHA is considered to be recognized if all the characters in it are recognized correctly. In any case, this experiment showed that the method works, and that it is only required to increase the accuracy of recognition.
Training with a real data set
For manual annotation of real CAPTCHA data set the script with GUI was written in the Matlab:
Here, the circles can be placed or moved by the mouse. A circle marks the center of the character. The manual annotation took 5–15 hours. However, there are services where they perform manual annotation of data sets for a small fee. But, as it turned out, the service Amazon Mechanical Turk does not work with Russian customers. I placed an order for data set annotating on a freelancer’s site. Unfortunately, the quality of the annotation was far from being perfect. I corrected the annotation by myself. In addition, the search for a person who can do it takes time (1 week) and also it seemed to be quite expensive: $30 for 560 labeled captchas. So, I had to decline this method and, eventually, I ended up using the sites for manual recognition of captchas, where the lowest price was $ 1 for 2000 captchas. But the answer I got there was only a string. Thus, manual arrangement of centers could not be avoided. Moreover, performers in such services appeared to make mistakes or even act in bad faith by typing an arbitrary string in the response. As a result, I had to check and correct their mistakes.
Obviously, recognition accuracy was not enough, so there was a question of architecture selection. I was interested whether one pixel in the output image “sees” the entire character in the input image:
Thus, I consider one pixel in the output image, and there is a question: values of which pixels in the input image affect the values of this pixel? The idea was the following: if a pixel does not see the whole character, then not all the information about the character is used and the accuracy is worse. To determine the size of the scope (let call it so), I performed the next experiment: set all the weight of convolution layers to 0.01, and the biases to 0, and the input of the network is fed with the image in which the values of all the pixels are 0 except a central one. As a result, the network yields a spot at the output :
The shape of this spot is close to the form of a Gaussian function. This shape raises the question: why the spot is round, while kernels in convolution layers are square? (kernels 3×3 and 5×5 are used in the network). My explanation is as follows: it’s like the central limit theorem. It has, as we can see here, an approach to a Gaussian distribution. According to the central limit theorem for random variables, even with different distributions, distribution of their sum is equal to the convolution of distributions. Thus, if we convolve any signal with itself many times, as follows from the central limit theorem, the result approaches to the Gaussian function and the width of the Gaussian function increases as the square root of the number of convolutions (layers). If in this network with constant weights we see where pixel value is greater than zero in the output image, we will get a square region (see fig. below), the size of this region is proportional to the sum of convolutions sizes in convolutional layers of the network.
I used to think that due to associative property of convolution two successive convolutions of 3×3 are equivalent to a convolution of 5×5 and therefore, if you convolve 2 kernels of 3×3 you will get one 5×5 core. However, then I came to the conclusion that it is not equivalent: 3×3 kernels have 9 * 2 = 18 degrees of freedom, and one 5×5 has 25 degrees of freedom. As a result, the output of the network produces a Gaussian function with a width less than the sum of the sizes of the convolutions in the layers. So, I found the answer to the question: which pixels at the output affected by one pixel at the input. Although, initially, the question had been reverse. But both questions are equivalent, which can be understood from the following figure:
In the figure we can imagine that this a side view of the image or that image height is equal to 1. Each of the pixels A and B has its own area of influence in the output image (indicated in blue): DC for A, CE for B. The value of pixel C is influenced by values of pixels A and B and by values of all the pixels between A and B. The distances are equal: AB = DC = CE (with account of scaling: the network has the pooling layers, so the input and output images have a different resolution). As a result, we have the following algorithm for finding the size of the scope:
- Set constant weight in convolution layers and set biases to 0
- Feed the input with the image with one non-zero pixel
- Get the spot size at the output
- Multiply the size by a scale factor. The scale factor takes into account different resolutions of the input and output images (e.g., if we have 2 pooling layers in the network, the output resolution 4 times lower than the input image, the size must be multiplied by 4).
To see features that the network uses, the following experiment was performed. The trained network that recognizes the CAPTCHA image is fed with the CAPTCHA image and at the output we get an image with marked centers of characters. Then we choose some detected character and in the center maps we nonzero only the image that corresponds to the considered character. This network output is memorized as
Then using gradient descent we minimize the function:
is an input image of the network,
refers to the network output images, c is a constant which is selected experimentally (c=0.1). With this minimization, the input and output of the network are considered as variables, and the network weights are constants. The initial value of the variable
is the CAPTCHA image. It is the starting point of the gradient descent algorithm optimization. With this minimization, we reduce the pixel values at the input of the image, while restricting the pixel values in the output image, as a result of optimization, only those pixels that the network uses in character recognition remain in the input image.
The results of this experiment are as follows:
For character “2”:
For character “5”:
For character “L”:
For character “u”:
The images on the left are original images of CAPTCHAs, the image on the right is an optimized image
The square in the images shows the area where output is more than 0, the circles in the figure are isolines for Gaussian function of the scope. A small circle is an isoline with the level of 35% of the maximum value, a large circle is with the level of 3%. Examples show that the network sees within its scope. However, the “u” character has features out of the scope. A possible explanation is that it is partially false positive on the “n” symbol.
A lot of experiments with the network architecture has been performed, the deeper and wider the network is, the more complex CAPTCHAs it can recognize. The most universal architecture was as follows:
Blue font above arrows shows the number of images (feature maps). c — convolutional layer, p — max-pooling layer, green font below shows the sizes of kernels. The convolution layers use 3×3 and 5×5 kernels without strade, the pooling layer has a 2×2 patch. There is a ReLU layer (not shown) after each convolutional layer. The input is a single image, the output is 24 images (the number of characters in the alphabet). There was special padding in convolutional layers that makes output of the layer the same size at the input. Padding adds zeros, but this does not affect the operation of the network, because the value of the CAPTCHA background pixel is 0, because it always takes a negative image (white letters on a black background). Padding only slightly slows down the network. Since the network has 2 pooling layers, the resolution of the output image is 4 times lower than the resolution of the input image, so each pooling reduces the resolution in half, for example, if CAPTCHA input size is 216×96, then the output will be 24 images with size 54×24 .
The transition from SGD solver to ADAM solver gave a noticeable boost of training, and the final quality got better. ADAM solver imported from the module Lasagne and used it within theano code, learning rate parameter was 0.0005, L2 regularization was added through the gradient vector. It was noticed that from training to training the result was different. It can be explained by the fact that the gradient descent algorithm gets stuck in an insufficiently optimal local minimum. I partially solved this problem the following way: I started the training several times and selected some of the best results, then I continued to train them for several more epochs, after that I chose one best result and this best result was trained for a long time. Thus, it was possible to avoid getting stuck in insufficiently optimal local minima and optimal final value of the loss function was quite small. The figure shows the graph, that is the evolution of the value of the loss function:
The x-axis is the number of epochs, the axis y is the value of the loss function. Different colors show different trainings. The procedure of training is something like this:
1) starting 20 trainings for 10 epochs
2) selecting best 10 values (at the lower loss) and train them 100 epochs more
3) choosing one best result and continuing training it for another 1500 epochs.
It takes about 12 hours. Of course, in order to save memory, the trainings are performed sequentially, e.g., in paragraph 2) 10 trainings were performed sequentially one after the other. To be able to do this I have modified ADAM solver of Lasagne. After the modification it was able to save and load the state variables in the solver.
The data set was splitted into 3 parts to track overfitting of the network:
Part 1: the training data set — was used to train the network
Part 2: the test data set, in which the network is checked during the training
Part 3: the second test data set , it is for testing the quality of training after training
Data sets 2 and 3 are small in my case. There were 160 captures in each. The data set 2 was used to find the optimal threshold, that is the threshold which is set in the output image. If the pixel value exceeds the threshold, then a symbol is found in the given location. Usually the optimal value of the threshold is 0.3–0.5. If the accuracy of the test data set is considerably lower than the accuracy of the training data set, it means that the overfitting took place and the data set has to be increased. If the accuracies are roughly the same, but not high, then it is necessary to complicate the architecture of the neural network and to increase the training data set. There are two ways to complicate the architecture of the network: to increase the depth or increase the width of the net.
Image preprocessing also increased recognition accuracy. The example of the preprocessing is below:
In this case, the least-squares method was used to find the midline of the rotated text. The rotation and the scaling are applied here. The scaling is carried out using the average text height. Hotmail often makes a variety of distortions:
These distortions need to be compensated.
It is always interesting to read about people’s failures, I’m going to describe them here.
There was the problem of small data set: for good recognition it was needed a large data set that had to me manually marked (about 1000 captures). I made a lot of attempts to train the network with a small data set. I attempted to train a network on the results of recognition of another network. While doing so, I selected only those CAPTCHA images and the places in which the network was confident. The confidence was determined by the value of the pixel on the output image. Thus, it was possible to increase the data set. However, the idea did not work, after several iterations of training, the quality of recognition deteriorated: the network did not recognize some characters, confused them. So, recognition errors were being accumulated.
Another attempt to perform training with the small data set was to use the Siamese network. The Siamese network input requires a couple of captchas. If we have a data set of N captchas than it will be N*N pairs. So, we could get a lot of training examples. The net converts captcha into a map of vectors. As a metric of similarity of vectors, a scalar product was chosen. It was supposed that the Siamese network will operate as follows: the network compares the portion of the image displayed in a CAPTCHA with some reference character image. If the network sees the same character with regard to distortion, it is considered that in this location of the captcha there is a corresponding character. The training of the Siamese network appeared to be hard, it often stuck in a suboptimal local minimum, the accuracy was noticeably lower than the accuracy of a conventional network. Perhaps, the problem was in the wrong choice of metric of similarity of vectors.
There was also the idea of using an autoencoder for pre-training the bottom of the network (one that is closer to the entrance) to accelerate training. The autoencoder is a network that is trained to give the same output image that is input. Autoencoder architecture is organized to have a narrow part. Training of the autoencoder is an unsupervised training.
The example of how the autoencoder works:
The first image is the input image, the second image is the output image.
First, I trained the autoencoder. Then I took the lower part of the network, added new untrained layers and then fine tuned the all net to the required task. In my case, the use of the autoencoder did not accelerate training of the network.
Also, there was an example of a CAPTCHA, which used color (see below):
In this CAPTCHA the method with the use of fully convolutional neural network did not give a result. There was no result even after applying various image preprocessing methods that enhance image contrast. I suppose that fully convolutional networks do not cope well with non-contrast images. However, this captcha was recognized with the usual convolutional network with fully connected layer. An accuracy of about 50% was obtained. The determination of the coordinates of characters was performed using special heuristic algorithm.
What else could be improved
An automatic annotation of centers of characters could be done. Services for manual recognition of captchas yield only recognized strings, so the automatic annotation of centers would help to fully automate the training data set annotation. The idea is to select only those CAPTCHAs where each symbol occurs once, to train a separate convolutional network on each character. Such a network would be responsible to answer only one question: does a character exist in a CAPTCHA? Then one could check what features a network uses with the help of the method of minimizing the pixel values of input images (described above). The features collected therefore will make it possible to localize the character, and then it will be possible to train fully convolutional network on the found centers of characters.
Text captcha can be recognized with fully convolutional neural network in most cases. Probably, the time has come to stop using text captchas. Google has not been using a text CAPTCHA for a long time. It uses pictures of various objects, which needs to be recognized by a human.
However, even this problem seems to be solvable for the convolutional network. We can assume that in the future there will be people registration centers. For example, there will be a Skype live interviews where a person will talk to a person and check the scans of passports or other documents. Then the person will get a digital signature with which he or she will be able to automatically log on to any website.
© Maxim A. Vedenev
Source: Deep Learning on Medium