Original article was published by Robin T. White, PhD on Artificial Intelligence on Medium
Applied Computer Vision
OCR and the WordSearch solver AI
Using a custom OCR model, Pytesseract and array manipulation to automatically solve Wordsearches
Recently, I have seen a lot of posts on Sodoku solvers. The algorithm for it is quite simple, usually a backtracking recursion algorithm in only a few lines of code in Python. My favorite video is from Computerphile and Professor Thorsten Altenkirch, who I could watch all day. If using a computer vision approach, a simple model already trained on MNIST is all you need, or something equivalent. A really great video explaining all the steps can be found here.
So where does that leave us with Wordsearch. Well, I was getting bored with seeing multiple versions and videos of Sodoku puzzle solvers. So I instead thought of seeing if anyone had done a Wordsearch solver. To my surprise there wasn’t something quite what I was looking for. Now, that is not to say it people haven’t done this, but I felt like it would be a good project to practice and so decided to not look further into it as I wanted to solve this myself. For this I am using WordSearch.com
After completing this project I did stumble across a nice post from Martin Charles, which is typical to most I have seen where the text needs to be entered by hand.
This was a good project because it touched on the following topics:
* Custom OCR — Training (including getting images) while handling small and imbalanced data, Testing, Deploying
* Image Processing — Finding the grid etc.
* Path finding — The brain. This lead me down a long and interesting path although I stuck with my original simple solution in the end
* Automated control — Getting the computer to solve the puzzle
For the sake of brevity, I will refrain my discussion to the OCR, Path finding and automated control. After all, finding the grid was easy and the same conditions used in the Sodoku solvers, for which there are many, can be used. Here is the result of the image processing which finds the grid box (outlined in red), the words box (outlined in yellow) and each letter bounding box outlined in green.
The first option before training my own model was to use Pytesseract. Unfortunately, however, right off the bat, this didn’t find the letters in a useable manner even after cropping the grid. If I instead cropped each letter, it still wasn’t super accurate and also incredibly slow. However, I did use Pytesseract to obtain the words to search for by passing the words box (outlined in yellow above). This left me with the relatively simple task of making a single letter OCR.
I first started by creating a dataset by gathering images of individual letters and manually creating my dataset. I used a simple script that gathered each letter bounding box, displayed on screen and took as input a letter to label my training data. Since I am using a specific font and images obtained from screenshots, I could get away with a small amount of training data. However, it was incredibly unbalanced. After going through about 5 grids worth of letters, which is about 1000, I only had seen about 3 ‘Q’s, 3 ‘Z’s, 3 ‘F’s and surprisingly a few other small amount of certain letters. On the other hand, I had a pretty high number of a, e, o’s etc. So much so I actually started skipping them.
To solve this, I used a Balanced Data Generator when applying data augmentation such that there would always be the same number of examples for each letter. This used imblearn to generate a random oversampler combined with augmentation in Tensorflow Keras. I found this from another medium post here. This means, a higher number of augmentations would be applied to Q over N to balance out the data. These augmentations were simple; slight rotation, shift, zoom and in the end obtained a high accuracy for validation data and essentially perfect on the test data which is each new Wordsearch. Of course this is only tested on my computer etc. but it does the trick at the moment.
For the model itself, I used just a simple LeNet architecture. This proved to be accurate and sufficient so hey, don’t fix what isn’t broken. This wasn’t real time fast on my GTX 1060 GPU, it took about 10s to classify all 196 letters. I’m not sure though what was the bottleneck; if it was the image processing to get each letter or the classification itself. I did do a resizing to the classic 28×28 image. I could perhaps get away with a smaller image and maybe it would be quicker. I wasn’t too concerned for the time, but maybe a consideration if you want this as a real time app. Here is a final result showing the letter bounding boxes using OpenCV and the found representative letter as a string written on the image.
The grid was then represented as a 2D numpy array of letters, just as if you would have typed it yourself. Also, each word to find was saved as a list. Now we know what to find, we just need to find it.
As I mentioned this led me down an interesting research path, and I probably learned the most here. I started with using Networks to find all paths with start and end points of the 1st and last letter in each word. I then parsed these results into valid moves (up/down, diagonal, horizontal). However, it was very slow from the many many possible combinations when I went full scale. I took a step back and instead made a very simple solution which simply checks the neighbors at each position of the first letter in each word. Then there is a neighbor of the next letter, it then checks each letter in that direction for the length of the word. If it matches the word, we have a solution, and if not we keep looking. A more general approach which I came to find could be to use recursion again keeping only neighbors of subsequent letters in the word. This also allows for non-valid moves, and is a more general path finder. As I said, I kept with my first usable solution. Here is a schematic of the algorithm I used. It is simple and fast.
If you are interested in seeing a really nice general solution, I started a reddit discussion which someone posted a really terrific solution in. Here is the link to it.
I also added some parser to remove things like ‘.’ or ‘-’ which are sometimes present in words.
For this, we can thank our good friend PyAutoGui. The ‘hard part’ was just getting the screen pixel coordinates from the numpy array of letters which we were using to define our grid. I created another array that stored the centroid positions of the letters when we first defined our bounding box before doing the OCR for each letter. I also kept the grid position offset from the screenshot. Simple math will tell us the location, after accounting for differences between OpenCV coords and numpy array position. A single dictionary could have been used for each letter and the corresponding pixel position instead of two separate arrays, but I mean, either way…To simplify the automation in wordsearch.com you can choose either a ‘drag’ action to find the word, or you can use ‘tap’ at the start and end of the word. This was simpler although honestly the same. So if you want to try it out for yourself, I did implement ‘tap’ word selection method. This is in the settings on the wordsearch page. I purposefully added a delay of 1 second and a sleep time of 0.2s between each word just to make sure any internet connection lag or other wouldn’t cause it to go crazy.
All-in-all, this is a pretty satisfying result to watch all the words be found. I certainly learned a lot and got to practice some useful techniques. I am quite happy with the performance.