Smart Sudoku Solver using OpenCV and TensorFlow in Python3

Source: Deep Learning on Medium

Smart Sudoku Solver using OpenCV and TensorFlow in Python3

I love solving Sudokus.

Well, not really but for the sake of a rather dramatic backstory let’s digress for now.

One day I got stumped by a particularly hard puzzle and decided to create a Sudoku Solver, not an ordinary one though but one that tries to recognize and extract a Sudoku from a photo.

Unnecessarily complicating things is my forte

Let’s get started.

Excited, I rubbed my hands, cracked my knuckles, booted up my trusty HP, downloaded the below image to be the “guinea pig”, opened up PyCharm and dove right into it.

Loaded the image in grayscale mode since we aren’t really concerned about the colors here.

Let’s call this little guy “testimg” from now on.

Image Processing

Step 1: Blurring and thresholding

The first step is to threshold the image. It’s a good idea to blur the image before thresholding to reduce noise and/or detail.

Blurring means exactly what the name suggests.

A small illustration of how Gaussian Blur works

Thresholding is the process of converting a grayscale image to a pure black and white image. Every pixel in a grayscale image has a value ranging from 0 (Black) to 255 (White). Based on a certain threshold, any pixel with pixel value lesser than the threshold will be made 0 and any pixel with pixel value higher than the threshold will be made 255.

“The world is either black or white, nothing in between” — Thresholding.

Simple Thresholding is basically where the threshold remains constant. Here’s our testimg after simple thresholding:

Kindly ignore the wrong window name below

Not good at all.

Hmm, what might be the reason?

The reason is that since there might be differences in lighting in different parts of the image, a constant threshold won’t cut it.

Here comes Adaptive Thresholding to the rescue.

“Change the lighting as thou desire, I shall adapt” — Adaptive Thresholding

Adaptive Thresholding as the name suggests changes the value of the threshold based on the neighborhood pixels using a certain function. Two of those functions are Mean and Gaussian.

Adaptive Gaussian Vs Adaptive Mean Thresholding

Step 2: Inversion and Dilation

Inverting as the name suggests implies converting all black pixels to white and white pixels to black.

Dilation in simple terms is a way to thicken boundaries by sort of adding pixels.

Normal vs Dilated Image

After Steps 1 and 2:

I played around with a few combinations of different blurs, thresholding, and dilation along with inversion:

And finally settled on this combination:

The reason is that the borderlines and numbers are solid and thick. We aren’t worried about the inner box lines since they are unnecessary.

Grid Extraction:

Step 1: Detecting the grid

Let’s talk about blobs.

BLOB stands for Binary Large Object and refers to a group of connected pixels in a binary image. Here’s an example to illustrate the meaning better.

Here the red circles are blobs of black connected pixels.

Now let’s take a look at the below image.

Notice something?

The Outer Grid is a BLOB of white pixels!

It’s also easy to see that the area enclosed by the outer grid is pretty large, in fact, the largest.

So all we have to do is find the biggest BLOB which is going to be our board!

How to find the biggest blob?

The answer is flood filling.

Flood filling is basically an algorithm that starts from a certain point (Seed point) and colors all the connected points of the same pixel value.

We started flood filling from a white point inside the larger shape with green color and as you can see, the entire shape is colored green!

Flood filling with gray in progress:


And now all we have to do is flood fill all the blobs except for the largest one with black to get rid of them.

And Voila!

We managed to isolate the outer grid

Remember dilation? Allow me to introduce his sworn rival erosion.

Erosion is a process of thinning boundaries by sort of removing pixels.

Let’s just say that if the right is the original image, then the left image is eroded.

Let’s erode the grid a bit to undo the dilation that we applied earlier.

Step 2: Finding the borderlines

Let’s start by finding all the lines. We can do that using the Hough Line Transform.

Hough Line Transform. Let’s just say that this is a technique to identify straight lines in an image. Read more about this here.

The next step is to find the extreme or borderlines.

Here’s the simple yet elegant logic.

The nearest line from the top with slope almost 0 (almost horizontal) would be the upper edge.

The nearest line from the bottom with slope almost 0 (almost horizontal) would be the lower edge.

The nearest line from the left with slope almost infinity (almost vertical) would be the left edge.

The nearest line from the right with slope almost infinity (almost vertical) would be the right edge.

Surprisingly simple and intuitive isn’t it?

The lines found and drawn.

Step 3: Finding the border points.

Once the lines are found, we can easily find the intersection points with some linear algebra.

Think Determinants.

Borderlines and points plotted on the original image

Step 4: Correcting perspective and cropping the grid out.

Now we need to crop this part out and “correct” the perspective.

You can see that the image looks as if the top part is kinda “farther” away from us than the bottom part.

Warp Perspective. It is a way to correct perspective that makes use of a Perspective transform matrix or Homography matrix to perform the transformation. In simple terms, it corrects the perspective.

This example perfectly illustrates correcting the perspective.

Whew! We finally got the grid.

Step 5: Some image processing

Now for some juicy thresholding and inversion.

Step 6: Cell extraction

Now it’s time to get a knife because we are gonna be slicing this bad boy up into 81 pieces.

*Devilishly smiling* and *Eagerly sharpening a pair of knives*

Here’s how one of the chopped cells looks like:

Very shabby indeed.

Step 7: Cell cleaning

We need to get rid of those dirty white patches in the background but how?

*Brainstorming hard*

Remember flood filling from earlier?

If we flood fill from all the white outer layer points with black, then we can get rid of those ugly white patches !

Huh, guess not.

If the white patches start somewhere from an inner layer, then we have a problem.

*Scratching head as frustration intensifies*

I finally came up with the (not so) brilliant solution of:

Flood filling from the last 2 or 3 layers instead of only the outer layer.

Since it (mostly) worked, I decided to stick with it.

Step 8: Center the number in the image

So I found the bounding box of the number and centered the image by cyclically shifting the pixels.

So how to find the bounding box?

Start from the center of the image and start moving in all 4 directions.

The moment we hit a row or column without any white pixels at all, it means that we reached the end of the number!

The solitary stalwart in a wide expanse of darkness

Digit Recognition

We got the grid and we can proceed to recognize the digits in each of the 81 cells right?

No, there’s a problem.

How do we recognize empty cells beforehand?

*Brainstorming intensifies*

*Stop that and think simple*

Look at the below picture of an empty cell:

Any ideas?

Notice something?

Empty cells don’t have ANY white pixels at all.

So all we have to do is sum up the pixel values of all the pixels and check if it is lesser than say a certain threshold.

The reason for the threshold is that there may be leftover white patches.

I set the threshold to be 5 white pixels (for safety).

Let’s move on to Recognition now.

Time to summon Tensorflow and Sklearn.

I first built and trained a Convolutional Neural Network (CNN) on the MNIST handwritten digits dataset.

The MNIST dataset is a dataset of 70,000 28X28 pixels images of handwritten digits.

Let’s talk about what a CNN is, before that what a Neural Network is:

Neural Network from a very high level of abstraction is an algorithm or rather a series of algorithms that try to mimic the brain.

Composed of layers of nodes or neurons. The math is mind-boggling and involves a lot of calculus, linear algebra, and statistics.

Watch out, these machines getting smarter by the day.

So a CNN is a special class of Neural Networks that is primarily used for Computer Vision.

CNN aims to mimic human vision in layman’s terms.

So a CNN is basically a Neural Network with a bunch of extra layers added.

Next, I turned to a trusty simple Machine Learning Algorithm called K Nearest Neighbours Classifier.

“Tell me who your neighbor is, I will tell you who you are” — KNN

K Nearest Neighbours: As the name might have suggested, in layman terms, the KNN Classifier uses the distance from a neighbor to determine which class a particular entity belongs to.

When we train KNN on Digit recognition, we are basically creating a 10-dimensional space and plotting on it.

Try visualizing that.

Despite both getting around 97 percent accuracy on the MNIST dataset, the KNN performed better on the testimg.

I decided to implement both and code in a variable to control which one is used by the application.


And finally

Solving the puzzle.

A technique called backtracking can be used to easily implement a solution.

Backtracking basically refers to the process of exploring all possible options and if any of the explored options become wrong at some time, then discard those options and continue exploring.

So to solve the Sudoku, we can do this:

  1. Find an empty slot.
  2. Find all the numbers that can be inserted into that slot without violating the row, column or inner box properties.
  3. Try each one of these numbers and recurse. If at least one of these numbers is successful then we have found a solution otherwise no solution.

Here’s the solution screen:


I used Tkinter to build the UI.

Main UI:

All the frames are stacked upon each other and whenever any frame is to be displayed, Tkraise is used to raise the frame.

First frame:

Second frame:

The second frame displays all the stages that the image goes through starting from the initial image until recognition.

Few of the stages:

Third frame:

The third frame displays the recognized Sudoku Grid. Here the user can change any wrongly recognized entries and click on reveal solution to see the solution.

GitHub Project link:


For those wondering about dabbling with it and possibly improving it, I have written a very detailed README and the entire codebase is well commented, so you should not have any problem understanding the code.

Helpful Resources:

Understanding Neural Networks

Machine Learning Basics with the K-Nearest Neighbors Algorithm

AI Shack

SuDoKu Grabber in OpenCV is a tutorial that really helped me out with the grid extraction portion. I followed the same steps listed in this tutorial for the grid extraction portion.

TkInter — Python Wiki

OpenCV-Python Tutorials

Backtracking | Introduction — GeeksforGeeks

Sudoku | Backtracking-7 — GeeksforGeeks