Mario vs. Wario: Image Classification in Python

From my preschool times I remember spending a lot of time playing games on my favourite Game Boy . Two of my favourite platforming games were Mario and Wario. I remember when my grandmother took a look at the game I was playing and asked me what it was. I explained that it was Super Mario. Sometime later when she saw me playing again, she looked at the screen and said: “Mario again? How long is this game?” But it was a totally different game, Wario. This memory inspired me to play around with image recognition and try to see if I can train a classifier which will accurately identify the origin of some screenshots.

In this article I use two approaches. The basic one is logistic regression, while the more advanced one is a Convolutional Neural Network (using Keras with tensorflow backend). I do not focus on explaining the logic or maths behind the algorithms, as there is already a ton of great articles on Medium and elsewhere. Instead, I try to show how a simple, random idea can be quickly converted into a data science project.

For brevity’s sake, I only post a few code snippets, while the entire code is available on my GitHub.

Data preparation

In line with my childhood memories I chose two games for this experiment: Super Mario Land 2: 6 Golden Coins and Wario Land: Super Mario Land 3. I did not only choose those games because they were my favourites back then, but also upon inspecting the images from the games one can see that they are pretty similar visually, which should make the task a bit harder!

I was wondering what was the best way to get large quantities of screenshots from those games and decided to ‘scrape’ them from a playthrough video on Youtube. Python’s pytube library comes to aid with this task. Quite effortless I can download the whole videos with a few lines of code.

The next step involves cutting the frames from a video. To do so, I iterate over all frames (using OpenCV library) and only save every n-th frame to a designated folder. I decided to use 10k images (5k per game). In both approaches I will use the same train-test split of 80–20 to ensure comparability.

When scraping the frames, I skip the first 60 seconds of the videos, which contain mostly opening sequence and menus (I do not do it at the end of the video, so some noise might be included, but we’ll see!).

A sample image from Mario class
A sample image from Wario class

After looking at the previews it is obvious that the images are not the same size. That is why I rescale them to be 64×64 pixels. Also, for logistic regression I will convert the images to greyscale to reduce the number of features for the model (CNN will handle 3 colour channels).

64×64 greyscale image for logistic regression

Logistic Regression

I will start with the simpler model. Logistic regression is one of the basic binary classifiers, i.e., using a set of predictors it assigns one of two classes.

Having said that, to use logistic regression for solving an image classification problem I first need to prepare the data. The input should be exactly the same as in other models from Scikit-Learn, namely feature matrix X and labels y.

As the goal of this article is to show how to build an image classifier for a particular problem, I do not focus on tuning the algorithm and use default settings of the logistic regression. Let me jump straight to the results!

Confusion matrix for logistic regression’s predictions on the test set

Above I present the results on the test set, so part of the data the model could not use for training (20% of data). This seems pretty awesome and actually maybe a bit too good to be true. Let’s inspect a few cases of correctly/incorrectly classified images.

Correctly classified images
Misclassified images

The logic behind wrong classification of 4 out of 5 images is pretty obvious. These are some transition screens where the model cannot actually do anything. The second screen comes from a map of levels in Super Mario, which is clearly distinct from the rest of the game (not a platforming game here). However, we can also see that the model correctly classified another map (image 3 from the correctly classified images).

Convolutional Neural Network

This part will obviously be a bit more complex than logistic regression. The first step involves storing the images in a particular way, so Keras can do it’s magic:


This directory tree shows how I structure folders and files for this particular project. The next part is data augmentation. The idea is to apply some random transformations to the available images to allow the network to see more unique images for training. This should prevent overfitting and result in better generalisation. I only use a few transformation:

  • rescale — value by which the data will be multiplied before any other processing. Original images consists of RGB coefficients in 0-255 range. Such values can be too high for the model to process (with a typical learning rate), so multiplying by a factor of 1/255 will rescale the variable to range 0-1
  • shear_range — for randomly applying shearing transformations
  • zoom_range — for randomly zooming inside pictures
  • horizontal_flip — for randomly flipping half of the images horizontally (relevant when there are no assumptions of horizontal asymmetry e.g. real-world pictures). I decided not to use this feature, as in case of video game screenshots this would make no sense (numbers etc.)

When specifying the path for the images I also determine the size of the images I want to feed to the neural network (64×64, the same as with logistic regression).

Below I show an example of an image after applying some transformations. We see that the image is stretched at the sides.

Now it is time to define the CNN. Firstly, I initialise 3 convolutional layers. In the first one I also need to specify the shape of the input images (64×64, 3 RGB channels). Later, Keras handles the size automatically. For all of them I use ReLU (Rectified Linear Unit) activation function.

After the convolutional layers comes flattening. As the last two layers are basically a regular ANN classifier I need to convert the data from the convolutional layers into a 1D vector. Between the two dense layers I also employ dropout. To put it simply, dropout ignores a specified number of neurons (chosen at random) during training. It is a way of preventing overfitting. The last dense layer uses the sigmoid activation function and will return the probability that a given observation belongs to one of the classes. The very last step is basically what logistic regression does.

Not it is time to run the CNN (this can take a while…). I use ADAM as the optimiser, select binary cross-entropy as the loss function for this binary classification task and use accuracy to evaluate the results (there is no need to use a different metric because in this particular case accuracy is what I am interested in).

So how did the neural network perform? Let’s see!

Confusion matrix for CNN’s predictions on the test set

Well, the accuracy is lower than in case of logistic regression, but it is still very good for such a quickly built model. It is possible to fine-tune the network by changing the amount of convolutional/dense layers, changing dropout, performing additional transformations on images and so on. It could also be the case that transformations hide some data from the images (such as the summary bar at the bottom of the images). Actually, I initially suspected that this bar might play a significant role in identifying the images, as it is present in almost all of the screenshots and it is slightly different between the two games. But I will come back to it in a moment.

Now it is time to inspect a few examples of correctly/incorrectly classified images. What strikes as different at the first glance is that in this case where are no such obvious misclassification examples as screen transitions.

Correctly classified images
Misclassified images

Explaining the classification with LIME

As a bonus, I try to explain CNN’s image classification with LIME (Local Interpretable Model-Agnostic Explanations). Being model-agnostic means that LIME can be applied to any machine learning model. It is based on the idea of modifying feature values of a single observation and observing the effect on the prediction. I highly recommend the paper [2] introducing the idea.

Below I show the results of applying LIME explanation to images. Green area indicates positive influence towards predicted class, red – negative. From the correctly classified cases we see that the character is always in the green area. Which is in line with logic. However, for the negative cases that is not true, also some of the images have only one colour on them. This provides some insight, but I think that even more can be extracted from LIME explanation with additional tinkering.

LIME explanation for correctly classified images
LIME explanation for misclassified images


In this article I presented how to quickly transform a random idea into an image classification project. Both considered approaches perform well on the dataset and I believe the CNN can achieve a better score given some tuning.

Some potential ideas for further tinkering:

  • adding more games (different platforming games or different instalments of Mario/Wario) to investigate how the models perform within a multi-class environment
  • cutting the images so that they do not contain the summary bar at the bottom (which is similar, yet distinct per game)
  • try to detect Mario/Wario in the images (object detection problem)

I hope you enjoyed the article. In case you have any suggestions of potential improvements to the framework or models please let me know in the comments!

Code used in this article can be found on my GitHub.


[1] Keras blog post on CNN:

[2] LIME paper:

Disclaimer: I do not own any rights to video games related content or videos from YouTube.

Source: Deep Learning on Medium