Apples and Oranges — what’s the difference?

Original article can be found here (source): Artificial Intelligence on Medium

A picture is worth a thousand words — and a whole lot of numbers

We, as humans, can pretty easily qualitatively analyze an image. A computer, on the other hand, needs numbers. The images are represented in data as pixels, and each pixel has 3 values: (R, G, B). Each of these numbers is a value between 0–255. The image as a whole, then, is represented as a 3-D matrix.

Images = numbers?

Convolutional Neural Nets: how computers see

The neural network I used is a Convolutional Neural Network which is commonly used for most image-related tasks. The idea behind a ConvNet is essentially a process of abstraction. Let’s start with an image of an apple:

That’s quite the apple!

We want the network to be able to detect features. We do this by using convolutions. Essentially, the network is going to pass a filter or “window” over the image. This filter is our feature detector.

This handy animation, taken from here, helps explain what’s actually happening. NOTE: this is a simplification, as the actual convolutions will be done in 3-D.

By passing the filter over the image, the network’s producing what is essentially a feature map of the image. This feature map will help the computer extract meaningful features. However, by doing the convolution above we went from a 5×5 grid to only 3×3! In order to combat this, padding is used. Padding will essentially surround the input with zeroes. This ensures that the spatial size of the output is the same as the input, and it improves computational performance.

The padding allows us to preserve the data in the edges and corners, taken from here

As with any other neural network, an activation function is applied.

model.add(Conv2D(128, 3, padding='same', activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)))

Typically between each convolution layer, a pooling layer is added. This pooling layer serves to reduce the dimensions of the data to improve computation and overfitting.

What was the point of padding then?

A max-pooling layer does something different than a lack of padding. When the filter is sliding over our image, it’s detecting features that we want to isolate. From a high-level perspective, however, the precise position where a feature is present isn’t important. Max-pooling is like a “zoom-out”: it allows later convolutions to work on larger areas of the image, since a small set of data after pooling corresponds to a much larger swath pre-pooling. Through doing this, the network can also go beyond small transformations and prevent overfitting. Pooling does not result in the loss of data that has not already been analyzed.

Padding, on the other hand, makes sure the network can analyze every piece of data it has, including the edges and corners. No padding means losing quite a bit of information for analysis, something usually undesirable.

This is an example of max-pooling in action, taken from here

A network can use multiple groups of convolution and pooling layers, as seen fit. Before the network can start working on classification, the multi-dimensional data needs to be flattened so the rest of the network can work with it.

A beautiful graphic from colah’s blog, illustrating the structure of a CNN.

Of course, all these parameters can be tweaked and played with. I played with many of them, training 27 different models (that takes quite a while on a laptop), but the plots were…

Yikes. I wasn’t able to evaluate the performance of specific models, but I had some general ideas

Yeah. Not fun.

After the data is flattened, we’re back in familiar territory with standard, feed-forward neural networks. I’m not going to explain them in this article, but if you want a good explanation, check out Joshua Payne’s article on the subject.

Taking my eyeballing and guesswork, I narrowed it down to 4 networks, iterating between 2 and 3 convolutions and 1 or 2 Dense layers.

A cleaner, simpler graph:)

For more detail regarding the code and building process, check out my Github repo here.

What are the next steps?

There are a couple things that can be done to improve this model.

  1. The dataset is imbalanced
  2. The sample size is relatively small
  3. No frontend!

In the data, there are more apple pictures than orange pictures — this could cause some issues with accuracy. The number of samples is also pretty small — this can be solved by applying transformations (rotations, reflections, small warps) to augment the data. Lastly — and this is something I plan on working towards in the future — this is all in Python, and in terms of use-case just Python is fairly limited. I want to be able to package this model nicely and have it make effective predictions without the user having to jump through so many hoops (Python is also kind of ugly:)).

Key takeaways

  • Data that replicates real-life inconsistencies = very good
  • By engineering variability into the dataset, the model can be more robust
  • Overfitting — when the neural net starts memorizing instead of learning
  • Convolutions allow computers to extract high-level features from images
  • Convolutions extract features -> pooling zooms out

Thanks for reading! If you have any questions, comments, or want to talk, you can reach me at All code for this project is available on my GitHub!