# Teaching Machines to Recognize Man’s Best Friend

Source: Deep Learning on Medium

# Teaching Machines to Recognize Man’s Best Friend

## How machines can understand the world around us

A dog is a dog. That sounds like a stupid thing to say, but what is a dog? Sure you can probably instantly recognize a dog in an image if you saw one, but how did you get to that conclusion? If you break down the question, what really makes a dog? If you went back a few thousand years, how would you explain the difference between how a cat looks and a dog looks?

You could say dogs have 4 legs and a tail, but so do cats. Even with that description there’s a lot of room for error. A dog with 3 legs is still a dog. Maybe you could look at the shape of their ears or the colour of their fur. Turns out it’s not that easy to describe how a dog physically looks and your description might only apply to a certain breed.

The idea of having a computer try to classify an object in an image has been around for a long time, but advancements in the field of Deep Learning have drastically how accurate they are. Why exactly is object detection important?

• Self-Driving Cars and other autonomous vehicles seeing roads
• Increasing concerns over facial recognition
• Diagnosis from images in healthcare

So how exactly are computers able to figure this out?

# How computers show images

Right now you’re reading this article on your computer or your phone (if it wasn’t obvious) and when you zoom in really closely you can see it’s made up of tiny dots/squares called pixels. Each of these pixels projects a single colour that when combined with the rest, can form entire images. That colour value is typically RGB or (red, green, blue) and each goes from 0 to 255, darkest to brightest. So something like (255,0,0) would be a completely red pixel, while (0,0,0) would be a black pixel and (255,255,255) would be a white pixel. Everything you see on screen is being created by millions of pixels.

For a computer trying to understand the contents of an image, having colours isn’t actually useful. Humans can still tell there’s a dog in the picture if it was black and white. So we can preprocess the data by turning the picture into grayscale. Now pixels have a single value 0 to 1 indicating its brightness. It is way more efficient to deal with one number in grayscale than managing three RGB values. This also makes sure that colour doesn’t play a major role, e.g. a yellow dog is still the same as a red one.

While we can see a picture on screen and identify it, the computer still only knows an image as numbers. So how can a computer figure how these numbers make a dog?

Let’s take inspiration from how humans work. When we see something, our brain does some of its own calculations. First of all, light that bounces from an object enters our eyes. We can see our phone because light bounces off of it. Then when that light enters our eyes, it’s converted into electric signals transmitted by neurons so our brain interpret the world we see. Those signals are also interpreted as objects we know. So not only do we get a picture of our dog, but we recognize it as one.

If our brain learned how to classify a dog, then a computer can do the same with a neural network. We aren’t discussing any complicated math here, just an understanding of how a neural net works. When we feed the neural net preprocessed information it does some math to it, some combination of adding the numbers up and multiplying them, then it spits out an answer. In this case we give it an image, or as we now know a collection of pixels, and does math with the pixel values, finally telling us how likely the image is a dog from 0 to 1.

If we want our neural network to classify multiple things, then we can have it output multiple numbers, with the largest value being what the neural network believes is in the picture. Who knew probability could be so useful!

At first it’s going to perform horribly since it has absolutely no idea how to get from seemingly random pixel values to a single output. But with backpropagation (math not explained here), it can learn to adjust its calculation and become very accurate.

# Mattered order knew who?

With the previous neural network, it has an interesting property called Permutation Variance. If we shuffled around the pixels in the image, it would still arrive at the same answer as if the pixels weren’t touched at all. Humans don’t interpret data this way, it’s like trying to compare a finished jigsaw puzzle with one that has its pieces scattered everywhere, nothing makes sense. In short, the data is related to each other.

A dog makes sense because all of its parts are organized in a reasonable way. We can clearly see a body, legs, and a face with all the other stuff on it.

NOW THIS. I have no clue what this is, why did I even bother to create this?? I think the picture speaks for itself.

If the neural network could learn to understand features like we do, it could improve drastically.

# What is a convolution and what does it want with me

Because the hidden layers in a CNN use convolutions, they can learn different features. The first layer might learn to find edges and the next one might find shapes. And because the neural net is trained to learn on its own, it can find its own features. We may only have an understanding of a dog’s body parts, but the CNN might find something even better.

If we go crazy and add hundreds of convolutions, the CNN becomes unnecessarily complicated and may encounter overfitting. The neural net might just start to memorize the features of a single dog instead of the general features.