Can You Find Waldo Faster Than A Computer? Spoiler: You Can’t.

Source: Deep Learning on Medium


Detecting Objects With Computer Vision

It seems like only yesterday that I was in my single-digit years, entertaining myself on flights and at church with books filled with the most random memory and activity games. The most iconic ones were hands-down the massive illustrations of “Where’s Waldo”; I even had competitions with my friends and family all the time to see who could find him faster.

I have to say that after all that practice, I got pretty good. And wasn’t as easy as it looked. You’d think a skinny guy wearing blue pants and a striped shirt could be easily spotted, but nope!

Did you try to find Waldo? Even if you did… your time was up a long, long while ago. Too bad! If only there was something you could do to be faster…

The whole point of the game is speed, focus, and concentration. It’s supposed to get kids to think proactively, and exercise their brain muscles. “Preparation for school”, or whatever.

But what if I told you that you needed literally none of that to actually succeed in the game?

See, the other part is winning. Yes I know, I‘ve heard the whole spiel on how “the most important thing is to have fun!” but listen. Technically, the person who comes up with a effective strategy is the most consistent winner. That’s the guy you want to be! All you have to do is outsmart the others (without cheating, obviously) and you’re golden. And this actually applies to most things in life. Since there are no real rules to “Where’s Waldo” except that to win you have to be the one to find him first, there’s a different path you can take to be the one on top, every time.


Kids these days also have access to such an insane amount of tech — it’s time they use it for a good cause. The point is, computers are smarter than you. This is because when they look at a really complicated picture, they can see the whole thing all at once and find Waldo within seconds.

We as humans aren’t capable of this processing power unless you genetically modify yourself to be insanely intelligent (which isn’t all that unrealistic… but that’s a whole other story). Anyway, your computer’s abilities surpass that of humans. How on Earth does this happen?

Computer Vision: Technology That Can See

What your average Jo Bloggs might think of computer vision is probably something along the lines of a computer that has superpower eyes to watch you at all times, just like the FBI in your webcam. But that’s wrong. Maybe one day…

Object detection (as a branch of computer vision) is actually when a computer is able to interpret the contents of a digital image or video without you having to manually enter that information. This technology can break down the different parts of that image, and figure out what objects are present. It’s what powers the facial recognition, like your family and friends in iPhoto and matches criminals with security footage.

Computer vision is not just being able to see what is going on, but understand it too.

YOLO… But It’s Not What You Think

You Only Look Once, aka YOLO, is a system that detects objects in real time. Not gonna lie, I was disappointed when I first learned that YOLO wasn’t “You Only Live Once”, but once I got to know what the real YOLO is, I promise it’s a million times more interesting and cool.

The system is different from a classical model (for example a fast R-CNN) because you don’t get multiple predictions for same regions on an image, but rather pass the dataset through the fully convolutional neural network (FCNN) only once in a single regression problem.

Here’s how the YOLO (V3) model works:

  1. Grid: The image is divided into an S x S grid (which you can see from the image on the left). This is to break the different parts down, and what allows the image to be ‘read’ as a whole. If an object falls at the centre of a grid, that grid is responsible for detecting that object.
  2. Bounding boxes: The features of the entire image as a whole are used to determine bounding boxes, which are basically just the outlines of where objects could be. It’s important to note that they cover every part of the image and overlap. It’s also more likely to be bigger than the grid itself. Think of it like you’re trying to find your phone that’s somewhere under the covers of your bed — the outlines of the different bumps in your bed would be your bounding boxes!
  3. Confidence scores: The confidence score is an indication on how much the model predicts that you think is there is actually there. It’s based on width, height, and the location of the centre of the object relative to the bounds of the cell. This is when you’re looking at the size of the different lumps of your bed (bounding boxes). You’re comparing the size of your phone to the size of the lumps and then determining how much you think your phone is actually there based on that.
  4. Conditional class probability: These are probabilities conditioned on a grid cell containing an object. With previous models of YOLO, you could only one set of class probabilities is predicted per grid cell. From this, you can differentiate your different objects (see bottom picture above). But YOLO V3 uses independent logic classifiers for each class. This basically allows you to have multiple labels for one same object: the computer finds Waldo, and he is labelled as “Waldo”, “person”, and “boy”.

Here’s a snippet of code collecting and interpreting the data as mentioned above (it also notes how fast the objects are being detected!):

This is just an excerpt of the full coded. I coded an object detection model using YOLO V3, which you can find the link to here.

This is the architecture of the model — you can see the different layers.

Manipulating the data

Because of the single regression, the loss for objectiveness and classification needs to be calculated separately — but still in the same network. You get your objectiveness score through logistic regression: 1 signifies a complete overlap of the bounding box and ground truth object (what the picture actually is). The model will only predict 1 bounding box of this type, but an error would be affected by objectiveness and classification loss.

The YOLO (V3) model also predicts boxes at 3 different scales, in order to support scale variation. That will look something like this:

To test the model, we have to multiply the conditional class probability with the individual box confidence prediction. This equation returns the probability of a class being in a box, and how well that predicted box fits the object.

Here is another excerpt of code that loads the pre-trained classes of object names, and how it deals with writing/annotating the different frames of the digital image:

The output of a You Only Look Once (YOLO V3) model looks something like the picture below. It’s really clear what exactly you want from the model, and how it can be used across diverse applications, like road-mapping.

Now I hope you understand why finding Waldo would be so easy! You wouldn’t even have to look twice…

Why We Use It

  • It’s considered real-time because it’s fast — capable of capturing 45 frames per second. There’s also a faster version with smaller architecture that can capture 155 frames per second, but it is less accurate.
  • The FCNN used can understand and functions with generalized object representation. You can use real-world images, and artwork to use it and to train it.

Looking Ahead; Thinking Big

This technology is already past its preliminary stages: people are using computer vision in their everyday lives. More and more technology is developing in that field, which is exciting because it has such a diverse array of real-world applications.

Combining object detection software with voice feedback is super powerful too. It’s already being used today — people just don’t really know that it exists. People who are blind would be able to get a description of their surroundings, having to be less reliant on all their other sense, 24/7. They could even drive! Although, I am not sure how necessary that would be when we have self driving cars!

Another possible use case is classifying, determining and assisting with inventory for large-scale retail or grocery stores. Medical diagnoses could benefit from this as well, while dealing with external wounds, fractures, spots, or injuries. But most important of all, it’s technology that helps me find Waldo!

Regardless, there is an incredible amount of potential in this field, and I can’t wait to be part of that future.

Let me know what you think!

Follow me on Linkedin and Medium for more.

Credits to Ayoosh Kathuria and Aviv Sham from GitHub, plus Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi from the University of Washington, Allen Institute for AI, and Facebook AI Research for their published paper, which you can find here.