Introduction to Computer Vision with Deep Learning: Filters and Kernels

Source: Deep Learning on Medium

Go to the profile of Deep Learner

Written by Praveen Kumar and Nilesh Singh.

This is how we want you to feel after all the hard work.

What is this fuss about deep learning?

Well, if we were to be strictly technical, then I’d say that deep learning is a field of study where we design, develop and deploy Artificial Neural Networks(ANNs) to perform various tasks including but not limited to object classification, semantic segmentation, Image captioning, Signal processing and object localization.

Phew, sounds heavy right? Trust me it’s not (at least not as much as the hype around it). We’ll come back to the definition later but first, let’s try to visualize what we are dealing with here.

Okay so think about this, you are walking down the street and your friend is walking from the opposite direction, you both see each other, say hi and move on, pretty much trivial, isn’t it? You wouldn’t give the process a second thought, but there’s actually a lot of biological and electrical stuff that is happening to make it possible:

1. Light from your friend’s face is falling on the retina of your eyes.

2. The cells in the back of your eyes are getting activated based on intensity and wavelength of light.

3. That information is then being sent back to your brain.

4. The brain is filtering it out and extracting relevant information, analyzing it, and trying to find a match for it in from memory, as soon as the match is found your brain tells you that you’re looking at your friend.

Ah, tiring to think about it, right? Now imagine if a computer were to do all this. One can never program it to do so, think of it, so many faces, so much variance, so many permutations to deal with. So what do we do? Shall we embark on our brave journey of writing an “if” condition for every possible permutation?

That would be a little unpleasing, wouldn’t it? So instead we design neural networks (brain in this case), then train it to remember your friend’s face (like your memory) and then use it to recognize the person whenever the computer is shown the same face again.

This is the intuition behind Deep Learning, Neural Networks, and computer vision. You might be thinking about the definition now, please don’t; that’s the mantra of it, leave all the definitions behind and carry the intuition forward.

Now we look at how we approach that first part of getting visual information and extracting relevant parts.


Sounds like a small conceptual word? It indeed does but fret not, for let me break it down to you. I’ll try to be as intuitive as possible so that you can remember it for a long time (otherwise you know where to look for it).

Let’s consider this picture of a magnificent juicy red apple:

What do you see? I mean obviously apple, but then again what makes an apple look like an apple?

For starters, you could say –

1. Its red color body

2. Top nob

3. Green leaf

4. Round in shape

These features define an apple. So whenever someone describes these features, you’ll make a mental picture of it and immediately know that the person is describing an apple. But how does a computer vision system (CSV) know? For a CSV to know all this, it needs something to extract these different set of features and then, later on, combine them and produce a result. So what could it be? Some wizardry? Voodoo magic? Or maybe just a simple boring matrix with some numbers?

In computer vision terms, we call these features extractors (the matrix) as kernels. People often use these words interchangeably (with a ton of others which we’ll see later).

Now putting back our technical hats, Kernel is generally represented using matrix. The matrix consists of some values inside it. These values decide what type of feature(s) we decide to extract from an image. We then do some basic arithmetic operations with these matrices and pixel values in the image to magically get our features. Kernels can be of different dimensions like 3×3, 5×5, 7×7, etc. We tend to choose odd dimension kernel because odd dimension kernels have a symmetric axis. Symmetric axis is an axis which equally divides the matrix into two equal halves (don’t worry if this seems a bit difficult to comprehend, we’ll come back to it in later chapters). 3×3 is the most famous kernel size after the breakthrough in computer vision in the year 2012. The 3×3 kernel is famous for 2 reasons-

1. It is the smallest odd dimension kernel after 1×1 (will be covered later because we need to cover more concepts to understand 1×1)

2. The number of computations is very less. (3×3 = 9 compared to other 5×5 = 25, 7×7 = 49 and goes on…)

To summarize it, we establish that everything we see is made of up of a set of features which uniquely identifies the object. These features need to be extracted from the image if a computer were to comprehend it. This is achieved via the use of some really boring and dull number matrices (called kernels) and some basic arithmetic operations.

We’ll have more insights on it in the later chapters of this series.