Source: Deep Learning on Medium
An Overview of K Nearest Neighbor
*** My warm welcome to all the readers! ***
- Introduction to KNN
- Understanding KNN with Toy Example
- Selection of K value
- Applications of KNN
1. Introduction to KNN
K Nearest Neighbor in short KNN is one of many machine learning algorithms. KNN comes under category of supervised learning . It is both classification and regression algorithm. It works on the basic idea of “classifying based on nearest neighbors of majority class.” Here, ‘K’ is the natural number and preferably odd number is chosen. Suppose, we have red balls on one side and blue balls on the other. We are placing a new ball (color unknown and need to be classified) somewhere near red and blue balls. We need to find out whether new ball belongs to red ball category or blue ball category. What we now do is, we will take K say 5 nearest neighbors to new ball. We found out that out of 5 nearest neighbors, 3 are blue balls and 2 are red balls. Based on majority count, we can classify new ball as blue.
2 Understanding KNN with Toy Example:
Let’s consider classification technique (binary classifier).
I am not taking big data set which contains n number of dimensions as it becomes hard to picture n number of dimensions in our mind and understand how algorithm is working. For better and easy understanding, I am creating a toy example.
2.1 Creating Toy Data Frame:
I have created a toy example which contains 10 rows and 2 columns. Column ‘Numbers’ represents independent variable (x) which has numbers from 1 to 10. Column ‘Label’ represents dependent variable (y) which has binary class label ‘0’ and ‘1’.
We can see from the above image that numbers from 1 to 5 belongs to class 0 and numbers from 6 to 10 belongs to class 1. Our objective now is given any number (real or fraction), our model should be able to predict the class of new data point i.e either 0 or 1.
Assigning ‘Numbers’ column as ‘x’ and ‘Label’ column as ‘y’. Below is the code image.
2.2 Training Model:
We will now train model with n_neighbors (K) as 3 and ‘metric’ as ‘euclidean’ . K as 3 means we will consider only 3 nearest neighbors and decide whether a give point belongs to class 0 or 1. Here, we are finding nearest neighbors with ‘euclidean’ distance also known as ‘L2 Norm’. It works similar to pythagoras theorem. Click on hyperlink for euclidean distance. Below is the code for training model.
Note: We can use other distance method as metric like Manhattan distance (L1 Norm), Minkowski distance, Hamming distance, Cosine Distance, etc.
2.3 Predicting Query Point
Here comes the interesting part — predicting new point. Let’s get into imagination. Just picture in your mind all 10 numbers (1 to 10) where 1–5 belongs to class 0 and 6–7 belongs to class 1. Suppose, I want to predict the class of number ‘5.4’. We will find 3 nearest numbers to 5.4 and those are 4, 5, and 6. Out of 3 nearest points, 2 points belongs to class 0 and 1 belongs to class 1. We can decide the class of 5.4 based on 2 methods. 1- Majority based and 2- Probability based.
1- Majority: Majority points (2) belongs to class 0 and minority point(s) (1) belongs to class 1. Hence, number 5.4 will be classified as 0.
2- Probability: Probability of number 5.4 belonging to class 0 is 0.66666667 and belonging to class 1 is 0.33333333. Based on highest probability, number 5.4 will be classified as 0.
3 Selection of K value:
3.1 Odd K value:
As I mentioned above, K value is preferably taken odd number, and there is a good and convincing reason for it. If K value is even then there is good chance that K/2 nearest points might belong to one class and other K/2 might belong to another class. Supposing, we have taken K value as 10. Unfortunately, if 5 nearest points belongs to class 0 and other 5 belongs to class 1, it will certainly be a great deal to decide to which class new point belongs to. This is the only reason for taking odd K value.
3.2 Right K value with Scikit Packages:
Selecting right K value plays a big role in model performance. If we have taken wrong K value, it may lead to overfitting or underfitting problems. How do we choose right K value then? Well, we have many methods like GridSearchCV, K-Fold Cross Validation, RandomizedSearchCV, For Loop, etc, which we can leverage in finding right K value.
Note: As K value decreases, model leads to overfitting, underfitting otherwise. So, selection of right K value is a major part.
3.3 Right K value with Graph:
While tuning hyperparameter with different K values with scikit packages, we will certainly have different K values and corresponding accuracy or error. In order to select right K value graphically, we need to plot both train and validation accuracy/error. I have plotted train and validation accuracy and we will use it to understand. Below is the graph image.
From the above image, we can see that train curve gradually declines and validation curve gradually inclines. We should select K value such that both train and validation curves are close to each other at that value. In our case, at K 47, train and validation curves are close to each other. So, we can select K as 47.
Note: If from certain K value say 43, both train and validation curves becomes constant i.e curves neither declines nor inclines. In this scenario, we might get into dilemma in choosing K value as curves are close to each other at more than one point. In this case, we should select the smaller K value i.e if curves are close to each other at K values 43, 45, 47, 49… then we should select K value 43.
4 Applications of KNN:
- Recommender System: Like in Amazon and Flipkart to recommend products based on our previous viewed, searched and bought data. In Netflix, IMDB, etc, to recommend movies and web series based on our searched and watched data.
- Concept Search: Classifying documents based on similar concepts. There are many more applications though.
- KNN is supervised learning algorithm and is used both for classification and regression.
- It is easy to understand and apply algorithm with scikit packages.
- If data has huge dimension, KNN becomes time consuming. Hence, applying KNN when data has huge dimension is not advisable.
- In most cases, tuning of only K will do great unlike other algorithms where more than one hyperparameter tuning is necessary.
*** Thank you all for reading this blog. Your suggestions are very much appreciated! ***