WHO’s Around You (Nearest Neighbor)

Original article was published on Artificial Intelligence on Medium


Hello Guys!

Source (It’s a JOKE)

K-Nearest Neighbor is one of the easiest and beginner friendly algorithm to learn. Just few points to know and this one is complete, yes it’s really easy. Point to be noted is that “KNN works good mostly on large dataset”. Important points related to this:

  • KNN is simplest Machine Learning Algorithm.
  • It is based on Supervised Learning technique.
KNN (Source: Author)
  • KNN finds the similar data-points according to their properties. As in above figure KNN grouped red, blue & green based on their similarity.
  • K-NN algorithm stores all the available data and classifies a new data point based on the similarity. Look in above figure, black point is new data point. If we take circle C1 then black->blue due to less distance, but if we take bigger circle C2 then black->red (meaning black will be classifies as red).
  • K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.
  • K-NN is a non-parametric algorithm, i.e. models assume that the data distribution cannot be defined in terms of such a finite set of parameters.
  • It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset
  • KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.

Working of KNN

As we discussed earlier, KNN uses feature similarity to predict the values of new data points, meaning that new data point category will be assigned on the basis on minimum distant data points. Follow steps are performed for implementing this algorithm:

  • First, assign the value of K-neighbor, i.e. the nearest data points. The value of K really matters, as in the above figure, we saw that when k value is low than black is classifies as blue, but if we increase the value of k then black is classified as red. Thus, keeping k around 5–8 suits best in most of the cases. Try to run your code multiple time by changing k to increase accuracy.
  • For each point in the test data do the following −
Distances Formula (Source: Author)
  • −> Calculate the distance between test data and each row of training data with the help of any of the method namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is Euclidean.
  • −> Now, based on the distance value, sort them in ascending order.
  • −> Next, it will choose the top K rows from the sorted array.
  • −> Now, it will assign a class to the test point based on most frequent class of these rows.

Since we know that KNN is used for both, regression and classification. But, What is difference between KNN regression & KNN classification?

  • KNN regression predict output on basis of local average or mean.
  • KNN classifier predict class on basis of probability or mode.

Main problems faced with KNN:

KNN doesn’t perform well with large number of features, so feature engineering becomes an important part when you select KNN algorithm. To deal with the problem of the curse of dimensionality, you need to perform principal component analysis before applying any machine learning algorithm, or you can also use feature selection approach.

KNN Implementation with Python

#Import all important libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
import seaborn as sns
#Read the dataset#Train the MODEL
K = []
training = []
test = []
scores = {}
for k in range(2, 21):
clf = KNeighborsClassifier(n_neighbors = k)
clf.fit(X_train, y_train)
training_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
K.append(k)
training.append(training_score)
test.append(test_score)
scores[k] = [training_score, test_score]
#Evaluating the model
for keys, values in scores.items():
print(keys, ':', values)
#Plotting the training scores graph
ax = sns.stripplot(K, training);
ax.set(xlabel ='values of k', ylabel ='Training Score')
plt.show()
#Plotting the testing scores graph
ax = sns.stripplot(K, test);
ax.set(xlabel ='values of k', ylabel ='Test Score')
plt.show()
#Plotting both training and testing graph
plt.scatter(K, training, color ='k')
plt.scatter(K, test, color ='g')
plt.show()

A Quick Summary of KNN Algorithm

  • K is a positive integer whose value best suits between 2–5(small dataset) and 5–8(relatively large dataset)
  • Iteratively change K to find the best results
  • KNN makes predictions using the similarity between an input sample and each training instance.

This leads to the end of this blog. I tried to cover everything I knew about KNN. Hope you liked this blog, else please feedback in response. Thank you, Have a great day!