Machine Learning: Getting Started With K-Nearest Neighbours.

Source: Artificial Intelligence on Medium

Machine Learning: Getting Started With K-Nearest Neighbor.

What we’ll learn in this article:

  1. What is K-NN?
  2. How the decision Surface change with the change in the value of K
  3. Overfitting and Underfitting for K-NN
  4. Types of distances used in K-NN
  5. Space & time complexity of K-NN
  6. K-NN for regression
  7. Applications of K-NN in real world

K-Nearest Neighbors or K-NN is a basic, simple yet one of the powerful Machine Learning algorithms. K-NN can be used for both Classification and Regression. In this blog, we’ll focus mainly on the Classification part since the K-NN is majorly used for Classification.

What is K-NN? What does it do? How does this algorithm work?

In simple words, given a data set for classification, K-NN tries to draw a boundary surface between the classes to separate them. Now, when a query point is given to K-NN it calculates the distance between this query point and the surrounding K nearest data points and based on the majority of the class points the given query point is classified.

Example:

source: Google

From the above figures:

Picture 1: we’ve two classes in our data which are starts and triangles. A new point is introduced in the data set and we need to classify the new data point.

Picture 2: The distance between the nearest points from our new data point is calculated. The nearest points to our new data point are the neighboring data points.

Picture 3: Let’s say we’ve given the K value as 3. So here the nearest three neighboring points are considered. In the three nearest neighboring points we’ve considered the majority of the points are triangles (triangles = 2 and star =1). Hence our new data point is now classified as a triangle.

Simple right?? Congrats, now you’ve understood the basic functionality of the K-Nearest Neighbours.

You might have a doubt here. How do we pick the correct value of K?

K here in the K-NN is a hyperparameter. Often we use an odd number as the value of K since the algorithm decides the class of the query point based on the majority. K can be 3,5,7,… but it cannot be 1. We need to be careful while choosing the value of K since it might lead to the problem of either overfitting or underfitting. K should not be too small value nor a big value.

So, what are overfitting and underfitting??

Before we understand the concept of overfitting and underfitting, let’s understand the concept of decision surface.

Decision Surface: It can be defined as a surface that separates the classes in a given data set.

The surface which is separating the three classes in the above figure is called a decision surface or a decision boundary.

How the decision surface changes when the K value changes:

K = 1

As seen in the above picture, when K=1, our algorithm tries to make sure that it creates a surface that completely separates the two classes. Let’s say, we have a purple point P1 in the yellow region and when a query point is given in a yellow region and to which the nearest neighbor is P1 (Purple point which is in the yellow region) the query point is classified as the purple point, though all the points surrounding the query points are Yellow points. Since the value of K=1, we do not consider the other neighbors. Here, the query point might be misclassified.

When the K-NN algorithm creates a surface that completely separates the two classes as shown in figure 1, then it is said that it is trying to overfit.

K = 5,7,15..

As shown in the above figure, as the value of K increases the decision surface becomes smoother.

K = n

Similarly, when we have n data points in the given data set and K = n, no matter which class our query belongs to our algorithm returns the query point to be the class which has the highest number of data points. Here, this is said to be underfitting in the K-NN.

How does the K-NN calculate the distance between two points?

The distance in K-NN can be calculated in three ways:

  1. Euclidean Distance
  2. Manhattan Distance
  3. Minkowski Distance

Euclidean Distance (L2):

Often referred to as the L2 norm of a vector is nothing but the distance of the shortest line between two data points or two vectors. The Euclidean distance between two points X1 and X2 can be calculated by the given formula:

Where X1 and X2 belong to d-dimensions.

Manhattan Distance (L1):

Often referred to as the L1 norm of a vector. The Manhattan distance between two points X1 and X2 can be calculated by the given formula:

P.S.: it is the sum of all the absolute differences between the two points in d-dimensions. |X1i-X2i|***

Where X1 and X2 belong to d-dimensions.

Minkowski Distance (LP):

Often referred to as the LP norm of a vector. The Minkowski distance between two points X1 and X2 can be calculated by the given formula:

If the value of P is 2, then the Minkowski distance is the same as the Euclidean distance.

If the value of P is 1, then the Minkowski distance is the same as the Manhattan distance.

Euclidean distance is used as the default distance while implementing the K-NN algorithm.

Space and Time Complexity of K-NN:

The time and space complexity of the K-NN is high and is considered as one of its limitations.

Space Complexity: O(nd)

Time Complexity: O(nd)

Bias & Variance trade-off:

If the value of K is low, the model tries to overfit.

if the value of K is high, the model tries to underfit.

This is all for the K-NN classification. Now let’s take a look at how K-NN can be implemented for regression.

K-NN for Regression:

As we discussed earlier, in the K-NN classification after calculating the distances between the query point and the K nearest neighbors, the algorithm classifies the query point based on the majority vote.

Similarly, in the K-NN regression, we calculate the distances between the query point and the K nearest neighbors and now instead of doing the majority vote we calculate the mean/median of all the nearest neighbors.

Note: Mean can be affected by outliers, however, the median is least affected by the outliers.

Applications of K-Nearest Neighbors in real-world:

  1. K-NN can be used for categorizing the text.
  2. Stock market forecasting is one of the most core financial tasks of KNN.
  3. recommender systems is also an area where K-NN can be applied.
  4. K-NN also has a lot of applications in the field of medicine.

Here are a few books on Data Science:

  1. Data Science for Beginners: This Book Includes Python Programming, Data Analysis, and Machine Learning. A Complete Overview for Beginners to Master The Art of Data Science From Scratch Using Python.
  2. DATA SCIENCE FROM SCRATCH: Complete guide to master data process, mining, data-analytic, neural networks, machine learning in Python, Linear Algebra, Statistics, Coding, Applications Decision Tree.

For more information on K-NN, you can refer to:

  1. https://scikit-learn.org/stable/modules/neighbors.html
  2. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

If you learned something new or enjoyed reading this article, please clap it up 👏 and share it so that others will see it. Feel free to leave a comment too.