All about KNN

Original article was published by Harshal Ramdham on Artificial Intelligence on Medium


Here I write most probably asked question on KNN algorithm

1.Explain about K-Nearest Neighbors?

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems. The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.

2.Define Distance measures: Euclidean(L2) , Manhattan(L1), Minkowski, Hamming?

Minkowski Distance :

The formula for Minkowski Distance is given as:

If p = 1 this is Manhattan distance

If p = 2 Euclidian distance

Hamming Distance:

A Hamming distance in information technology represents the number of points at which two corresponding pieces of data can be different. It is often used in various kinds of error correction or evaluation of contrasting strings or pieces of data.

Example: Suppose there are two strings 1011001001and 1001000011.

1011001001⊕ 1001000011= 0010001010. Since, this contains two 1s, the Hamming distance, d(1011001001, 1001000011) = 3

Manhattan Distance (L1 Norm):

We use Manhattan Distance if we need to calculate the distance between two data points in a grid-like path. As mentioned above, we use the Minkowski distance formula to find Manhattan distance by setting p’s value as 1.

Let’s say, we want to calculate the distance, d, between two data points- x and y.

Distance d will be calculated using an absolute sum of the difference between its Cartesian coordinates as below :

Euclidean Distance (L2 Norm):

Euclidean distance formula can be used to calculate the distance between two data points in a plane.

Euclidean distance is one of the most used distance metrics. It is calculated using the Minkowski Distance formula by setting p’s value to 2. This will update the distance d’ formula as below

3. What is Cosine Distance & Cosine Similarity?

The relation between cosine similarity and cosine distance can be define as below.

a. Similarity decreases when distance between two vectors increases

b. Similarity increases when distance between two vectors decreases.

Cosine similarity says that to find the similarity between two points or vectors we need to find Angle between them.

Formula to find the Cosine Similarity and Distance is as below:

Cosine Similarity and Cosine Distance is heavily used in recommendation systems to recommend products to the users based on there likes and dislikes.

It has three case:

Case1: When angle between points P1 & P2 is 45 Degree then cosine_similarity= Cos 45 = 0.525

P1 and P2 Are 52% Similar

Cosine_distance = 1–0.525 = 0.475

Case2: When two points P1 & P2 are far from each other and angle between points is 90 Degree then cosine_similarity= Cos 90 = 0

P1 and P2 Are disimilar

Cosine_distance = 1–0 = 1

Case3: When two points P1 & P2 are very near and lies on same axis to each other and angle between points is 0 Degree then cosine_similarity= Cos 0 = 1

P1 and P2 Are Similar

Cosine_distance = 1–1 = 0

Here we see when distance is less the similarity is more (points are near to each other) and distance is more, two points are dissimilar (far away from each other).

4. Limitations of KNN?

KNN Time And Space Complexity is O(n)

  • When we take KNN Algorithm in real-time it takes lot of time and space to evaluate as its time complexity and space complexity is O(n).
  • For large scale of application (low latency system) KNN fails.
  • In higher dimensional space, the cost to calculate distance becomes expensive and hence impacts the performance.
  • KNN is sensitive to outliers and missing values and hence we first need to impute the missing values and get rid of the outliers before applying the KNN algorithm.

5. How to handle Overfitting and under fitting in KNN?

The model is overfitting when the train data is good but worse performance in cross validation (test data).

The model is under fitting when both train and cross validation (test) both badly performance.

To overcome this we use best value of K by using cross validation or elbow method. And also we resampling and holding and validate the dataset

6. How to measure the effectiveness of k-NN?

Knn(distance) + Majority Vote

1st break the actual data into two parts train and test.

Dtrain union Dtest = Actual data set

Dtrain intersection Dtest = Null

2nd then train Knn model by using train data and to measure the model use test dataset.

Now measure :

Accuracy = count/n2

Where: Count = no of point which dtrain + Knn gave correct class value

n2 = no of point in dtest

Accuracy lies [0–1]

7. Need for Cross validation?

suppose we have 1000 dataset if in train 70% data and in test in 30% data. When I do basic train test split our model use 70% for train the model and remain 30% test data for check then say the model accuracy here happen that 70% and 30% is randomly selected when this type selection happen then may be all data is not present in 70% train data set then accuracy goes down.

When we use train-test split we select randomly. Here also every new data or we refresh the model the value will change it is major disadvantage of this train-test split.it shuffled every time when we add new data so accuracy fluctuate.

To prevent this we use cross validation

https://www.analyticsvidhya.com/blog/2018/05/improve-model-performance-cross-validationin-python-r/

Types:

1. Leave one out cross validation (LOOCV)

2. k-fold cross validation

3. Stratified k-fold cross validation

4. Adversarial validation

5. Cross validation for time series

6. Custom cross validation techniques

8. What is K-fold cross validation?

It’s easy to follow and implement. Below are the steps for it:

1. Randomly split your entire dataset into k”folds”

2. For each k-fold in your dataset, build your model on k — 1 folds of the dataset. Then, test the model to check the effectiveness for kth fold

3. Record the error you see on each of the predictions

4. Repeat this until each of the k-folds has served as the test set

5. The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model

Below is the visualization of a k-fold validation.

9. What is Time based splitting?

In a Machine Learning algorithm we can split the given dataset into training and test data.

We can either split randomly or use time based splitting. For time based splitting we need a timestamp as one of the attributes / features. Like in case of e-commerce website we can have reviews for various products. These reviews can have timestamps also. In such scenarios it’s better to use time based strategy. To do this we can first sort the reviews using timestamp and then do the split.

Sort data by time.

Split — Training (80%) and Testing (20%)

10. Weighted k-NN ?

Weighted KNN is a modified version of k nearest neighbors.

a. In this technique we give the weight to the every k nearest neighbours W = 1/d for give the weight to the KNN.

As weight increase distance decrease and vice versa.

b. After give the proper weight to the every KNN then calculate total weight in negative point and total weight in positive point. Then choose the higher value of weight of the KNN.