Original article was published by Akshit Kothari on Artificial Intelligence on Medium
Step 1: Import the libraries
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
Step 2: Read the data and pre-processing
#Read the csv in the form of a dataframe
#Removing the null values
#Reset the index to avoid error
y = df['RAIN'].replace([False,True],[0,1])
#Removing Date feature and Rain because it is our label
#Splitting the data to train(75%) and test(25%)
Step 3: Implementing Euclidean distance to find the nearest neighbor
Let’s try to understand Fig. 5 first, we have three features in two clusters, similar to our Seattle Rainfall data. Lets assume cluster 1 to be of data points when it did not rain and cluster 2 to be of data points when it did rain. The first two clusters can be thought of as our training data. The three data point outside the two clusters are the testing data for which we have to find the label.
Now, we will use Euclidean Distance to calculate the distance between the training data and the testing data. We can also use different methods to calculate distance like Manhatten Distance, Minkowski Distance etc.
Euclidean Distance:- It is used to find the straight line distance between two points.
Since, KNN is non-parametric i.e. it does not make any assumptions about the probability distribution of the input, we find the distance of test data points from the training data and the data points which are near to the test data are considered for classifying our test data. Let’s see this with the help of an image:-
We see that in the image we have 5 data points per feature, 2 classified in one cluster, 2 classified in other cluster and 1 not yet labelled. Once we calculate the euclidean distance between these data points we can classify the cluster where this would lie.
The K in KNN signifies the count which we have to consider, lets say K is provided 5, in this case it will consider top 5 shortest distances after which the class with more frequency will be considered as the class of the test data.
To understand the purpose of K we have taken only one independent variable as shown in Fig. 7 with 2 labels, i.e binary classification and after calculating the distances we have marked 5 nearest neighbors as K is assigned a value of 5. Now as we can see the frequency (3 for blue and 2 for red) of label in color blue is higher than for color red, we will label the test data point as blue. Finding out the value of K is a task in K-NN as it may require multiple iterations by trying different values and evaluating the model based on those values.
dist = 
#Computing Euclidean distance
dist_ind = np.sqrt(np.sum((x-y)**2, axis=1))
#Concatinating the label with the distance
main_arr = np.column_stack((train_label,dist_ind))
#Sorting the distance in ascending order
main = main_arr[main_arr[:,1].argsort()]
#Calculating the frequency of the labels based on value of K
count = Counter(main[0:k,0])
keys, vals = list(count.keys()), list(count.values())
Step 4: Calculating the accuracy and classification report
Well we have achieved 96% in training data and 94% in test data, not bad!