Original article was published by Vagif Aliyev on Artificial Intelligence on Medium

# Getting Technical with the code

It’s great to know how to use ML libraries like scikit-learn to code algorithms, but what can really level up your ML skills is learning how to build the algorithms from scratch. So, we will be doing just that; creating a KNNClassifier from scratch!

NOTE: the link to the code can be found here, however I recommend you to go through the blog post before checking out the code to get a complete understanding of what is going on.

First things first, let’s import our libraries:

import numpy as np

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from collections import Counteriris = load_iris()

X, y = iris.data, iris.target

Yes, the only reason we are importing scikit-learn is to use to iris dataset and to split the data. Besides that, we are using plain numpy and collections!

Next, let’s create our train and test set:

`X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1810)`

Nothing out of the ordinary here, so let’s move swiftly along.

Now, I mentioned that feature scaling was an important preprocessing step for KNN. However, our data already lies in a similar range, so we can skip this step. In real world data however, it is very rare that we get this lucky.

## Using an OOP Approach for coding the Algorithm

Clean and reusable code are key for any Data Scientist or Machine Learning Engineer. Therefore, to follow Software Engineering principles, I will be creating a KNN Class to make the code reusable and pristine.

First, we define our class name, and pass in some parameters. Namely,

- X(the features)
- y(the label vector)
- n_neighbors(the number of neighbors we desire)

`class KNN:`

def __init__(self,X,y,n_neighbors=3):

self.X = X

self.y = y

self.n_neighbors = n_neighbors

Next, we convert our Euclidean distance formula from above into code and make it a method of the class:

`def euclidean(self,x1,x2):`

return np.sqrt(np.sum((x1 - x2) ** 2))

Now the heavy lifting. I will initially show you the code and then explain what is going on:

`def fit_knn(self,X_test):`

distances = [self.euclidean(X_test,x) for x in X_train]

k_nearest = np.argsort(distances)[:self.n_neighbors]

k_nearest_labels = [y_train[i] for i in k_nearest]

most_common = Counter(k_nearest_labels).most_common(1)[0][0]

return most_common

We first create a method named fit_knn, that does, well, fit a KNN to the data! More specifically, the following is being done:

- The distance between each data point in the test set to the data points in the train set is measured
- We get the K nearest points(K being our parameter for the number of neighbors, which, in our case, is 3)
- After, the code returns the labels of the nearest neighbours we found to be closest to our new test set instance.
- the most common class is counted and returned by the method

Finally, to top it all off, we make predictions:

knn = KNN(X_train,y_train)preds = knn.predict(X_test)

Now, let’s evaluate our model and see how well it did classifying the new samples:

accuracy = (preds == y_test).mean()OUT: 1.0

So, the full code is the following:

import numpy as np

from collections import Counter

from sklearn.datasets import load_irisiris = load_iris()

X, y = iris.data, iris.targetX_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1810)class KNN:

def __init__(self,X,y,n_neighbors=3):

self.X = X

self.y = y

self.n_neighbors = n_neighbors

def euclidean(self,x1,x2):

return np.sqrt(np.sum((x1 - x2) ** 2))

def fit_knn(self,X_test):

distances = [self.euclidean(X_test,x) for x in X_train]

k_nearest = np.argsort(distances)[:self.n_neighbors]

k_nearest_labels = [y_train[i] for i in k_nearest]

most_common = Counter(k_nearest_labels).most_common(1)[0][0]

return most_common

def predict(self,X_test):

preds = [self.fit_knn(x) for x in X_test]

return predsknn = KNN(X_train,y_train)

preds = knn.predict(X_test)accuracy = (preds == y_test).mean()