Machine Learning Algorithms from Start to Finish in Python: KNN

Original article was published by Vagif Aliyev on Artificial Intelligence on Medium


Getting Technical with the code

It’s great to know how to use ML libraries like scikit-learn to code algorithms, but what can really level up your ML skills is learning how to build the algorithms from scratch. So, we will be doing just that; creating a KNNClassifier from scratch!

Photo by redcharlie on Unsplash

NOTE: the link to the code can be found here, however I recommend you to go through the blog post before checking out the code to get a complete understanding of what is going on.

First things first, let’s import our libraries:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter
iris = load_iris()
X, y = iris.data, iris.target

Yes, the only reason we are importing scikit-learn is to use to iris dataset and to split the data. Besides that, we are using plain numpy and collections!

Next, let’s create our train and test set:

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1810)

Nothing out of the ordinary here, so let’s move swiftly along.

Now, I mentioned that feature scaling was an important preprocessing step for KNN. However, our data already lies in a similar range, so we can skip this step. In real world data however, it is very rare that we get this lucky.

Using an OOP Approach for coding the Algorithm

Clean and reusable code are key for any Data Scientist or Machine Learning Engineer. Therefore, to follow Software Engineering principles, I will be creating a KNN Class to make the code reusable and pristine.

First, we define our class name, and pass in some parameters. Namely,

  • X(the features)
  • y(the label vector)
  • n_neighbors(the number of neighbors we desire)
class KNN:
def __init__(self,X,y,n_neighbors=3):
self.X = X
self.y = y
self.n_neighbors = n_neighbors

Next, we convert our Euclidean distance formula from above into code and make it a method of the class:

def euclidean(self,x1,x2):
return np.sqrt(np.sum((x1 - x2) ** 2))

Now the heavy lifting. I will initially show you the code and then explain what is going on:

def fit_knn(self,X_test):
distances = [self.euclidean(X_test,x) for x in X_train]
k_nearest = np.argsort(distances)[:self.n_neighbors]
k_nearest_labels = [y_train[i] for i in k_nearest]

most_common = Counter(k_nearest_labels).most_common(1)[0][0]
return most_common

We first create a method named fit_knn, that does, well, fit a KNN to the data! More specifically, the following is being done:

  1. The distance between each data point in the test set to the data points in the train set is measured
  2. We get the K nearest points(K being our parameter for the number of neighbors, which, in our case, is 3)
  3. After, the code returns the labels of the nearest neighbours we found to be closest to our new test set instance.
  4. the most common class is counted and returned by the method

Finally, to top it all off, we make predictions:

knn = KNN(X_train,y_train)preds = knn.predict(X_test)

Now, let’s evaluate our model and see how well it did classifying the new samples:

accuracy = (preds == y_test).mean()OUT: 1.0

So, the full code is the following:

import numpy as np
from collections import Counter
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1810)class KNN:
def __init__(self,X,y,n_neighbors=3):
self.X = X
self.y = y
self.n_neighbors = n_neighbors

def euclidean(self,x1,x2):
return np.sqrt(np.sum((x1 - x2) ** 2))

def fit_knn(self,X_test):
distances = [self.euclidean(X_test,x) for x in X_train]
k_nearest = np.argsort(distances)[:self.n_neighbors]
k_nearest_labels = [y_train[i] for i in k_nearest]

most_common = Counter(k_nearest_labels).most_common(1)[0][0]
return most_common

def predict(self,X_test):
preds = [self.fit_knn(x) for x in X_test]
return preds
knn = KNN(X_train,y_train)
preds = knn.predict(X_test)
accuracy = (preds == y_test).mean()