Writing your first classifier in ML

Original article was published by Irzelindo Salvador on Artificial Intelligence on Medium


Episode #4 — How to write your own classifier in Machine Learning

Writing your first classifier in ML

These series will be composed of #10 episodes of machine learning for beginners…

In this episode we’re going to start getting deeper into the matter, we’ll write our own classifier from scratch. In machine learning, this is a big milestone, if you can follow along and do this on your own, it means you can understand an important piece of the puzzle. The classifier we’re going to write is a scrappy version of k-Nearest Neighbors.

So, What is K Nearest Neighbors(KNN)?

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function. If K = 1, then the case is simply assigned to the class of its nearest neighbor.

K! What does it stand for?

K in KNN is a parameter that refers to the number of nearest neighbors to include in the majority of the voting process. K is based on feature similarity, choosing the right value of K is a process called parameter tuning and is important for better accuracy. Finding the value of k is not a simple task. And here’re few ideas on how to pick a value of K:

  1. We need to find out with various values by trial and error, assuming that training data is unknown;
  2. Smaller values for K can be noisy and will have a higher influence on the result;
  3. Larger values of K will have smoother decision boundaries which mean lower variance but increased bias. Also, computationally intensive;
  4. In general, practice, choosing the value of K is K = sqrt(N) where N stands for the number of samples in your training dataset.

For more details on how KNN works please refer to this article. As in this post, we’re concerned about writing a classifier.

We’ll again work with the iris dataset from episode #2, although using a different approach this time, with the help of the train_test_split function, one of the functions inside the model_selection module from scikit_learn. This function will help us to split arrays or matrices into a random train and test subsets.

Writing the classifier

Now that we already know how a KNN classifier works, let’s get our hands dirty, previously we imported a classifier as seen in the code snippet below.

from sklearn.neighbors import KNeighborsClassifierclf = KNeighborsClassifier()

Now we’ll comment these lines out and write our own, the rest of the code will stay the same.

# from sklearn.neighbors import KNeighborsClassifier# clf = KNeighborsClassifier()

Let’s implement a class to represent our classifier now, I’ll call it testKNN, you can call it whatever you want.

Now let’s see what methods/functions we need to implement:

  • Looking into the classifier’s behavior we can easily infer that we’ll have to implement the fit() and predict() to train the classifier and make predictions respectively.
  • Let us declare both thefit() and predict()functions, remember the fit(self, X_train, y_train) function takes the features(X_train) and labels(y_train) for our training set as input and the predict(self, X_test) features for our testing and outputs/returns the predictions for our labels. The self represents the instance of the class. By using the self keyword we can access the attributes and methods of the class in python. It binds the attributes with the given arguments. Our first goal is to get our pipeline working and understand what this methods/functions do, for now our class looks like the one below:

Let’s start with something simple, a random classifier, by random, I mean we’re just going to guess the label.

  1. Let’s add some code to fit() a predict() functions… I’ve also included some docstrings and comments following the style guide for python code PEP8

From the code above, we initialized in[lines 33–34] the X_train and y_train attributes in the fit(self, X_train, y_train) function, and finally implemented a random choice for the predict(self, X_test) function [line36]. Here we’re just taking a random training label label = random.choice(self.y_train)as our prediction for each row in test features subset(2D array/list)line 50,append it into the predictions list line 51and finally return the entire list of predictions inline 53 . At this point our pipeline is running again, recalling that there is three different types of flowers in the iris dataset, so our accuracy should be around 33%. Lets run the code.

From the notebook above we can clearly see in line[5] that this time we imported the testKNN from testKNN.py fileclf = testKNN() after we make use of the fit() and predict() functions in lines[6] and [7]. This time though we cannot use the score() function because we didn’t overwrite or define it in our testKNN class, we could have done so. We’ll relay on the accuracy_score() to check how accurate our model is. The accuracy of our model fall in 32%, turns out the accuracy is of course around 33%, this is because we’re randomly making predictions and we have 3 types of flowers.

Now we know the interface for our classifier, great, but when we started this episode our acurracy was above 90%, lets see if we can do better and improve the classifier accuracy. To achieve that, we’ll implement our classifier based on the KNN classifier.

KNN classifier

Here’is the intuition on how that algorithm works, In the graph aside we can clearly see that there’re three(3) classes of data(A,B and C) each one containing it’s own datapoints, let’s imagine the datapoints we see in each class are the training data we memorized in the fit function for the Iris dataset. If someone asks us to predict the class in which a new point P. belongs to, how can we do that?

Well, in nearest neighbor classifier it works exactly like it sounds, We’ve to find the training point that is closest to the testing point. This point is known as the nearest neighbor. Then, we can predict that the testing point has the same label as the closest training point. For example we can easily predict that the red point P belongs to Class B because the nearest point to it, is in the class B.

What if the point P had the same distance between two or the three classes nearest points. How do we find out it’s class label?

That’s where the K comes in. K, is the number of neighbors we consider when making our prediction, if K is 1 we just look at the closest training point but if key is 3 we’d have to look at the three(3) closest training points. Therefore we can have more training points options among the classes to make predictions on. The majority wins and we infer that the point belongs to the majority class label. There’s more details to this algorithm but this is enough to get us started.

Back to the code

We’ll have to find a way to get the nearest neighbor, to do that we’ll measure the straight line distance between two points, just like we do with a ruler. There‘s a formula called eucliden distance that can help us with that.

Euclidean distance between points P and Q

From the graph above, we can clearly see that the euclidean distance applies the Pitágoras teorem to find the distance between two points. The distance is the length of the hypotenuse C.

With that concept in mind we can find the distance between two features. Something cool about euclidean distance formula is that it’s not restricting us only for the 2D space. It works the same way regardless of the space dimension, means that with more feature points/dimensions, we can add more terms to the equation. Isn’t that cool?

In example, our Iris dataset features has 4 dimensions, a hypercube, to compute the distance we just add more 2 dimensions squares into the square root. The general formula is:

Euclidean distance formula

Now lets get back and edit the testKNN_Random_Choice.py file so it can return the correct predicted labels instead of the random choice.

We commented out the random library as it’s no longer necessary this time, instead we’ll benefit from the scipy library with is a free and open-source Python library used for scientific computing and technical computing. The submodule scipy.spatial.distance has a function euclidean(u, v, w=None) tha will compute the euclidean distance for us and that what’s happening in the functioneuclidean_distance(self, a, b) .

Notice that we also defined the function closest_point(self, row) and here is where the magic happens, in this function we return the closest training data point label to the testing point feature with the help of the previuos defined function euclidean_distance. By doing so we can predict data that our classifier had never seen before. And finally we append that label into the predictions list for accuracy verification later.

Note: We did not define the value of K, by default K will be equal to one(1) K = 1. So far our version of KNN will check only one closest data point.

This time our accuracy is above 90%, we could reach 97.3%. Your accuracy may be a bit different because of the randomness of the test set, as long as it’s above 90% it’s fine.

Prons and Cons of the KNN

  • It’s simple and easy to understand;
  • Works reasonable well for some problems;
  • Computational intesive;
  • Hard to represent relationships between features;

That’s it for this episode, hope you enjoy… See you on episode #5

Thanks for reading…

References

https://scikit-learn.org/stable/modules/cross_validation.html