Original article was published on Artificial Intelligence on Medium
Image Classification From Scratch With KNN: A Python Implementation
In recent years we have seen vast advances in the field of computer vision, the application of which spans many industries. Convolutional neural networks (CNN) have come to be known as the pre-eminent method in image classification as evidenced by its success in the ImageNet Large Scale Visual Recognition Challenge, and in some narrow cases have even reportedly exceeded human abilities. Although in this post we will explore a similar problem, we will approach it with the much simpler K-Nearest Neighbor(KNN) algorithm and in doing so show that reasonable performance can still be achieved with a less complex classifier. This will be implemented in python with the help of the numpy library.
What is KNN?
KNN is a non-parametric learning model that utilizes local approximation and does not require training. As such it is known as a ‘lazy learner’. The training set is memorized by the algorithm and called during the test step to predict new instances based on a simple majority vote of the labels of nearest neighbors.
The basis for determination of the nearest neighbors is predicated on a distance metric. Examples of common distance metrics include Euclidean, Hamming (categorical data) and Manhattan distances. The appropriate distance metric depends on the structure of the data and will directly affect the classifier’s accuracy. Research has shown that a simple KNN algorithm with the appropriate choice of distance metric can outperform many more sophisticated approaches in image classification tasks.
KNN also requires the selection of the parameter k , the number of neighbors by which to assess the classification of an observation. The interaction between the choice of k and the misclassification rate has been well documented and is an illustration of the classic bias vs variance trade off. When k = 1 the model will overfit on the training data set and become prone to generalization error, conversely when k is larger, the prediction is averaged over a larger neighborhood potentially increasing the misclassification rate.
To illustrate KNN in action, the above example introduces a toy example where the 7 closest neighbors for a new data point has been identified(k=7), by taking a simple sum of the labels of each of the neighbors we can see that the majority class is oranges (5 vs 2 counts) thus the new data point will be classified as an orange.
Compared to more sophisticated methods such as neural networks, KNN is significantly faster in the training phase as no computation is required, however the opposite is true in the testing phase where KNN can quickly become very expensive as both k and the number of dimensions increase. To combat this we will apply dimensionality reduction with PCA.
We will demonstrate the algorithm using the NMIST Fashion data set in python consisting of a training set of 60,000 observations (test set 10,000) of 28 by 28 gray scale images of ten types of clothing.
- Importing the data set
from keras.datasets import fashion_mnistimport numpy as npimport matplotlib.pyplot as plt((trainX, trainY), (testX, testY)) = fashion_mnist.load_data()
#flattening image pixel values into vectorstestX=testX.reshape([10000,784])trainX= trainX.reshape([60000,784])print(trainX.shape)print(trainY.shape)print(testX.shape)print(testY.shape)
2. Defining and applying a numpy implementation of the PCA algorithm.
for more information on this implementation check out this article:
def pca_np(x): #centering data
m = np.mean(x, axis =0)
x_centered = x - m #calculating covariance matrix
eigenvals, eigenvecs = np.linalg.eig(x_cov) #sorting
eigenvecs = eigenvecs[:,i]
eigenvals= eigenvals[i] #returning the eigenvalues, eigenvectors and means of training set
return(eigenvals, eigenvecs, m)
applying PCA to the training set and plotting % variance explained by first 20 principal components:
evals, evecs, train_mean = pca_np(trainX)exp_var =evals/sum(evals)#plot % of variance explainedplt.bar(range(1,21),exp_var[:20])
there are several “elbows” in the above plot. We will arbitrarily retain the first 10 principle components which explains about 72% of the variance in the
data. Now we project training and test data onto the retained eigenvectors:
#number of PCs to retainn =10 X_evecs_n = evecs[:,:n]#projecting the training and test data back onto the retained eigenvectors to get our factors for use in KNNtrainX_factors = np.dot(trainX-train_mean,X_evecs_n)
testX_factors= np.dot(testX-train_mean,X_evecs_n)#checking dimension after PCA
3. Now that we have pre-processed the data let us define the KNN function using numpy with the option of using both the euclidean and manhattan distance metrics.
def knn_np(x_train, y_train, K, X_test, dist = 'euclidean'): #calculating the distance metric for the test data point and each
observation in the training set if dist !='euclidean' : dist = (abs(X_test - x_train)).sum(axis = 1) else: dist = np.sqrt(((X_test - x_train)**2).sum(axis=1))arg_ascending = np.argsort(dist)classes = np.zeros(10)for i in range(k): if y_train[arg_ascending[i]]==0: classes += 1 elif y_train[arg_ascending[i]]==1: classes += 1 elif y_train[arg_ascending[i]]==2: classes += 1 elif y_train[arg_ascending[i]]==3: classes += 1 elif y_train[arg_ascending[i]]==4: classes += 1 elif y_train[arg_ascending[i]]==5: classes += 1 elif y_train[arg_ascending[i]]==6: classes += 1 elif y_train[arg_ascending[i]]==7: classes += 1 elif y_train[arg_ascending[i]]==8: classes += 1 elif y_train[arg_ascending[i]]==9: classes += 1 return np.argmax(classes)
4. As we do not need to train the algorithm, we can apply it directly to the test set. We will again arbitrarily select a value for k of 10. In practice, the value for k should be determined through cross validation.
import timek=10wrong = 0#iterating through every point in test set and applying knnstart_time = time.time()for i in range(testX_factors.shape):prediction = knn_np(trainX_factors, trainY, k, testX_factors[i])if prediction != testY[i]:wrong += 1end_time = time.time()print("Accuracy=", 1-wrong/testX_factors.shape)print("----Runtime:%.0fmin%.0fs----" %((end_time- start_time) // 60, (end_time- start_time) % 60))
as we can see we are able to achieve a decent 81.53% accuracy on the test set with the default euclidean distance metric and at a very reasonable runtime.
running the same exercise again but with the Manhattan distance metric we get a very similar result:
Manhattan distance is generally more robust than euclidean distance as the latter is more sensitive to outliers and scaling issues, however since we’ve applied PCA during the pre-processing step this has helped to center the data and remove some these issues.
In this post I hoped to have demonstrated with a practical example some of the intuition and simplicity between the K-Nearest Neighbor algorithm. The choice of using more sophisticated class of algorithms requires a careful consideration of the trade off between computation costs, complexity and accuracy. This decision should be made with an intimidate understanding of the data in mind as in some cases the correct specification of a simple algorithm may perform on par with its more complex counterparts.
 Zhang H, Berg AC, Maire M, Malik J. SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Conference Paper, 2006.
Murphy KP, Machine Learning: A Probabilistic Perspective, Cambridge, Massachusetts: The MIT Press:2012.