A Complete K Mean Clustering Algorithm From Scratch in Python: Step by Step Guide

Original article was published by Rashida Nasrin Sucky on Artificial Intelligence on Medium


A Complete K Mean Clustering Algorithm From Scratch in Python: Step by Step Guide

Also, How to Use K Mean Clustering Algorithm for Dimensionality Reduction of an Image

What is K mean clustering?

K means clustering is the most popular and widely used unsupervised learning model. It is also called clustering because it works by clustering the data. Unlike supervised learning models, unsupervised models do not use labeled data.

The purpose of this algorithm is not to predict any label. Instead to learn about the dataset better and to label them.

In k mean clustering we cluster the dataset into different groups.

Here is how a k mean clustering algorithm works

  1. The first step is to randomly initialize a few points. These points are called cluster centroids.

In the picture above, the red and blue points are cluster centroids.

You can choose any number of cluster centroids. But the number of cluster centroids has to be less than the total number of data points.

2. The second step is the cluster assignment step. In this step, we need to loop through each of the green dots. Depending on if the dot is closer to the red or blue point, we need to assign it to one of the points.

In other words, color the green points either red or blue depending on if it is closer to blue cluster centroid or red cluster centroid.

3. The next step is to move the cluster centroids. Now, we have to take an average of all the red dots that are assigned to the red cluster centroid and move the red cluster centroid to that average. We need to do the same for the blue cluster centroid.

Now, we have new cluster centroids. We have to go back to number 2, the cluster assignment step. We need to rearrange the dots to the new cluster centroids. After that repeat number three.

Numbers 2 and 3 need to be repeated several times until both the cluster centroids are in suitable positions like this picture below.

Look, we just colored all the green dots as per the cluster centroids they are assigned to. The blue cluster centroid is in the center of the blue cluster and the red cluster centroid is in the center of the red cluster.

It will be a lot more clear in a bit when we will develop the algorithm. We will discuss this in more detail.

Develop the Algorithm

The dataset I am going to use for this algorithm is obtained from Andrew Ng’s machine learning course in Coursera. Here is the step by step guide to developing a k mean algorithm:

1. Import the necessary packages and the dataset

import pandas as pd
import numpy as np
df1 = pd.read_excel('dataset.xlsx', sheet_name='ex7data2_X', header=None)
df1.head()

The dataset has only two columns. I took two featured datasets because it will be easy to visualize. The algorithm will be more understandable to you when you will see the visuals. But the same algorithm will work on a multidimensional dataset as well.

Now, I will follow the three steps I discussed above.

2. The first step was to initialize the centroids randomly.

I will initialize three points randomly from the dataset. For that first, I will choose three numbers between 0 and the length of the dataset.

import random
init_centroids = random.sample(range(0, len(df1)), 3)
init_centroids

The output:

[95, 30, 17]

Use these three numbers as indices and get the data points of these indices.

centroids = []
for i in init_centroids:
centroids.append(df1.loc[i])
centroids

Output:

[0    3.907793
1 5.094647
Name: 95, dtype: float64,
0 2.660466
1 5.196238
Name: 30, dtype: float64,
0 3.007089
1 4.678978
Name: 17, dtype: float64]

These three points are our initial centroids.

I will convert them into a two-dimensional array. Because that’s a more known format to me.

centroids = np.array(centroids)

Output:

array([[3.90779317, 5.09464676],
[2.66046572, 5.19623848],
[3.00708934, 4.67897758]])

3. Implement the cluster assignment step.

In this step, we will loop through all the data points in the dataset.

One data point means one row of data

Let’s take a single row of data and understand how to assign that data to a cluster.

We will calculate the distance of that data from all three centroids. And then assign that data point to the centroid that has the smallest distance to it.

As we can see that we have to calculate a lot of distance between two points. Let’s develop a function to calculate distance.

def calc_distance(X1, X2):
return(sum((X1 - X2)**2))**0.5

Develop a function to assign each data point to a centroid. Our ‘centroid’ array has only three values. So we have three indices: 0, 1, 2. We will assign one of these indices to each data point.

def findClosestCentroids(ic, X):
assigned_centroid = []
for i in X:
distance=[]
for j in ic:
distance.append(calc_distance(i, j))
assigned_centroid.append(np.argmin(distance))
return assigned_centroid

This is the function to assign the data points to a cluster. Let’s use this function to calculate the centroids of each data point:

get_centroids = findClosestCentroids(centroids, X)
get_centroids

Part of the output:

[2,
0,
0,
2,
1,
2,
2,
2,
1,
1,
2,
2,
2,
2,
2,
2,
0,

The total output is long. So, I am showing part of the output here. The first centroid in the output is 2, that means it is assigned to the index 2 of the centroid list.

4. The final step is to move the centroids based on the mean of the data points

In this step, we will take an average of all the data points of each centroid and move the centroid to that average.

For example, We will find the average of all the points that are assigned to the centroid at index 2 and move the centroid 2 to the average. Do the same to centroid at index 0 and 1 as well.

Let’s define a function to do this:

def calc_centroids(clusters, X):
new_centroids = []
new_df = pd.concat([pd.DataFrame(X), pd.DataFrame(clusters, columns=['cluster'])],
axis=1)
for c in set(new_df['cluster']):
current_cluster = new_df[new_df['cluster'] == c][new_df.columns[:-1]]
cluster_mean = current_cluster.mean(axis=0)
new_centroids.append(cluster_mean)
return new_centroids

These are all the functions we needed to develop.

As I discussed before we need to repeat this process of cluster assignment and moving centroids a few times till the centroids are in a suitable position.

For this problem, I choose to repeat the process 10 times. I will keep plotting the centroids and the data after each iteration to show you visually how it works.

for i in range(10):
get_centroids = findClosestCentroids(centroids, X)
centroids = calc_centroids(get_centroids, X)
#print(centroids)
plt.figure()
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1], color='black')
plt.scatter(X[:, 0], X[:, 1], alpha=0.1)
plt.show()

After five iterations, the centroids were set to their optimum position. So they were not changing positions after that.

I suggest, please run all the codes above learn it well before you try dimensionality reduction.

Otherwise, you may feel overwhelmed! Also, I will move a bit faster now, as we already explained the algorithm in detail.

Dimensional Reduction

I want to explain at least one use case of this algorithm. One very useful use case is dimensionality reduction.

Think of an image. There might be so many different pixels in an image. In any computer vision problem, if we can reduce the dimensions of the picture, it will be a lot faster for a device to read that picture! isn’t it?

We can use the algorithm we just developed to reduce the dimensions of pictures.

I am going to use the picture of a frog to demonstrate this:

Image By Author

I have this picture uploaded in the same folder as my notebook. Let’s import this:

import cv2
im = cv2.imread('frog.png')
im

Output:

array([[[  2,  57,  20],
[ 2, 57, 20],
[ 2, 57, 21],
...,
[ 0, 5, 3],
[ 8, 12, 11],
[ 91, 94, 93]], [[ 2, 56, 20],
[ 1, 54, 20],
[ 1, 56, 19],
...,
[ 0, 2, 1],
[ 7, 9, 8],
[ 91, 92, 91]], [[ 2, 55, 20],
[ 2, 53, 19],
[ 1, 54, 18],
...,
[ 2, 4, 2],
[ 8, 11, 9],
[ 91, 93, 91]], ..., [[ 6, 76, 27],
[ 6, 77, 26],
[ 6, 78, 28],
...,
[ 6, 55, 18],
[ 13, 61, 25],
[ 94, 125, 102]], [[ 9, 79, 31],
[ 11, 81, 33],
[ 12, 82, 32],
...,
[ 6, 56, 19],
[ 14, 61, 27],
[ 96, 126, 103]], [[ 43, 103, 63],
[ 44, 107, 66],
[ 46, 106, 66],
...,
[ 37, 81, 50],
[ 47, 88, 59],
[118, 145, 126]]], dtype=uint8)

Check the shape of the array,

im.sgape

Output:

(155, 201, 3)

I will divide the whole array by 255 to make all the values from 0 to 1.

Then reshape it to 155*201 x 3 to make it a two-dimensional array. Because we developed all the functions before for two-dimensional arrays.

im = (im/255).reshape(155*201, 3)

As you can see above, so many different pixel values. We want to reduce it and keep only 10-pixel values.

Let’s initialize 10 random indexes,

random_index = random.sample(range(0, len(im)), 10)

Now, find the centroids as we did in the previous example:

centroids = []
for i in random_index:
centroids.append(im[i])
centroids = np.array(centroids)

Output:

array([[0.00392157, 0.21176471, 0.06666667],
[0.03529412, 0.2627451 , 0.09803922],
[0.29411765, 0.3254902 , 0.26666667],
[0.00784314, 0.18431373, 0.05882353],
[0.29019608, 0.49411765, 0.28235294],
[0.5254902 , 0.61176471, 0.48627451],
[0.04313725, 0.23921569, 0.09803922],
[0.00392157, 0.23529412, 0.0745098 ],
[0.00392157, 0.20392157, 0.04705882],
[0.22352941, 0.48235294, 0.40784314]])

Now I will convert ‘im’ to an array as well,

im = np.array(im)

The data is ready. Now we can proceed to our clustering process. But this time, I will not make the visualization. Because the data is not two-dimensional anymore. So, visualization is not easy.

for i in range(20):
get_centroids = findClosestCentroids(centroids, im)
centroids = calc_centroids(get_centroids, im)

We got the updated centroids now.

centroids

Output:

[0    0.017726
1 0.227360
2 0.084389
dtype: float64,
0 0.119791
1 0.385882
2 0.247633
dtype: float64,
0 0.155117
1 0.492051
2 0.331497
dtype: float64,
0 0.006217
1 0.048596
2 0.019410
dtype: float64,
0 0.258289
1 0.553290
2 0.406759
dtype: float64,
0 0.728167
1 0.764610
2 0.689944
dtype: float64,
0 0.073519
1 0.318513
2 0.170943
dtype: float64,
0 0.035116
1 0.273665
2 0.114766
dtype: float64,
0 0.010810
1 0.144621
2 0.053192
dtype: float64,
0 0.444197
1 0.617780
2 0.513234
dtype: float64]

This is the final step. We will only keep these 10 points.

If you print get_centroids as well you will see the cluster assignment.

Now, we want to loop through the whole array ‘im’ and change the data to its corresponding cluster centroid value. That way we will only have these centroids values.

Instead of changing the original array, I want to make a copy and make the change there.

im_recovered = im.copy()
for i in range(len(im)):
im_recovered[i] = centroids[get_centroids[i]]

As you remember, we changed the dimension of the image, in the beginning, to make it a two-dimensional array. We need to change it to its original shape now.

im_recovered = im_recovered.reshape(155, 201, 3)

Here, I am going to plot the original and the reduced image side by side to show you the difference:

im1 = cv2.imread('frog.png')
import matplotlib.image as mpimg
fig,ax = plt.subplots(1,2)
ax[0].imshow(im1)
ax[1].imshow(im_recovered)
Image by Author

Look, we reduced the dimension of the image so significantly. Still, it looks like a frog! But it will be far faster for a computer to read!

Conclusion

In this article, I explained how a k means clustering works and how to develop a k mean clustering algorithm from scratch. I also explained, how to use this algorithm to reduce the dimension of an image. Please try this with a different image.

Here is the link to the dataset I used in this article.

Here is the GitHub link to the complete code:

Reading Recommendation: