Opening the Black Box of Clustering — KMeans

Source: Artificial Intelligence on Medium

“The unsupervised learning is the way most people will learn in the future. You have this model of how the world works in your head and you’re refining it to predict what you think is going to happen in the future.” — Mark Zuckerberg

Unsupervised learning forms a very niche part of Machine Learning, simply because most tasks have a label on them (supervised). However, in cases where we lack these ‘labelled’ data, clustering methods can help us to find patterns by making inferences on the dataset. Common areas in which clustering is applied includes customer segmentation (for ad-targeting), population analysis (understanding demographics) and also anomaly detection.

Some people view unsupervised learning as a ‘grey area’ in Machine Learning because it can sometimes be hard to interpret the types of clusters that the algorithms output, because there is no single ‘metric’ that can tell us how meaningful those predicted clusters are.


K-means clustering is a distance-based clustering method for finding clusters and cluster centers in a set of unlabelled data. This is a fairly tried and tested method and can be implemented easily using sci-kit learn.

The goal of K-Means is fairly straightforward — to group points that are ‘similar’ (based on distance) together. This is done so by regarding the center of data points as the center of the corresponding clusters (centroids).

The core idea is as such: Update the cluster centroid by iterative computation and the iterative process will be continued until some convergence criteria is met.

How it works:

  1. To start, one chooses the desired number of clusters, say K.
  2. K random points are selected from the dataset to be the cluster centroids.
  3. At each step, select an unassigned data point and check which centroid is closest to it. The definition of closeness is usually measured by a distance metric, normally in the form of euclidean distance.
  4. After each point is allocated a cluster, recalculate the new cluster centroid using the mean of the data points within the same cluster. This serves to optimise the location of the cluster centroid.
  5. Repeat steps 3 and 4 until either a) the centroids have stabilised (do not pass a threshold) or b) the desired number of iterations have been achieved.

Note that the centroids are in fact not actual data points in the dataset.

An example using the Iris Dataset

For those who are not familiar with this dataset, it contains data on variations of iris flowers, with features including sepal_length, sepal_width, petal_length and petal_width. Since this is an unsupervised problem, I will not reveal how many different types of irises’ there are (to find out more about the dataset, please visit this link).

We will use the iris dataset to showcase how K-Means work. Let’s first import the data and visualise it.

import numpy as np
from sklearn import datasets, cluster
import matplotlib.pyplot as plt
%matplotlib inline
iris = datasets.load_iris()
x =[:, [1,3]] # takes the 2nd and 4th column of the data
plt.scatter(x[:,0], x[:,1])
plt.xlabel('Sepal width')
plt.ylabel('Petal width')
Output of above code block

From the above visualisation, it seems like there might be two distinct clusters. Now, let’s use the K-Means algorithm by sci-kit learn.

from sklearn.cluster import KMeansx_ = # Note that in we use all 4 columns here
random_state = 2020
kmeans = KMeans(n_clusters=2, init='k-means++', random_state=random_state)
label = kmeans.labels_
color = ['red','green','blue']
plt.figure(figsize=(8, 6))
plt.scatter(x_fit[:,0], x_fit[:,1], c=[color[i] for i in label], marker='+')
plt.title('Kmeans on Iris dataset with 2 clusters', fontweight='bold', fontsize=14)
K-Means on Iris Dataset

It looks like segregation using n_clusters = 2 yields pretty decent results, except for a small number of data points that may be clustered wrongly (green ones in the centre that should probably be red).

From the above, we determined the number of clusters n_clusters by visually inspecting the data (using two features). It is clear that determining the optimal n_clusters is a very important step in clustering, especially K-Means where this number has to be specified in the algorithm beforehand. By using two features to visually inspect the data is clearly not sufficient as there may be additional clusters if our data is high-dimension and our human judgment may fail miserably (especially with anything more 3-D).

How to determine the optimal number of clusters

Method 1: Elbow Plots

This method computes the total intra-cluster sum of squared errors (SSE) using sklearn’s inertia_ method. The point at which the elbow kinks (gradient sharply changes) suggests that there is diminishing returns and we would generally want to take the n_clusters before this occurs. Intuitively, we want a number k that minimises the SSE, but at the same time an increasing k will naturally reduce the SSE to zero (where every data point is its own cluster). In the below example, n_clusters = 3 seems to be the most optimal.

elbow = []
kmax = 10
for k in range(2, kmax+1):
kmeans = KMeans(n_clusters = k).fit(x_)

plt.plot(np.arange(2,11), elbow)
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia (Intra cluster sum of squares)')
plt.title('Inertia vs n_clusters to determine optimal cluster size', fontweight='bold')
Using Inertia to determine optimal n_clusters

However, this method may not work well when the data is not very clustered. Hence, it may be better to do a cross-check with our next method.

Method 2: Silhouette Plots

The Silhouette Coefficient is essentially what we call the inter-cluster distance. In general, clustering aims to minimise intra-cluster distance while maximising inter-cluster distance. Using both methods will ensure that n_clusters is more optimally defined.

from sklearn.metrics import silhouette_score
sil = []
kmax = 10
for k in range(2, kmax+1):
kmeans = KMeans(n_clusters = k).fit(x_)
labels = kmeans.labels_
sil.append(silhouette_score(x_, labels, metric = 'euclidean'))

plt.plot(np.arange(2,11), elbow)
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Coefficient (Inter-cluster distance)')
plt.title('Silhouette vs n_clusters to determine optimal cluster size', fontweight='bold')
Using Silhouette coefficient to determine optimal n_clusters

Ideally, we would want the silhouette coefficient to be as high as possible (i.e. maximise inter-cluster distance). In this instance, n_clusters = 2 seems to be most optimal.

Reconciling difference between Elbow and Silhouette

What should we do since the elbow plot suggests n_clusters = 3 while the silhouette plot suggests n_clusters = 2 ?

Method 1: Try to visualise the data as much as we can! In this method, we use Principal Component Analysis (PCA) so that we can do a 3-D plot of the data. Since the topic is on clustering, I will not delve into the details of PCA (let me know in the comments if you guys want it covered!).

from sklearn import decomposition
from mpl_toolkits.mplot3d import Axes3D
x_full = iris.datapca = decomposition.PCA()
xr = pca.fit_transform(x_full)
kmeans = cluster.KMeans(n_clusters=3, init='k-means++', random_state=random_state)
label = kmeans.labels_
color = ['red', 'green', 'blue']
plt.figure(figsize=(12, 12))
fig = plt.subplot(1, 1, 1, projection='3d')
fig.scatter(xr[:,0], xr[:,1], xr[:,2], c=[color[i] for i in label])
fig.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c='yellow', s=100)
fig.set_xlabel('Principal Component 1')
fig.set_ylabel('Principal Component 2')
fig.set_zlabel('Principal Component 3')
3-D plot of Iris Dataset (after PCA)

From the 3-D plot above, it does seem like there might be three different clusters — though not completely certain as the inter-cluster distance is not high.

Method 2: Use descriptive statistics to make sense of the clusters

Descriptive statistics involves using measures such as mean, max, min and median to find out how different the clusters are. In unsupervised learning, there isn’t a ‘right’ answer to the problem at hand and most of the time, finding the optimal number of clusters involve looking at the results of these clusters (i.e. compare the descriptive statistics of these different clusters).

(In this case however, if we want to ‘cheat’, we do know that there is in fact 3 different types of iris variations in the dataset (since it is labelled!)

Evaluation of K-Means


  • Easy to use and easy to understand
  • Suitable for datasets where hierarchical relationship amongst clusters can be easily detected
  • Relatively low time complexity and high computing efficiency which leads to high scalability


  • Not suitable for non-convex data
  • Highly sensitive to initialisation of cluster centroids — it is worth tuning hyperparameters such as n_init and init='random' to ensure that the clustering result you get is consistent
  • Highly sensitive to outliers — the means in K-Means means that outliers will skew the cluster centroid quite significantly — other algorithms that use mode or median will be less prone to outliers
  • Easily drawn into local optimal
  • Clustering result is sensitive to number of clusters

Final Thoughts and Closing Remarks

K-Means is a good model to get started with on any clustering problem as it has the ability to fit on high dimension data. However, it is important to know what kind of distance metric is being used, the sensitivity of the model with regards to outliers and most importantly, the use of domain knowledge to tune the model accordingly.

In the second part of this 3-part series, I will explore a different dataset and use Agglomerative Clustering (one of two variants of hierarchical clustering) for our analysis. I will also go through methods to obtain descriptive statistics of the clusters yielded by the model.

Pro Tip: Explaining the clusters generated is equally (if not more) important than obtaining the clusters themselves!