Source: Artificial Intelligence on Medium

“The unsupervised learning is the way most people will learn in the future. You have this model of how the world works in your head and you’re refining it to predict what you think is going to happen in the future.” —

Mark Zuckerberg

Unsupervised learning forms a very niche part of Machine Learning, simply because most tasks have a label on them (*supervised*). However, in cases where we lack these ‘*labelled*’ data, clustering methods can help us to find patterns by making inferences on the dataset. Common areas in which clustering is applied includes customer segmentation (for ad-targeting), population analysis (understanding demographics) and also anomaly detection.

Some people view unsupervised learning as a ‘*grey area*’ in Machine Learning because it can sometimes be hard to interpret the types of clusters that the algorithms output, because there is no single ‘metric’ that can tell us how meaningful those predicted clusters are.

# K-Means

K-means clustering is a *distance-based* clustering method for finding clusters and cluster centers in a set of unlabelled data. This is a fairly tried and tested method and can be implemented easily using *sci-kit learn*.

The goal of K-Means is fairly straightforward — to group points that are ‘similar’ (based on distance) together. This is done so by regarding the center of data points as the center of the corresponding clusters (centroids).

The core idea is as such: Update the cluster centroid by iterative computation and the iterative process will be continued until some convergence criteria is met.

## How it works:

- To start, one chooses the desired number of clusters, say
*K*. *K*random points are selected from the dataset to be the cluster centroids.- At each step, select an unassigned data point and check which centroid is closest to it. The definition of closeness is usually measured by a distance metric, normally in the form of euclidean distance.
- After each point is allocated a cluster, recalculate the new cluster centroid using the
*mean*of the data points within the same cluster. This serves to optimise the location of the cluster centroid. - Repeat steps 3 and 4 until either
**a)**the centroids have stabilised (do not pass a threshold) or**b)**the desired number of iterations have been achieved.

Note that the centroids are in fact not actual data points in the dataset.

## An example using the Iris Dataset

For those who are not familiar with this dataset, it contains data on variations of *iris* flowers, with features including `sepal_length`

, `sepal_width`

, `petal_length`

and `petal_width`

. Since this is an unsupervised problem, I will not reveal how many different types of irises’ there are (*to find out more about the dataset, please visit this **link*).

We will use the `iris`

dataset to showcase how K-Means work. Let’s first import the data and visualise it.

import numpy as np

from sklearn import datasets, cluster

import matplotlib.pyplot as plt

%matplotlib inlineiris = datasets.load_iris()

x = iris.data[:, [1,3]] # takes the 2nd and 4th column of the data

plt.scatter(x[:,0], x[:,1])

plt.xlabel('Sepal width')

plt.ylabel('Petal width')

plt.show()

From the above visualisation, it seems like there might be two distinct clusters. Now, let’s use the K-Means algorithm by sci-kit learn.

from sklearn.cluster import KMeansx_ = iris.data # Note that in we use all 4 columns here

random_state = 2020kmeans = KMeans(n_clusters=2, init='k-means++', random_state=random_state)

kmeans.fit_transform(x_)

label = kmeans.labels_

color = ['red','green','blue']plt.figure(figsize=(8, 6))

plt.scatter(x_fit[:,0], x_fit[:,1], c=[color[i] for i in label], marker='+')

plt.title('Kmeans on Iris dataset with 2 clusters', fontweight='bold', fontsize=14)

plt.show()

It looks like segregation using `n_clusters = 2`

yields pretty decent results, except for a small number of data points that may be clustered wrongly (green ones in the centre that should probably be red).

From the above, we determined the number of clusters `n_clusters`

by visually inspecting the data (using two features). It is clear that determining the optimal `n_clusters`

is a very important step in clustering, especially K-Means where this number has to be specified in the algorithm beforehand. By using two features to visually inspect the data is clearly not sufficient as there may be additional clusters if our data is high-dimension and our human judgment may fail miserably (especially with anything more 3-D).

## How to determine the optimal number of clusters

**Method 1:** Elbow Plots

This method computes the total **intra-cluster sum of squared errors** (SSE) using sklearn’s `inertia_`

method. The point at which the elbow kinks (gradient sharply changes) suggests that there is diminishing returns and we would generally want to take the `n_clusters`

before this occurs. Intuitively, we want a number `k`

that minimises the SSE, but at the same time an increasing `k`

will naturally reduce the SSE to zero (where every data point is its own cluster). In the below example, `n_clusters = 3`

seems to be the most optimal.

elbow = []

kmax = 10for k in range(2, kmax+1):

kmeans = KMeans(n_clusters = k).fit(x_)

elbow.append(kmeans.inertia_)

plt.figure(figsize=(8,6))

plt.plot(np.arange(2,11), elbow)

plt.xlabel('Number of Clusters')

plt.ylabel('Inertia (Intra cluster sum of squares)')

plt.title('Inertia vs n_clusters to determine optimal cluster size', fontweight='bold')

plt.show()

However, this method may not work well when the data is not very clustered. Hence, it may be better to do a cross-check with our next method.

**Method 2:** Silhouette Plots

The Silhouette Coefficient is essentially what we call the *inter-cluster* distance. In general, clustering aims to minimise intra-cluster distance while maximising inter-cluster distance. Using both methods will ensure that `n_clusters`

is more optimally defined.

from sklearn.metrics import silhouette_score

sil = []

kmax = 10for k in range(2, kmax+1):

kmeans = KMeans(n_clusters = k).fit(x_)

labels = kmeans.labels_

sil.append(silhouette_score(x_, labels, metric = 'euclidean'))

plt.figure(figsize=(8,6))

plt.plot(np.arange(2,11), elbow)

plt.xlabel('Number of Clusters')

plt.ylabel('Silhouette Coefficient (Inter-cluster distance)')

plt.title('Silhouette vs n_clusters to determine optimal cluster size', fontweight='bold')

plt.show()

Ideally, we would want the silhouette coefficient to be as high as possible (i.e. maximise inter-cluster distance). In this instance, `n_clusters = 2`

seems to be most optimal.

**Reconciling difference between Elbow and Silhouette**

What should we do since the elbow plot suggests `n_clusters = 3`

while the silhouette plot suggests `n_clusters = 2`

?

**Method 1:** Try to visualise the data as much as we can! In this method, we use Principal Component Analysis (PCA) so that we can do a 3-D plot of the data. Since the topic is on clustering, I will not delve into the details of PCA (let me know in the comments if you guys want it covered!).

from sklearn import decomposition

from mpl_toolkits.mplot3d import Axes3Dx_full = iris.datapca = decomposition.PCA()

xr = pca.fit_transform(x_full)kmeans = cluster.KMeans(n_clusters=3, init='k-means++', random_state=random_state)

kmeans.fit(xr)

label = kmeans.labels_

color = ['red', 'green', 'blue']plt.figure(figsize=(12, 12))

fig = plt.subplot(1, 1, 1, projection='3d')

fig.scatter(xr[:,0], xr[:,1], xr[:,2], c=[color[i] for i in label])

fig.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c='yellow', s=100)

fig.set_xlabel('Principal Component 1')

fig.set_ylabel('Principal Component 2')

fig.set_zlabel('Principal Component 3')

plt.show()

From the 3-D plot above, it does seem like there might be three different clusters — though not completely certain as the inter-cluster distance is not high.

**Method 2:** Use descriptive statistics to make sense of the clusters

Descriptive statistics involves using measures such as mean, max, min and median to find out how different the clusters are. In unsupervised learning, there isn’t a ‘right’ answer to the problem at hand and most of the time, finding the optimal number of clusters involve looking at the results of these clusters (i.e. compare the descriptive statistics of these different clusters).

(*In this case however, if we want to ‘cheat’, we do know that there is in fact 3 different types of iris variations in the dataset (since it is labelled!)*

## Evaluation of K-Means

*Advantages*:

- Easy to use and easy to understand
- Suitable for datasets where hierarchical relationship amongst clusters can be easily detected
- Relatively low time complexity and high computing efficiency which leads to high scalability

*Disadvantages*:

- Not suitable for non-convex data
- Highly sensitive to initialisation of cluster centroids — it is worth tuning hyperparameters such as
`n_init`

and`init='random'`

to ensure that the clustering result you get is consistent - Highly sensitive to outliers — the
*means*in K-Means means that outliers will skew the cluster centroid quite significantly — other algorithms that use mode or median will be less prone to outliers - Easily drawn into local optimal
- Clustering result is sensitive to number of clusters

## Final Thoughts and Closing Remarks

K-Means is a good model to get started with on any clustering problem as it has the ability to fit on high dimension data. However, it is important to know what kind of distance metric is being used, the sensitivity of the model with regards to outliers and most importantly, the use of domain knowledge to tune the model accordingly.

In the second part of this 3-part series, I will explore a different dataset and use **Agglomerative Clustering** (one of two variants of hierarchical clustering) for our analysis. I will also go through methods to obtain descriptive statistics of the clusters yielded by the model.

*Pro Tip**: Explaining the clusters generated is equally (if not more) important than obtaining the clusters themselves!*