Original article was published by Satyam Kumar on Artificial Intelligence on Medium
Silhouette Method — Better than Elbow Method to find Optimal Clusters
Deep dive analysis of Silhouette Method to find optimal clusters in k-Means clustering
Hyperparameters are model configurations properties that define the model and remain constants during the training of the model. The design of the model can be changed by tuning the hyperparameters. For K-Means clustering there are 3 main hyperparameters to set-up to define the best configuration of the model:
- Initial values of clusters
- Distance measures
- Number of clusters
Initial values of clusters greatly impact the clustering model, there are various algorithms to initialize the values. Distance measures are used to find points in clusters to the cluster center, different distance measures yield different clusters.
The number of clusters (k) is the most important hyperparameter in K-Means clustering. If we already know beforehand, the number of clusters to group the data into, then there is no use to tune the value of k. For example, k=10 for the MNIST digit classification dataset.
If there is no idea about the optimal value of k, then there are various methods to find the optimal/best value of k. In this article we will cover two such methods:
- Elbow Method
- Silhouette Method
Elbow Method is an empirical method to find the optimal number of clusters for a dataset. In this method, we pick a range of candidate values of k, then apply K-Means clustering using each of the values of k. Find the average distance of each point in a cluster to its centroid, and represent it in a plot. Pick the value of k, where the average distance falls suddenly.
With an increase in the number of clusters (k), the average distance decreases. To find the optimal number of clusters (k), observe the plot and find the value of k for which there is a sharp and steep fall of the distance. This is will be an optimal point of k where an elbow occurs.
In the above plot there a sharp fall of average distance at k=2, 3, and 4. Here comes a confusion to pick the best value of k. In the below plot observe the clusters formed for k=2, 3, and 4 with their average distance.
This data is 2-D, so it’s easy to visualize and pick the best value of k, which is k=4. For higher-dimensional data, we can employ the Silhouette Method to find the best k, which is a better alternative to Elbow Method.