Original article was published by Bharath K on Artificial Intelligence on Medium
There are lots of clustering algorithms but today we will be focusing mainly of the three most popular and important types of clustering algorithms. These clustering algorithms are —
- Centroid-based clustering (K-means Clustering)
- Connectivity-based clustering (hierarchical clustering)
- Density-based clustering(DBSCAN)
We will analyze each of these algorithms in detail and understand how they exactly work. We will also look at the advantages and limitations of these algorithms. So, without further ado, let us get started!
1. K-means Clustering:
The K-means clustering algorithm is one of the most popular methods for performing clustering analysis. In K-means clustering, the hyperparameter used for evaluation is ‘K.’
‘K’ = No. of clusters.
The number of clusters will determine the type of clustering classification that will be performed. In the above figure, we can assume that the value of K chosen is 3. The value of K will determine the number of centers that will be considered for the process of segregation and grouping.
The “correct” value of the hyperparameter K can be determined with methods like grid search or random search. Hence, we can say that the main objective of K-means is to find the best centroids and group the clusters accordingly in a suitable manner for the particular dataset.
Steps to follow for K-means clustering —
1. Initialization: Randomly picking the n points from the dataset and initialize them. Choose the K value as well.
2. Assignment: For each selected point find the nearest centroid values.
3. Update: Re-compute the centroid values and update them accordingly.
4. Repetition: Repeat the step 2 and step 3 until you reach convergence.
5. Termination: Upon reaching convergence terminate the program.
The above image is a representation for the convergence procedure of K-means.
- Relatively simple to implement and execute.
- Effective and efficient with larger datasets.
- Guarantees convergence at some point.
- Choosing the best hyperparameter ‘k’ can be difficult.
- Issues in computing higher dimensional data.
- It is more prone to error in the presence of outliers or noise.
2. Hierarchical Clustering:
Connectivity-based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away.
The concept of hierarchical clustering is basically grouping of similar things from either bigger chunks to smaller chunks or vice versa. This point can be understood better when we look at the two types of hierarchical clustering methods which are as follows:
- Agglomerative Clustering — This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
- Divisive Clustering — This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
The agglomerative clustering is usually preferred over the divisive clustering. So, we will look further into the analysis of agglomerative cluster rather than divisive.
The above dendrogram is a representation of how hierarchical clustering and specifically how agglomerative clustering works.
Let us understand the steps involved in the process of agglomerative clustering —
1. Compute the proximity matrix which is basically a matrix containing the closest distance of each of the similarities i.e., the inter/intra cluster distances.
2. Consider each point to be a cluster.
3. Repeat the following step for every point.
4. Merge the two closest points.
5. Update the proximity matrix.
6. Continue until only a single cluster remains.
- Easier to judge the number of clusters looking at the dendrograms.
- Overall easy to implement.
- Very sensitive to outliers.
- Not suitable for larger datasets.
- It has a high time complexity which not be ideal for some applications.
DBSCAN stands for Density-based spatial clustering of applications with noise and is gaining a rising popularity. In DBSCAN, we create clusters that for the areas of higher density than the remainder of the data set. Objects in sparse areas that are required to separate clusters are usually considered to be noise and border points.
DBSCAN utilizes two important hyperparameters in minimum points (or MinPts) and epsilon ε. Before we look at the steps on how to solve these problems, let us analyze how these hyperparameters exactly work.
- MinPts: The larger the data set, the larger the value of minPts should be chosen. minPts must be chosen at least 3.
- Epsilon ‘ϵ’: The value for ϵ can then be chosen by using a k-distance graph, plotting the distance to the k = minPts nearest neighbor. Good values of ϵ are where the plot shows a strong bend like an elbow shape.
For the implementation of the DBSCAN algorithm, the stepwise procedure we need to follow is as described below:
1. Find the points in the ε (eps) neighborhood of every point, and identify the core points with more than minPts neighbors.
2. Label each of the selected points as a core point or border point or noise point. This initial labeling is an important step for the overall functionality.
3. Remove all the noise points from your data because sparse regions do not belong to any clusters.
4. for each core point 'p' not assigned to a cluster create a loop as follows -
a. Create a new cluster with the point 'p'.
b. Add all points that are density connected to into this newly created cluster.
5. Assign each non-core point to a nearby cluster if the cluster is an ε (eps) neighbor, otherwise assign it to noise. Repeat the procedure until convergence is reached.
- Density based clustering methods are highly resistant to noise and outliers.
- It usually works for any shape unlike the two previously mentioned algorithms which have trouble dealing with non-globular (non-convex) shapes.
- They do not have specific requirement for number of clusters to be set unlike K-means.
- DBSCAN cannot cluster data sets well with large differences in densities.
- It is not entirely deterministic and it is prone to errors on a high-dimensionality dataset.
- Since it has two variable hyperparameters, it is susceptible to changes in them.