What is Clustering?

Original article was published on Deep Learning on Medium

Introduction: Ever thought of arranging data based on similar features without having the actual labels/classes/targets for the data. This article will provide you complete knowledge of this thing with all the possible ways and their practical examples.

Source: heyerlein via unsplash

Clustering is a Machine Learning algorithm. It falls under unsupervised machine learning algorithms.

Unsupervised Machine Learning

These are the category of algorithms in which we have only features for data i.e. labels/classes/target are not available in the data.

We can also say that it is not known that for a particular record of the data, where it should belong. This is main significance of this category of the machine learning algorithms.

Clustering Explanation

As it is clear till now that it falls under unsupervised machine learning algorithms, so obviously we are not having the target classes for our data.

These algorithms works with the goal of making few groups/clusters of the data by the similarity between them or finding some patterns between the data.

In a nutshell, clustering will make several clusters, & data present in each cluster will be having the utmost similarity, & the data present between clusters will be having least similarity.

From the above explanation, it can be concluded that the ultimate goal of the clustering is to minimize the Intra-cluster distance & maximize the inter-cluster distance.

Types of Clustering

  1. Partition based Clustering
  2. Hierarchical Clustering
  3. Density based Clustering

Partition based Clustering

  • These types of clustering algorithms generate Sphere like clusters.
  • They are relatively efficient.
  • Used for Medium or Large size Databases.
  • Examples: K-Means, Fuzzy C-Means, K-Median.

Hierarchical Clustering

  • These are the algorithms which generate trees of clusters and group the similar data.
  • Very Intuitive Algorithms.
  • Generally good to use with small sized datasets.
  • Example: Agglomerative, Divisive.

Density Based Clustering

  • They produces clusters with arbitrary shape.
  • They are excellent to use when there is no noise in the dataset.
  • Example: DBScan Algorithm.

Use Cases of Clustering

In Retail/Marketing:

  • Identifying buying patterns of customers.
  • Recommending new movies to customers.
  • Recommending new gadgets to customers, etc.

Banking:

  • Identifying a set of customers. (Eg, loyal, churn etc.)
  • Fraud Detection, etc.

Insurance:

  • Fraud detection in claim analysis etc.

Publication:

  • Automatic categorising of the news based on the content of the news.
  • Recommending similar news articles.
  • Identifying a set of readers. (Eg, loyal, churn, etc.)

Medicine:

  • Characterising Patient Behaviour for the effect of medicine.
  • Identifying similar drugs by clustering them.

Biology:

  • Clustering genetic markers to identify family ties/ family generation.
  • Identifying a particular species.

Clustering VS Classification

The most significant difference between them is that, in classification, for eacc record we have corresponding label, but in clustering we do not have labels at all.