Clustering the DATA — K-MEANS

Original article was published on Artificial Intelligence on Medium


K — MEANS Clustering | Data Science | ML (Part 8)

Learning from the Clusters formed automatically in SPACE:

Stars clusters in Galaxy (Source: Author)

As we see in the above image, the group of stars, together to form a clusters. There are multiple clusters indicated, actually I tried to shows as many clusters as possible for better understanding. We clearly get an idea from above image that grouping something forms a cluster. Same thing is done in K-MEANS clustering, by forming clusters of data i.e. grouping similar data-points. IN this blog, I tried to enlighten this topic in an easiest way.

First of all, Can you guess whether K-Means is supervised or unsupervised learning???

I answered this in the very first line, that the clusters are formed automatically. That means this algorithm try to learn itself from the data and give us the desired output, therefore undoubtedly it’s unsupervised learning algorithm. Now, we are ready to begin k-means learning beginning with clustering.

ALGORITHM

Clustering is a process of grouping data based on data-patterns observed i.e. forming cluster on basis of similarity of data. This is unique way of understanding the given data by observing the similarity of data-points.

We will be given dataset, with certain features, and values for these features (like a vector). The task is to categorize those items by forming clusters. he algorithm will categorize the items into k groups of similarity. To calculate that similarity, we will use the Euclidean distance as measurement.

Source

Data-point within cluster must be similar(i.e. near) and data-points of different cluster must be as different(i.e. far) as possible.

To achieve above point, we will use Euclidean distance. For that, first we will make a random centroid in each clusters. Then…

  • Calculate intra-cluster distance. This is of Euclidean distance between different points to the center of cluster. The distance should be as minimum as possible.
  • Calculate inter-cluster distance. This is Euclidean distance between the centroids of two clusters. This distance should be as maximum as possible.
Distance in view (Source)
  • Then calculate Dunn Index. It is a ratio of max(inter-cluster distance) / min(intra-cluster distance).

TO achieve the above aim, we follow following algorithm.

  • Randomly initialize the number of clusters k.
  • Select k points from each cluster as centroids.
  • Calculate the Dunn index.
  • Repeat the above steps again and again for better accuracy.

The “points” mentioned above are called MEANS, because they hold the mean values of the items categorized in it. To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set. Another method is to initialize the means at random values between the boundaries of the data set (if for a feature x the items have values in [0,5], we will initialize the means with values for x at [0,5]).

Other way to determine number of clusters is through ELBOW Method.

The Elbow Method:

Abrupt Decrease at k=3 (Source)

Increasing the value of k, we will come at a point where there will be sudden decrease in value of distortion (forming a tip of elbow) at a particular value of k. This value is best suited value of k for maximum accuracy.

Implementation of K-Means in Python

# Important NOTE: K-Means is a distance-based algorithm, this difference of magnitude can create a problem. So let’s first bring all the variables to the same magnitude.
# standardizing the data

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# defining the kmeans function with initialization as k-means++
kmeans = KMeans(n_clusters=2, init='k-means++')
# fitting the k means algorithm on scaled data
kmeans.fit(data_scaled)

#ELBOW PLOTTING
# fitting multiple k-means algorithms and storing the values in an empty list

SSE = []
for cluster in range(1,20):
kmeans = KMeans(n_jobs = -1, n_clusters = cluster, init='kmeans++')
kmeans.fit(data_scaled)
SSE.append(kmeans.inertia_)
# converting the results into a dataframe and plotting them
frame=pd.DataFrame({'Cluster':range(1,20),'SSE':SSE})
plt.figure(figsize=(12,6))
plt.plot(frame['Cluster'], frame['SSE'], marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

# k means using 5 clusters and k-means++ initialization
kmeans = KMeans(n_jobs = -1, n_clusters = 5, init='k-means++')
kmeans.fit(data_scaled)
pred = kmeans.predict(data_scaled)
frame = pd.DataFrame(data_scaled)
frame['cluster'] = pred
frame['cluster'].value_counts()

K-Means Advantages & Disadvantages

Advantages:

  • Easy to understand
  • Works good for large data-set
  • Adapt changes in data-set

Disadvantages :

  • k value error
  • Outlier affects the centroid
  • Scaling issue when increase in dimension (Can be solved with PCA)

Applications of K-Means

  • Document Classification
  • Customer Segmentation
  • Insurance Fraud Detection
  • Automatic clustering of IT Alerts
  • Market Research

Summary

In this blog, we discussed the basic unsupervised machine learning algorithm. I tried to implement it from scratch and explain in easiest way. We also saw few pros, cons & applications of this algorithm in real world.

I hope this blog post helped in understanding K-Means. Comment down your thoughts, feedback or suggestions if any below. Make sure you follow me for similar content. Good Bye. Have a Great day.