Original article was published on Artificial Intelligence on Medium

# K — MEANS Clustering | Data Science | ML (Part 8)

Learning from the Clusters formed *automatically* in SPACE:

As we see in the above image, the group of stars, together to form a clusters. There are multiple clusters indicated, actually I tried to shows as many clusters as possible for better understanding. We clearly get an idea from above image that grouping something forms a cluster. Same thing is done in **K-MEANS** clustering, by forming clusters of data i.e. grouping similar data-points. IN this blog, I tried to enlighten this topic in an easiest way.

First of all, Can you guess whether K-Means is

supervisedorunsupervised learning???

I answered this in the very first line, that the clusters are formed automatically. That means this algorithm try to learn itself from the data and give us the desired output, therefore undoubtedly it’s unsupervised learning algorithm. Now, we are ready to begin k-means learning beginning with clustering.

# ALGORITHM

Clusteringis a process of grouping data based on data-patterns observed i.e. forming cluster on basis of similarity of data. This is unique way of understanding the given data by observing the similarity of data-points.

We will be given dataset, with certain features, and values for these features (like a vector). The task is to categorize those items by forming clusters. he algorithm will categorize the items into k groups of similarity. To calculate that similarity, we will use the **Euclidean distance** as measurement.

# Data-point **within** **cluster** must be **similar(i.e. near)** and data-points of **different cluster **must be as **different(i.e. far) **as possible.

To achieve above point, we will use Euclidean distance. For that, first we will make a random centroid in each clusters. Then…

- Calculate
**intra-cluster distance.**This is of Euclidean distance between different points to the center of cluster. The distance should be as**minimum**as possible. - Calculate
**inter-cluster distance.**This is Euclidean distance between the centroids of two clusters. This distance should be as**maximum**as possible.

- Then calculate
**Dunn Index**. It is a ratio of max(inter-cluster distance) / min(intra-cluster distance).

TO achieve the above aim, we follow following algorithm.

- Randomly initialize the number of clusters k.
- Select k points from each cluster as centroids.
- Calculate the Dunn index.
- Repeat the above steps again and again for better accuracy.

The “points” mentioned above are called **MEANS**, because they hold the mean values of the items categorized in it. To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set. Another method is to initialize the means at random values between the boundaries of the data set (if for a feature x the items have values in [0,5], we will initialize the means with values for x at [0,5]).

Other way to determine number of clusters is through **ELBOW Method**.

# The Elbow Method:

Increasing the value of k, we will come at a point where there will be sudden decrease in value of distortion (forming a tip of elbow) at a particular value of k. This value is **best** suited value of k for maximum accuracy.

# Implementation of K-Means in Python

# Important NOTE: K-Means is a distance-based algorithm, this difference of magnitude can create a problem. So let’s first bring all the variables to the same magnitude.

# standardizing the data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

data_scaled = scaler.fit_transform(data)# defining the kmeans function with initialization as k-means++

kmeans = KMeans(n_clusters=2, init='k-means++')# fitting the k means algorithm on scaled datakmeans.fit(data_scaled)

#ELBOW PLOTTING

# fitting multiple k-means algorithms and storing the values in an empty list

SSE = []

for cluster in range(1,20):

kmeans = KMeans(n_jobs = -1, n_clusters = cluster, init='kmeans++')

kmeans.fit(data_scaled)

SSE.append(kmeans.inertia_)# converting the results into a dataframe and plotting themframe=pd.DataFrame({'Cluster':range(1,20),'SSE':SSE})

plt.figure(figsize=(12,6))

plt.plot(frame['Cluster'], frame['SSE'], marker='o')

plt.xlabel('Number of clusters')

plt.ylabel('Inertia')kmeans = KMeans(n_jobs = -1, n_clusters = 5, init='k-means++')

# k means using 5 clusters and k-means++ initialization

kmeans.fit(data_scaled)

pred = kmeans.predict(data_scaled)frame = pd.DataFrame(data_scaled)

frame['cluster'] = pred

frame['cluster'].value_counts()

# K-Means Advantages & Disadvantages

Advantages:

- Easy to understand
- Works good for large data-set
- Adapt changes in data-set

Disadvantages :

- k value error
- Outlier affects the centroid
- Scaling issue when increase in dimension (Can be solved with
**PCA**)

# Applications of K-Means

- Document Classification
- Customer Segmentation
- Insurance Fraud Detection
- Automatic clustering of IT Alerts
- Market Research

# Summary

In this blog, we discussed the basic unsupervised machine learning algorithm. I tried to implement it from scratch and explain in easiest way. We also saw few pros, cons & applications of this algorithm in real world.

I hope this blog post helped in understanding K-Means. ** Comment down your thoughts, feedback or suggestions** if any below. Make sure you

**follow**me for similar content. Good Bye. Have a Great day.