K-means Clustering on Ordinal Data

Original article was published on Deep Learning on Medium

K-means Clustering on Ordinal Data

Using a mapping to uncover the numeric-like behaviors in your data

Image by Reimund Bertrams from Pixabay

K-means clustering. It’s the holy grail of unsupervised learning. And honestly? I understand why… Sure, there’s a bit of an art form to deciding on the number of clusters you should calculate, but by and large it’s borderline magical to sit back and let the algorithm do it’s thing. Even so, there’s one very important caveat: k-means clustering only works on numerical dataRight?!

Well… Maybe

In general, attempting to broaden k-means into categorical applications is precarious at best. The most integral part of k-means clustering deals with finding points with the minimal distance between them. How do we define distance amongst categorical variables? How far is an apple from an orange? Are those closer to blueberries or watermelon? In unordered contexts, it just doesn’t make sense to ask these questions.

But what if my categorical data is ordered?

Wow, I’m so glad you asked. Just as you’ve astutely pointed out, there absolutely are contexts where it seems like there should be a way to answer these questions.

Recently, I was looking at some data from KFF regarding the statewide social distancing mandates that have been implemented in the United States in response to COVID-19. After some wrangling, here’s what the first few rows looked like:

(Note: If you want to see my whole process for wrangling and analyzing this dataset, you can check out my R code on Github.)

Consider for instance, the Stay at Home Order attribute. This attribute describes whether or not a state has a Stay at Home order, and which populations are affected by the order. Obviously having no order is different from having a statewide order, which is different from having an order that only applies to high risk groups. And here’s the important part: a statewide order seems “closer” to an order affecting only high risk groups, than no order at all.

It seems like we should be able to answer the questions required for k-means clustering. The natural ordering of our data lends itself to further analysis. Sure, it’s not actually numeric data, but it behaves in a roughly numeric way, so what’s holding us back from treating it as such? In the ubiquitous words of Hannah Montana, life’s what you make it so let’s make it possible to perform k-means clustering on ordinal data! (No? That wasn’t her quote? Shoot.)