How to Handle Cyclical Data in Machine Learning

Original article was published by Dario Radečić on Artificial Intelligence on Medium

Encoding cyclical data

One-hot encoding wouldn’t be that wise of thing to do in this case. We’d end up with 23 additional attributes (n — 1), which is terrible for two reasons:

  1. Massive jump in dimensionality — from 2 to 24
  2. No connectivity between attributes — hour 23 doesn’t know it’s followed by hour 0

So, what can we do?

Use a sine an cosine transformations. Here are the formulas we’ll use:

Or, in Python:

import numpy as np last_week['Sin_Hour'] = np.sin(2 * np.pi * last_week['Hour'] / max(last_week['Hour'])) last_week['Cos_Hour'] = np.cos(2 * np.pi * last_week['Hour'] / max(last_week['Hour']))

Awesome! Here’s how the last week of data now looks:

These transformations allowed us to represent time data in a more meaningful and compact way. Just take a look at the last two rows. Sine values are almost identical, but still a bit different. The same goes for every following hour, as it now follows a waveform.

That’s great, but why do we need both functions?

Let’s explore the functions graphically before I give you the answer.

Look at one graph at a time. There’s a problem. The values repeat. Just take a look at the sine function, somewhere between 24 and 48, on the x-axis. If you were to draw a straight line, it would intersect with two points for the same day. That’s not the behavior we want.

To further prove this point, here’s what happens if we draw a scatter plot of both sine and cosine columns:

That’s right; we get a perfect cycle. It only makes sense to represent cyclical data with a cycle, don’t you agree?

That’s all you should know. Let’s wrap things up in the next section.