Original article was published by Dario Radečić on Artificial Intelligence on Medium
Encoding cyclical data
One-hot encoding wouldn’t be that wise of thing to do in this case. We’d end up with 23 additional attributes (n — 1), which is terrible for two reasons:
- Massive jump in dimensionality — from 2 to 24
- No connectivity between attributes — hour 23 doesn’t know it’s followed by hour 0
So, what can we do?
Use a sine an cosine transformations. Here are the formulas we’ll use:
Or, in Python:
import numpy as np last_week['Sin_Hour'] = np.sin(2 * np.pi * last_week['Hour'] / max(last_week['Hour'])) last_week['Cos_Hour'] = np.cos(2 * np.pi * last_week['Hour'] / max(last_week['Hour']))
Awesome! Here’s how the last week of data now looks:
These transformations allowed us to represent time data in a more meaningful and compact way. Just take a look at the last two rows. Sine values are almost identical, but still a bit different. The same goes for every following hour, as it now follows a waveform.
That’s great, but why do we need both functions?
Let’s explore the functions graphically before I give you the answer.
Look at one graph at a time. There’s a problem. The values repeat. Just take a look at the sine function, somewhere between 24 and 48, on the x-axis. If you were to draw a straight line, it would intersect with two points for the same day. That’s not the behavior we want.
To further prove this point, here’s what happens if we draw a scatter plot of both sine and cosine columns:
That’s right; we get a perfect cycle. It only makes sense to represent cyclical data with a cycle, don’t you agree?
That’s all you should know. Let’s wrap things up in the next section.