Source: Deep Learning on Medium
Dispatches from the Nth Dimension
The Curse of Dimensionality -Part III of III
In this series, we’ve done a lot of dimension jumping. Reducing and increasing them as needed, always aware that lurking, somewhere just in shadows, was a phenomenon known as the curse of dimensionality.
In short, the curse says that as the number of features( dimensions) increases, the amount of data we need to generalize accurately about that data grows exponentially. Every time we add a new feature, we are adding space. And we need to account for that space. In our bathtub example, this was not too much of an issue, as we were only adding one new feature space. But even here think about the volume of space we needed to sort our balls. It was quite a bit larger than our simple line of sorted balls.
Look at the following:
In the first line, we have 10 data points. We jump up dimensions, and we are at 100, another dimension, and we are at 1000, and so on and so on. In some models, we can have 100s of aspects. You can see that this can get out of hand pretty quickly. And these data points are not necessarily ‘useful’ information, just space. Suddenly a simple nearest neighbor algorithm has to deal with a tremendous amount of sparsity.
It is very tempting to throw a whole mess of features at a model. After all, if one is good, then twenty has to be twenty times better. But this only leads to a whole mess of problems, including overfitted models.
So how can you avoid the curse of dimensionality? Think back to our discussion of Alfred Hitchcock, and how we have been trained to identify him through just a few pen strokes.
(Breaking Bad, just in case you didn’t get it)
When playing with dimensionality in modeling, it is best to take a similar approach. More is not better, shoot for just enough. You also have to look at the size of the data set, a good rule of thumb is that the more data you have, the more features are in play. Now the problems associated with large feature spaces are pretty complicated, and as the point of this series is more orientational than comprehensive, you may want to check out the math behind it all. There are tons of great resources out there.
I hope you found this series somewhat helpful. Thank you for your time.