Source: Deep Learning on Medium
Distributed Vector Representation : Simplified
Arguably the most essential feature representation method in Machine Learning
Let’s play a simple game. We have 3 “people” with us — Micheal, Lucy and Gab. We wanna know which person’s playlist matches the most with Lucy, is it Micheal or Gab?? Let me give you some hints. Lucy loves classic rock, and so does Gab but Micheal is not a fan. Lucy prefers instrumental versions and Micheal is also the same, but Gab does not fancy them at all. Lucy does not like Pop, Micheal outright hates it but Gab is definitely a Pop fanatic!!
Was this representation of the information helpful? Can you conclusively say whose playlist matches more with Lucy? Lucy and Micheal have common interest in 2 different song genres while Lucy and Gab share only one common song genre. But can you be sure that the one genre Lucy and Gab share will not outweigh the other two? (I mean comeon .. it’s classic rock!!!)
What if I told you that I have some magic formula with which I can represent their music interests as one single value? Lucy can be represented as -0.1, Gab can be represented as -0.3 and Micheal can be represented as 0.7. That certainly makes the job easier because if you trust the formula, I think it is clear that Lucy’s playlist will match the most with Gab. But how did I come up with the formula? Let’s take it from the top…
What is Distributed Representation?
Distributed Representation refers to feature creation, in which the features may or may not have any obvious relations to the original input but they have comparative value i.e. similar inputs have similar features. Converting inputs to numerical representation (or features) is the first step to any Machine Learning algorithm in every domain.
Why is non-distributed representation not enough?
Non-distributed representation (also called one-hot vector representation) adds a new vector dimension for every new input possibility. Obviously, the more number of unique possible inputs, the longer the feature vector will be. This kind of representation has 2 major flaws,
- Distance or ‘similarity’ between any 2 feature vectors is the same. In other words, this representation holds no information about how the inputs relate to each other i.e. no comparative value.
- Since every dimension of the vector represents a unique input, this representation is incapable of handling unseen or unknown inputs.
For example, what if we encounter a new input, circle, to the representation above. If we use non-distributed representation (left), we don’t have a way of representing the circle, but if we use distributed representation (right), we can represent circle as vertical, horizontal and ellipse. This representation also helps our model understand that a circle is more like an ellipse than a rectangle.
How do I create distributed vector representation?
There is no one way of creating distributed vector representation. You can create features that have logical relation to the domain and represent your inputs accordingly, as done in the rectangle/ellipse example above (although this process requires domain expertise). Or you can use the more successful approach, which is using Deep Learning to create feature representation that cannot be directly explained, but hold the comparative value required (like I did in the playlist example at the start of the post).
For creating Deep Learning based Distributed Vector representation, one needs to first create a logical feature representation (which is actually non-distributed representation in most cases) and then pass it through a ‘transformation matrix’ (also sometimes known as embedding matrix) to get the distributed vector representation. This part of the model is usually attached at the start of the complete pipeline to learn the representation.
There are numerous examples of Vector representation that has helped ML pipelines, like Word2Vec, Gene2Vec, Prot2Vec, Node2Vec, Doc2Vec, Tweet2Vec, Emoji2Vec etc. The list is enormous, although a major portion of it is present in this repo.
Why does the vector length matter?
It all seems so easy now. Why waste so much space, let’s just learn to represent everything into one single numerical value right? Nope!! Let me walk you through it.
What if we need to represent 3 inputs that are equally similar to each other (like a triangle)? Can we represent them with a single dimension vector? No!! Why? Because a triangle is 2-dimensional, duh!! Now, what if I need to represent 4 inputs that are equally similar (like a pyramid)? You know the drill..
Every time we increase the feature dimension, the sparsity of our feature representation increases exponentially. Distributed Vector representation using Deep Learning is rooted in the fact that we provide approximate (not necessarily exact) distributed representation that has comparative values. So, the size of feature dimension needs to be chosen wisely such that it can provide an accurate enough representation of the inputs provided without causing extreme sparsity in the feature space.
Distributed Vector Representation as a core idea is extremely popular and is widely used across various domains. Multiple methods of numerical feature representation have been proposed across a variety of problem statements. I believe that the core idea of such a representation is always going to be essential to any Machine Learning problem (even if the exact methods evolve significantly).