Original article was published by Pranjal Soni on Deep Learning on Medium
This article is based on week 2 of course Sequence Models on Coursera. In this article, I try to summarise and explain the concept of word representation and word embedding.
Word Representation :
Generally, we represent a word in natural language processing through a vocabulary where every word is represented by a one-hot encoded vector. Suppose we have a vocabulary(V) of 10,000 words.
V = [a, aaron, …, zulu, <UNK>]
Let’s take the word ‘ Man’ is at position 5391 in the vocabulary, then it can be represented by a one-hot encoded vector (O₅₃₉₁ ). The position of 1 in the sparse vector O₅₃₉₁ is the index of word Man in the vocabulary.
O₅₃₉₁ = [0,0,0,0,...,1,...,0,0,0]
Like that, we can other words in the vocabulary which can are represented by one-hot encoded vectors Women (O₉₈₅₃), King(O₄₉₁₄), Queen(O₇₁₅₇), Apple(O ₄₅₆), Orange(O₆₂₅₇).
But this method is not an effective method to feed our algorithms to learn sequence models because the algorithm is not able to capture the relationship between different examples.
Suppose we train our model for the sentence:
I want a glass of orange juice.
And want to predict the next word for the sentence:
I want a glass of apple _____.
Even if both examples are almost the same and our algorithm well trained, but it failed to predict the next word in test example. The reason behind this is that in the case of one-hot encoded vector representation the inner product between two one-hot encoded vectors is 0. Even if take the Euclidean distance between for any two vectors it is also 0.
We know that the next word would juice in the example we take, but the algorithm is not able to find any relationship between the words of the above two sentences, and it fails to predict the word in the sentence.
To solve this problem we take the help of word embeddings, it is the featurized representation of words. For each word in the vocabulary, we can learn a set of features and values.
Instead of taking a sparse one-hot encoded vector, we take a dense featured vector for each word. We can take different properties of each word and give a weightage of how much the property belongs to the word. For example, Gender property highly belongs to man, woman, king, and queen but not related to Apple and Orange, so high weightage is provided for those four, and less weightage is provided for apple and orange. Thus we are able to establish a relationship between the words which have the same properties. Now our algorithm is also capable to find the relation between apple and orange, and predict the next word according to it.
In the image shown above, there are 300 properties are taken for each word and converted into a vector, now the word man represented by (e₅₃₉₁)
e₅₃₉₁ = [-1, 0.01, 0.03, 0.04, ...]
Here e represents the embedded vector.
We can take help t-SNE (t-distributed stochastic neighbor embedding) machine-learning algorithm to visualize these words into a 2-D plot.
In this graph, we can see that the words which have the same properties are neighbors and the word which does not have any common property are far away.