Preprocessing raw data is a necessary step in the neutral network training pipeline. Computers (and neural networks) cannot understand raw text or images. The input has to be converted in to numerical vectors that are meaningful in context to the constraints of the scenario. What are the constraints to begin with?

In the context of text, each word must be identifiable by a unique vector that represents the word. Unique representation of words is a fairly trivial problem to solve. Let us go ahead and look at an example that could make this clear. Suppose we want to convert the following sentence into a bunch of numerical vectors:

‘*I enjoy playing squash’*

We will initially start with an empty vector of zeroes. Specifically, four in this case as our corpus has a total of four words! We would end up with: [0,0,0,0]. Next, each word will be represented by a ‘1’ as we go through the corpus (‘I’ would be represented by [1,0,0,0], ‘enjoy’ by [0,1,0,0] and so on). This method is called **One Hot Encoding**.

While the the above solution addressed the problem of unique representation of words, the method is extremely inefficient and leads to sparse vectors that take up a lot of space as size of each vector is the size of the corpus. More importantly, one hot encoding does not capture the context of words. What do we mean by capturing context? Words with similar meaning should cluster together if their respective vectors were to be plotted in some high dimensional space.

Without getting too much in to the math involved, let us try to understand how the problem can be tackled, or in other words, how can one achieve **word embeddings**? The first point to note is that neural networks cannot ‘understand’ the language like humans do but rather use a statistical approach to extract patterns from text. Like all statistical approaches, the accuracy improves with the size of data available.

The first step to achieve word embeddings is to initialize all words using one-hot encoding. Next, we need to decide on the length of the final vector of each word, chosen arbitrarily. Each value in the vector is called a weight. The weights are initialized randomly at the beginning. Now that we understand the first step, let us understand the crux of word embeddings.

The idea is to look at a window of words at each step, say 4 in our case. Word embeddings looks to predict center word of a window, given the context of the words that occur around it. Specifically, conditional probabilities. For example, the probability of ‘enjoy’ given ‘I’ in our sentence. **Using conditional probabilities, we try to predict the center word, given the context of words occurring around it!**

But how do we go from numerical word vectors to probabilities? The Softmax function helps us do exactly that. The Softmax functionally essentially computes the dot product of the center word vector with the context word vector, then runs an exponent on the product and then normalizes for the whole corpus by dividing with a summation. In essence, Softmax is a great activation function to use as it always gives values in the range between 0 and 1 and we are looking for probabilities.

Once we have the conditional probabilities, they are compared to the ‘ground truth’. We know which words occur when. Accordingly, a loss is computed through an objective function. The error is propagated back and the weights are updated accordingly to minimize error. The updated weights are the final word vectors. Finally, we find word vectors with similar words clustering together in a high dimensional space!

I do not understand math, what should I take away from this post?

Word embeddings are used to derive meaningful, dense and unique vectors for each word of a corpus. If plotted, similar words would cluster in a high dimensional space. The vectors are derived using a neural network that essentially calculates conditional probabilities with the assumption that similar words appear in close to each other. There are tools out there to determine word embedding given a vocabulary of words for any task that we wish to carry out.

For example, the embedding layer in Keras determines the word vectors given any corpus that we would like to work with. While we might be able to determine word vectors for any corpus, keep in mind that the word vectors are effective only if the corpus is large. Large corpus means higher computation time. If limited by the size of the corpus and computation power (most of us are), it is common to use pre-trained word embeddings. The Glove algorithm by Stanford provides downloadable word vectors that one can use in the first layer of a neural network.

Word embeddings are extremely simple to incorporate in Python using Keras. Below is the code snippet to achieving word embeddings. The arguments to the embedding layer are:

1. Number of total words

2. Length of the final word vectors

3. Length of the input word vector

from keras.models import Sequential

from keras.layers import Flatten, Dense

model = Sequential()

model.add(Embedding(5000, 32, input_length=5))

Finally, is it good to always use word embeddings for neural networks? Unfortunately, its not. Some tasks such as sentiment analysis work better if the word vectors or weights are randomly initialized. Random initialization and word embeddings must be tried out for specific tasks to evaluate performance.

**Bibliography:**

1. Deep Learning applied to NLP (Stanford University)

2. Deep Learning with Python (Chollet)

3. https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

4. http://dataaspirant.com/2017/03/07/difference-between-softmax-function-and-sigmoid-function/

Source: Deep Learning on Medium