How Word Embeddings are created Context-Counting semantics (in brief).

Source: Deep Learning on Medium

Let’s think a step ahead.

Is it possible to create a vector space for all English words that has this same “Closer in space is Closer in meaning” property?

To answer this, we need to know the answer to the question:

What does meaning mean?

No one really knows, but one theory popular among computational linguists, computer scientists and other people who make search engines is the Distributional Hypothesis , which states that:

Linguistic items with similar distributions have similar meanings.

Here “similar distributions” is same as “similar contexts”.

For example, all the following sentences have been in Similar context.

It was really cold yesterday.
It will be really warm today, though.
It'll be really hot tomorrow!
Will it be really cool Tuesday?

According to the Distributional Hypothesis, the words cold, warm, hot and cool must be related in some way (i.e.,close in meaning) because they occur in a similar context, which occur between the word "really" and a word indicating a particular day.

Likewise, the words yesterday, today, tomorrow and Tuesday must be related, since they occur in the context of a word indicating a temperature in this context.

In other words, according to the Distributional Hypothesis, a word’s meaning is just a big list of all the contexts it occurs in. Two words are closer in meaning if they share contexts.

Word vectors by counting contexts

Here the important question to answer is : How do we turn this insight from the Distributional Hypothesis into a system for creating general-purpose vectors that capture the meaning of words?

What if we made a really big spreadsheet that had one column for every context for every word in a given source text. Let’s use a small source text to begin with such as following sentence:

It was the best of times, it was the worst of times.

Such a spreadsheet might look something like this:

Fig-5 : context-counting

This spreadsheet has one column for every possible context, and one row for every word. The values in each cell correspond with how many times the word occured in the given context. The numbers in the columns constitute that word’s vector, i.e., the vector for the word of is

[0, 0, 0, 0, 1, 0, 0, 0, 1, 0]

It is a vector of shape (1×10). Because, there are ten possible contexts (which is a ten dimensional space). It is very hard to imagine ten dimensional space.

Performing vector arithmetic on vectors with ten dimensions and drawing inferences from it is as easy as, doing it on vectors with two or three dimensions. We can use the same distance formula and get useful information (as in Fig-4) about which vectors in this space are similar to each other.

In particular, the vectors for best and worst are actually the same (a distance of zero), since they occur only in the same context (the ___ of):

[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

But, the conventional way of thinking about “best” and “worst” is that they are ”antonyms” not “synonyms”.

But, they are also clearly two words of the same kind, with related meanings (through opposition), a fact that is captured by this distributional model.

Contexts and dimensionality

In a corpus (large set of documents in a database) of any reasonable size, there will be many thousands if not many millions of possible contexts. It is difficult enough working with a vector space of ten dimensions. If it is a vector space of a million dimensions then it turns out to be very complex and might end up being superfluous and can either be eliminated or combined with other dimensions without significantly affecting the predictive power of the resulting vectors.

The process of getting rid of superfluous dimensions in a vector space is called dimensionality reduction and most implementations of count-based word vectors make use of dimensionality reduction so that the resulting vector space has a reasonable number of dimensions (from 100–300, depending on the corpus and application).

The question of how to identify a “context” is itself very difficult to answer. (It is domain specific) and (depends on the type of problem to solve).

In the sample example above, it is said that a “context” is just the word that precedes and the word that follows. Depending on the implementation of this procedure, though, we need a context with a bigger “window” (example — two words before and after) or a non-contiguous window (skip a word before and after the given word).

Generally, certain functional words like “the” and “of” will be excluded, when determining a word’s context, or words are lemmatized before beginning analysis. So, two occurrences with different “forms” of the same word count as the same context.

All these are debatable and research questions and different implementations of procedures for creating count-based word vectors make different decisions on this issue.

But, the best part is we need not to create own word vectors from scratch.

Many researchers have made downloadable databases of pre-trained vectors. One such project is Stanford’s Global Vectors for Word Representation (GloVe). These 300-dimensional vectors are included with spaCy. spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.

Let’s see an example of how word vector looks like , which is created using glove.

Fig-6 : Word vector for the word “testing”

Conclusion

Humans can understand language and decode the meaning and analyse the given information, but computers can only understand numbers. So, creation of word vectors makes life easier to drawing inferences and conclusions from huge corpus, which will be time consuming and difficult for humans to analyse.

To understand it more better and analyse it in more intuitive way, refer here.