Source: Deep Learning on Medium
TLDR: You can head to my repository where you will find instructions on how to train your own relation vector model using numpy, gensim and keras. Each step is explained in this blog post.
A major area of research in Natural Language Processing is word-level (or lexical) semantics. That is, building systems capable to understand the meaning of a unit of analysis. This unit of analysis can be a (single token) word, but it can also be a phrase, a proper noun (named entity) or a subword (an affix, for instance). Now, how do we pin down the meaning of one of these units? Probably the most immediate source for word meaning that we can possibly think of is the dictionary.
Another source of knowledge when it comes to lexical semantics is to assign meaning to a word in terms of how it relates to other concepts. For example, we may not know what indigo is, but if we knew it is a type of dye, which in turn is a coloring material, then we wouldn’t really need a definition to have a pretty clear idea of what it is.
In NLP, a powerful way of inducing word meaning is to represent words as vectors, such that they account for the distribution of a word with respect to others with which it co-occurs. The already classic word embeddings turn this intuition into a vector space where similar words are grouped together because they tend to occur in similar contexts. For example, indigo will be represented with a vector close to dye, violet or magenta.
Relations in word embeddings
In fact, it has been shown that it is possible to model relations (or analogies) in word embeddings with simple vector arithmetic, the following equation being a well-known case: a−b + c ≈ x, or the famous man − king + woman ≈ queen. This can also be read as “the similarity betwen the result of doing man−king is similar to that of woman− queen”. This is a property surprisingly well preserved in word embeddings, and holds for many morphosyntactic relations as well (e.g., play:played::run:ran).
However, as you may expect, not all relations can be modeled based on this vector difference approach. It has been shown, in fact, that apparently well defined relations like part-of, a.k.a meronymy, (e.g., a wheel is a meronym for car) do not hold in classic embedding models. The below example from a Coling paper by Bouraui et al. (2018) shows the difference between the superlative degree relation (e.g., close-closest) (left) and meronymy (right).
Because relations are important for NLP and distributional semantics tells us that context is useful for modeling meaning, a possible approach is to learn an embedding for all pairs of words which are strongly associated in a corpus. In this post, we use the term relation embedding to refer to a vector encoding the relation between a pair of words, such that similar relations are grouped together in a vector space. For example, we would like to see the relation vectors for (jumper, wool) and (margarita, tequila) to be near in the space because both pairs of words are involved in an is-made-of relation.
This is the intuition behind SeVeN (Semantic Vector Networks), let us explore how to build such model.
SeVeN: Building relation embeddings
- Build a word graph. Our input can be just a text corpus. Then, simply by computing PMI between center and context words we can generate a word graph.
To do this, we first need to build a co-occurrence matrix (a table where we store how often each word co-occurs with others within a certain window).
python3 src/preprocess/_get_coocs.py -c corpus_file -b build_folder -v 10000 -win 10 -sw english_stopwords.txt
In this example, we are creating a vocabulary of 10k words, and counting co-occurrences between them within a window of 10 words. A bunch of files are saved in the build_folder/ directory. Then, to actually compute PMI between them and obtain the word graph, run:
python3 src/preprocess/_cooc2pmi.py -d build_folder/weighted_cooc_matrix.pkl -rd build_folder/raw_cooc_matrix.pkl -n build_folder/N_vals.txt -b build_folder -t 100 -wid build_folder/words2ids.txt -mc 100
Without going into the details, this call creates a tab-separated file with, for each word a in the vocabulary, its top 100 context words b by PMI, but which co-occur in the corpus at least 100 times. In this way we ensure we will have enough sentences containing a and b (which is important for learning good vectors).
2. Get contexts. For each sentence containing a word pair (a, b), we get the embeddings for the left, middle and right contexts, and do the same for sentences containing (b, a).
We achieve this by calling:
python3 src/preprocess/_get_contexts.py -p build_dir/ppmi_pairs_topk=100.tsv_filtered.txt -b build_dir -mw 5 -sw 5
Here, we are taking the word pairs generated in the previous step, and acquired their 6 possible context words which occur within a middle window (the green contexts below) of 5 tokens, and a side window (blue and pink contexts below) of also 5 tokens.
Once we have the contexts extracted for a and b (note that this is the slowest step), we vectorize, that is, we convert those contexts into vectors by taking their average (Figure 4), and then, average over all sentences (Figure 5).
Vectorization happens when calling:
python3 src/preprocess/_vectorize.py -wv word_vectors -p build_dir/ppmi_pairs_topk=100.tsv_filtered.txt -b build_dir
The script takes as input the contexts stored in the build_folder/ directory, a word embedding model, the pairs, and creates a relational vector space.
3. Run relation vectors through an autoencoder. Why? If you look at the examples below (sect. Differences in the spaces), you will see that the embeddings coming from simply averaging context words encode a relation, yes, but the meaning of the individual words (a and b) “governs” the relation. Similar relations in this space are also relations between pairs of words similar to the target pair. Ideally, we would like to have a way to remove the features specific to a and b from the relation, so what we have is a purified relation vector.
In the above figure, you can see that we feed to an autoencoder the original 1800d (if you use 300d pretrained embeddings) relation vector. And, when the autoencoder attempts to reconstruct it, the decoder has access to words a and b. We are in a way telling the network to forget all it knows about the words themselves when learning to reconstruct the relation, because this information is been given to it explicitly. We achieve this by calling the following script:
python3 src/preprocess/_autoencoder.py -rv relation_vectors -wv word_vectors -b build_dir
A set of autoencoded relation vectors (with different sizes) will be saved to your build_folder/ directory. In our experiments we needed to produce vectors of 50 or less dimensions to actually remove word-specific features.
Differences in the spaces
Consider a relation loosely defined as a → product developed by company → b. If we take directX and Microsoft, we can see that in the original 1800d space the meaning of directX and, especially, Microsoft enforce (windows, microsoft) to be the most similar relation. When using the purified space, we go outside the Microsoft domain, but still with a similar relation. For the future, it would be interesting to go as low as 5 or 3 dimensional relation vectors.
Another interesting example is a → is a landmark of→ b. If we take mii and nintendo, we can see that the purified model sticks to the tech domain, but now the relation is not about Nintendo anymore, but rather about another named entity (iphone) which has a landmark not related with videogames.
Do it yourself
I hope this tutorial was useful to learn about what are relation vectors. In later posts we will discuss alternatives to our SeVeN approach and how these relation embeddings can be used in NLP. Remember that in my repository you will find more details about the steps covered here (from corpus preprocessing to vector purification) and pretrained relation vectors.
Espinosa-Anke, L. & Schockaert, S. (2018). SeVeN: Augmenting Word Embeddings with Unsupervised Relation Vectors. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 2653–2665).