Introduction To Text Representations For Language Processing — Part 2

Original article was published on Deep Learning on Medium


Photo by Jaredd Craig on Unsplash

Introduction To Text Representations For Language Processing — Part 2

In the previous article, we discussed about discrete text representation for feeding text to machine learning or artificial intelligence algorithms. We learnt few techniques, its workings, advantages & disadvantages of each of them. We also discussed on the drawbacks of discrete text representations & how it ignores the positioning of words & doesn’t try to explain word similarities or meanings inherently.

In this article, we will look at distributed text representation of text & how it solves some of the drawbacks of discrete representation.

Distributed Text Representation:

Distributed text representation is when the representation of a word is not independent or mutually exclusive of another word and their configurations often represent various metrics & concepts in a data. So the information about a word is distributed along the vector it is represented as. This is different from discrete representation where each word is considered unique & independent of each other.

Some of the commonly used distributed text representations are:

  • Co-Occurrence matrix
  • Word2Vec
  • GloVe

Co-Occurrence Matrix:

Co-Occurrence matrix, as the name suggests, considers the co-occurrence of entities nearby each other. The entity used could be a single word, could be a bi-gram or even a phrase. Predominantly, a single word is used for computing the matrix. It helps us understand association between different words in a corpus.

Lets look at an example using CountVectorizer discussed in previous article & convert it into continuous representation,

from sklearn.feature_extraction.text import CountVectorizerdocs = ['product_x is awesome',
'product_x is better than product_y',
'product_x is dissapointing','product_y beats product_x by miles',
'ill definitely recommend product_x over others']
# Using in built english stop words to remove noise
count_vectorizer = CountVectorizer(stop_words = 'english')
vectorized_matrix = count_vectorizer.fit_transform(docs)
# We can now simply do a matrix multiplication with the transposed image of the same matrix
co_occurrence_matrix = (vectorized_matrix.T * vectorized_matrix)
print(pd.DataFrame(co_occurrence_matrix.A,
columns=count_vectorizer.get_feature_names(),
index=count_vectorizer.get_feature_names()))

Output:

awesome  beats  better  definitely  dissapointing  ill  miles  \\
awesome 1 0 0 0 0 0 0
beats 0 1 0 0 0 0 1
better 0 0 1 0 0 0 0
definitely 0 0 0 1 0 1 0
dissapointing 0 0 0 0 1 0 0
ill 0 0 0 1 0 1 0
miles 0 1 0 0 0 0 1
product_x 1 1 1 1 1 1 1
product_y 0 1 1 0 0 0 1
recommend 0 0 0 1 0 1 0
product_x product_y recommend
awesome 1 0 0
beats 1 1 0
better 1 1 0
definitely 1 0 1
dissapointing 1 0 0
ill 1 0 1
miles 1 1 0
product_x 5 2 1
product_y 2 2 0
recommend 1 0 1

The representation of each word is its corresponding row (or column) in the co-occurrence matrix

If we want to understand the word associations to product_x, we can filter for the column & analyse that product_x is being compared with product_y & there are more positive adjectives associated to it than negative ones.

Advantages:

  • Simple representation for finding word associations
  • It considers the order of words in the sentence unlike discrete techniques
  • The representation coming out of this method is a global representation. i:e it uses the entire corpus for generating the representation

Disadvantages:

  • Similar to CountVectorizer & TF-IDF matrices, this too is a sparse matrix. Which means its not storage efficient & calculations are inefficient to run on top
  • Larger the vocabulary size, larger the matrix size (not scalable to large vocabulary)
  • Not all word associations can be understood using this technique. In the above example, if you look at the product_x column, there is a row with name beats. It is uncertain what is the context on beats in this scenario simply by looking at the matrix

Word2Vec

Word2Vec is a famous algorithm for representing word embeddings. It was developed by Tomas Mikalov in 2013 under the research paper Efficient Estimation of Word Representations in Vector Space

Its a prediction based method for representing words rather than count based technique like co-occurrence matrix

Word embeddings are vector representation of a word. Each word is represented by a fixed vector size while capturing its semantic & syntactic relation with other words

The architecture of word2vec is shallow, single hidden layer network. The weights of the hidden layer is the embedding of the word & we adjust this via a loss function (normal backprop)

This architecture, is similar to that of a autoencoder, where you have an encoder layer & a decoder layer & the middle portion is the compressed representation of the input which can be used for dimensionality reduction or anomaly detection use cases.

word2vec constructs the vector representation via 2 methods/techniques:

  • CBOW — tries to predict the middle word in the context of surrounding word. So in simple terms, it tries to fill in the blanks as to what word will be more suitable to fit given the context/surrounding words. More efficient with smaller datasets. Fast training time compared to Skip-Gram
  • Skip-Gram — Tries to predict the surrounding context words from a target word (opposite of CBOW). Tends to perform better in larger dataset but larger training time

Word2vec is capable of capturing multiple degrees of similarity between words using simple vector arithmetic. Patterns like “man is to woman as king is to queen” can be obtained through arithmentic operations like “king” — “man” + “woman” = “queen” where “queen” will be closest vector representation of the word itself. It is also capable of syntatic relationship like present & past tense & semantic relationships like country-capital relationships

Lets look at the word2vec implementation using gensim

# pip install --upgrade gensim or conda install -c conda-forge gensim# Word2Vec expects list of list representation of words, the outer list represents
# the sentence, while the inner list represents the individual words in a sentence
# Ex: ["I love NLP", "NLP is awesome"] -> [["I", "love", "NLP"], ["NLP", "is", "awesome"]]
import gensim
sentences = ["ML is awesome", "ML is a branch of AI", "ML and AI are used interchangably nowadays",
"nlp is a branch and AI", "AI has fastforwarded nlp",
"RL is also a branch of AI", "word2vec is a high dimensional vector space embedding",
"word2vec falls under text representation for nlp"]
# Preprocessing sentence to convert to format expected by w2v
sentece_list=[]
for i in sentences:
li = list(i.split(" "))
sentece_list.append(li)
print(sentece_list)# Training Word2Vec with Skip-Gram (sg=1), 100 dimensional vector representation,
# with 1 as min word count for dropping noise terms, 4 parallel workers to run on
# Window of 4 for computing the neighbours & 100 iterations for the model to converge
model = gensim.models.Word2Vec(Bigger_list, min_count=1,
workers=4, size = 100, iter=100, sg=1, window=4)
model.wv['word2vec']model.wv.most_similar(positive=['word2vec'])

Output

# Sentence List
[['ML', 'is', 'awesome'],
['ML', 'is', 'a', 'branch', 'of', 'AI'],
['ML', 'and', 'AI', 'are', 'used', 'interchangably', 'nowadays'],
['nlp', 'is', 'a', 'branch', 'and', 'AI'],
['AI', 'has', 'fastforwarded', 'nlp'],
['RL', 'is', 'also', 'a', 'branch', 'of', 'AI'],
['word2vec',
'is',
'a',
'high',
'dimensional',
'vector',
'space',
'embedding'],
['word2vec', 'falls', 'under', 'text', 'representation', 'for', 'nlp']]
# 100-dimensional vector representation of the word - "word2vec"
array([-2.3901083e-03, -1.9926417e-03, 1.9080448e-03, -3.1678095e-03,
-4.9522246e-04, -4.5374390e-03, 3.4716981e-03, 3.8659102e-03,
9.2548935e-04, 5.1823643e-04, 3.4266592e-03, 3.7806653e-04,
-2.6678396e-03, -3.2777642e-04, 1.3322923e-03, -3.0630219e-03,
3.1524736e-03, -8.5508014e-04, 2.0837481e-03, 5.2613947e-03,
3.7915679e-03, 5.4354439e-03, 1.6099468e-03, -4.0912461e-03,
4.8913858e-03, 1.7630701e-03, 3.1557647e-03, 3.5352646e-03,
1.8157288e-03, -4.0848055e-03, 6.5594626e-04, -2.7539986e-03,
1.5574660e-03, -5.1965546e-03, -8.8450959e-04, 1.6077182e-03,
1.5791818e-03, -6.2289328e-04, 4.5868102e-03, 2.6237629e-03,
-2.6883748e-03, 2.6881986e-03, 4.0420778e-03, 2.3544163e-03,
4.8873704e-03, 2.4868934e-03, 4.0510278e-03, -4.2424505e-03,
-3.7380056e-03, 2.5551897e-03, -5.0872993e-03, -3.3367933e-03,
1.9790635e-03, 5.7303126e-04, 3.9246562e-03, -2.4457059e-03,
4.2443913e-03, -4.9923239e-03, -2.8107907e-03, -3.8890676e-03,
1.5237951e-03, -1.4327581e-03, -8.9179957e-04, 3.8922462e-03,
3.5140023e-03, 8.2534424e-04, -3.7862784e-03, -2.2930673e-03,
-2.1645970e-05, 2.9765235e-04, -1.4117253e-03, 3.0826295e-03,
8.1492326e-04, 2.5406217e-03, 3.3184432e-03, -3.5381948e-03,
-3.1870278e-03, -2.7319558e-03, 3.0047926e-03, -3.9584241e-03,
1.6430502e-03, -3.2808927e-03, -2.8428673e-03, -3.1900958e-03,
-3.9418009e-03, -3.3188087e-03, -9.5077307e-04, -1.1602251e-03,
3.4587954e-03, 2.6288461e-03, 3.1395135e-03, 4.0585222e-03,
-3.5573558e-03, -1.9402980e-03, -8.6417084e-04, -4.5995312e-03,
4.7944607e-03, 1.1922724e-03, 6.6742860e-04, -1.1188064e-04],
dtype=float32)
# Most similar terms according to the trained model to the word - "Word2Vec"
[('AI', 0.3094254434108734),
('fastforwarded', 0.17564082145690918),
('dimensional', 0.1452922821044922),
('under', 0.13094305992126465),
('for', 0.11973076313734055),
('of', 0.1085459440946579),
('embedding', 0.06551346182823181),
('are', 0.06285746395587921),
('also', 0.05645104497671127),
('nowadays', 0.0527990460395813)]

Within a few lines of code, we are not only able to train & represent words as vectors, but also use some of the inbuilt functions to use vector arithmetic to find most similar words, most dissimilar words etc.

There are 2 ways to find the similarity between the vectors depending if they are normalised or not:

  • If normalised : We can compute simple dot product between the vectors to find how similar they are
  • If not normalised : We can compute the cosine similarity between the vectors using the below formula
relationship between cosine similarity & cosine distance

For all the possible parameters & functionalities, you can refer to the gensim documentation below:

For more detail on cosine similarity refer to the wiki article below

The exact workings of the architecture & the training algorithms & how the relationships between words are found is beyond the scope of this article & deserves a separate article on its own

The original paper can be found below:

Advantages:

  • Capable of capturing relationship between different words including their syntactic & semantic relationships
  • Size of the embedding vector is small & flexible, unlike all the previous algorithms discussed where the size of embedding is proportional to vocabulary size
  • Since its unsupervised, human effort in tagging the data is less

Disadvantages:

  • Word2Vec cannot handle out-of-vocabulary words well. It assigns a random vector representation for OOV words which can be suboptimal
  • It relies on local information of language words. The semantic representation of a word relies only on its neighbours & can prove suboptimal
  • Parameters for training on new languages cannot be shared. If you want to train word2vec on a new language, you have to start from scratch
  • Requires a comparatively larger corpus for the network to converge (especially if using skip-gram

GloVe

Global Vectors for word representation is another famous embedding technique used quite often in NLP. This was the result of a paper from 2014 by Jeffery Pennington, Richard Socher & Christopher D Manning from Stanford.

It tries to overcome the 2nd disadvantage of word2vec mentioned above by trying to learn both local & global statistics of a word to represent it. i:e it tries to encompass the best of count based technique (co-occurrence matrix) & prediction based technique (Word2Vec) and hence is also referred to as a hybrid technique for continuous word representation

In GloVe, we try to enforce the below relationship

Which can be re-written as,

So, essentially, we are constructing word vectors Vi and Vj to be faithful to P(i|j) which is a globally computed statistics from the co-occurrence matrix

The tricky part of GloVe is the derivation of the objective function which is out of scope for this article. But I’d encourage you to read the paper which contains the derivation of it to understand further on how it is being converted into an optimisation problem

For a change, instead of building GloVe vectors from scratch, lets understand how we can utilise awesome pre-trained models trained on billions of records

import gensim.downloader as api# Lets download a 25 dimensional GloVe representation of 2 Billion tweets
# Info on this & other embeddings : <https://nlp.stanford.edu/projects/glove/>
# Gensim provides an awesome interface to easily download pre-trained embeddings
# > 100MB to be downloaded
twitter_glove = api.load("glove-twitter-25")
# To find most similar words
# Note : All outputs are lowercased. If you use upper case letters, it will throw out of vocab error
twitter_glove.most_similar("modi",topn=10)
# To get the 25D vectors
twitter_glove['modi']
twitter_glove.similarity("modi", "india")# This will throw an error
twitter_glove.similarity("modi", "India")

Output:

# twitter_glove.most_similar("modi",topn=10)
[('kejriwal', 0.9501368999481201),
('bjp', 0.9385530948638916),
('arvind', 0.9274109601974487),
('narendra', 0.9249324798583984),
('nawaz', 0.9142388105392456),
('pmln', 0.9120966792106628),
('rahul', 0.9069461226463318),
('congress', 0.904523491859436),
('zardari', 0.8963413238525391),
('gujarat', 0.8910366892814636)]
# twitter_glove['modi']
array([-0.56174 , 0.69419 , 0.16733 , 0.055867, -0.26266 , -0.6303 ,
-0.28311 , -0.88244 , 0.57317 , -0.82376 , 0.46728 , 0.48607 ,
-2.1942 , -0.41972 , 0.31795 , -0.70063 , 0.060693, 0.45279 ,
0.6564 , 0.20738 , 0.84496 , -0.087537, -0.38856 , -0.97028 ,
-0.40427 ], dtype=float32)
# twitter_glove.similarity("modi", "india")
0.73462856
# twitter_glove.similarity("modi", "India")
KeyError: "word 'India' not in vocabulary"

Advantages

  • It tends to perform better then word2vec in analogy tasks
  • It considers word pair to word pair relationship while constructing the vectors & hence tend to add more meaning to the vectors when compared to vectors constructed from word-word relationships
  • GloVe is easier to parallelise compared to Word2Vec hence shorter training time

Disadvantages

  • Because it uses co-occurrence matrix & global information, memory cost is more in GloVe compared to word2vec
  • Similar to word2vec, it does not solve the problem of polysemous words since words & vectors have one-to-one relationship

Honourable Mentions:

Below are some of the advanced language models that should be explored after mastering above representations

ELMO

Embedding from language models was a paper by Matthew E. Peters et al., in the name of Deep Contextualized word representations in Mar 2018.

It tries to solve the disadvantages of word2vec & GloVe by having many-to-one relationship between vector representation & the word it represents. i:e it incorporates the context & changes the vector representation of the word accordingly

It uses character-level CNNs to convert words into raw word vectors. These are further fed into an bi-directional LSTMs to train. The combination of forward & backward iteration creates intermediate word vectors representing the context information before and after the word respectively

The weighted sum of original word vector & the 2 intermediate word vectors, gives us the final representation

Original ELMO paper

BERT

BERT is a paper from Google AI team in the name of BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding which came out of May 2019

It is a new self-supervised learning task for pre-training transformers in order to fine tune them for downstream tasks

BERT uses bidirectional context of language model i:e it tries to mask both left-to-right & right-to-left to create intermediate tokens to be used for the prediction tasks hence the term bidirectional

The input representation to the BERT model is the sum of token embeddings, segmentation embeddings & position embeddings & follows a masking strategy for the model to predict the correct word in the context

It uses transformer network & attention mechanism that learn contextual relationship between words and is fine tuned to take up other tasks like NER & question answer pair etc.

Original paper can be found below:

Summary

Distributed text representations are powerful algorithms capable of handling complex problem statements in NLP.

Alone, they can be used for understanding & exploring the corpus, for example, exploring words within the corpus & how they associate to each other. But their strength & importance really comes out when combined with supervised learning models for solving problem statements like question-answering, document classification, chatbot, named entity recognition to name a few.

Nowadays, they are quite frequently used in conjecture with CNNs & LSTMs to solve & are part of many state-of-the-art results

Hope you enjoyed this series !

Repo Link: