How to Compute Sentence Similarity Using BERT and Word2Vec

Original article was published by Pedram Ataee, PhD on Artificial Intelligence on Medium


— How to Use the “Sent2Vec” Python package

How to Install

Since the sent2vec is a high-level library, it has dependencies to spaCy (for text cleaning), Gensim (for word2vec models), and Transformers (for various forms of BERT model). So make sure to install these libraries before installing sent2vec using the code below.

pip3 install sent2vec

How to Use BERT Method

If you want to use the BERT language model (more specifically, distilbert-base-uncased) to encode sentences for downstream applications, you must use the code below. Currently, the sent2vec library only supports the DistilBERT model. More models will be supported in the future. Since this is an open-source project, you can also dig into the source code and find more details of implementation.

sent2vecHow to compute sentence embedding using BERT

You can compute distance among sentences by using their vectors. In the example, as expected, the distance between vectors[0] and vectors[1] is less than the distance between vectors[0] and vectors[2].

How to Use Word2Vec Method

If you want to use a word2vec approach instead, you must first split sentences into lists of words using the sent2words method from the Splitter class in this library. You can customize the stop-word list by revising (adding to or removing from) the default stop-word list. Prior to any computation, it is crucial to investigate the stop word list. The final results can be easily skewed with a small change in this step.

When you extract the most important words in sentences, you can compute the sentence embeddings using the word2vec method from the Vectorizer class. This method computes the average of vectors corresponding to the remaining (i.e., the most important) words using the code below.

sent2vecHow to compute sentence embedding using word2vec

As seen above, you can use different word2vec models by sending its path (i.e., PRETRAINED_VECTORS_PATH) to the word2vec method. You can use a pre-trained model or a customized one. This configuration is crucial to obtain meaningful outcomes. You need a contextualized vectorization, and the word2vec model takes care of that.

— What is the Best Sentence Encoders

The final outcomes of sentence encoding or embedding techniques are rooted in various factors such as a relevant stop-word list or a contextualized pre-trained model. You can find more explanation below.

  • Text Cleaning — Let’s say you use spaCy for the text cleaning step as I also used in the sent2vec library. If you mistakenly forgot to remove “Not” from the default stop-word list, the sentence embedding results can be totally misleading. A simple word “Not” can thoroughly change the sentiment of a sentence. The default stop-word list differs in each environment. So, you must curate this list to your needs before any computation.
  • Contextualized Models — You must use contextualized models. For example, if the target data is in finance, you must use a model trained on the finance corpus. Otherwise, the outcomes of sentence embedding can be inaccurate. So, if you use the word2vec method and want to use the general English model, the sentence embedding results may be inaccurate.
  • Aggregation Strategy — When you compute sentence embedding using the word2vec method, you may need to use a more advanced technique to aggregate word vectors rather than taking an average of them. Currently, the sent2vec library only supports the “average” technique. Using a weighted-average to compute the sentence embedding, is a simple enhancement that can improve the final outcomes. More advanced techniques will be supported in future releases.

To emphasize the significance of the word2vec model, I encode a sentence using two different word2vec models (i.e., glove-wiki-gigaword-300 and fasttext-wiki-news-subwords-300). Then, I compute the cosine similarity between two vectors: 0.005 which may interpret as “two sentences are very different”. By this example, I want to demonstrate the vector representation of a sentence can be even perpendicular if we use two different word2vec models. In other words, if you blindly compute the sentence embedding with a random word2vec model, you may be surprised in the process.

— Sent2Vec is an Open-Source Library So …

The sent2vec is an open-source library. The main goal of this project is to expedite building proof of concepts in NLP projects. A large number of NLP tasks need sentence vectorization including summarization and sentiment analysis. So, please consider to contribute and push this project forward. I also hope you can use this library in your exciting NLP projects.