Combing LDA and Word Embeddings for topic modeling



“Business newspaper article” by G. Crescoli on Unsplash

Latent Dirichlet Allocation (LDA) is a classical way to do a topic modelling. Topic modeling is a unsupervised learning and the goal is group different document to same “topic”.

Typical example is clustering a news to corresponding category including “Finance”, “Travel”, “Sport” etc. Before word embeddings we may use Bag-of-Words in most of the time. However, the world changed after Mikolov et al. introduce word2vec (one of the example of Word Embeddings) in 2013. Moody announced lda2vec which combing LDA and word embeddings together to tackle topic modelling problem.

After reading this article, you will understand:

  • Latent Dirichlet Allocation (LDA)
  • Word Embeddings
  • lda2vec

Latent Dirichlet Allocation (LDA)

Photo: https://pixabay.com/en/golden-gate-bridge-women-back-1030999/

LDA is famous in topic modelling area. Clustering document based on word usage. Make it sample, LDA use Bag-of-Words as a feature for clustering. For detail, you may check out this blog.

Word Emeddings

Credit: https://pixabay.com/en/books-stack-book-store-1163695/

The goal of word embeddings is resolving sparse and high dimensional feature in NLP problem. Having word embeddings (or vectors), we can use low dimension (it is 50 or 300 in most of the time) to represent all words. For detail, you may check out this blog.

lda2vec

lda2vec includes 2 parts which are word vector and document vector to predict word such that all vectors are trained simultaneously. It builds word vector by skip-gram model. In short, it use target word to predict surrounding words to learn the vector. Second part is document vector which is combing by

  • document weight vector: The weight of each topic. Leveraging softmax to transform the weight to percentages.
  • topic matrix: The topic vector. One column refers to one topic while the row store the nearby related words per topic.
https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=

The formula of document vector is

Moody, Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec (2016)
  • dj: The j document vector
  • pj0: Weight of j document in “0” topic
  • pjn: Weight of j document in “n” topic
  • t0: The vector of “0” topic
  • tn: The vector of “n” topic

The weight is different among document while topic vectors are shared. For more detail, you may check out Moody’s original blog.

Take Away

For source code, you may check out this notebook

  • Per author suggested, you should use LDA if you want to have human-readable topics. You may try lda2vec if you want to rework the optic model in other way or predict topics over users.

About Me

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. You can reach me from Medium Blog, LinkedIn or Github.

Reference

Moody Christopher. “Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec”. 2016. https://arxiv.org/pdf/1605.02019.pdf

Source: Deep Learning on Medium