The Vocabulary of Reddit

Reddit has become a force when it comes to getting news and connecting people all over the world. If you’re unfamiliar with Reddit’s structure, it’s composed of pages called “subreddits”. Each subreddit is dedicated to a certain topic. For instance, “r/nba”, “r/gaming”, and “r/funny” are restricted to posts about the NBA, video games, funny posts like images or comics respectively.

Each subreddit develops a community, culture, and language. With the power of NLP and machine learning, I will attempt to understand how each of these subreddits conceptualize different topics. To do this, I’ll have to explain a little bit about data processing, basic NLP and the concept of word embeddings. This post will utilize data from the historical Reddit comment corpus, which can be found on Google BigQuery.

How can I get my text data into the model?

Machine learning algorithms can’t deal with text, at least not directly. The text has to be converted to a numerical format before the model can begin generating results. There are many ways to do this conversion, and one of the easiest ways is to assign each word in the document to a number. Let’s use the example text, “I’m eating this pizza.”

The first thing I need to do is split this text into an array. This process is called tokenization. There’s not necessarily one correct way to tokenize, but for this example, let’s say that I will be removing punctuation and I’ll be using the space character as my delimiter. I would end up with the array ["I’m", "eating", "this", "pizza"]. Now that I have my tokenized array, I can assign each word to an index.

Word to index mapping.

Using this mapping, I apply it to my array to get [0, 1, 2, 3]. Great!

But there’s some issues with our approach. Converting my text directly into increasing integers implies that there is a comparative relationship between my words. Remember, the model has no idea where these integers came from. They could be ages, grades, anything! So the model will begin to interpret information about “this” being twice the value of “eating”, when in reality, the numbers were arbitrarily chosen. Another issue arises with potential capitalization problems. What if I see the word “i’m” in another piece of text? Should I assign it another index or should it be assigned to index 0? This is more of a design question, and less of an actual implementation issue, because sometimes we want to preserve proper nouns. In this example, we’ll simply make everything lower case, however.

To fix our mapping issue, we’ll introduce a concept called one-hot encoding. One-hot encoding simply means that instead of representing each of my words as integers, I will represent each word as a vector. These vectors are special, however, because every single element of the vector will have the value ‘0’ except for one (which we will choose) that has the value ‘1’. Conveniently enough, we don’t have to change our word to index mapping. We can use the indices we chose earlier as the index for our ‘1’ values. To be clear, let’s see an actual example with our lower case and one-hot encoded vectors.

i’m -> [1 0 0 0]

eating -> [0 1 0 0]

this -> [0 0 1 0]

pizza -> [0 0 0 1]

Which would result in a now 2D document array of:

[[1 0 0 0], [0 1 0 0], [0 0 1 0], [0 0 0 1]]

Simple enough, right? Why did we do this? Now each of our numerical representations of our words are on equal footing. There’s no sense of ordering or comparison between any pair of vectors. This seems like the right approach, but we can actually do better before we get started, and that opens up the awesome world of word embeddings.

What is a word embedding and how do I use it?

We have our one-hot encoded vectors, but we lost a lot of information about our document in the process. Language is such an interesting problem, because words have more inherent meaning than just the characters that make them up. We have synonyms, antonyms, figures of speech, and numerous other areas of complication to consider. Word embeddings begin to solve some of these issues that arise when discussing NLP problems.

Currently, we’re only storing the fact that words are distinct, nothing more. You may also notice a problem that will arise with a large vocabulary. Right now, we only have 4 words, but what about if I had 40,000? Or 100,000? I’m going to be building a lot of very sparse vectors (with 99,999 zeros). What if I also wanted to know if two words are similar, or if two words are different? How could I improve the way I’m generating my vectors?

Let’s say we have a 2D coordinate plane and our x-axis measures “food-relatedness” and the y-axis measures “pronoun-ness”. I want to plot all four of my words (i’m, eating, this, pizza) on this plane in some way. Maybe it will result in something like this:

2D Representation of Word Embeddings.

“Pizza” has a high food and a low pronoun, while “I’m” has a high pronoun value and low food value. By plotting these points, I can see that “eating” and “pizza” are more closely related than “pizza” and “this” in terms of their context. I also only need 2 values to represent each of these words (one for each dimension, x and y) instead of 4, which I needed from the one-hot encoding. This is the basic concept of word embeddings. By plotting each word, we can figure out more complex relationships between each, and shrink the arrays that are storing the information.

In a real word embedding, it is more common to see hundreds of dimensions instead of just two. Obviously, this can be difficult to visualize, but the concept from the 2D example still holds. Each of our words is represented by a 300 element vector, and we can plot each of those points. The major question that arises is: how are these values calculated? This is where the machine learning comes in. There are multiple different ways to go about calculating a word embedding, and there are many different libraries to use. Word2Vec and GloVe are popular options that offer pre-trained embeddings (meaning the vectors are already calculated for you by Google and Stanford). These pre-trained embeddings are usually derived from a web crawl of text, Wikipedia, Twitter, or other large text corpuses. They are a good option if your project requires a general understanding of language, because they pull from such diverse sources. In this project, though, we are directly concerned with how certain subreddits understand concepts, so we won’t be able to utilize a pre-trained dictionary.

I won’t go too far into the details of how Word2Vec operates, as it’s a topic for another blog post, but at a high level, it uses a Convolutional Neural Network to scan each of the documents passed and determine the overall context for each word. Words used in similar contexts will be grouped more closely together. For instance, if I saw the text “I am eating pizza” and “I am eating pancakes”, the words “pizza” and “pancakes” are being used in the exact same context, which would cause the CNN to relate the two concepts. CNNs are very powerful in text and image processing, but they deserve their own post to cover them fully.

Methodology

We can download the comment history for any subreddit and tokenize it, giving us a 20 million element array, with each element representing one tokenized comment. We’ll be using Word2Vec, and more specifically, a Python implementation from Gensim. The hyperparameters to the model will vary depending on the corpus being used, but I was using a 300-dimensional embedding. Usually, this will require more data than a lower-dimensional embedding. Once the model has been trained, we’re able to to query it for different information. We can start slow by checking the NBA subreddit data, and seeing the closest embeddings to a few input words.

Sample output from the Reddit r/nba word embeddings.

There are actually a few amazing findings from these results. Remember that this model looks at every word that’s ever been spoken in r/nba’s history. We input words like “pop”. It has a very common meaning in English, usually used as a verb, but in the NBA, Gregg Popovich is one of the most respected coaches in the league. Each of the top 6 closest vectors are related to either him, or the San Antonio Spurs (the team he coaches). “Goat” is a very similar situation. Usually, we would be referring to the animal, but here, it usually stands for “Greatest of All Time”, as you see from the embeddings. Sam Hinkie, the ex-GM of the Philadelphia 76ers famously dubbed his team’s rebuilding strategy as “The Process”. Even concepts that aren’t actually words (LeDecline) result in similar ideas. Usually each of these terms is a play on LeBron James’ name, an inside joke that the community has created over the years.

This has amazing implications. For any text corpus, we can immediately gain a greater understanding of concepts and what people actually mean when they communicate. We can also generate lists based on these concepts. Let’s say I wanted to know who r/nba thought was the GOAT. I can take a list of players and compare them to the embedding for “GOAT”. The closest players are the most relevant to that term (and Brian Scalabrine).

I can also try words like “MVP”:

Or I can train using the r/NFL corpus:

There are few amazing things happening here. First, in the “Eagles” column, the model was not only able to recognize that the Eagles were a team, but it was able to identify each of the other NFC East teams. It can identify nicknames. “OBJ” refers to Odell Beckham Jr., the Giants star Wide Receiver. It can detect typos and abbreviations, as we see from the “BB” column where we are presented all of the common misspellings of Bill Belichick’s last name (with the correct spelling first). The last column is referring to Carolina QB Cam Newton’s nickname, “Superman”. He also dabs.

A lot.

I hope this analysis has proven the power that word embeddings can have when it comes to text and topic analysis. Understanding customers, comments, reviews and any other kind of text data can be enhanced with this amazing approach.

Source: Deep Learning on Medium