The Stanford Sentiment Treebank (SST): Studying sentiment analysis using NLP

Original article was published by Jerry Wei on Artificial Intelligence on Medium

The Stanford Sentiment Treebank (SST): Studying sentiment analysis using NLP

A quick guide to the Stanford Sentiment Treebank (SST), one of the most well-known datasets for sentiment analysis.

Published in 2013, “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank” presented the Stanford Sentiment Treebank (SST). SST is well-regarded as a crucial dataset because of its ability to test an NLP model’s abilities on sentiment analysis. Let’s go over this fascinating dataset.

Predicting levels of sentiment from very negative to very positive (- -, -, 0, +, ++) on the Stanford Sentiment Treebank. Image credits to Socher et al., the original authors of the paper.

The task. SST handles the crucial task of sentiment analysis in which models must analyze the sentiment of a text. For example, this could come in the form of determining whether restaurant reviews are positive or negative. Here are some made-up examples that display a range of positivity and negativity in their sentiment:

This was the worst restaurant I have ever had the misfortune of eating at.

The restaurant was a bit slow in delivering their food, and they didn’t seem to be using the best ingredients.

This restaurant is pretty decent— its food is acceptable considering the low prices.

This is the best restaurant in the Western Hemisphere, and I will definitely be returning for another meal!

Based on these examples, sentiment analysis may seem like an easy task. However, there are lots of challenging nuances that can make it difficult to accurately analyze a phrase’s sentiment. Linguistic anomolies such as negation, sarcasm, and using negative terms in a positive way are especially difficult for NLP models to handle. Take the following examples:

I do not hate this restaurant. (Negation)

I just love being served cold food! (Sarcasm)

The food is unnervingly unique. (Negative words being positive)

As you can see from these examples, it’s not as easy as just looking for words such as “hate” and “love.” Instead, models have to take into account the context in order to identify these edge cases with nuanced language usage. With all the complexity necessary for a model to perform well, sentiment analysis is a difficult (and therefore proper) task in NLP.

Compiling the dataset. For SST, the authors decided to focus on movie reviews from Rotten Tomatoes. By scraping movie reviews, they ended up with a total of 10,662 sentences, half of which were negative and the other half positive. After converting all of the text to lowercase and removing non-English sentences, they use the Stanford Parser to split sentences into phrases, ending up with a total of 215,154 phrases.

Labeling interface used to annotate SST — annotators used a slider to select the degree to which a phrase was positive or negative. Image credits to Socher et al., the original authors of the paper.

And how was SST annotated? Pulling out the classic Amazon Mechanical Turk workers, the authors presented these phrases in a random order and asked annotators to indicate the sentiment and degree of sentiment for each phrase using a slider. The slider allows for up to 25 different levels of sentiment (see the figure below for details on this), and the authors used the annotations to define fine-grained and binary versions of the task. In the fine-grained version of SST, there are 5 different classes (very negative, negative, neutral, positive, very positive) and the presented baseline model achieves 45.7% accuracy. In the binary version of SST, there are just 2 classes (positive vs negative) and the presented baseline model achieves 85.4% accuracy.

Histogram of sentiment annotations for phrases of a certain length (n-gram length). Shorter phrases are more likely to be neutral, and longer phrases are more well-distributed. Image credits to Socher et al., the original authors of the paper.

The impact of SST. As the leading dataset for sentiment analysis, SST is often used as one of many primary benchmark datasets to test new language models such as BERT and ELMo, primarily as a way to demonstrate high performance on a variety of linguistic tasks.

SST will continue to be the go-to dataset for sentiment analysis for many years to come, and it is certainly one of the most influential NLP datasets to be published.

Further reading: