Discover the Sentiment of Reddit Subgroup using RoBERTa Model

Original article was published on Deep Learning on Medium


How do you feel when you log in to social media accounts and read the opening post? Is it put a smile on your face or make you sad or angry? I have a mixed experience. Most of the time, social media posts make me happy. How? Well, we can’t control what other people post, but we can control what we want to see on our social media accounts.

If you joined a group having high negative comments, then you will read those comments more often. That makes you angry and sad. Leave those toxic group before it takes a toll on your mental health.

So if I tell you to find out the toxic group of your social media account, can you do that?

Well, this article will help you to create a model which help you to summarize the sentiment of all post or comment. So you can leave those groups before they make you feel like quitting social media.

We will use Reddit social media references for this article. I will analyze my Reddit subgroup. And check whether these subgroups have a high count of negative comments or not.

Why Reddit?

Social media like Reddit and twitter will let you access the user’s post and comments via API. You can test and implement the sentiment analysis model on Reddit data.

This article has divided into two parts. In the first part, I will build a RoBERTa model. And in the second part, we analyze the sentiment of the Reddit subgroup.

Building of RoBERTa Model

We will train and fine-tune the pre-trained model RoBERTa with a twitter dataset. You can find the data here. The dataset contains positive and negative sentiment tweets. I have chosen binary sentiment data to increase accuracy. It’s easy to interpret binary prediction. Also, it makes the decision process easy.

Huggingface team transformers library will help us to access the pre-trained RoBERTa model. The RoBERTa model performs exceptionally good on the NLP benchmark, General Language Understanding Evaluation (GLUE). Performance of RoBERTa model match with human-level performance. Learn more about RoBERTa here. Learn more about the Transformers library here.

Now, let’s go over different parts of the code in sequence.

Part 1. Configuration and Tokenization

The pre-trained model has a configuration file contain pieces of information, such as the number of layers and the number of attention heads. The details of the RoBERTa model configuration file are mention below.

{
“architectures”: [
“RobertaForMaskedLM”
],
“attention_probs_dropout_prob”: 0.1,
“bos_token_id”: 0,
“eos_token_id”: 2,
“hidden_act”: “gelu”,
“hidden_dropout_prob”: 0.1,
“hidden_size”: 768,
“initializer_range”: 0.02,
“intermediate_size”: 3072,
“layer_norm_eps”: 1e-05,
“max_position_embeddings”: 514,
“model_type”: “roberta”,
“num_attention_heads”: 12,
“num_hidden_layers”: 12,
“pad_token_id”: 1,
“type_vocab_size”: 1,
“vocab_size”: 50265
}

The tokenization means converting python string or sentences in arrays or tensors of integers, which is indices in model vocabulary. Each model has its own tokenizer. Also, it helps in making data ready for the model.

from transformers import RobertaTokenizer
roberta_tokenizer = RobertaTokenizer.from_pretrained(“roberta-base”)

Note: The final version of the code is available at the end of this article.

Part 2. Data Pre-Processing

In this section, we use the tokenizer to tokenize the sentences or input data. This model requires to add tokens at the beginning and end of sequences like [SEP], [CLS] or </s> or <s>.

def convert_example_to_feature(review):
return roberta_tokenizer.encode_plus(review,
add_special_tokens=True,
max_length=max_length,
pad_to_max_length=True,
return_attention_mask=True,
)
def encode_examples(ds, limit=-1):
# prepare list, so that we can build up final TensorFlow dataset from slices.
input_ids_list = []
attention_mask_list = []
label_list = []
if (limit > 0):
ds = ds.take(limit)
for review, label in tfds.as_numpy(ds):
bert_input = convert_example_to_feature(review.decode())
input_ids_list.append(bert_input[‘input_ids’])
attention_mask_list.append(bert_input[‘attention_mask’])
label_list.append([label])
return tf.data.Dataset.from_tensor_slices((input_ids_list,
attention_mask_list,
label_list)).map(map_example_to_dict)

max_length: This variable represents the max length of the sentence allowed. Max value for this variable should not exceed 512.

pad_to_max_length: If True, tokenizer add [PAD] at the end of sentence.

The RoBERTa model needs 3 inputs.

1. input_ids: The sequence or index of data points.

2. attention_mask: It distinguishes original words from special tokens or padded words.

3. label: labeled data

Part 3. Model Training and Fine-Tuning

The Transformers library loads the pre-trained RoBERTa model in one line of code. The weights are downloaded and cached on your local machine. We fine-tune these models according to NLP tasks.

from transformers import TFRobertaForSequenceClassificationmodel = TFRobertaForSequenceClassification.from_pretrained(“roberta-base”)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy(‘accuracy’)
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
model.fit(ds_train_encoded,
epochs=number_of_epochs,
validation_data=ds_test_encoded,
callbacks=[metrics])

Use the below pointers to fine-tune the model.

1. The value of learning_rate variable between 1e-05 to 1e-06 gives a good accuracy score.

2. An increase in batch size improves accuracy and also increase training time.

3. The pre-trained model does not require to train on more number of epochs. Epochs between 3 to 10 will work fine.

Part 4. Accuracy, F1 score and save the model

The accuracy score helps you to detect bias and variance in models. Improvement in the model mostly depends on the accuracy score. Use accuracy score during balanced data and the F1 score in an unbalanced data. F1 scores tell us whether the model learns all data equally or not. We will use Keras callback function to calculate the F1 score of the model.

class ModelMetrics(tf.keras.callbacks.Callback):  def on_train_begin(self, logs={}):
self.count_n = 1
def on_epoch_end(self, batch, logs={}):
os.mkdir(‘/create/folder/’ + str(self.count_n))
self.model.save_pretrained(‘/folder/to/save/model/’ + str(self.count_n))
y_val_pred = tf.nn.softmax(self.model.predict(ds_test_encoded))
y_pred_argmax = tf.math.argmax(y_val_pred, axis=1)
testing_copy = testing_sentences.copy()
testing_copy[‘predicted’] = y_pred_argmax
f1_s = f1_score(testing_sentences[‘label’],
testing_copy[‘predicted’])
print(‘\n f1 score is :’, f1_s)
self.count_n += 1
metrics = ModelMetrics()

We will use a save_pretrained method to save the model. You can save the model with each epoch. We will keep the model that has high accuracy and delete the rest.

Analyze sentiment of Reddit subgroup

Once you complete the building of the RoBERTa model, we will detect the sentiment of the Reddit subgroup. These are the steps you follow to complete the task.

1. Fetch the comment of the Reddit subgroup. Learn more about how to fetch comments from Reddit here.2. Check the sentiment of each comment using your RoBERTa model.3. Count the positive and negative comments of the Reddit subgroup.4. Repeat the process for different Reddit subgroup.

You can find a detailed explanation of steps 1 and 2 here. I have selected my favorite five subreddit for analysis. We analyze the comments of the top 10 weekly posts. I have restricted the comments due to the limitation of Reddit API requests.

The count of positive and negative comments will give you the overall sentiment of the Reddit subgroup. I have implemented these steps in the code. You can find this code at the end of this article.

Graph of sentiment analysis of my favorite 5 Reddit subgroups.