Sentiment Analysis on Tweets with NLP Achieving 96% accuracy

Original article was published on Artificial Intelligence on Medium

Natural Language Processing

Sentiment Analysis on Tweets with NLP Achieving 96% accuracy

Photo by Sara Kurfeß on Unsplash

One of the most complicated uses of AI is NLP (Natural Language Processing). The reason why, is that language differs from all the other kinds of data. While numerical and image data, for example, has the advantage of being objective, written text is relative. Its interpretation varies cross-culturally: the same sentence can signify completely different things for two different persons.

Of course, data analysts have found functioning solutions that allow a machine to have a ‘basic understanding’ on the content of a text.

Sentiment Analysis

The first step in learning NLP models is creating a sentiment analysis. Given some text, the AI should be trained to recognize if its meaning is positive or negative. Practically, you can use this tool to understand the overall customer perception of products or news, especially when there are no numerical measures (such as ratings), but only text.

Machine Learning vs. Deep Learning

In my experience, we have to use different tools depending on the complexity of the problem. In movie reviews, given the complexity of every single element, we would need to use neural networks, but for tweets, we can use machine learning with very promising results.

nltk

I will be using a machine learning library specialized for NLP, called nltk. I prefer using scikit-learn for creating machine learning models, but it is a library specialized for tabular data rather than natural language processing.

Steps

In this article, I will follow the following steps:

  1. Importing Modules
  2. Creating Features and Labels (encoding)
  3. Creating train and test (splitting)
  4. Using the model: Naive Bayes Classifier
  5. Performance Estimation

As usual, AI is not standardized. There are several ways of reaching the same results. In a regular NLP we would need to preprocess the data in this way:

  • Tokenization: splitting sentences into individual words
  • Encoding: converting these individual words to numbers
  • Creating the NLP Model

I will be using dictionaries, therefore I won’t be encoding text into numbers, rather into Boolean values.

1. Importing Modules

!pip install nltk
import nltk
#per risolvere un bug, altrimenti da errore
nltk.download('punkt')
#tokenizer
def format_sentence(sent):
return({word: True for word in nltk.word_tokenize(sent)})

2. Creating Features and Labels

In this section I will separately import the datasets containing both positive and negative tweets, preprocessing them separately.

The dataset in question contains a sample of 617 positive tweets and 1387 negative tweets, for a total of 2000 tweets.

# X + y
total = open('/content/drive/My Drive/Colab Notebooks/Projects/20200602_Twitter_Sentiment_Analysis/pos_tweets.txt')
Xy_pos = list()
#word tokenization
for sentence in total:
#print(sentence)
Xy_pos.append([format_sentence(sentence), 'pos'])
#saves the sentence in format: [{tokenized sentence}, 'pos]
#Xy_pos
# X + y
total = open('/content/drive/My Drive/Colab Notebooks/Projects/20200602_Twitter_Sentiment_Analysis/neg_tweets.txt')
Xy_neg = list()
#word tokenization
for sentence in total:
#print(sentence)
Xy_neg.append([format_sentence(sentence), 'neg'])
#saves the sentence in format: [{tokenized sentence}, 'pos]
#Xy_neg

As a result, I will have a dictionary nested in a list:

[dictionary, sentiment]

If we have a look at the first element of the positive_tweets, we can see how our data has been encoded:

Xy_pos[0]
[{"''": True,
"'m": True,
',': True,
'.': True,
':': True,
'Ballads': True,
'Cellos': True,
'Genius': True,
'I': True,
'``': True,
'and': True,
'by': True,
'called': True,
'cheer': True,
'down': True,
'iPod': True,
'listening': True,
'love': True,
'music': True,
'my': True,
'myself': True,
'of': True,
'playlist': True,
'taste': True,
'to': True,
'up': True,
'when': True},
'pos']

3. Creating train and test

We are working in the domain of Supervised Learning. Unfortunately, compared with the analysis of tabular data, at least for the nltk tool, preprocessing acts a bit differently.

If I had to analyze tabular data I would create my portions divided into:

X_train, y_train, X_test, y_test

In this particular case I need to merge labels and features, remaining with:

Xy_train, Xy_test

The reason for this change is that the nltk model only accepts one parameter, X_train and y_train merged: in our case Xy_train

Splitting

To create my train and test portion I will have to merge pos and neg as they contain both the train and the test portion. Then I will split the combined datasets.

def split(pos, neg, ratio):
train = pos[:int((1-ratio)*len(pos))] + neg[:int((1-ratio)*len(neg))]
test = pos[int((ratio)*len(pos)):] + neg[int((ratio)*len(neg)):]
return train, test
Xy_train, Xy_test = split(Xy_pos, Xy_neg, 0.1)

4. Using the model: Naive Bayes Classifier

It is now time to create the Machine Learning model.

from nltk.classify import NaiveBayesClassifier#encoded thrugh dictionaries
classifier = NaiveBayesClassifier.train(Xy_train)
classifier.show_most_informative_features()
Most Informative Features no = True neg : pos = 20.6 : 1.0 awesome = True pos : neg = 18.7 : 1.0 headache = True neg : pos = 18.0 : 1.0 beautiful = True pos : neg = 14.2 : 1.0 love = True pos : neg = 14.2 : 1.0 Hi = True pos : neg = 12.7 : 1.0 fan = True pos : neg = 9.7 : 1.0 Thank = True pos : neg = 9.7 : 1.0 glad = True pos : neg = 9.7 : 1.0 lost = True neg : pos = 9.3 : 1.0

The model has associated one value to each word in the dataset. It will perform a calculation on all the words contained in every tweet it has to analyze, and then make its estimation: positive or negative.

5. Performance Estimation

As we can see, we obtained (by excess approximation) a 96% accuracy on all the tweets! Amazing result.

from nltk.classify.util import accuracy
print(accuracy(classifier, Xy_test))
0.9562326869806094