Original article was published by Rashida Nasrin Sucky on Artificial Intelligence on Medium
A Complete Sentiment Analysis Algorithm in Python with Amazon Product Review Data: Step by Step
An NLP Project Using Python’s Scikit_learn library
In today’s world sentiment analysis can play a vital role in any industry. Classifying tweets, Facebook comments or product reviews using an automated system can save a lot of time and money. At the same time, the probability of error is lower. In this article, I will explain a sentiment analysis task using the amazon product review dataset.
I am going to use python and a few libraries of python. Even if you haven’t used these libraries before, you should be able to understand it well. If this is new to you, please copy each step of code to your notebook and see the output for better understanding.
- Pandas library
- scikit-learn library
- Jupyter Notebook as an IDE.
Dataset and task Overview
I am going to use a product review dataset as I mentioned earlier. The dataset contains Amazon baby product reviews.
Please download the dataset for yourself from this link if you want to practice with it.
It has three columns: name, review, and rating. Reviews are text data and ratings are numbering from 1 to 5 where 1 is the worst and 5 is the best review.
Our job is to analyze the reviews as positive and negative reviews. Let’s have a look at the dataset. Here we used the first five entries to examine the data.
import pandas as pd
products = pd.read_csv('amazon_baby.csv')
In real life, data scientists rarely get data that are very clean and already prepared for machine learning models. For almost every project, you have to spend time cleaning and process the data. So, let’s clean the dataset first.
One important data cleaning process is to get rid of null values. Let’s check how many null values we have in the dataset.
In this dataset, we have to work on these three columns and all three of them are crucial. If the data is not available in any row in a column that row is unnecessary.
len(products) - len(products.dropna())
We have null values in 1147 rows. Now, check how much total data we have.
We have a total of 183531 data. So, if we delete all the null values, we will still have a sizable data to train an algorithm. So, let’s drop the null values.
products = products.dropna()
We need to have all the string data in the review column. If there is any data that has other types, it will cause trouble in later steps.
Now, we will check the datatype of the review data of every row. If there is any row having data in any other type than string we will change that to a string.
for i in range(0,len(products)-1):
if type(products.iloc[i]['review']) != str:
products.iloc[i]['review'] = str(products.iloc[i]['review'])
As we are doing sentiment analysis, it is important to tell our model what is positive sentiment and what is a negative sentiment.
In our rating column, we have ratings from 1 to 5. We can define 1 and 2 as bad reviews and 4 and 5 as good reviews.
What about 3?
3 is in the middle. It’s neither good nor bad. Just average. But we want to classify good or bad reviews. So, I decided to get rid of all the 3’s.
It depends on the employer or your ideas of good or bad. If you think you will put 3 in the good review slot, just do it. But I am getting rid of them.
products = products[products['rating'] != 3]
We will denote positive sentiments as 1 and negative sentiments as 0. Let’s write a function ‘sentiment’ that returns 1 if the rating is 4 or more else return 0. Then, apply the function sentiment and create a new column that will represent the positive and negative sentiment as 1 or 0.
return 1 if n >= 4 else 0
products['sentiment'] = products['rating'].apply(sentiment)
Look, we have the ‘sentiment’ column added at the end now!
First, we need to prepare the training features. Combine both ‘name’ and ‘review’ columns and make one single column. First, write a function ‘combined_features’ that will combine both the columns. Then, apply the function and create a new column ‘all_features’ that will contain the strings from both name and review columns.
return row['name'] + ' '+ row['review']products['all_features'] = products.apply(combined_features, axis=1)
You can see the ‘all_features’ column at the end. Now, we are ready to develop the sentiment classifier!
Develop the sentiment classifier
Here is the process step by step.
We need to define the input variable X and the output variable y.
X should be the ‘all_features’ column and y should be our ‘sentiment’ column
X = products['all_features']
y = products['sentiment']
We need to split the dataset so that there is a training set and a test set.
The ‘train_test_split’ function from the scikit-learn library is helpful. The model will be trained using the training dataset and the performance of the model can be tested using the test dataset.
‘train_test_split’ automatically splits the data in 75/25 proportion. 75% for the training and 25% for the testing. If you want the proportion to be different, you need to define that.
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
I am going to use ‘CountVectorizer’ from the scikit-learn library. CountVectorizer develops a vector of all the words in the string. Import CountVectorizer and fit both our training, testing data into it.
from sklearn.feature_extraction.text import CountVectorizercv = CountVectorizer()
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)
Let’s dive into the original model part. This is the most fun part. We will use the Logistic Regression as this is a binary classification. Let’s do the necessary imports and fit our training data in the model.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
The logistic regression model is trained with the training data.
Use the trained model above to predict the sentiments for the test data. If we pass the test features, it will predict the output y that is the sentiment data.
y_pred_class = model.predict(X_test_dtm)
array([1, 1, 1, ..., 1, 1, 0], dtype=int64)
Here is the output for the test data. As you remember, we used 1 for good reviews and 0 for a bad review.
Use the accuracy_score function to get the accuracy_score of the test data. So, it will compare the predicted ‘sentiment’ with the original ‘sentiment’ data to calculate the percentage of accuracy.
The accuracy score I got for this data on the test set is 84%, which is very good.
This simple sentiment analysis classifier can be useful in many other types of datasets. It can be used in real-world projects and businesses as well. The dataset we used here resembles a real business dataset. Please try this technique with some other dataset.