Sentimental Analysis using Amazon fine food review dataset !!!

Original article was published on Artificial Intelligence on Medium

Sentimental Analysis using Amazon fine food review dataset !!!

A s we all know that amazon is one of the biggest e-commerce website. Here you can buy different varieties of product like several media , baby products, gourmet food, groceries, health and personal-care items etc and for every product we buy we can write a review as well as rate the product.If we like we usually give a positive review and rating of 4 or 5 .

Better the product ,better the rating, better the review !!!

In this tutorial we are going to look into amazon rating review dataset which consist of reviews from customer showing their experience and satisfaction with the product.

  • Objective: Determine weather the given review is positive or negative.·
  • Dataset:For this tutorial ,we are going to use the amazon rating review dataset provide by “Kaggle” which is one of the largest machine learning platform where you can get standardised dataset.

→ This dataset contains all the reviews written by people who purchase different food product from the amazon along with their rating and review.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 — Oct 2012
Number of Attributes/Columns in data: 10

Libraries that we are going to use are :-

1. Pandas:-Pandas is a library in python which provide us tools for data storing and data manipulation.

2.Numpy:-numpy is used to perform high level mathematical computation .

3.Nltk:-nltk is used for natural language processing .

4.Sklearn :- sklearn is very useful tool nd features various algorithms .In this we are going to use it in order to extract text and convert it into n dimensional vector.

5.Countvectorizer :- It is one of the most elegant library which allow us to covert text document into matrix of token counts.


The above line of code will allow us to connect or code with the database so that we can fetch the database and perform our operation on it.

  • Now let’s talk about our approach that we are going to follow. We know that amazon has a rating scale ranging from 1 to 5. We will consider 1, 2 as negative rating which means the consumer is not satisfied with the product and 4,5 as positive rating which will indicate that the product is good.

Now lets jump into filtering process:-

Here,we are going to eliminate the rows with “3” as their rating.

data=pd.read_sql_query("""SELECT *From reviewswhere Score !=3""",con)

In the above code we are not only extracting positive and negative data as well as excluding the rows with rating 3 with the help of simple sql query.

We are going to change our score in the form of positive and negative.
Score 1–2 : Negative
Score 4–5 : Positive
Score 3 : Already Neglected

def convert(x):if x >3:return 1return 0#new column after getting score converted in Positive or Negative# Updating our score column with 0 & 1 \\ the new Column that we have

[2] Exploratory Data Analysis

→Data Cleaning (Deduplication) After having look into data we discovered that there were many duplicacy like:-
* Same Person reviewed many product with same time stamp and given similar review and rating.
* So it is very important to remove those type of duplicate data in order to prevent unbiased results for the analysis of the data.

Removing the duplicate data.

sort_data=data.sort_values(by=['ProductId'],axis=0,ascending =True)clean=data.drop_duplicates(subset={'ProductId','UserId','ProfileName','Score','Time','Summary'},keep='first',inplace=False)

→Removing null values from our dataset:-



id 0ProductId 0UserId 0ProfileName 0HelpfulnessNumerator 0HelpfulnessDenominator 0Score 0Time 0Summary 0Text 0dtype: int64-->There are no null values.

[3]Text pre-processing:-

→Pre-processing Review Text
Now when we have finally removed all the duplicates and all unwanted data so we need to do some pre-processing of our data before creating our model.

→We might have some html tags and url in our text so first we are going to remove that.

def removeUrl(text):pr=re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)return pr

Now we are going to Deconcatenate the words like

won’t=will not

there’re=there are

def decontracted(phrase):# specificphrase = re.sub(r"won't", "will not", phrase)phrase = re.sub(r"can\'t", "can not", phrase)# generalphrase = re.sub(r"n\'t", " not", phrase)phrase = re.sub(r"\'re", " are", phrase)phrase = re.sub(r"\'s", " is", phrase)phrase = re.sub(r"\'d", " would", phrase)phrase = re.sub(r"\'ll", " will", phrase)phrase = re.sub(r"\'t", " not", phrase)phrase = re.sub(r"\'ve", " have", phrase)phrase = re.sub(r"\'m", " am", phrase)return phrase

1.Stemming -> Its the process of reducing inflected words to their word stem, base or root form — generally a written word form.

2.Stop Words Removal>
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.
Why to remove stopwords?
Removing stopwords can potentially help in improving the performance as there are fewer and only meaningful tokens left.

from nltk.stem import PorterStemmersnow=nltk.stem.SnowballStemmer('english')def removeStopWord(word):token=word.split(" ") ## coverting string to token (list of word) \\ like ["this","is","token"]removestop=[snow.stem(x) for x in token if x not in stopwords] ##removing stopwords and also doing Stemmingremoved=" ".join(removestop) ##joing back the list into sentencereturn removed

→Now let’s just create a list taht contains all cleaned text after applying all the filter that we have discussed.

from tqdm import tqdmpreprocessed_reviews = []for line in tqdm(final.Text.values):line= BeautifulSoup(line, 'lxml').get_text() ## Remove Html Tagsline=removeUrl(line) #removing urlline=decontracted(line) #Coverting word like { aren’t -> are not}line = re.sub(r'[0-9]+', '', line) ## To Remove Numbers from the stringline=line.lower() ## Converting every word to lower caseline = re.sub(r'[^a-z0-9\s]', '', line) ## To clean all special Charactersline=removeStopWord(line) ## Removing Stop Words And doing Steamingpreprocessed_reviews.append(line.strip()) ## adding cleaned word into a list after removing spaces {By using strip()}
  • Now we have our clean data in preprocessed_reviews. Let’s now move into modelling.
import numpy as npimport pandas as pdimport mathimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.model_selection import cross_val_scorefrom collections import Counterfrom sklearn.metrics import accuracy_scorefrom sklearn import model_selectionfrom sklearn.metrics import roc_auc_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import StandardScaler
  • Splitting data it into train and test:-
X=preprocessed_reviewsy=np.array(final[‘Score’])X_1, X_test, y_1, y_test = train_test_split(X, y, test_size=0.3, random_state=0X_train, X_cv, y_train, y_cv = train_test_split(X_1, y_1, test_size=0.3)

# Model Building:-

Logistic Regression:-

from sklearn.linear_model import LogisticRegression
c= 10**-4

→ Using Bi-Grams:-

count_vect = CountVectorizer(min_df = 5, ngram_range = (1,2))X_train=count_vect.fit_transform(X_train)X_cv=count_vect.transform(X_cv)X_test=count_vect.transform(X_test)scalar = StandardScaler(with_mean=False)X_train = scalar.fit_transform(X_train)X_test= scalar.transform(X_test)X_cv=scalar.transform(X_cv)lr=LogisticRegression(),y_train)predictions = lr.predict(X_test)print('AUC: ', roc_auc_score(y_test, predictions))output:- AUC: 0.8851395124283175 
Now,we have a good accuracy of 88% with logistic regression.

Lets just check it mannuly:-

text=['I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most','Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".']

output:-[1 0]

To view entire code visit this link:-


No machine learning algorithm is perfect but we manage to bring the accuracy using logistic regression to 88% .We can further improve the accuracy using some more algorithm .