Data Preparation and Text-Preprocessing on Amazon Fine Food Reviews

Source: Deep Learning on Medium

Data Preparation and Text-Preprocessing on Amazon Fine Food Reviews

Here, I am going to show you the data preparation and text-preprocessing on Amazon Fine food Reviews.

But before that you need to know why text-preprocessing is necessary.

Amazon Fine Food(Pic Courtesy:Kaggle)

Why do we need to do text pre-processing?

Machine Learing models don’t work with the text data, so text data needs to be cleaned and converted into numerical vectors.This process is called text-preprocessing.

These are below basic steps that I am going to show you in this article.

Understanding the data:First of all, you need to see what the data is all about and what parameters(Stopwords,Punctuations,html tag….etc) is in the data.

Data Cleaning: In this step, I will remove all the unneccessary parameters.

Techniques for encoding text data: There are lot of techniques for encoding text data. but below are the techniques I have mostly used while solving real-world problems.

  1. Bag of Words
  2. Bi-gram,n-gram
  3. TF-IDF
  4. Avg-Word2Vec

Now, let’s get started:

First,Importing all necessary libraries

import pandas as pd
import numpy as np
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import SnowBallStemmer
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

So,we are now going to import the Amazon-fine-food reviews dataset

data=pd.read_csv('./amazon-fine-food-reviews/Reviews.csv')

Checking the data

data.head() 
Amazon Fine Food Reviews Dataset

Checking the shape

data.shapeOutput:(568454, 10)

Objective: Given a text review, predict whether the review is positive or negative.

But here I am doing only the data preparation and text pre-processing part.

So let’s go to the data preparation part.

Data Preparation:

Let’s first see the ‘Score’ column

data['Score'].value_counts()Output:
5 363122
4 80655
1 52268
3 42640
2 29769

So if you see the ‘Score’ column,it has 1,2,3,4,5 values. Our main objective is to predict whether a given review is positive or negative. Here, if we consider 1,2 as negative reviews and 4,5 as positive reviews then logically 3 does not add any value to our objective. So, let’s discard those rows where ‘Score’=3

data=data[data['Score']!=3]

Now, data will contain only with the rows containing ‘Score’=1 and ‘Score’=2.

Let’s convert the score values into class label either ‘positive’ or ‘negative’ .

def xyz(x):
if x>3:
return 'positive'
else:
return 'negative'
s=data['Score']
d=list(map(xyz,s))
data['Score']=d
data

Now, I am going to show you how to remove duplicates and unwanted records. In this stage, you require some domain knowledge because it is one of the most ‘state of the art’ part in data science.

First, I have checked duplicates based on UserId, ProfileName, Time, Text.(because no user cannot review on same products on same time). If I find any duplicates regarding this, I remove these records.

Also, Helpfulness numerator must be less than or equal to helpfulness denominator, so checking the records and remove these records.

data_f=data.sort_values('ProductId').drop_duplicates(subset=['UserId','ProfileName','Time','Summary','Text'],keep='first',inplace=False)
Final_Values=data_f[data_f['HelpfulnessDenominator']>=data_f['HelpfulnessNumerator']]
Final_Values

Now, Let’s do the text pre-processing stage.

Text-Preprocessing:

I am applying this text pre-processing on Text column as shown above.

Before starting text-processing I want to explain you about some topics such as Stemming and Stopwords.

Stemming: It is a technique which can convert words to their base word or stem word(i.e tasty, tastefully is converted into base word tasti ….)

Stopwords: These are some unimportant words even if you remove them from sentences, semantic meaning of the text doesn’t change.

Example: ‘This restaurant is good’(Here ‘This’, ’is’ are stopwords)

Checking for all stopwords

stop=set(stopwords.words('english'))
print(stop)

First I have checked few samples of the text data to know what parameters (html tag,punctuations,special characters….etc) needs to removed to get the clean text. After checking few of the samples I have found that few of the sentence contains some html tag, punctuations and also special characters. So, I need to remove these parameters to get clean text.

Below are the steps that I have done for pre-processing:

a. Remove html tags.

b. Remove any punctuations and special characters

c. Convert the word to lowercase

d. Remove stopwords

e. Finally I use snowballstemmer for stemming the words.

Below is the code for text pre-processing:

import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import numpy
import re
from bs4 import BeautifulSoup
stop=set(stopwords.words('english'))
snow = nltk.stem.SnowballStemmer('english')
stop
def decontracted(phrase):
phrase=re.sub(r"won't","will not",phrase)
phrase=re.sub(r"can't","can not",phrase)
phrase=re.sub(r"n\'t","not",phrase)
phrase=re.sub(r"\'re","are",phrase)
phrase=re.sub(r"\'s","is",phrase)
phrase=re.sub(r"\'d","would",phrase)
phrase=re.sub(r"\'ll","will",phrase)
phrase=re.sub(r"\'t","not",sentence)
phrase=re.sub(r"\'ve","have",sentence)
phrase=re.sub(r"\'m","am",sentence)
return phrase
preprocessed_reviews=[]
for sentence in Final_Values['Text'].values:
sentence=re.sub(r"http\S+"," ",sentence)
sentence=BeautifulSoup(sentence,'lxml').get_text()
cleanr=re.compile('<.*?>')
sentence=re.sub(cleanr,' ',sentence)
sentence=decontracted(sentence)
sentence=re.sub("\S\*\d\S*"," ",sentence)
sentence=re.sub("[^A-Za-z]+"," ",sentence)
sentence=re.sub(r'[?|!|\'|"|#]',r' ',sentence)
sentence=re.sub(r'[.|,|)|(|\|/]',r' ',sentence)
sentence=' '.join(snow.stem(e.lower()) for e in sentence.split() if e.lower() not in stop)
preprocessed_reviews.append(sentence.strip())

Let’s see how the text looks like after pre-processing:

preprocessed_reviews[1700]

As you can see from above that we have got clean text after pre-processing.

Now, we are going to apply some techniques for text encoding.

Techniques for Text Encoding:

Bag of Words(BOW):

In BOW, we construct a dictionary that contains a set of all unique words from our review text dataset.Here, the frequency of every word is counted. If there are d unique words in our dataset, then for every review the vector will be the length of size d. The vector will be very sparse in this case.

This is the basic concept of BOW.

For more info, I have provided the link at the very end of this article. Please go through it.

Now, let’s apply the BOW on my pre-processed text data.

from sklearn.feature_extraction.text import CountVectorizer
count=CountVectorizer()
Reviews_BOW=count.fit_transform(preprocessed_reviews)
print(Reviews_BOW[1]

Drawbacks of using BOW:

As you can see from the above output, it is a sparse matrix representation.Our main objective is to similar meaning reviews should be close to each other.But it doesn’t take the semantic meaning of the sentences.

Let’s give you an example:

Suppose there are two reviews:

r1: The pasta is tasty.

r2: The pasta is not tasty.

As you can see there are obvious difference in the semantic meaning of these above two reviews. As these two reviews are not similar, their corresponding vectors should not be close to each other. But in BOW, after stopwords removal both sentences will be converted to pasta tasty so both giving exact same meaning and their corresponding vectors will be close to each other, which is not correct.

Let’s go to our second text encoding techinque,which is Bi-gram, n-gram.

Bi-gram, n-gram:

Bi-gram is basically means pair of two consecutive words used for creating dictionary, tri-gram is basically three consecutive words.

Scikit-learn CountVectorizer has parameter ngram_range, if it is assigned as (1,2) then it is called Bi-gram.

Now, let’s apply the Bi-gram on my pre-processed text data.

count=CountVectorizer(ngram_range=(1,2))
Bigram_Counts=count.fit_transform(preprocessed_reviews)
print(Bigram_Counts[1])

Drawbacks of Bi-gram,n-gram:

It has the same drawback as BOW, it also doesn’t take the semantic meaning of the text and also it increases dictionary size a lot.

Let’s move onto our next text encoding technique,which is TF-IDF

TF-IDF:

Term Frequency -Inverse Document Frequency(TF-IDF) gives less importance to most frequent words and gives more importance to less frequent words.

Term Frequency is number of times a particular word(W) occurs in a review divided by total number of words (Wr) in review. The term frequency value ranges from 0 to 1.

Inverse Document Frequency is calculated as log(Total Number of Docs(N) / Number of Docs which contains particular word(n)). Here Docs referred as Reviews.

TF-IDF =TF * IDF =(W/Wr)*LOG(N/n)

counts=TfidfVectorizer()
cnt=counts.fit_transform(preprocessed_reviews)
print(cnt[1])

Drawbacks of TF-IDF:

Here, just we get a TF-IDF value for every word. It has the same drawback as BOW,Bi-gram,n-gram. It also doesn’t take the semantic meaning of the text.

So to actually overcoming the issue of semantic reviews, I will use Word2Vec.

Word2Vec:

Before going to Avg-Word2Vec I am just going to tell you about how Word2Vec works.

To know more about Word2Vec and it’s mathematical intuition I am going to give you some links at the end of this article.

It actually takes the semantic meaning of the words and their relationships between other words. It learns all the internal relationships between the words.It represents the word in dense vector form.

I am importing gensim library which has Word2Vec which takes the parameters like min_count=5 means if a word repeats less than 5 times then it will ignore that word, size=50 gives a vector of length of size 50, and workers are cores to run this.

Average Word2Vec:

To compute Average Word2Vec, below are the steps to follow.

  1. Compute Word2Vec for each of the words
  2. Add the vectors of each words of the sentence
  3. Then divide the vector with the number of words in the sentence

It’s a simple average of the Word2Vec of all the words.

Below is code to compute Average Word2Vec

from gensim.models import Word2Vec
list_of_sentences=[]
for sentence in preprocessed_reviews:
list_of_sentences.append(sentence.split())
w2v_model=Word2Vec(list_of_sentences,min_count=5,size=50,workers=4)
w2v_words=list(w2v_model.wv.vocab)
sent_vectors=[]
for sent in list_of_sentences:
sent_vec=np.zeros(50)
cnt_words=0
for word in sent:
if word in w2v_words:
vec=w2v_model.wv[word]
sent_vec=sent_vec+vec
cnt_words=cnt_words+1
if cnt_words!=0:
sent_vec=sent_vec/cnt_words
sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(sent_vectors[0])

Conclusion:

In this article, I have shown you some different techniques for encoding text data into numerical vectors. But which is the most approriate technique for your model, it depends totally on the structure of the data, which model will you choose, objective of your model and most importantly on your business requirement.

Hope, you’ve got a basic understanding of data preparation and text pre-processing techniques in this article.