Hotel Reviews Sentiment Analysis From Scratch To Deployment With Both Machine Learning And Deep…

Original article was published by Janibasha Shaik on Deep Learning on Medium


Hotel Reviews Sentiment Analysis From Scratch To Deployment With Both Machine Learning And Deep Learning Algorithms

All you need to know impact of Machine Learning on Unstructured data

Table Of Contents :

(i) Problem Statement

(ii) Motivation

(iii) Data Processing

(iv) Data Cleaning

(v) Machine Learning Model Building

(vi) Deep Learning Model building

(vii) Deployment

(viii) Conclusions

(ix) Future Scope

(x) References

Photo by Rhema Kallianpur on Unsplash

(i) Problem Statement :

Our objective is to classify the given hotel review is a positive review or a negative review.

So our problem statement belongs to binary classification problem.

(ii) Motivation :

If any business wants to sustain in the market for a longer period then their customer’s reviews are the key indicators for their business. In the future NLP plays very crucial role in business because in 10 out of 9 people using mobiles for their daily shopping, social media posts ,movie ticket booking, hospital appointment , Hotel booking etc .Every day petabytes of unstructured data generating in the web if we extract some useful information from that unstructured data that can be useful for business growth.

So I take hotel domain to extract the information from the unstructured data(text reviews) and contribute my knowledge towards society development.

While solving hotel review analysis problem I learn a lot and gain more knowledge.

(iii) Data Processing :

I take the data set from the Kaggle. https://www.kaggle.com/anu0012/hotel-review?select=train.csv

Data set divided into two csv files

First csv file train.csv which contains 38932 data points and 5 features

Second csv file test.csv which contains 29404 data points and 5 features

I take train.csv for training the model

I split train.csv into training data set and validation data set

import pandas as pd

# Reading csv

df=pd.read_csv(‘/content/drive/My Drive/archive/train.csv’)

df.head()

Our data frame contains 5 features

In 5 features only two features are useful for sentiment analysis

Description and Is_Response features only useful for the sentiment analysis

So now we can drop remaining features from dataframe

df.drop([‘User_ID’,’Browser_Used’,’Device_Used’],axis=1,inplace=True)

# After droping not useful features

df.head( )

Now we have two features in dataframe

(iv) Data Cleaning :

Both features have unstructured data now we need to convert the unstructured data into structured data by using text to vector conversion techniques

Before text to vector conversion we need to clean the Description feature because it contains special characters, stop words , numbers which are not useful for the sentiment classification

For text cleaning I used ‘re’ module

Is_Response contains two classes one is happy and another is not happy , now we need to convert the two classes into numerical form because machine learning model only understands numerical values.

Before converting we need to check is there any null values present in dataframe

df.isnull().sum()

Description 0
Is_Response 0

we don’t have any null values in both features

Now we want to convert Is_Response feature into numerical form

We can use label encoder to convert the Is_Response feature but I create dictionary and map it with Is_Response feature

# Creating dictionary

dict={‘happy’:0,’not happy’:1}

# Mapping dictionary to Is_Response feature

df[‘class’]=df[‘Is_Response’].map(dict)

df.head( )

Now we have class feature

not happy:1

happy :2

Now we can drop Is_Response feature

df.drop([‘Is_Response’],axis=1,inplace=True)

df.head( )

We successfully converted one feature into numerical form

Now we need to convert Description feature into vectors ,before vector conversion we need to clean the special characters , tags , numbers etc

For text preprocessing I used nltk

# Text Preprocessing

from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

corpus = []

for i in range(0, len(df)):

review = re.sub(‘[^a-zA-Z]’, ‘ ‘, df[‘Description’][i])

review = review.lower()

review = review.split()

review = [ps.stem(word) for word in review if not word in stopwords.words(‘english’)]

review = ‘ ‘.join(review)

corpus.append(review)

After the text preprocessing we got corpus which is completely cleaned

corpus[15]

stay elan th th octob like much return day trip vega anoth night unassum appear hotel score heavili great locat spotlessli clean classic design comfort bedroom friendli manag staff jorg colleagu front desk revel untir enthusiast help recommend great restaur place visit etc manag particularli help let us complimentari room post check freshen even flight home long day enjoy southern cal sunshin

If you observe the above corpus we don’t have any special characters, no tags, no numbers , no upper case, completely stemmed words

Now our Description feature is completely cleaned, now we can apply Bag of words /Tfidf Vectorizer/ word2vec to convert text into vectors

TfidfVectorizer: To convert text into vector

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

# I take ngrams = (1,3) [1 to 3 words combination]

# I take top 10000 words (most frequency)

tfidf=TfidfVectorizer(max_features=10000,ngram_range=(1,3))

tfidf_word=tfidf.fit_transform(corpus).toarray()

tfidf_class=df[‘class’]

Now we have both features in numerical form

tfidf_word is vector

tdfidf_class is 0/1 class

final_df[‘class’].value_counts()

0: 26521

1: 12411

We have imbalanced data set

To balance the data set we can do Upsampling , Downsampling or create Synthetic data

But I want to experiment so I train the model with imbalanced data set

(v) Building Machine Learning Model :

tfidf_word.shape,tfidf_class.shape

((38932, 10000), (38932,))

# splitting data into train and test data set

from sklearn.model_selection import train_test_split

x2_train,x2_test,y2_train,y2_test=train_test_split(tfidf_word,tfidf_class,test_size=0.33)

x2_train.shape,y2_train.shape

((26084, 10000), (26084,))

x2_test.shape,y2_test.shape

((12848, 10000), (12848,))

Applying Multinomial Naive Bayes

from sklearn.naive_bayes import MultinomialNB

classifier_2=MultinomialNB()

classifier_2.fit(x2_train,y2_train)

Predicting test data on training model

y2_pred=classifier_2.predict(x2_test)

array([0, 0, 0, …, 1, 0, 0])

Model Score on training and test data set

# On training

classifier_2.score(x2_train,y2_train)

0.8793896641619383

#On test

classifier_2.score(x2_test,y2_test)

0.8709526774595268

Calculating Confusion Matrix of trained model

from sklearn import metrics

metrics.confusion_matrix(y2_test,y2_pred)

([[8278, 480],

[1178, 2912]])

Hyperparameter Optimization

hyper_classifier=MultinomialNB(alpha=0.1)

import numpy as np

# alpha is hyperparameter in Naive Bayes

previous_score=0

for alpha in np.arange(0,1,0.1):

sub_classifier=MultinomialNB(alpha=alpha)

sub_classifier.fit(x2_train,y2_train)

y_pred=sub_classifier.predict(x2_test)

score = metrics.accuracy_score(y2_test, y2_pred)

if score>previous_score:

hyper_classifier=sub_classifier

print(“Alpha: {}, Score : {}”.format(alpha,score))

Hyperparameter result

Alpha: 0.0, Score : 0.8709526774595268

Alpha: 0.1, Score : 0.8709526774595268

Alpha: 0.2, Score : 0.8709526774595268

Alpha: 0.30000000000000004, Score : 0.8709526774595268

Alpha: 0.4, Score : 0.8709526774595268

Alpha: 0.5, Score : 0.8709526774595268

Alpha: 0.6000000000000001, Score : 0.8709526774595268

Alpha: 0.7000000000000001, Score : 0.8709526774595268

Alpha: 0.8, Score : 0.8709526774595268

Alpha: 0.9, Score : 0.8709526774595268

My Observations:

On training data set we got 0.8793896641619383

On test data set we got 0.8709526774595268

We have imbalanced data set , according to the theory in dataset one class dominate other then our model biased towards major class but we got nearly same results on both train and test data set !.

(vi) Building Deep Learning Model :

In the above we already convert the Is_Responsible feature into class feature

Now I want convert the text into vector form with the help of tensorflow documentation

# Text preprocessing

Function for removing special characters

def remove_special_chars(text):

for remove in map(lambda r: re.compile(re.escape(r)), [‘,’, ‘:’, “=”, “&”, “;”, ‘%’, ‘$’,’@’, ‘%’, ‘^’, ‘*’, ‘(‘, ‘)’, ‘{‘, ‘}’,’[‘, ‘]’, ‘|’, ‘/’, ‘\\’, ‘>’, ‘<’, ‘-’,’!’, ‘?’, ‘.’, ‘‘’,’ — ‘, ‘ — — ‘, ‘#’,”’ “ ,”\””]):

text.replace(remove, ‘’, inplace=True)

return text

Function for removing tags

def remove_tags(text):

return re.compile( r” <[^>]+> “).sub(“ “, text)

Function for removing numbers

def remove_num(text):

return ‘’.join(re.sub(r’([0–9+])’,’ ‘ ,text))

I create three individual function to clean text

# copy the df into new variable

final_df=df.copy( )

Calling remove_tags function

final_df.Description=final_df.Description.apply(lambda x : remove_tags(x))

Calling remove_num function

final_df.Description=final_df.Description.apply(lambda x : remove_num(x))

Calling remove_special_chars function

remove_special_chars(final_df.Description)

After calling all functions we got cleaned text from special characters ,numerical,and tags.

Now we want tokenize the sentences , after tokenize we got words

We want lower the sentence so lower=True

We use ‘ ’ (space) Separator for word splitting.

tokenizer = Tokenizer(num_words=5000,filters=’!”#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n’,lower=True,split=’ ‘)

# tensorflow documentation

tokenizer.fit_on_texts( ) :

Updates internal vocabulary based on a list of texts.

In the case where texts contains lists, we assume each entry of the lists to be a token.

tokenizer.fit_on_texts(final_df[“Description”])

# tensorflow documentation

tokenizer.texts_to_sequences ( ) :

Transforms each text in texts to a sequence of integers.

X = tokenizer.texts_to_sequences(final_df[“Description”])

Padding for to get same length sentences, If we have same length sentences then we can use batch of data points for training. It reduces the time complexity

X = pad_sequences(X,maxlen=100)

Now we have converted vector

Vector of text

X[500]

array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 26, 298, 2, 6, 47, 7, 3, 425, 14, 21, 3, 566, 71, 1483, 3, 22, 1022, 12, 21, 133, 2107, 2568, 4, 1, 179, 63, 6, 5, 3, 501, 7, 28, 14, 2, 6, 749, 1, 12, 33, 3, 528, 6, 56, 469, 371, 3, 109, 88, 11, 3, 12, 21, 381, 3, 301, 1194, 1483, 3, 109, 102, 32, 1, 123, 118, 651, 52, 94, 193, 70, 1295, 7, 1, 1248, 496, 6, 289, 1, 12, 2533, 120, 88], dtype=int32)

If we observe the above vector carefully 500 index sentence converted into vector. Each word in the sentence represent by word index in the corpus

y= final_df[‘class’]

vocabulary_size

vocab_size = len(tokenizer.word_index) + 1

vocab_size

70925

We need to save the tokenizer in to pickle file for the deployment purpose.

import pickle
# saving
with open(‘tokenizer.pickle’, ‘wb’) as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

Splitting the data into train and test data sets

X=converted vector

y= class of review

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.3, random_state = 24)

Building single layered LSTM cell for training

LSTM model need Embedding layer and Dense layer for training

Embedding layer Turns positive integers (indexes) into dense vectors of fixed size.

Our problem statement belong to binary classification so we need to use sigmoid activation in dense layer

In LSTM cell no of neurons is hyperparameter , I take 100 neurons.

We need to specify the embedding vector output , I take embedding vector output size = 40 , it’s also a hyperparameter

Optimizer=adam, metric= accuracy

Model building

embedding_vector_features=40
model=Sequential()
model.add(Embedding(vocab_size,embedding_vector_features,input_length=100))
model.add(LSTM(100))
model.add(Dense(1,activation=’sigmoid’))

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 100, 40) 2837000
_________________________________________________________________
lstm (LSTM) (None, 100) 56400
_________________________________________________________________
dense (Dense) (None, 1) 101
=================================================================
Total params: 2,893,501
Trainable params: 2,893,501
Non-trainable params: 0
_________________________________________________________________

Model compilation

model.compile(loss=’binary_crossentropy’,optimizer=’adam’,metrics=[‘accuracy’])
print(model.summary())

Fitting model

batch_size and epochs are hyperparameters , I take batch_size =32 and epochs=20 manually

history=model.fit(X_train, Y_train, batch_size=32, epochs=20, validation_data=(X_test, Y_test), )

Epoch 16/20
852/852 [==============================] - 44s 51ms/step - loss: 0.0529 - accuracy: 0.9835 - val_loss: 0.7259 - val_accuracy: 0.8307
Epoch 17/20
852/852 [==============================] - 43s 51ms/step - loss: 0.0400 - accuracy: 0.9880 - val_loss: 0.8985 - val_accuracy: 0.8256
Epoch 18/20
852/852 [==============================] - 43s 51ms/step - loss: 0.0387 - accuracy: 0.9882 - val_loss: 0.8848 - val_accuracy: 0.8247
Epoch 19/20
852/852 [==============================] - 44s 51ms/step - loss: 0.0268 - accuracy: 0.9917 - val_loss: 1.0238 - val_accuracy: 0.8229
Epoch 20/20
852/852 [==============================] - 44s 51ms/step - loss: 0.0329 - accuracy: 0.9905 - val_loss: 0.8495 - val_accuracy: 0.8091

If you observe the above training process at the end of 20 th epoch

training accuracy = 0.99 , val_accuracy = 0.80

training loss = 0.032 , val_loss=0.84

As training loss decreases validation loss increasing , it means over model overfitting. we know that our dataset is imbalanced if we train a neural network on imbalanced data set it baised towards major class

We don’t get much accuracy difference in machine learning model but when it comes to deep learning model we clearly understand the impact of imbalanced dataset

To overcome this problem we can use Dropout rate while training , Dropout rate is a regularization technique in neural networks.

Evaluating the trained model with new data

# New review

string11=”’ This hotel is awesome I love the service Anthony is really a great guy you see at the front desk! It is close to everything and is wonderful for kids I love it. The best hotel ever but wonderful cleanliness and quality great hotel for couples and singles.”’

# Evaluating trained mode with new review

x_1=tokenizer.texts_to_sequences([string11])
x_1 = pad_sequences(x_1,maxlen=100)
model.predict(x_1)

Output

array([[0.00015311]], dtype=float32)

we know that 0 :positive class 1 : negative class

output value near to the 0 , given review is positive , model prediction also positive so our trained model working well on new data but we can not conclude it’s generalized model. We need to do lot of hypreparameter optimization then only we can conclude.

Building LSTM with Dropout rate 0.5

Form the above training we know that our LSTM model overfitting . To avoid this we need to do hyperparameter optimization , Dropout rate and BatchNormalization layer adding.

Dropout rate regularize the Neural Network , So to avoid the overfitting problem I add only Dropout layer with 50% dropout rate remaining network same as above LSTM network. We can add BatchNormalization also but I want to experimet on data so I take Dropout rate only.

# Model building

Dropout tecembedding_vector_features=40
model_2=Sequential()
model_2.add(Embedding(vocab_size,embedding_vector_features,input_length=100))
model_2.add(Dropout(0.5))
model_2.add(LSTM(100))
model_2.add(Dropout(0.5))
model_2.add(Dense(1,activation=’sigmoid’))

# Model compilation
model_2.compile(loss=’binary_crossentropy’,optimizer=’adam’,metrics=[‘accuracy’])
print(model_2.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 100, 40) 2837000
_________________________________________________________________
dropout (Dropout) (None, 100, 40) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 100) 56400
_________________________________________________________________
dropout_1 (Dropout) (None, 100) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 101
=================================================================
Total params: 2,893,501
Trainable params: 2,893,501
Non-trainable params: 0

# Fitting the model

history_2=model_2.fit(X_train, Y_train,
batch_size=32,
epochs=20,
validation_data=(X_test, Y_test),
)

Epoch 16/20
852/852 [==============================] - 30s 35ms/step - loss: 0.1535 - accuracy: 0.9424 - val_loss: 0.4867 - val_accuracy: 0.8491
Epoch 17/20
852/852 [==============================] - 29s 34ms/step - loss: 0.1494 - accuracy: 0.9442 - val_loss: 0.5066 - val_accuracy: 0.8455
Epoch 18/20
852/852 [==============================] - 30s 35ms/step - loss: 0.1459 - accuracy: 0.9459 - val_loss: 0.4815 - val_accuracy: 0.8439
Epoch 19/20
852/852 [==============================] - 30s 35ms/step - loss: 0.1397 - accuracy: 0.9482 - val_loss: 0.4840 - val_accuracy: 0.8450
Epoch 20/20
852/852 [==============================] - 30s 35ms/step - loss: 0.1414 - accuracy: 0.9481 - val_loss: 0.4699 - val_accuracy: 0.8447

If you observe the above training process at the end of 20th epoch

training accuracy = 0.9481 and Val_accuracy = 0.84

training loss = 0.1414 and val_loss = 0.4699

Loss
Accuracy

By adding dropout rate our model val_loss also decreasing with training loss compare to without dropout rate LSTM network . So by adding Dropout rate our model have less chances to overfit.

I get nearly 94 % accuracy for just training the model for 20 epochs without hypreparameter optimization

If I do hyperparameter optimization and adding Dropout and BatchNormalization layer then I can get generalized model which can perform more significantly on new data.

In This article my objective is to know the how imbalanced data set impact Neural Networks

My Observation

Without Dropout rate our model prone to overfitting due to imbalanced data set

By adding Dropout rate our model less prone to overfitting

By doing Hyperparameter optimization and adding Batchnormalization we can increase the performance of model on new data.

Bidirectional LSTM :

Using bidirectional will run our inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backwards you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.

Building single layered Bidirectionsal-LSTM cell for training

Bi-LSTM model need Embedding layer and Dense layer for training

Embedding layer Turns positive integers (indexes) into dense vectors of fixed size.

Our problem statement belong to binary classification so we need to use sigmoid activation in dense layer

In Bi-LSTM cell no of neurons is hyperparameter , I take 100 neurons.

We need to specify the embedding vector output , I take embedding vector output size = 40 , it’s also a hyperparameter.

I add Dropout layer with 0.65 Dropout rate

Optimizer=adam, metric= accuracy

# Model building

embedding_vector_features=40
model_3=Sequential()
model_3.add(Embedding(vocab_size,embedding_vector_features,input_length=100))
model_3.add(Dropout(0.65))
model_3.add(Bidirectional(LSTM(100)))
model_3.add(Dropout(0.65))
model_3.add(Dense(1,activation=’sigmoid’))

# Model Compilation
model_3.compile(loss=’binary_crossentropy’,optimizer=’adam’,metrics=[‘accuracy’])
print(model_3.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_3 (Embedding) (None, 100, 40) 2837000
_________________________________________________________________
dropout_4 (Dropout) (None, 100, 40) 0
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200) 112800
_________________________________________________________________
dropout_5 (Dropout) (None, 200) 0
_________________________________________________________________
dense_3 (Dense) (None, 1) 201
=================================================================
Total params: 2,950,001
Trainable params: 2,950,001
Non-trainable params: 0

# Fitting the model

history_2=model_3.fit(X_train, Y_train,
batch_size=32,
epochs=20,
validation_data=(X_test, Y_test),
)

Epoch 16/20
852/852 [==============================] - 36s 42ms/step - loss: 0.1928 - accuracy: 0.9258 - val_loss: 0.4412 - val_accuracy: 0.8521
Epoch 17/20
852/852 [==============================] - 36s 42ms/step - loss: 0.1915 - accuracy: 0.9270 - val_loss: 0.4410 - val_accuracy: 0.8501
Epoch 18/20
852/852 [==============================] - 36s 42ms/step - loss: 0.1840 - accuracy: 0.9300 - val_loss: 0.4129 - val_accuracy: 0.8537
Epoch 19/20
852/852 [==============================] - 36s 42ms/step - loss: 0.1836 - accuracy: 0.9290 - val_loss: 0.4079 - val_accuracy: 0.8492
Epoch 20/20
852/852 [==============================] - 35s 42ms/step - loss: 0.1787 - accuracy: 0.9329 - val_loss: 0.4438 - val_accuracy: 0.8545

If you observe the above training process at the end of 20th epoch

training accuracy = 0.9329 and val_accuracy= 0.85

training loss = 0.17 and val_loss = 0.44

Loss
Accuracy

If you observe the above results Bidirectional LSTM is some what better than LSTM

By doing Hyperparameter optimization and adding BatchNormalization we can improve performance of the model

Saving Bidirectional model for deployment purpose

# Saving trained model

model_3.save(‘b_lstm.h5’)

Evaluating Bi-LSTM with new data

# New data for evaluating training model

string11=”’ Looking for a motel in close proximity to TV taping of a Dr. Phil show, we chose the Dunes on Sunset Blvd in West Hollywood. Although the property displayed the AAA emblem, it certainly left a lot to be desired. There were chips & scrapes on the bottom of the door frame in the bathroom and the lotion containers were half full–apparently not replaced by housekeeping. We needed an early wakeup call, but couldn’t use the clock radio alarm as there wasn’t a radio in the room. There was no TV channel listing on the remote, or on the TV menu making viewing a chore.The TV remote had to be returned when checking-out. This place served its purpose, but not a place to revisit.”’

# Converting above text into vectors using tokenizer

x_1=tokenizer.texts_to_sequences([string11])

# Padding above converted vectores
x_1 = pad_sequences(x_1,maxlen=100)

# evaluating trained model on new data
model_3.predict(x_1)

output array([[0.98076105]], dtype=float32)

Given new data is Negative review data

Our model prediction also Negative review

So our trained model working well on new data.

We can improve our model performance by doing hyperparameter optimization.

(vii) Deployment Of Trained Bidirectional LSTM With Streamlit Frame Work :

For deployment we need trained model.h5 , and tokenizer.pkl

I already saved the both files

For building web app I am using streamlit framework

Now we can deploy this web app on any cloud platform , this is deeplearning model so we can’t deploy this web app on Heroku because Heroku allows only 500 MB app this app will be more than 500 MB so try on AWS or Azure

For deployment we need requirement.txt ,setup.sh, Procfile files

(viii) Conclusions :

LSTM without Dropout :

Total params: 2,893,501

At the end of 20 th epoch

training accuracy = 0.99 , val_accuracy = 0.80

training loss = 0.032 , val_loss=0.84

As training loss decreases validation loss increasing , it means over model overfitting.

LSTM with Dropout :

Total params: 2,893,501

At the end of 20th epoch

training accuracy = 0.9481 and Val_accuracy = 0.84

training loss = 0.1414 and val_loss = 0.4699

By adding dropout rate our model val_loss also decreasing with training loss compare to without dropout rate LSTM network . So by adding Dropout rate our model have less chances to overfit.

Bidirectional LSTM with Dropout

Total params: 2,950,001

At the end of 20th epoch

training accuracy = 0.9329 and val_accuracy= 0.85

training loss = 0.17 and val_loss = 0.44

If you observe the above results Bidirectional LSTM is some what better than LSTM

My main objective of this article is how imbalanced dataset impact machine learning models

In Naive Bayes we don’t find that much impact but when it comes to deeplearning models we observe the impact of imbalance dataset

LSTM impact by imbalance data set to overcome impact of imbalanced dataset I used Dropout rate in LSTM model then our model less impact by Imbalanced data set. We can do hyperparameter optimization to increase performance.

(ix)Future Scope:

  • Applying more algorithms and checking impact of imbalanced data set
  • Hyperparameter Optimization
  • Adding synthetic data to balance the data set

(x)References :