Sentiment Analysis using Deep Learning with Tensorflow

Original article was published on Deep Learning on Medium

Sentiment Analysis using Deep Learning with Tensorflow

Photo Credit: Lionbridge AI

Sentiment Analysis
Sentiment analysis is the contextual study that aims to determine the opinions, feelings, outlooks, moods and emotions of people towards entities and their aspects. The primitive functions of sentiment analysis are emotion recognition that focuses on extracting a cluster of emotion labels and polarity detection which aims to classify the writer’s attitude as positive, negative and neutral.

DATASET

Women’s E-Commerce Clothing Reviews
This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.

Data Source: https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews

Content

This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:
· Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
· Age: Positive Integer variable of the reviewers age.
· Title: String variable for the title of the review.
· Review Text: String variable for the review body.
· Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
· Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
· Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
· Division Name: Categorical name of the product high level division.
· Department Name: Categorical name of the product department name.
· Class Name: Categorical name of the product class name.

Import the Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Load the Dataset

df=pd.read_csv(“Womens Clothing E-Commerce Reviews.csv”)
df.head()
df.info()
df=df.drop([‘Unnamed: 0’, ‘Title’, ‘Positive Feedback Count’], axis=1)
df.dropna(inplace=True)
df['Rating_Polarity'] = df['Rating'].apply(lambda x: 'Positive' if x>3 else('Neutral' if x== 3 else 'Negative'))df.head()

Exploratory Data Analysis

sns.set_style('whitegrid')
sns.countplot(x='Rating',data=df, palette='YlGnBu_r')
sns.set_style('whitegrid')
sns.countplot(x='Rating_Polarity',data=df, palette='summer')
sns.set_style(‘whitegrid’)
sns.countplot(x=’Rating’,hue=’Division Name’,data=df,palette=’CMRmap’)
plt.figure(figsize=(8,4))
sns.set_style(‘whitegrid’)
sns.countplot(x=’Rating’,hue=’Department Name’,data=df,palette=’viridis’)
plt.figure(figsize=(8,4))
sns.set_style('whitegrid')
sns.countplot(x='Recommended IND',hue='Department Name',data=df,palette='YlGnBu_r')
plt.figure(figsize=(8,4))
sns.distplot(df[‘Age’],color=’darkred’,bins=30)

Dealing with Imbalanced Dataset

df_class_Positive = df[df['Rating_Polarity'] == 'Positive'][0:8000]
df_class_Neutral = df[df['Rating_Polarity'] == 'Neutral']
df_class_Negative = df[df['Rating_Polarity'] == 'Negative']
df_class_Neutral_over = df_class_Neutral.sample(8000, replace=True)
df_class_Negative_over = df_class_Negative.sample(8000, replace=True)
df = pd.concat([df_class_Positive, df_class_Neutral_over, df_class_Negative_over], axis=0)
print('Random over-sampling:')
print(df['Rating_Polarity'].value_counts())
df['Rating_Polarity'].value_counts().plot(kind='bar', title='Count (Rating_Polarity)');
df.shape
(24000, 9)

Text Preprocessing

import re
import string
from string import punctuation
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
def text_processing(text):

Stopwords = stopwords.words('english')
# Check characters to see if they are in punctuation
no_punctuation = [char for char in text if char not in string.punctuation]
# Join the characters again to form the string.
no_punctuation = ''.join(no_punctuation)

# Now just remove any stopwords
return ' '.join([word for word in no_punctuation.split() if word.lower() not in Stopwords])
df['review'] = df['Review Text'].apply(text_processing)
df.head()
df=df[['review', 'Rating_Polarity']]
df.head()
# one hot encoding
one_hot = pd.get_dummies(df["Rating_Polarity"])
df.drop(['Rating_Polarity'],axis=1,inplace=True)
df = pd.concat([df,one_hot],axis=1)
df.head()

Train Test Split

from sklearn.model_selection import train_test_split
X=df['review'].values
y=df.drop('review', axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
#Vectorization
bow = CountVectorizer()
X_train = bow.fit_transform(X_train)
X_test = bow.transform(X_test)
#Term Frequency, Inverse Document Frequency
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)
X_train=X_train.toarray()
X_test=X_test.toarray()

Build the Model

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
X_train.shape
(16800, 12673)
model = Sequential()model.add(Dense(units=12673,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=4000,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=500,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=3, activation='softmax'))opt=tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)
model.fit(x=X_train, y=y_train, batch_size=256, epochs=100, validation_data=(X_test, y_test), verbose=1, callbacks=early_stop)

Evaluation

df_m=pd.DataFrame(model.history.history)
df_m[‘Epoch’]=range(1,5)
df_m.index=df_m[‘Epoch’]
df_m
score = model.evaluate(X_test, y_test, batch_size=64, verbose=1)
print('Test accuracy:', score[1])
#Loss Graph(Training and Validation)
plt.plot(df_m['loss'])
plt.plot(df_m['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train_loss', 'val_loss'])
plt.show()
# Accuracy Graph(Training and Validation) 
plt.plot(df_m[‘accuracy’])
plt.plot(df_m[‘val_accuracy’])
plt.title(‘model accuracy’)
plt.ylabel(‘accuracy’)
plt.xlabel(‘epoch’)
plt.legend([‘train_acc’, ‘val_acc’])
plt.show()
preds = model.predict(X_test)from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(np.argmax(y_test,axis=1),np.argmax(preds,axis=1)))
print(classification_report(np.argmax(y_test,axis=1),np.argmax(preds,axis=1)))

Note: The entire python code and dataset can be downloaded from, https://github.com/Harshita9511/Sentiment-Analysis-using-Deep-Learning-with-Tensorflow