Text classification approaches with code snippets

Source: Deep Learning on Medium

Text classification approaches with code snippets

The data used in this blog post is from Kaggle competition (https://www.kaggle.com/crowdflower/twitter-airline-sentiment#Tweets.csv), you can choose either to download it or to load it through a Kaggle kernel. So let’s start some basic tasks and explore the data

What we will do

  • We will do some data exploration and semantic analysis with a hypothesis and tested it using two different approaches
  • We will test different vectorization techniques on different classifiers
  • We will Justify our models’ decisions using LIME
  • We will use Glove embeddings with fully connected nn, LSTM, Bidirectional LSTM and GRU

Data Exploration

df = pd.read_csv('/kaggle/input/twitter-airline-sentiment/Tweets.csv')

The data has more columns but these are the columns we are interested in! The First column is the actual tweets by the customers this column we will be working with all the time in the next steps, This what we will analyze and in the second table we can see the sentiment of the tweets in the column “airline_sentiment” and “airline” for different airline service providers. Now let’s see what mood that dominates the passengers the most and what next and so on.

sns.barplot(mood_count.index, mood_count.values, alpha=0.8)
plt.title('Count of Moods')
plt.ylabel('Mood Count', fontsize=12)
plt.xlabel('Mood', fontsize=12)

As you can see the negativity dominates the people’s impression towards different airline service providers! Now let’s see the top reasons for these negative tweets.

neg_reasons = df['negativereason'][df['airline_sentiment']=='negative'].value_counts()

It seems that customer service comes before all!


As we can see most negative responses come from the United Airline, and most positive responses come from Southwest airline

Now let’s discover a new aspect which is if the length of the tweets affect its semantics

df['text_length'] = list(map(lambda x:len(x),df['text']))
target_0 = df.loc[df['airline_sentiment'] == 'neutral']
target_1 = df.loc[df['airline_sentiment'] == 'positive']
target_2 = df.loc[df['airline_sentiment'] == 'negative']
sns.distplot(target_0[['text_length']], hist=False, rug=False,color='red',label='Neutral')
sns.distplot(target_1[['text_length']], hist=False, rug=True,color = 'yellow',label='positive')
sns.distplot(target_2[['text_length']], hist=False, rug=True,color='black',label='negative')

As you can see from the plot above that it’s likely for the tweet to be negative when it’s length increases.

Semantic Analysis

# Some initial features in text
qmarks = np.mean(df['text'].apply(lambda x: '?' in x))
exclamation = np.mean(df['text'].apply(lambda x: '!' in x))
at = np.mean(df['text'].apply(lambda x: '@' in x))
fullstop = np.mean(df['text'].apply(lambda x: '.' in x))
capital_first = np.mean(df['text'].apply(lambda x: x[0].isupper()))
capitals = np.mean(df['text'].apply(lambda x: max([y.isupper() for y in x])))
numbers = np.mean(df['text'].apply(lambda x: max([y.isdigit() for y in x])))
hashtags = np.mean(df['text'].apply(lambda x: '#' in x))
print('Tweets with question marks: {:.2f}%'.format(qmarks * 100))
print('Tweets with question hashtags: {:.2f}%'.format(hashtags * 100))
print('Tweets with exclamation marks: {:.2f}%'.format(exclamation * 100))
print('Tweets with full stops: {:.2f}%'.format(fullstop * 100))
print('Tweets with capitalised first letters: {:.2f}%'.format(capital_first * 100))
print('Tweets with capital letters: {:.2f}%'.format(capitals * 100))
print('Tweets with @: {:.2f}%'.format(at * 100))
print('Tweets with numbers: {:.2f}%'.format(numbers * 100))

The first hypothesis is tweeting with question marks should be angrier

let’s see how many of these tweets that have ‘?’ are negative ones

df_has_question.airline_sentiment.value_counts()negative 2377
neutral 1195
positive 103
Name: airline_sentiment, dtype: int64

So our benchmark on the unfiltered dataset was.

Negative = 62.69125683060109 %

Neutral = 21.168032786885245 %

Postive = 16.140710382513664 %

As you can see the percentage of positivity for the tweet decreased tremendously and the neutral percentage increased, So we can conclude that adding ‘!’ to the tweets increased the probability of it being neutral or negative

Let’s validate this assumption

Thanks to my friend Mohamed Donia who guided me through validating the assumption I made above, you can follow him on https://www.linkedin.com/in/mohamed-donia-b0a76a27/

We will do an unpaired t-test and then deduce the p-value and if its too small then we can go with the assumption we made above if you don’t understand what t-test is or the p-value, don’t worry I will explain as we go further. The unpaired t-test is a measure of how the variance of two different samples differs and the p-value is the metric for this difference in variance so whenever the p-value is small then this assures you that variance in both samples differs and this could be applied to our problem so whenever we compare two samples of data one has tweets with question marks and the other doesn’t have question marks If the variance between these two samples is different enough then this feature is as important as we thought it is then it’s p-value should be small, let’s get through the process.

what we will do is simply fit a logistic regression model to x(has_question) binary feature and y(sentiment) and then calculate the p-value for the beta of the feature (has_question)

df_hasquestion = df[['has_question','airline_sentiment']]df_hasquestion.head()
df_hasquestion['has_question'] = [1 if df_hasquestion['has_question'][x] == True else 0 for x in range(len(df_hasquestion['has_question']))]df_hasquestion['airline_sentiment'] = [1 if df_hasquestion['airline_sentiment'][x] == 'positive' else 0 for x in range(len(df_hasquestion['airline_sentiment']))]x = np.array(df_hasquestion['has_question'])
y = np.array(df_hasquestion['airline_sentiment'])

Then fit a model,

clf = LogisticRegression(solver='liblinear',random_state=0).fit(x,y)

the last thing we will do is to calculate the p-value,

params = np.append(clf.intercept_,clf.coef_)
predictions = clf.predict(x)
newX = pd.DataFrame({"Constant":np.ones(len(x))}).join(pd.DataFrame(x))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))
var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b
p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b]
sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,)
params = np.round(params,4)
myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilites"] = [params,sd_b,ts_b,p_values]

So as you can see the probability for a given beta is 0.000 so there is a very small room for error so we can now accept our hypothesis

word cloud

One very handy visualization tool for a data scientist when it comes to any sort of natural language processing is plotting “Word Cloud”. A word cloud (as the name suggests) is an image that is made up of a mixture of distinct words which may make up a text or book and where the size of each word is proportional to its word frequency in that text (number of times the word appears)

Wordcloud for negative sentiment

from wordcloud import WordCloud,STOPWORDS
words = ' '.join(df_x['text'].values)
cleaned_word = " ".join([word for word in words.split()
if 'http' not in word
and not word.startswith('@')
and word != 'RT'
wordcloud = WordCloud(stopwords=STOPWORDS,
plt.figure(1,figsize=(12, 20))

As we may presume the most frequent words would be people complaining about flight cancellation, customer service, and bags issues as appeared in the word cloud for negative responses, no clean text for positive and neutral responses to generate a word cloud for them. We can go deeper and computer the actual TFIDF weight for each word, let’s see how can we do that.

ML Pipeline

Let’s begin preparing our data for ML pipeline so the first step is that we should vectorize the text we have in the tweets data. We have seen a type of vectorization the TFIDF when we gave weights to each word remember? SKLearn got us covered in this issue we lots of vectorizing techniques and we will explore them one by one!

CountVectorizer Creates a matrix with frequency counts of each word in the text corpus

TF-IDF Vectorizer **TF — Term Frequency — Count of the words(Terms) in the text corpus (same of Count Vect) IDF — Inverse Document Frequency — Penalizes words that are too frequent. We can think of this as regularization

HashingVectorizer Creates a hashmap(word to number mapping based on hashing technique) instead of a dictionary for vocabulary This enables it to be more scalable and faster for larger text corpus Can be parallelized across multiple threads

Now let’s get the text in shape for out text vectorization technique.

import re
import nltk
from nltk.corpus import stopwords
def tweet_to_words(raw_tweet):
letters_only = re.sub("[^a-zA-Z]", " ",raw_tweet)
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]
return( " ".join( meaningful_words ))
df['clean_text']=df['text'].apply(lambda x: tweet_to_words(x))
train,test = train_test_split(df,test_size=0.2,random_state=42)
x_train = train['clean_text']
y_train = train['sentiment']
x_test = test['clean_text']
y_test = test['sentiment']
for tweet in x_train:
for tweet in x_test:
y = y_train

Now that our text is ready for vectorization, I will be exploring only two approaches that I found appealing and the other approaches you can find here: https://github.com/omar178/Text-classification/blob/master/airline_reviews_analysis/different-approaches-for-text-classification.ipynb

tfv = TfidfVectorizer(min_df=3, max_features=None, 
strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
stop_words = 'english')
train_features= tfv.fit_transform(x_train)

We will be using TFIDF then decompose the vectors using SVD.

Using truncated SVD to reduce the dimensionality

Truncated Singular Value Decomposition (SVD) is a matrix factorization technique that factors a matrix M into the three matrices U, Σ, and V. This is very similar to PCA, excepting that the factorization for SVD is done on the data matrix, whereas for PCA, the factorization is done on the covariance matrix. Typically, SVD is used under the hood to find the principal components of a matrix

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=120)
xtrain_svd = svd.transform(train_features)
xvalid_svd = svd.transform(test_features)
# Scale the data obtained from SVD. Renaming variable to reuse without scaling.
scl = StandardScaler()
xtrain_svd_scl = scl.transform(xtrain_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)

Then using these classic classifiers,

Classifiers = [
SVC(kernel="rbf", C=0.025, probability=True),
xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8,
subsample=0.8, nthread=10, learning_rate=0.1)
dense_test= xvalid_svd_scl
for classifier in Classifiers:
fit = classifier.fit(xtrain_svd_scl,train['sentiment'])
pred = fit.predict(xvalid_svd_scl)
except Exception:
fit = classifier.fit(dense_features,train['sentiment'])
pred = fit.predict(dense_test)
accuracy = accuracy_score(pred,test['sentiment'])
average_precision = average_precision_score(pred, test['sentiment'])
classification_rep = classification_report(pred,test['sentiment'])
print('Accuracy of '+classifier.__class__.__name__+'is '+str(accuracy))
print('Average precision-recall score: {0:0.2f}'.format(
print('classification report',classification_rep)

Put in mind that we used precision and recall as our main metric as the data-set is already unbalanced so any model would be biased towards the negative semantics and these are the results that these models have produced.

Now let’s take a step further in our analysis, let’s justify the decision taken by our model using LIME explainer.

from sklearn.pipeline import make_pipeline
from lime import lime_text
from lime.lime_text import LimeTextExplainer
c = make_pipeline(tfv, svd,scl,Classifiers[7])
explainer = LimeTextExplainer(class_names=class_names)
idx = 4794
exp = explainer.explain_instance(x_test[idx], c.predict_proba, num_features=6)

As you can see this is correct as positive, and the model gives the highest weight to the word best.

This as well correctly classified as negative, as you can see model highlights the word hard and give it a bigger weight so this tweet tends to be negative.

Deep learning

Glove Embeddings

So what are the embeddings, Imagine you found a text corpus and you decided to train a model to predict the next word for every two consecutive words so for example for the first sentence you have “He likes dogs” so you used he and likes to predict dogs and so on for the rest of your corpus. This is what is called a fake training because you didn’t want the training result instead you wanted a middle layer that capture semantics like for example when two sentences “He likes dogs” and “He likes cats” and we used the same two features to predict the words “dogs” and “cats” so the vectors of these two words would be somehow similar and this is briefly the idea of embeddings, you can download the embeddings data from here https://www.kaggle.com/terenceliu4444/glove6b100dtxt

from tqdm import tqdm
embeddings_index = {}
f = open('/kaggle/input/glove6b100dtxt/glove.6B.100d.txt')
for line in tqdm(f):
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print('Found %s word vectors.' % len(embeddings_index))from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
def sent2vec(s):
words = str(s).lower()
words = word_tokenize(words)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]
M = []
for w in words:
M = np.array(M)
v = M.sum(axis=0)
if type(v) != np.ndarray:
return np.zeros(100)
return v / np.sqrt((v ** 2).sum())

The function above returning vectors for each word in our dataset same idea as TFIDF.

xtrain_glove = [sent2vec(x) for x in tqdm(train_clean_tweet)]
xvalid_glove = [sent2vec(x) for x in tqdm(test_clean_tweet)]

Bidirectional LSTM

from keras.layers import Bidirectionalmodel = Sequential()
model.add(Embedding(len(word_index) + 1,
model.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3)))
model.add(Dense(1024, activation='relu'))
model.add(Dense(1024, activation='relu'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=50,
verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])

you have reached the end of this tutorial, For the full code implementation on GitHub https://github.com/omar178/Text-classification and on Kaggle https://www.kaggle.com/omarayman/different-approaches-for-text-classification#Deep-learning

Thanks !!