Introduction to Text Classification with Python

Original article was published by Irfan Alghani Khalid on Artificial Intelligence on Medium


The Process

Cleaning the text

The first step that we have to do is to prepare and clean the dataset. Cleaning the dataset is an essential step to remove any meaningless words or useless terms like hashtags, mentions, punctuations, and many more.

To clean the text, we can utilize libraries like re for removing terms with patterns and NLTK to remove words, for example, stop words. I have also explained how to clean text step by step using Python, which you can look here,

Here is how some texts look like before preprocessing,

Our Deeds are the Reason of this #earthquake May ALLAH Forgive us allForest fire near La Ronge Sask. CanadaAll residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected13,000 people receive #wildfires evacuation orders in California Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school #RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areasI'm on top of the hill and I can see a fire in the woods...There's an emergency evacuation happening now in the building across the streetI'm afraid that the tornado is coming to our area...

The code for doing the task will look like this,

# # In case of import errors
# ! pip install nltk
# ! pip install textblob
import re
from textblob import TextBlob
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
# # In case of any corpus are missing
# download all-nltk

nltk.download()
df = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
stop_words = stopwords.words("english")def text_preproc(x):
x = x.lower()
# x = ' '.join(wordnet.lemmatize(word, 'v') for word in x.split())
x = ' '.join([word for word in x.split(' ') if word not in stop_words])
x = x.encode('ascii', 'ignore').decode()
x = re.sub(r'https*\S+', ' ', x)
x = re.sub(r'@\S+', ' ', x)
x = re.sub(r'#\S+', ' ', x)
x = re.sub(r'\'\w+', '', x)
x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
x = re.sub(r'\w*\d+\w*', '', x)
x = re.sub(r'\s{2,}', ' ', x)
return x
df['clean_text'] = df.text.apply(text_preproc)
test['clean_text'] = test.text.apply(text_preproc)

Here is the result after cleaning step,

deeds reason may allah forgive usforest fire near la ronge sask canadaresidents asked place notified officers evacuation shelter place orders expectedpeople receive evacuation orders californiagot sent photo ruby smoke pours schoolupdate california hwy closed directions due lake county fireheavy rain causes flash flooding streets manitou colorado springs areasi top hill see fire woodsthere emergency evacuation happening building across streeti afraid tornado coming area

Also, take notes! Make sure that you download all the packages and the corpus (basically is a collection of words) from NLTK.

Build a Term-Document Matrix with TF-IDF weighting

Right after we clean the data, now we can build a text representation so the computer can read the data easily. We will use a Term-Document Matrix as a representation for texts.

Term-Document Matrix (TDM) is a matrix, which the rows represent each document, the columns represent each term (word), and the cell filled with a number.

The cell consists of the word count on each document. One of the method we can use for filling it called as Term Frequency — Inverse Document Frequency (TF-IDF).

Term Frequency — Inverse Document Frequency (TF-IDF) is a product between the frequency of a term on a document (Term Frequency) and the inverse frequency of a term on all documents (Inverse Document Frequency).

Term Frequency (TF) is a formula for calculating the amount of a term in a document. Because the amount varies between words, we apply log with bases 10 to rescale it. It looks like this,

Inverse Document Frequency (IDF) is a formula for calculating the rarities of words on all documents. If the number is small, the word is frequent. But if it’s bigger, the word is less frequent. This formula will be used as the weighting for the TF, and it looks like this,

To create a Term-Document Matrix (TDM), we can use a function from sklearn library called TfidfVectorizer. The code will look like this,

vectorizer = TfidfVectorizer()X = vectorizer.fit_transform(df['clean_text']).toarray()
df_new = pd.DataFrame(X, columns=vectorizer.get_feature_names())
X_test = vectorizer.transform(test['clean_text']).toarray()
test_new = pd.DataFrame(X_test, columns=vectorizer.get_feature_names())

When you write the code, you have to be very careful about which function you will use for each dataset. On train data, make sure that you use the fit_transform method because it will fit based on the number of terms inside the train data and transform it into a matrix.

Meanwhile, on the test data, make sure that you use the transform method because it will transform the text into a matrix with the same number of train data’s column. If we also use the fit_transform on that, it will create a matrix based on the number of terms on the test data. Therefore, it will not have the same dimension on columns, so make sure you check the methods that you will use.

If we did it correctly, it will give the matrices have the same column’s dimension, and also a matrix that looks like this,

The concept of Naive Bayes

After we have the matrix, now we can apply it to the model. The model that we will use is the Naive Bayes.

Naive Bayes is a machine learning model to solve supervised learning tasks by calculating the probability of a data belongs to a class.

It is based on the Bayes principle and assumes that each term inside a document is independent each other. The formula for calculating this looks like this,

Let me explain each part of it,

  • P(c|d) stands for the probability of a document belongs to a class,
  • The alpha symbol corresponds to the proportionality on both sides,
  • P(c) is the prior probability of the class by calculating the proportion of the number of a class by the total number of documents. The formula looks like this,

Where the Nc is the number of corresponding class on the dataset and N is the number of documents on the dataset.

  • The product of P(t id | c) is a product of the probability result on each term inside the document (d) that belongs to the class (c). The formula looks like this,

Where the T ct corresponds to the amount of that term inside the class, the sum of T ct’ corresponds to the total of terms with given class, B stands for the number of distinct vocabulary on a training dataset, and 1 as the smoothing for the model for avoiding zero.

That P(t id | c) formula will differ based on how we formulate the problem. The previous one formulates it as a multinomial problem where we count how many the exact terms belongs to a class. We call this model sometimes as Multinomial Naive Bayes.

There is also a model called Bernoulli Naive Bayes, where the calculation of P(t id | c) is different. It will calculate the proportion of how many documents that include the term by the total of all documents. The formula will look like this,

Where the Nct corresponds to the total number of documents that contain the term with that class and Nc corresponds to the total number of documents with that class.

After we calculate each probability, we will choose the best class with the highest probability to it.

The implementation using Python

After I explain to you the concepts, let’s move on to the implementation. For this step, I will use the scikit-learn library to do it.

When we build the model, the important aspect to know is whether the model gives a great result or not especially on unseen data, so we are confident for using it. We can do this by a concept called cross-validation. The code looks like this,

from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score
X = df_new.values
y = df.target.values
kfold = KFold(n_splits=10)# Define the model
nb_multinomial = MultinomialNB()
nb_bernoulli = BernoulliNB()
# As a storage of the model's performance
def calculate_f1(model):
metrics = []

for train_idx, test_idx in kfold.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
metrics.append(f1_score(y_test, y_pred))

# Retrieve the mean of the result
print("%.3f" % np.array(metrics).mean())

calculate_f1(nb_multinomial)
>>> 0.681
calculate_f1(nb_bernoulli)
>>> 0.704

What is going on the calculate_f1 function?

  • First, it will take the model as an input.
  • Then, it will conduct a cross-validation in k-times where on each loop it will split the dataset into train and test dataset, and then the model fits the train data and predict the label on the test data.
  • Finally, we calculate the mean from each cross-validation score.

Based on that, we get the result that the Bernoulli Naive Bayes model has the best score (0.704) than the Multinomial Naive Bayes (0.681).

Therefore, we will use the Bernoulli Naive Bayes as our model to predict the real test set data. The code looks like this,

from sklearn.naive_bayes import BernoulliNBdef predict_to_csv(model, X, y):
model.fit(X, y)
X_test = test_new.values
y_pred = model.predict(X_test)
# Preparing submission
submission = pd.DataFrame()
submission['id'] = test['id']
submission['target'] = y_pred
submission.to_csv('file_name.csv', index=False)
# Validate
submission = pd.read_csv('file_name.csv')
print(submission.head())

nb_bernoulli = BernoulliNB()
X = df_new.values
y = df.target.values
predict_to_csv(nb_bernoulli, X, y)
>>> id target
0 0 1
1 2 0
2 3 1
3 9 0
4 11 1

As we can see above, we fit the model with the real train data and predict the label of the test data. After that, we create a data frame and save the result to CSV format. Finally, you can submit that to Kaggle, and you will know whether the result is good or not.