Cleaning Text Data with Python

Original article was published by Irfan Alghani Khalid on Artificial Intelligence on Medium


The Process

Lowercase the text

Before we are getting into processing our texts, it’s better to lowercase all of the characters first. The reason why we are doing this is to avoid any case-sensitive process.

Suppose we want to remove stop words from our string, and the technique that we use is to take the non-stop words and combine those as a sentence. If we are not lowercase those, the stop word cannot be detected, and it will result in the same string. That’s why lowering case on texts is essential.

To do this in Python is easy. The code looks like this,

# Example
x = "Watch This Airport Get Swallowed Up By A Sandstorm In Under A Minute http://t.co/TvYQczGJdy"
# Lowercase the text
x = x.lower()
print(x)>>> watch this airport get swallowed up by a sandstorm in under a minute http://t.co/tvyqczgjdy

Remove Unicode characters

Some tweets could contain a Unicode character that is unreadable when we see it on an ASCII format. Mostly, those characters are used for emojis and non-ASCII characters. To remove this, we can use code like this one,

# Example
x = "Reddit Will Now Quarantine‰Û_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP"
# Remove unicode characters
x = x.encode('ascii', 'ignore').decode()
print(x)>>> Reddit Will Now Quarantine_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP

Remove stop words

After we do that, we can remove words that belong to stop words. Stop word is a type of word that has no significant contribution to the meaning of the text. Because of that, we can remove those words. To retrieve the stop words, we can download a corpus from the NLTK library. Here is the code on how to do this,

import nltk
nltk.download()
# just download all-nltk
stop_words = stopwords.words("english")# Example
x = "America like South Africa is a traumatised sick country - in different ways of course - but still messed up."
# Remove stop words
x = ' '.join([word for word in x.split(' ') if word not in stop_words])
print(x)>>> America like South Africa traumatised sick country - different ways course - still messed up.

Remove terms like mentions, hashtags, links, and more.

Besides we remove the Unicode and stop words, there are several terms that we should remove, including mentions, hashtags, links, punctuations, etc.

To remove those, it’s challenging if we rely only on a defined character. Therefore, we need patterns that can match terms that we desire by using something called Regular Expression (Regex).

Regex is a special string that contains a pattern that can match words associated with that pattern. By using it, we can search or remove those based on patterns using a Python library called re. To do this, we can implement it like this,

import re# Remove mentions
x = "@DDNewsLive @NitishKumar and @ArvindKejriwal can't survive without referring @@narendramodi . Without Mr Modi they are BIG ZEROS"
x = re.sub("@\S+", " ", x)print(x)
>>> and can't survive without referring . Without Mr Modi they are BIG ZEROS
# Remove URL
x = "Severe Thunderstorm pictures from across the Mid-South http://t.co/UZWLgJQzNS"
x = re.sub("https*\S+", " ", x)print(x)
>>> Severe Thunderstorm pictures from across the Mid-South
# Remove Hashtags
x = "Are people not concerned that after #SLAB's obliteration in Scotland #Labour UK is ripping itself apart over #Labourleadership contest?"
x = re.sub("#\S+", " ", x)print(x)
>>> Are people not concerned that after obliteration in Scotland UK is ripping itself apart over contest?
# Remove ticks and the next character
x = "Notley's tactful yet very direct response to Harper's attack on Alberta's gov't. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli"
x = re.sub("\'\w+", '', x)print(x)
>>> Notley tactful yet very direct response to Harper attack on Alberta gov. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli
# Remove punctuations
x = "In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare."
x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)print(x)
>>> In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare.
# Remove numbers
x = "C-130 specially modified to land in a stadium and rescue hostages in Iran in 1980... http://t.co/tNI92fea3u http://t.co/czBaMzq3gL"
x = re.sub(r'\w*\d+\w*', '', x)print(x)
>>> C- specially modified to land in a stadium and rescue hostages in Iran in ... http://t.co/ http://t.co/
# Replace the over spaces
x = " and can't survive without referring . Without Mr Modi they are BIG ZEROS"
x = re.sub('\s{2,}', " ", x)print(x)
>>> and can't survive without referring . Without Mr Modi they are BIG ZEROS

Combine them

After you know each step on preprocessing texts, Let’s apply this to a list. If you look closer at the steps in detail, you will see that each method is related to each other. Therefore, it’s essential to apply it on a function so we can process it all the same time sequentially. Before we apply the preprocessing steps, here are the preview of sampled texts,

Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Forest fire near La Ronge Sask. Canada
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
13,000 people receive #wildfires evacuation orders in California
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school

There are several steps that we should do for preprocessing a list of texts. They are,

  1. Create a function that contains all of the preprocessing steps, and it returns a preprocessed string
  2. Apply the function using a method called apply and chain the list with that method.

The code will look like this,

# # In case of import errors
# ! pip install nltk
# ! pip install textblob
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import nltk
import string
from nltk.corpus import stopwords
# # In case of any corpus are missing
# download all-nltk

nltk.download()
df = pd.read_csv('train.csv')
stop_words = stopwords.words("english")
wordnet = WordNetLemmatizer()
def text_preproc(x):
x = x.lower()
x = ' '.join([word for word in x.split(' ') if word not in stop_words])
x = x.encode('ascii', 'ignore').decode()
x = re.sub(r'https*\S+', ' ', x)
x = re.sub(r'@\S+', ' ', x)
x = re.sub(r'#\S+', ' ', x)
x = re.sub(r'\'\w+', '', x)
x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
x = re.sub(r'\w*\d+\w*', '', x)
x = re.sub(r'\s{2,}', ' ', x)
return x
df['clean_text'] = df.text.apply(text_preproc)

And here is the result from it,

deeds reason may allah forgive us
forest fire near la ronge sask canada
residents asked place notified officers evacuation shelter place orders expected
people receive evacuation orders california
got sent photo ruby smoke pours school