Cleaning up the text data using text preprocessing techniques

Original article was published on Artificial Intelligence on Medium

Cleaning up the text data using text preprocessing techniques

Photo by Dmitry Ratushny on Unsplash

Text preprocessing is a approach for cleaning and preparing text information for use in a specific context. Developers use it in practically all natural language processing (NLP) pipelines, including voice recognition programming, web index query, and AI model preparing. It is a fundamental step since text information can differ. From its organization (site, text message, voice recognition) to the individuals who make the text (language, dialect), there are a lot of things that can bring noise into your information.

The ultimate objective of cleaning and planning text information is to decrease the text to just the words that you need for your NLP goal.

Noise Removal

Text cleaning is a strategy that developers use in an wide range of areas. depending upon the objective of your task and where you get your information from, you might need to evacuate undesirable data, for example,

  • punctuation and accents
  • special characters
  • numeric digits
  • leading, ending, and vertical whitespace
  • HTML formatting

The kind of noise that you have to remove from text for the most part relies upon its source. For instance, you could get to information by means of the Twitter API, scratching a website page, or voice acknowledgment programming. Luckily, you can utilize the .sub() technique in Python’s normal articulation (re) library for the vast majority of your noise removal needs.

The .sub() method has three required arguments:

  1. pattern – a regular expression that is searched for in the input string. There must be an r preceding the string to indicate it is a raw string, which treats backslashes as literal characters.
  2. replacement_text – text that replaces all matches in the input string
  3. input – the input string that will be edited by the .sub() method

“Who was partying?” -> Original string

“Who was partying” -> Lower case and Noise removal


For some, natural language processing tasks, we need access to each word in a string. To get to each word, we initially need to break the content into smaller parts. The technique for breaking content into smaller parts is called tokenization and the individual segments are called tokens.

A couple of tasks that require tokenization include:

  • Finding how many words or sentences appear in text
  • Determining how many times a specific word or phrase exists
  • Accounting for which terms are likely to co-occur

While tokens are typically singular words or terms, they can likewise be sentences or other size pieces of content.

To tokenize singular words, we can utilize nltk’s word_tokenize() work. The capacity acknowledges a string and returns a rundown of words.

“Who was partying?” -> Original string

“Who was partying” -> Lower case and Noise removal

[“who”, “was”, “parting”] -> Tokenization


Tokenization and noise removal are staples of all text pre-processing pipelines. Nonetheless, a few information may require further processing through text normalization. Text normalization is a catch-all term for different text pre-processing tasks. Few of the normalization tasks include:

  • Upper or lowercasing
  • Stopword removal
  • Stemming — bluntly removing prefixes and suffixes from a word
  • Lemmatization — replacing a single-word token with its root

The simplest of these approaches is to change the case of a string. We can use Python’s built-in String methods to make a string all uppercase or lowercase.

Stopword Removal

Stopwords are words that we remove during preprocessing when we don’t care about sentence structure. They are generally the most widely recognized words in a language and don’t give any data about the tone of an information. They include words such as “a”, “an”, and “the”. NLTK provides a built-in library with these words.


In natural language processing, stemming is the text preprocessing normalization task concerned with bluntly removing word affixes (prefixes and suffixes). For instance, stemming would cast “going” to “go”. This is a typical technique utilized by web crawlers to improve coordination between user input and website hits. NLTK has a built-in stemmer called PorterStemmer. You can utilize it with a list comprehension to stem each word in a tokenized list of words.


Lemmatization is a method for casting words to their root forms. This is a more involved process than stemming, because it requires the method to know the part-of-speech for each word. Since lemmatization requires the part of speech, it is a less efficient approach than stemming.

Part-of-Speech Tagging

To improve the performance of lemmatization, we need to find the part of speech for each word in our string. We create a part-of-speech tagging function. The function accepts a word, then returns the most common part of speech for that word. Let’s break down the steps:

  1. Import wordnet and Counter

2. Get synonyms

3. Use synonyms to determine the most likely part of speech

4. Return the most common part of speech

“Who was partying?” -> Original string

“Who was partying” -> Lower case and Noise removal

[“who”, “was”, “parting”] -> Tokenization

[“who”, “be”, “party”] -> Lemmatization