Micro-Tutorial: Quick Text Preprocessing with NLTK

Original article was published by Federico Bianchi on Artificial Intelligence on Medium


Micro-Tutorial: Quick Text Preprocessing with NLTK

I often do not remember which are the exact methods to run a quick pre-processing pipeline. And most of the times I just just the bare minimum: remove punctuation and remove stopwords.

First thing, install NLTK, the toolkit we are going to use to handle the preprocessing.

pip install nltk

Give me the code

I will just write here this quick function, so you can copy and paste it everywhere you want.

A few examples:

The function we just defined removes both punctuation and stopwords

How does it work?

It’s super easy:

  • line 11) we lowercase the sentence
  • line 12) we instantiate NLTK tokenizer to get only words
  • line 13) we actually tokenize the sentence (removing punctuation) and get a list of tokens
  • line 14) we remove stopwords
  • line 15) we join the words to make a new sentence without punctuation and stopwords