Original article was published by Federico Bianchi on Artificial Intelligence on Medium
Micro-Tutorial: Quick Text Preprocessing with NLTK
I often do not remember which are the exact methods to run a quick pre-processing pipeline. And most of the times I just just the bare minimum: remove punctuation and remove stopwords.
First thing, install NLTK, the toolkit we are going to use to handle the preprocessing.
pip install nltk
Give me the code
I will just write here this quick function, so you can copy and paste it everywhere you want.
A few examples:
How does it work?
It’s super easy:
- line 11) we lowercase the sentence
- line 12) we instantiate NLTK tokenizer to get only words
- line 13) we actually tokenize the sentence (removing punctuation) and get a list of tokens
- line 14) we remove stopwords
- line 15) we join the words to make a new sentence without punctuation and stopwords