Why you should avoid removing STOPWORDS

Original article was published by Gagandeep Singh on Artificial Intelligence on Medium

You might think it is very common to remove stop words from the text during preprocessing it. Yes, I agree with you but you should be careful about what kind of stopwords you are removing.

The most common method to remove stop words is using NLTK’s stopwords.

Let’s look at the list of stop words from nltk.

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Now, look at all the bold words.

So, the question is what is wrong with them?

Let’s imagine you are asked to create a model that does sentiment analysis of product reviews. The dataset is fairly small that you label it your self. Consider a few reviews from the dataset.

1. The product is really very good. — POSITIVE

2. The products seems to be good. — POSITIVE

3. Good product. I really liked it. — POSITIVE

4. I didn’t like the product. — NEGATIVE

5. The product is not good. — NEGATIVE

You performed preprocessing on data and removed all stopwords.

Now, let us look what happens to the sample we selected above.

1. product really good. — POSITIVE

2. products seems good. — POSITIVE

3. Good product. really liked. — POSITIVE

4. like product. — NEGATIVE

5. product good. — NEGATIVE

Look at negative feedbacks.

Scary, right?

source: pixabay.com

Positive feedback doesn’t seem to be affected but look at negative feedback. Their whole meaning has changed. If we train our model on this data, then it is surely going to underperform.

This happens very often, after removing stopwords the whole meaning of sentence changes.

If you are working with basic NLP techniques like BOW, Count Vectorizer or TF-IDF(Term Frequency and Inverse Document Frequency) then removing stopwords is a good idea because stopwords act like noise for these methods. If you working with LSTM’s or other models which capture the semantic meaning and the meaning of a word depends on the context of the previous text, then it becomes important not to remove stopwords.

Now, coming to my original question — Does removing stopwords really improve model performance?

Like I said earlier it depends on what kind of stopwords are you removing. The problem here is that if you do not remove stop words, the noise will increase in the dataset because of words like I, my, me, etc.

So, what’s the solution? Creating a new list of correct stop words but the problem is to reuse it in different projects.

This is why I’ve created a Python package nlppreprocess which removes stops words that are not necessary. It also has some additional functionalities that can make cleaning of text fast.

The best way to utilize its functionality is by connecting it with pandas as below:

You can check its complete documentation on the page itself.

Now, if we utilize this package to preprocess the above samples we’ll get something like this

1. product really very good. — POSITIVE

2. products seems good. — POSITIVE

3. Good product. really liked. — POSITIVE

4. not like product. — NEGATIVE

5. product not good. — NEGATIVE

Now, it seems reasonable to use this package for the removal of stopwords and other preprocessing.

Let me know what is your opinion on this in the comment section.

Thank You!