Stemming vs Lemmatization ?

Original article was published on Artificial Intelligence on Medium

What is actually Stemming?

Stemming is the process of reducing a word to its stem or root format. Let us take an example. Consider three words, “branched”, “branching” and “branches”. They all can be reduced to the same word “branch”. After all, all the three convey the same idea of something separating into multiple paths or branches. Again, this helps reduce complexity while retaining the essence of meaning carried by these three words.

Stemming, on the other hand, is meant to be a fast and crude operation carried out by applying very simple style rules of search and replace.

Another example is, the suffixes “ing” and “ed” can be dropped off and “ies” can be replaced by “y”. By following these approaches, we may get words that are not complete words, but its okay. Because, all forms of that word in the corpus are reduced to the same form. Thus, capturing the common underlying idea.

words = ['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']

Further, NLTK or Natural Language Toolkit has a few different stemmers for us to choose from like the PorterStemmer, that we are using here, Snowball Stemmer and other language specific stemmers. Lets import the PorterStemmer here for a simple stemming operation.

from nltk.stem.porter import PorterStemmer

For stemmers to work, one has to simply pass one word at a time from the corpus. Note that here we have already removed the stopwords from it.

stemmed_words = [PorterStemmer().stem(w) for w in words]print(stemmed_words)

By running this code, we get an output similar to:

['first', 'time', 'see', 'second', 'renaiss', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definit', 'watch', 'part', '2', 'chang', 'view', 'matrix', 'human', 'peopl', 'one', 'start', 'war', 'ai', 'bad', 'thing']

By looking at the output, we can observe that some of the conversions are actually good, like “started” reduced to “start”, “people” losing “e” at the end, “ones” reduced to “one”, are a result of applying very simplistic rules.