Original article was published by Michelangiolo Mazzeschi on Artificial Intelligence on Medium
Issues with NLP and how to solve them
Using basic programming, there are too many elements of a text stored as a string that will make our results useless:
- Different words with the same root
For example, the sentence:
'I asked my mother if she could buy me cookies. She told me she already bought them.'
After removing punctuation, if I run the algorithm to extract words by frequency, this is the result:
As you can see, the computer makes a big distinction between the verb buy and bought. In reality, because they derive from the same verb, should be counted under the same category. How do I solve this problem? By using lemmatization.
What lemmatization does is iterating through every single word to find its root.
This is the result I obtain after lemmatization. As you can see, now the word buy is counted twice.
I am ready to write the software and analyze the text I want:
!pip install spacy
The spacy library is one of the best tools to perform an NLP analysis. There are other very useful applications you can find in spacy, such as entity recognition. Today I will limit myself to count the word frequency.
Creating a words counter
This is the function that holds the entire experiment. As an input, it receives a text and the number of most frequent words we want to extract.
def top_frequent(text, num_words):
#frequency of most common words
from collections import Counternlp = spacy.load("en")
text = text#lemmatization
doc = nlp(text)
token_list = list()
for token in doc:
token_listlemmatized = ''
for _ in token_list:
lemmatized = lemmatized + ' ' + _
lemmatized#remove stopwords and punctuations
doc = nlp(lemmatized)
words = [token.text for token in doc if token.is_stop != True and token.is_punct != True]
word_freq = Counter(words)
common_words = word_freq.most_common(num_words)
Steve Jobs commencement speech
What I did to use the commencement speech as an input was to place it as a variable. Remember to use the three ‘ so that no matter if you put other commas, spaces, or different punctuations within the text area, they will not be considered as code.
text = '''
I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories...
I will simply call the function on the text:
These are the most frequent words in the speech:
Do you find them inspiring?