# The language of a CEO, NLP analysis of Steve Jobs commencement speech

Original article was published by Michelangiolo Mazzeschi on Artificial Intelligence on Medium

# Issues with NLP and how to solve them

Using basic programming, there are too many elements of a text stored as a string that will make our results useless:

• Punctuation
• Different words with the same root

For example, the sentence:

`'I asked my mother if she could buy me cookies. She told me she already bought them.'`

After removing punctuation, if I run the algorithm to extract words by frequency, this is the result:

`('asked', 1),('mother', 1),('buy', 1),('cookies', 1),('told', 1),('bought', 1)`

As you can see, the computer makes a big distinction between the verb buy and bought. In reality, because they derive from the same verb, should be counted under the same category. How do I solve this problem? By using lemmatization.

## Lemmatization

What lemmatization does is iterating through every single word to find its root.

This is the result I obtain after lemmatization. As you can see, now the word buy is counted twice.

`('buy', 2),('ask', 1),('mother', 1),('cookie', 1),('tell', 1)]`

I am ready to write the software and analyze the text I want:

# Installing Libraries

`!pip install spacy`

The spacy library is one of the best tools to perform an NLP analysis. There are other very useful applications you can find in spacy, such as entity recognition. Today I will limit myself to count the word frequency.

# Creating a words counter

This is the function that holds the entire experiment. As an input, it receives a text and the number of most frequent words we want to extract.

`def top_frequent(text, num_words):  #frequency of most common words  import spacy  from collections import Counternlp = spacy.load("en")  text = text#lemmatization  doc = nlp(text)  token_list = list()  for token in doc:    #print(token, token.lemma_)    token_list.append(token.lemma_)  token_listlemmatized = ''  for _ in token_list:    lemmatized = lemmatized + ' ' + _  lemmatized#remove stopwords and punctuations  doc = nlp(lemmatized)  words = [token.text for token in doc if token.is_stop != True and token.is_punct != True]  word_freq = Counter(words)  common_words = word_freq.most_common(num_words)  return common_words`

# Steve Jobs commencement speech

What I did to use the commencement speech as an input was to place it as a variable. Remember to use the three so that no matter if you put other commas, spaces, or different punctuations within the text area, they will not be considered as code.

`text = '''I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories...'''`

I will simply call the function on the text:

`top_frequent(text, 10)`

These are the most frequent words in the speech:

`('life', 16),('college', 12),('year', 12),('drop', 11),('want', 9),('look', 9),('love', 9),('Apple', 9)`

Do you find them inspiring?