The language of a CEO, NLP analysis of Steve Jobs commencement speech

Original article was published by Michelangiolo Mazzeschi on Artificial Intelligence on Medium


Issues with NLP and how to solve them

Using basic programming, there are too many elements of a text stored as a string that will make our results useless:

  • Punctuation
  • Different words with the same root

For example, the sentence:

'I asked my mother if she could buy me cookies. She told me she already bought them.'

After removing punctuation, if I run the algorithm to extract words by frequency, this is the result:

('asked', 1),
('mother', 1),
('buy', 1),
('cookies', 1),
('told', 1),
('bought', 1)

As you can see, the computer makes a big distinction between the verb buy and bought. In reality, because they derive from the same verb, should be counted under the same category. How do I solve this problem? By using lemmatization.

Lemmatization

What lemmatization does is iterating through every single word to find its root.

Lemmatization of the word ‘change’

This is the result I obtain after lemmatization. As you can see, now the word buy is counted twice.

('buy', 2),
('ask', 1),
('mother', 1),
('cookie', 1),
('tell', 1)]

I am ready to write the software and analyze the text I want:

Installing Libraries

!pip install spacy

The spacy library is one of the best tools to perform an NLP analysis. There are other very useful applications you can find in spacy, such as entity recognition. Today I will limit myself to count the word frequency.

Creating a words counter

This is the function that holds the entire experiment. As an input, it receives a text and the number of most frequent words we want to extract.

def top_frequent(text, num_words):
#frequency of most common words
import spacy
from collections import Counter
nlp = spacy.load("en")
text = text
#lemmatization
doc = nlp(text)
token_list = list()
for token in doc:
#print(token, token.lemma_)
token_list.append(token.lemma_)
token_list
lemmatized = ''
for _ in token_list:
lemmatized = lemmatized + ' ' + _
lemmatized
#remove stopwords and punctuations
doc = nlp(lemmatized)
words = [token.text for token in doc if token.is_stop != True and token.is_punct != True]
word_freq = Counter(words)
common_words = word_freq.most_common(num_words)
return common_words

Steve Jobs commencement speech

What I did to use the commencement speech as an input was to place it as a variable. Remember to use the three so that no matter if you put other commas, spaces, or different punctuations within the text area, they will not be considered as code.

text = '''
I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories...
'''

I will simply call the function on the text:

top_frequent(text, 10)

These are the most frequent words in the speech:

('life', 16),
('college', 12),
('year', 12),
('drop', 11),
('want', 9),
('look', 9),
('love', 9),
('Apple', 9)

Do you find them inspiring?