Natural Language Processing (NLP) for Beginners

Original article was published by Behic Guven on Artificial Intelligence on Medium


Step 5 — Data Visualization

In this step, firstly we will count the frequency of the tokens and then we will filter the high-frequency ones. After filtering, it’s time to visualize the most frequently used words in the natural language processing Wikipedia page. Visualization will help us to see them in order with their frequency.

Let’s calculate the frequency of words using FreqDist function by NLTK.

freq = nltk.FreqDist(clean_tokens)for key, val in freq.items():
print('Word: ' + str(key) + ', Quantity:' + str(val))
tokens quantity

Now, we will define a new dictionary and get the tokens that has been used more than 10 times in the page. These keywords are more valuable than others:

high_freq = dict()
for key, val in freq.items():
if (val > 10):
high_freq[key] = val

Perfect! Now we have a new dictionary called high_freq. Let’s move to the final step and create a bar chart. I think a bar chart will work better with quantitative data representation. I’ve also sorted by descending order so that the word with highest frequency comes first. Here is the visualization code:

#Note: to pass keys and values of high_freq dictionary, I had to convert them to list when passing themfig = dict({
"data": [{"type": "bar",
"x": list(high_freq.keys()),
"y": list(high_freq.values())}],
"layout": {"title": {"text": "Most frequently used words in the page"}, "xaxis": {"categoryorder":"total descending"}}
})
pio.show(fig)