Original article was published on Artificial Intelligence on Medium
Part-2: Tokenization (NLP)
Text is very important data. Once we try to convert this data into the informational model. Our first for NLP is to do tokenization. Lets understand this how to deal with it :D.
What is Tokenization (NLP)?
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.
As you can see from the above example a string is now tokenized. Now question is why we are doing tokenization.
Why Tokenization (NLP)?
Now as we know that the process of splitting sentence or word is known as tokenization. We do tokenization because if we will be having the words and sentences differently then we can reach to each and every word and sentence and can have insights from each and every word or sentence. That is why tokenization is very important for us.
How to Implement Tokenization?
Please go through the above given notebook for different types of Tokenization.
Practice Assignment Time 🙂
Now i hope you understood from the above explaination you know about tokenization. Please use this practice assignment to have more better understanding on tokenization.
Feel free to use our github repo associated with this NLP series : https://github.com/wakeupcoders/Natural-Language-Processing-
Link : Part -1 : Introduction to NLP