Source: Deep Learning on Medium
If you are working on some NLP tasks related to Chinese, Japanese and Korean, you might notice that the NLP workflow is different from the English NLP task. Because different from the English, there is no space in these languages to separate the words naturally. So word segmentation is very important for these languages. I have done a little research about different word segmentation methods. In this post, I will give simple advice to choose the best approach.
Use the Unigram Language Model. You can the implementation from SentencePiece, which is a language-independent subword tokenizer. No matter what language you use, this is a good start.
The below sentence in English is separated with space, but the sentence in Japanese has no space.
The corresponding tokens in two languages look like below.
I -> 私
will be -> になる
the prirate king -> 海賊王
So how to extract the “海賊王” is the word segmentation problem we need to deal with. Usually, there are three levels we can use, the word level, character level, subword level. But in recent years, the subword level approach has shown its superiority over other approaches, so in this post, I will focus on the subword level.
Subword level segmentation
This post gives a great introduction about 3 subword algorithms:
- Byte Pair Encoding (BPE)
- Unigram Language Model
The author of the Unigram Language Model also implements a library, SentencePiece, which contains two subword algorithms, BPE and Unigram Language Model.
In the recent powerful language model, BERT uses the WordPiece model and XLNet uses the Unigram Language Model.
The greatest advantage of Unigram Language Model is that this is a language-independent model. No matter what language you use, this is a good start.
Check out my other posts here!