English to Katakana with Sequence to Sequence in TensorFlow

Source: Deep Learning on Medium


This is a reposted of an article I wrote a few years ago with some updates (e.g. Python 3, TensorFlow 1.13, etc).

All data and code are available on Github.

If you are a non-specialist deep-learning enthusiasm like me, you probably feel it’s difficult applying deep NLP techniques, e.g. Sequence-To-Sequence, into real-world problems. I’ve found most datasets or corpuses aside from class tutorials, e.g. chat dialogs, are often very big (e.g. WMT’15 in Tensorflow’s Neural Machine Translation tutorial is 20GB) and sometimes difficult to access (e.g. Annotated English Gigaword). The training time on those large datasets also takes several hours to several days.

As it happens, I’ve found writing Japanese Katakana a perfect practice problem for machine learning. It could be viewed as a smaller version of machine translation. As you will see, with just around 4MB of training data and a few hours of training on a non-GPU machine, we can build a reasonably good model to translate English words to Katakana characters.

At the end of this tutorial, you will be able to build a machine learning model to write your name in Japanese.

Basic Katakana

(Feel free to skip this part if you know basic Japanese)

Katakana is a subset of Japanese characters. Each Katakana character represents a single Japanese syllable and has a specific pronouncing sound. To write a word in Katakana is to combine the syllable of each character. For example:

  • カ pronounces “ka”
  • ナ pronounces “na”
  • タ pronounces “ta”. Its variance, ダ, pronounces “da”

Putting those together, you can write カナダ (Ka-na-da) = Canada.

This Japanese writing system is used for writing words borrowing from other languages.

  • バナナ (ba-na-na) : Banana
  • バーベキュー(baa-be-ki-yuu) : Barbecue
  • グーグル マップ (guu-gu-ru map-pu) : Google Maps

It’s also used for writing non-Japanese names:

  • ジョン・ドウ (ji-yo-n do-u): John Doe
  • ドナルド・ダック(do-na-ru-do da-k-ku): Donald Duck
  • ドナルド・トランプ (do-na-ru-do to-ra-n-pu): Donald Trump

You can look up the character-to-English-syllable from this website. Or start learning Japanese today :)

Dataset

Our training data is English/Japanese pairs of Wikipedia titles, which usually are names of persons, places, or companies. We use only the articles whose Japanese title consists of only Katakana characters.

The raw data can be download from DBpedia website. The website provides Labels datasets that each line contains its resource ID and its title in the local language.

<http://wikidata.dbpedia.org/resource/Q1000013> <http://www.w3.org/2000/01/rdf-schema#label> "ジャガー・XK140"@ja .
<http://wikidata.dbpedia.org/resource/Q1000032> <http://www.w3.org/2000/01/rdf-schema#label> "アンスクーリング"@ja .
...

We can parse Japanese and English title out, join them by the resource ID, and filter only the articles with Japanese title in all Katakana. By the approach described in this note, I created ~100k joined title pairs.

The built dataset can be downloaded directly here.

Sequence-to-Sequence in TensorFlow (Keras)

Sequence-to-Sequence (aka. Seq2Seq) is a technique to train a model that predicts output sequence from the input sequence. There are a lot of documents and tutorials that explain the model in more details:

In this article, I assume you have some theoretical knowledge on Recurrent Neural Network (RNN and LSTM) and Sequence-to-Sequence model. I also assume that you have some experience with Keras or TensorFlow.

We will build a simple Sequence-to-Sequence model (without attention) as shown in the diagram in tf.keras.

The model has two parts, Encoder and Decoder.

  • The encoder reads a sequence of English characters and produces an output vector
  • The decoder reads the encoder output and a sequence of (shifted) Katakana characters, then predict the next Katana characters in the sequence.

Data Transformation

As shown in the diagram, our model’s input is a sequence of characters. But, Keras doesn’t actually take characters as input. We need to transform English and Katakana characters into IDs. To do that, we also need to build encoding dictionary and its reverse use for decoding the results later.

We also need to:

  • Pad the decoder input with START character (as shown in the diagram)
  • Transform the decoder output into one-hot encoded form

The Encoder

Our input is a sequence of character. The first step is to embed each input character into a dense vector by Embedding Layer.

We use a Recurrent Layer (LSTM) to encoder the input vectors. The output of this encoding step is LSTM’s output at the final time step. This can be done in Keras by setting return_sequnece=False.

The Decoder

The decoder is more complicated, but not by much.

Similar to the encoding step, decoder input is a sequence of character. We also pass the decoder input into an Embedding layer to transform each character into a vector.

We, again, pass the embedded input into an LSTM. However, this time, we want the LSTM to produce an output sequence (return_sequnece=True). We also use the output from encoder as the initial_state for the LSTM.

Finally, we use (TimeDistributed) Dense Layer with softmax activation to transform each LSTM’s output to the final output.

Building and Training the Model

The final step is to create a Keras’s Model object. Our model has two inputs (from the encoder and decoder) and one output (from the decoder).

We compile the model with Adam optimizer and Categorical Cross-Entropy loss function, then fit the model with the transformed data.

The training usually takes around an hour.

Write Katakana from English words

We can apply the trained model to write Katakana for any given English input. In this article, we will use a greedy decoding technique (there is more accurate beam search technique) as following:

  1. Create encoder_input from the English input.
  2. Create decoder_input as an empty sequence with only START character.
  3. Use the model to predict decoder_output from encoder_input and decoder_input.

4. Take the most-likely first character of the decoder_output sequence as the first Katakana output. Copy the output character to the next character in decoder_input.

5. Keep repeating step 3–4. Every time, we generate one Katakana output and assign the character as the next decoder_input character.

We also need character decoding and some wrapper functions.

The Results

If you have common (Western-style) names, the model should be able to write your name correctly.

  • James : ジェームズ
  • John : ジョン
  • Robert : ロベルト
  • Mary : マリー
  • Patricia : パトリシア
  • Linda : リンダ

Of course, our simple model is not perfect. Because we train the model with mostly place and people names, some English words may not be written correctly (but almost).

  • Computer : コンプーター (correctly, コンピューター)
  • Taxi : タクシ (correctly, タクシー).

Also, our simple model doesn’t have the attention mechanism. It has to encode, compress and remember the whole input sequence in a single vector (the encoder_output). So, writing Katakana for a long English word or phrase will be challenging.

My name, for example, “Wanasit Tanakitrungruang” is written as “ワナシート・タナキトリングラウン” (correctly ワナシット・タナキットルンアン). The model actually doesn’t do all that bad, because most Japanese I know can’t write my name correctly either :). However, making the model that write long Thai names in Katakana correctly would be an interesting next step.