Transcribr

Source: Deep Learning on Medium

DATA

Much of the training for this task was based on the popular IAM Handwriting Database. This dataset is composed of 1,539 pages of scanned, handwritten text by 657 different writers. The IAM is based on a corpus (Lancaster-Oslo/Bergen) of British English texts first compiled in the 1970s. The content of these texts span genres including press reporting, scientific writing, essays, and popular lore.

Understanding that deep learning algorithm performance is bound by the quality of training data; this dataset is not ideal for a modern, American English application. However, as the largest publicly available compilation of annotated handwriting images, it is the most popular dataset for research applications in this space and offers a good baseline comparison with previous architectures.

In addition to 1,539 pages of text, the dataset is further segmented into ~13k lines and ~115k isolated words. These additional segmentations were critical to expand the training dataset.

Before expansion, a thorough examination of the data and manual correction of annotation/segmentation errors was necessary. A test set of 15 pages or approximately 1% of the total data was removed. The small test size is not ideal but is necessitated by the small overall size of the dataset.

Word Combinations

A second dataset was created by combining randomly chosen images from the word segmentation list. 50k new images were created in a random configuration from a single word up to 4 lines of 4 words each.

Line Concatenations

Another dataset was created by concatenating randomly chosen line images (normalized by height) from 3 to 13 lines in length. 20k of these images were created.

Synthetic fonts

Another strategy was to use google fonts to create images of handwriting-like text. Text from Wikipedia, IMDB movie reviews, and open source books were rendered using 95 different handwriting fonts in variable sizes to create ~129k images of varying lengths. Background noise, blurring, random padding/cropping, pixel skewing, etc. was added to reflect the organic irregularities of the primary dataset as well as handwritten text in general.

Downloaded handwriting samples

In order to improve the generalization performance of the algorithm beyond the IAM dataset, another dataset was constructed of 161 manually annotated images, publicly available on the internet. These images were treated with 11 different combinations of image modulation resulting in a final dataset containing 1771 images.

Tokenization

For tasks involving text sequences, Tokenization is critical. Character tokenization has a number of benefits. Character sizes are relatively standard. Vocabulary size is small with few out-of-vocabulary tokens. However, character inference is slow (auto-regressive sequential decoding of 1000+ characters takes time…) and as mentioned above, the heterogeneity of handwriting means characters are often overlapping, illegible, or even omitted.

Word tokenization is intuitive as words seem to have primacy over characters in human reading*. Inference time is much faster than with characters. However, vocabulary size must be very large in order to limit out-of-vocabulary tokens.

A fixed-length subword tokenizer, SentencePiece, was used as a compromise. Using a learned, unigram language model encoding, [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates] I created a vocabulary of 10k subword units. Critically, additional special tokens were added to represent spaces, line-break characters, letter and word capitalization indicators, and punctuation marks. (Note: 10k, 30k, and 50k vocabularies were tested with the 10k being the most performant as well as keeping the model footprint small. Win win!)

Subword tokenization offers good inference speed, modest vocabulary size, and few out-of-vocabulary tokens. However, text images are not obviously divisible into subword units. I found that training with both character and subword tokens together helped encode a more robust feature representation and improved accuracy for both.

[*A fanscinaitg dgsiesrion itno pchsyo-liigntusics via a ppoular ietnernt mmee. Tl;dr: While words are recognized as chunks; human reading (especially in difficult conditions) involves both word and character processes.]