[Review] TreyNet: A Neural Model for Text Localization, Transcription and Named Entity Recognition…

Source: Deep Learning on Medium

[Review] TreyNet: A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages


As a humanities major, I’ve pored through more centuries-old paper of unreadable chicken scratch than I care to admit. Just look in any online archive of American colonial-era letters and you’ll know exactly what I mean.

Naturally, as a computer scientist — and aspiring machine learning aficionado— I asked the most obvious question possible: “Can I write a program to do this?”

So I got to work. This summer, I experimented with an exciting dataset of Japanese calligraphy pages. The task? Identifying which characters were on the page, and where they were. In computational terms, this was a bounding-box regression and classification problem. I wanted to tackle the problem of extracting the text sequence from the page. I tried a innovative (in my mind) method with a CNN encoder for extracting features, and an LSTM decoder for recovering the text sequences.

Spoiler: this worked poorly. So after that failure, I searched for other methods, from Faster R-CNN to Dense Captioning Networks.

Faster R-CNN architecture. From Ren et. al. (2016), “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”

But while I was waiting for that magical paper on Fastest R-CNN to come out, I came across a interesting paper that was directly in this problem domain.

Enter “TreyNet: A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages.” It’s a triple-threat network that finds where text is, transcribes it, and also recognizes entities based on document images. The applications of such a work are astounding given the amounts of written material present in historical archives around the world. Currently, this task is relatively easy for humans (some cases may require varying levels of academic training), but very time consuming. Naturally, this would make primary source research for the historians and history students of the world a lot faster and accessible, so I dove in and took a look at how this works.

An example of TreyNet output. One can observe color-coded boxes around each word, where the color represents the type of object a word represents (occupation, location, name, etc.). Above each box (though it is hard to read here) there is a small textual annotation recording the transcribed text.

Previous Work

Let’s briefly overview what they’re building off of. We’ll assume some familiarity with Convolutional Neural Networks (CNNs); a guide for beginners can be found here (without too much math!). For technical or interested readers, feel free to follow along with the italicized “Notes.”

  1. Selective Search: first proposed by Uijlings et. al. (2012) in “Selective Search for Object Recognition”

Selective search (SS) tackles the question of finding where an object is within an image, and tackles it by over-segmenting the image and treating those segments as proposal boxes. It then combines some segments based on how similar they are and how cohesive the combination is, treats those new combined regions as proposal boxes, and so on.

Note: this is a hand-wavey treatment of how SS combines regions, since the focus of this article is on TreyNet. For a more formal explanation refer to Section 3.1 of the original paper. You may also notice a bunch of overlapping boxes below — which one is the right box? This technique can help filter out extraneous boxes.

Demonstration of selective search algorithm for object detection. From Stanford CS231B lecture slides, Spring 2014–2015. Original can be found here.

2. Fast R-CNN: first proposed by Ross Girshick (2015) in “Fast R-CNN”

Building off of SS, a Fast R-CNN network first 1) extracts features from an image through a CNN, 2) takes region proposals from SS and pools them into a fixed size box, and then predicts a) what’s in that box and b) where the object in the box actually is (x, y, width, height). For training, the loss function is a weighted sum of categorical crossentropy (negative log probability of the true class), a.k.a. classification loss, and the smoothed L1 loss, a.k.a. regression loss. Tl;dr: loss = how unconfident we were that the correct object is in the box + how bad our box prediction was.

Note: for regression loss the (x, y, w, h) coordinates are parameterized such that they are translation invariant. See Appendix C of “Rich feature hierarchies…”, Girshick et. al. (2014). Regression loss is multiplied by an indicator that is non-zero only if a non-background object is predicted: we don’t care how badly a “background” object bounding box is predicted because it has no bearing on our problem.

3. Faster R-CNN: first proposed by Ren et. al. (2016)

You know what, SS is slow. Let’s get rid of that and have a Region Proposal Network (RPN) separately learn regions of interest. We’ll give the network “hints” by overlaying anchors where objects could possibly be on the page, and if they overlap with a real object, that anchor is an actual region proposal that we’ll pass into a Fast R-CNN style network. This network actually brought object detection down to near real-time (0.2s) speeds.

“Anchors” visualized. From Hao Gao’s overview of Faster R-CNN. These boxes can be labeled 0 if they’re uninteresting, or 1 if they may contain and object. These regions are then passed into a network for training.

Note: the actual training of the RPN and the four-stage training process of the entire network is actually really cool and worth reading, but outside the scope of this article. I’d encourage you to check it out!

4. YOLO (You Only Look Once): proposed by Redmon et. al. (2016)

The previous algorithms were known as two-stage object detection algorithms: first you find interesting regions, then you detect objects. YOLO tries to do that all at once. It’s much faster than the above methods, but less accurate.

How YOLO works, kind of. From Redmon et. al. (2016), “You Only Look Once:
Unified, Real-Time Object Detection”

YOLO works by partitioning an image into a grid. Then, for each cell of the grid, it predicts B bounding boxes (x, y, w, h, confidence score) and the conditional class probability of each object (probability of object X being the actual object in the cell, given there is an object). The confidence score is calculated by multiplying the overlap (IoU) between boxes and the probability of an object being in the box.

Note: For training, YOLO uses one of the coolest loss functions of all time. Since there’s a bunch of terms, I’ve tried to annotate it to save space:

YOLO Loss Function, annotated. It’s Not As Scary As It Looks ™. From Redmon et. al. (2016)

5. End-to-End Text Transcription: Proposed by Carbonell et. al. (2019)

TreyNet builds heavily on this work, which combines previous object location + classification methodologies and applies them to text, where text is the object, and region proposals text bounding boxes. They extract features via ResNet18 and a Feature Pyramid Network (FPN) — the specifics of which aren’t important to understand this article, but I’ve linked more in-depth treatments of ResNets (more technical/slightly less technical) and FPNs here (more technical/less technical). Don’t worry if these terms don’t make sense — just know that this network basically takes images as input and outputs abstract representations (called feature maps) that encode information useful to our task.

Note: TL;DR ResNets use “skip” connections to tackle vanishing/exploding gradient problems, therefore allowing deeper networks. FPNs just take a bunch of convolutional layer outputs and combine them.

Feature extraction scheme of End-to-End Text Transcription. 5 ResNet blocks with FPN, which is then fed into a prediction module (classification and regression). From Carbonell et. al. (2019)

Similarly to the earlier two-stage object detectors, there is a two-part classification and regression problem here. Just like Fast R-CNN, regression comes down to drawing a box around text; classification comes down to whether or not there is text at all within that box. The new part for us is the recognition module:

“This module takes as inputs the features extracted with the FPN network and the corresponding regressed boxes and returns a probability tensor of variable width per alphabet length corresponding to the possible transcriptions of the

Let’s break this down. Basically, we extract areas corresponding to predicted boxes from the regression phase from the features extracted from the page, standardize the size those areas (via RoI pooling), then pulls out text from the module output (the “probability tensor of variable width”). To recover the text, it takes a bunch of sequential “slices” of the image, learns what character is most likely to appear there, and then pieces that together to recover a word.

Note: Formally, the module is a CRNN (Convolutional Recurrent Neural Network). The output tensor has shape (time_steps, alphabet_size). To see this, assume the input has shape (height, width). After applying a variety of conv + max-pool layers, we can reach a tensor of width (1, n, n_filters). Note that variable “n” corresponds to time_steps. Now that we have a sequence of features, apply an RNN; take the sequential output and send it (optionally) to FC layer(s). Apply Connectionist Temporal Classification (CTC) loss at training, then CTC decoding at test time to obtain the maximum likelihood sequence.

End-to-End Text Transcription architecture. Note how all three branches can be — and are — jointly trained. This scheme yielded better results than the separate-training experiments. From Carbonell et. al. (2019).

6. Joint Recognition of Handwritten Text and Named Entities: Proposed by Carbonell et. al. (2018)

Named entity recognition (NER) is essentially the classification of words: finding out what type of noun a word is, if it is a place name, a human name, a company name, etc. This paper combines handwriting text-recognition (HTR) with NER. Carbonell et. al. (2018) fine-tune a pre-trained HTR network on images of words and lines. Recognition works similarly to the above: in the training data, tags like “location” or “name” are added to the text, and then the tags themselves are treated like “characters” — a single textual category to learn. This means that the internal representation of the text line might be something more like:

habitat en <location> <husband> Bara ab <name> <wife> Elisabeth…

where the bracketed “<location>” are treated just like a “letter” of the alphabet in the CTC loss paradigm.

Top: image of document line. Middle: image of document line segmented into words. Bottom: text transcription + named entity recognition demonstration.

Note: Carbonell et. al. experiment with four different entity-encoding methods, finding that prepending the label as a “single combined label” yielded the best results. In this labeling scheme, “name-wife” would be a singular label for “Juana” in the above example as opposed to two (“name” plus “wife”) labels. See Section IV, Methodology.

TreyNet: Methodology

The methodology of TreyNet largely builds on the above papers, especially 5. For feature extraction, TreyNet uses the same scheme as 5., where 5 ResNet blocks plus a FPN. If you recall the concept of a “feature map” — at a very high level, this configuration takes as input an image and outputs a fixed-size, abstract representation of the image with information relevant to classification encoded.

A Brief Aside: Sayre’s Paradox

To paraphrase, the task of text transcription and other modes of information extraction from image data leads to a chicken-and-egg type scenario: in order to transcribe a word, one must first recognize what the characters are. In order to recognize the characters, one must first know what the word is. A similar deadlock situation arises in other information extraction tasks.

This is their motivation for joint extraction: by incrementally optimizing performance on all of these tasks simultaneously, the problem of establishing a “good enough” segmentation or a “good enough” transcription in order to subsequently learn other tasks goes away.


High-level architecture for TreyNet. From Carbonell et. al. (2019).

For classification (text or not-text), they use a Faster R-CNN style loss scheme where each anchor contains a probability distribution over possible classes. This is then used to determine object presence and object class.

For bounding-box regression, similarly to the works outlined above, Carbonell et. al. use anchor-based regression where the targets are the offsets (dx, dy, dw, dh) of the ground-truth box. The objective is simply mean-squared error. This process is analogous to finding region proposals. After bounding-boxes are made, text transcription in a similar manner as 5. ensues, which we will refrain from repeating here.

For named entity recognition, there are two main approaches outlined by Carbonell et. al. First is expanding the classification module to predict named entity class directly. However, bounding-boxes over single words could fail to capture the context of sentences/paragraphs, so Carbonell et. al. also test an additional “Semantic Annotation” branch. This involves taking the bounding-boxes on the page, extracting the corresponding outputs from the FPN, and ordering them in sequence (left to right, then top to bottom for simpler Western text layouts) — creating a “paragraph” of processed images.

This “paragraph” is then padded and fed through just a couple of conv layers, then a fully connected layer, which yields a probability distribution over entity/tag classes for the entire sequence. Then they simply minimize cross entropy loss.

Experiments & Analysis

For their experiments, Carbonell et. al. (2019) use mAP (mean average precision) to evaluate object localization and named entity recognition, while using character error rate (CER) to evaluate text transcription. They tested five different setups: A) a “Triple Task Model” where all three tasks were jointly trained, and the convolutional approach described above was used for named entity recognition based on the bounding-box results; B) a “Triple Task Sequential Model” with all things equal except that features were pooled in reading order (think “paragraph of feature maps,” C) “Detection + named entity recognition,” where named entity recognition is trained as a two-stage detect-and-classify task, D) “Detection + transcription,” which is just word detection and transcription, and E) a baseline CNN that simply classifies context-removed cropped words.

The results demonstrated that jointly training TreyNet to perform these tasks is viable. Setup D) obtained the best results on text localization, though the metrics were quite close for all setups. The A) Triple Task setup performed the best on character recognition (transcription), doing no worse than any of the other setups, but faltered when more out-of-vocabulary words appeared.

Consonant with Sayre’s paradox, this viability demonstrates that such interdependencies between information extraction tasks on image-based data could be resolved by joint models.

Closing Thoughts

This was a refreshing read, and an exciting approach to a very difficult task for computers. It is still difficult to figure out a way to meaningfully encode all the complex information and dependencies represented on something as seemingly simple as a historical text — in a way that a computer can understand and learn from — but this is a promising move.

The implications for such a technology aren’t just the niche fantasies of humanities-tech crossover students like myself, but also have the potential to unlock treasure-troves of historical knowledge inaccessible to the public. Few of us today have the expertise to read hieroglyphics, medieval Japanese calligraphy, and other lesser-known writing systems. But once something can be transcribed and analyzed, it can be understood, translated, and distributed to all of us. So to me, this is also about making knowledge accessible to as many people as possible.

Overall, I attempted to write a concepts-focused review such that both technical and non-technical audiences could enjoy the material. As such, some low-level mathematical details are abstracted away, and there is no “example code” here. I hope that this was an enjoyable and informative read, and am always happy to chat about this!

The author can be reached at tchang97 ‘at’ stanford ‘dot’ edu.