Original article was published by Kartik Chaudhary on Artificial Intelligence on Medium
EASTER Model Architecture
Easter model architecture is quite simple that utilizes only 1-D Convolutional layers for the task of OCR and HTR.
Easter encoder part consists of multiple stacked 1-D Convolutional layers where kernel-size increases with the depth of the model. The effectiveness of stacked 1-D Convolution based networks to handle the sequence-to-sequence tasks has already been proved in the area of ASR (Automatic Speech Recognition).
The basic structure of an EASTER block is shown in the figure below. Each block has multiple repeating sub-blocks. Each sub-block is made up of 4 ordered components-
- 1-D Convolutional layer
- Batch-Normalization layer
- Activation layer (ReLU)
- A Dropout layer
The overall encoder is a stack of multiple repeating EASTER blocks (discussed in the last paragraph). Apart from repeating blocks, there are four extra 1-D Convolutional blocks present in the overall architecture as shown in the figure below.
Preprocessing Block (Downsampling block)
This the first block of the model that contains two 1-D convolutional layers with a stride of 2. This block is used to downsample the original width of the image to width/4. Apart from the stride, all other components of the sub0-blocks are similar to the one discussed above.
There are three post-processing blocks at the end of the encoder part, where the first one is a dilated 1-D Convolutional block with dilation of 2, the second one is a normal 1-D Convolutional block while the third post-processing block is a 1-D Convolutional block with ‘number-of-filters’ equal to the number of possible outcomes (model vocabulary length) and with a softmax activation layer. The output of this layer is passed to the CTC decoder.
EASTER encoder passes the output probability distribution of the encoded sequence to a CTC decoder for decoding.
To map the predicted output characters into the resulting output sequence, the EASTER model utilizes a weighted CTC decoder. This weighted CTC decoder results in the fast convergence of the model and gives better results than vanilla-CTC when training data is limited.
The configurations of this weighted-CTC decoder is described in detail in the original paper.
3×3 Architecture Variant
EASTER 3X3: A 14-layered variant can be constructed using the table shown below. This is a very shallow/simple architecture with just 1M parameters yet very effective for the task of OCR/HTR.
This model can be easily scaled to increase performance/capacity. In the experiments shown in the paper, a 5×3 variant achieves the state of the art performance for the tasks of HTR and OCR.