Original article was published by Shaun Phua on Artificial Intelligence on Medium
Improving the OCR system
The original (and current) design of AMWA uses an external paid OCR service. Although it was able to attain good OCR accuracy, we are not able to make modifications to fine-tune the system since the source code is proprietary. In that regard, the next stage of development was to experiment and create our own OCR engine using a myriad of open-source OCR software.
Tesseract OCR was shortlisted as a good candidate to start. It was developed in the 1980s by HP, made open-source in 2005 and it is currently being maintained by Google. Being a mature open-source software and having a very active community, there are extensive documentations and help available online. The base OCR engine is also extremely capable, having trained on millions of images. The latest version, Tesseract 4.0 added a long short-term memory (LSTM) based OCR engine. LSTM is an artificial recurrent neural network architecture used in the field of deep learning and is able to process entire sequences of data. To put it simply, we can train Tesseract 4.0 to perform handwritten recognition — perfect for our use case.
Training our own custom OCR engine
Having no prior knowledge on neural networks and how optical character recognition systems work, I had to perform extensive independent research and readings before I could proceed on training Tesseract. Fortunately, I was under the tutelage of our Chief AI Officer David Low, who has a wealth of experience in the field of Artificial Intelligence and Data Science.
I was given articles and resources to begin my journey into the world of AI and OCR systems. David also transferred to me what he has been developing so I could dissect the source code to gain a deeper understanding on the system and engineer a way to integrate Tesseract with AMWA.
The training data required copies of handwritten assignments by students. Fortunately, our academic team was able to provide a sizable repository of worksheets and I proceeded to manually extract, crop, resize the handwriting from each worksheet. After this step, I had to provide the ground truth value for each of the image created. The ground truth here refers to the expected output for each image i.e. the handwritten words. I was able to create a small training set of 486 images from all of the worksheets.
Before any training, the base accuracy of Tesseract on the test images was 57.3%. After each training process, I would tweak the training parameters to find out the best configuration. At the end, I was able to raise the accuracy to 76.21% before my internship came to an end and I passed the mantle to the Research & Development team.
This internship has certainly been a very rewarding and enriching experience. Due to the COVID-19 pandemic, the entire team worked from home and our interactions were limited to Slack and video calls. Working alone at home also meant that I need to take the initiative to perform my own research first before consulting other members in the team. Nonetheless, the passion and generosity in sharing knowledge culture was not dampened even without the face-to-face interactions. I am also extremely grateful for all of the help and support that my CEO and co-founder Glenn Low provided throughout my entire internship. Without his trust and support, I think this internship would certainly have been much less exciting and more mundane.