Original article was published by Manoj Kukreja on Artificial Intelligence on Medium
DevOps — Serverless OCR-NLP Pipeline using Amazon EKS, ECS and Docker
How we were able to auto-scale an Optical Character Recognition Pipeline to convert thousands of PDF documents into Text per day using event driven microservices architecture driven by Docker and Kubernetes
On a recent project we were called in to create a pipeline that has the ability to convert PDF documents to text. The incoming PDF documents were typically 100 pages and could contain both typewritten and handwritten text. These PDF documents were uploaded by users to an SFTP. Normally, on average there would be 30–40 documents per hour, but as high as 100 during peak periods. Since their business was growing the client expressed a need to OCR up to a thousand documents per day. These documents were then fed into an NLP pipeline for further analysis.
Lets do a Proof of Concept — Our Findings
Time to convert a 100 page document — 10 minutes
Python process performing the OCR consumed around 6GB RAM and 4 CPU.
We needed to come up with a pipeline that not only keeps us with the regular demands but can auto-scale during peak periods.
We decided to architect a serverless pipeline using event driven microservices. The entire process was broken down as follows:
- Document uploaded in PDF Format — Handled using AWS Transfer for SFTP
- Trigger an S3 event Notification for when a new PDF document is uploaded — Trigger a Lambda Function
- Lambda Function adds an OCR event in Kinesis Streams
- OCR microservice is triggered — Converts PDF to Text using the Tesseract Library (One per page). Text Output saved as JSON document in MongoDB
- Add an NLP event in Kinesis Streams
- NLP microservice reads JSON from MongoDB. Final Results of NLP saved back to MongoDB
Data Ingestion — AWS SFTP Service
Microservices — Docker Images stored in Amazon Elastic Container Registry (ECR)
Container Orchestration — Amazon Elastic Kubernetes Service (Amazon EKS) over EC2 Nodes
Serverless Compute Engine for Containers — AWS Fargate
Infrastructure Provisioning — Terraform
Messaging —Amazon Kinesis Streams
Cluster autoscaling was achieved using a combination of Horizontal & Vertical Scaling as below:
Vertical Scaling — Containers
Based on our calculations we were able to support 25 running containers on the given EC2 node. We started the OCR microservice with 3 Replica containers (minReplicas=3) and set the Maximum to 25 (maxReplicas=25). We also set the targetAverageUtilization=15, which means if the container CPU utitlization goes over 15% i.e. the container is running processing a document then spin up a new container to a max of 25 on a given EKS node.
Horizontal Scaling — EKS Nodes
If the current ELS Node is full i.e. 25 concurrently running containers then a new EKS Node is automatically provisioned by EKS. Thereafter, the vertical scaling takes over on the new node and spins up new containers.
This way the infrastructure is able to support hundreds of OCR and NLP processes. After the peak demands have been met a wait period kicks. After the expiry of the wait period the newly deployed EKS Nodes and Containers are scaled back so that the optimal resource allocations are met.
I hope this article was helpful in kick-starting your DevOps knowledge. Topics like these are covered as part of the DevOps course offered by Datafence Cloud Academy.