Serverless Machine Learning Inference with Tika and TensorFlow

Source: Deep Learning on Medium

Serverless Machine Learning Inference with Tika and TensorFlow

Browse jobs on seek.com.au whilst signed in and, if you have a default resume, you may encounter the little purple “You may be a strong applicant” badge below.

The “strong applicant” prediction is provided by the DeepMatch API which uses features from your resume and the job.

You may be a strong applicant

These features are computationally expensive to generate at scale and so need to be computed in advance, written to a serving store, and then combined for prediction when you visit. A high-level overview of the serving architecture looks something like this:

DeepMatch high-level architecture

This post will guide you through how we built an event-driven serverless version of this architecture. It’s aimed at data scientists, engineers, or anyone building ML products in production.

To make the service work well we had to:

  • decide what tools and language to use across training and serving
  • understand the eventing infrastructure
  • select a compute platform with the right cost and performance
  • build in reliability
  • design a replay process.

Let’s now talk more about these pieces in detail.

Common code across training and serving

Machine learning models are typically trained in a research environment that differs from the production serving environment. Features are usually generated from historical data in a data warehouse, rather than real-time events. And models are often trained using ML toolkits that favour an experimental workflow rather than the production needs of serving requests quickly and at scale.

Bridging the training-serving divide can be tricky and require engineering effort.

One way to approach this divide is model conversion. For example, suppose a tree model has been built in the research environment using Python but the production environment is a Java API. To take the model to production, the feature generation code could be rewritten and the model ported to Java. If this requires different skill sets, this may mean handing the work over to someone else to productionise. However, model conversion may introduce training-serving skew when slight differences between the two codebases cause predictions between the environments to diverge. Prediction test cases generated in the research environment can be added to the production codebase to detect this.

The other approach is to use common libraries, code, and models shared across both training and serving. Common code provides a guarantee that features and predictions generated in the research environment are the same as the production environment. However, meeting the constraints of both environments and having a common codebase can be a challenge. Let’s take a look at what this means in the context of DeepMatch.

DeepMatch generates features from resumes, which are typically in either Microsoft Word or PDF format. To do anything meaningful with them, we need to extract their text content. We do this using Apache Tika, a Java-based content extraction and analysis library. It’s open-source and hence easy to adopt, and also mature. To generate enough training data we run Tika at scale over millions of resumes using Apache Spark. Spark is a popular (although complex) general-purpose data processing engine which runs distributed workloads across many cores on multiple machines. Both Spark and Tika run on the Java Virtual Machine (JVM) so it’s easy to parallelise a Tika workload as a Spark job. The early choices of Tika and Spark for training data generation anchored us to using a JVM language. The team was already familiar with building Scala APIs and running them in production, and so we settled on using Scala.

Whilst Scala and Spark are a good choice for text extraction in the data warehouse, most popular ML toolkits are native libraries with first-class support for Python and limited support for other languages. DeepMatch uses an ensemble of models, including a Convolutional Neural Network (CNN) that produces embeddings, and a LightGBM tree model that uses a bag of phrases as features. Using best-of-breed toolkits for these models requires running native libraries with predominantly Python support.

Photo by Quinten de Graaf on Unsplash

Complex machine learning pipelines need to wire together feature extraction, feature generation, and prediction across different libraries and ML toolkits that run in both the data warehouse and production.

Combining feature extraction, generation and prediction in DeepMatch means running a JVM process for text extraction, and a native process for feature generation and prediction. A containerised microservices architecture is one solution, although it introduces additional deployment complexity. Iterating over versions of a multi-container pipeline during training isn’t trivial and has a slow feedback cycle. Kubeflow looks like it may solve some of these problems, although it wasn’t available when we started and requires knowledge and deep investment in the Kubernetes stack.

Instead, we built a single end-to-end pipeline library in Scala for feature extraction, generation and prediction. This allows us to run the pipeline at scale in Spark during research for training data preparation in the data warehouse, and in multi-threaded applications in production. We chose TensorFlow because of its excellent Java bindings and used Keras (prior to its incorporation into TensorFlow) for its ease of use during research. However, LightGBM’s native libraries have less mature JVM bindings and aren’t tuned for production workloads. It required some additional work to make serving production-ready, which we have open-sourced. Speaking to others in the field, it appears it isn’t uncommon to find ML libraries without robust support for production use-cases.

With appropriate investment where needed, we’ve been able to build a common library for feature extraction, generation and prediction, and use the same code across our training data pipelines and our real-time production pipelines and avoid any potential training-serving skew.

SEEK’s eventing infra

Photo by israel palacio on Unsplash

SEEK is a heavy user of SNS for real-time asynchronous integration between teams. When a resume is uploaded, an application made, or a job created, an SNS event is generated. SNS provides fire-and-forget fan out of events from a single producer to many consumers. It has reliable at least once delivery, which means consumers need to be able to handle duplicates, typically with idempotent processing.

SNS is loosely ordered, so events generally occur time-ordered but without any ordering guarantees. If ordering guarantees are required, then Kinesis (or Kafka) is a better choice for eventing. Ordered systems have a fixed partition count. In ordered systems when there’s a failure or latency, the partition can’t progress immediately and events back up. At SEEK, events for an entity (eg: an application, or a candidate) contain the full entity state rather than minimal changes. With a low rate of change to each entity, in this context, the lack of event order guarantees isn’t a concern. Using SNS avoids the complexity and failure modes of an ordered system, and can be used to build a reliable system (more on this later).

Going serverless

Photo by Adi Goldstein on Unsplash

DeepMatch keeps its serving store fresh by listening to updates to resumes and jobs via SNS. When a change event is received, DeepMatch regenerates features (eg: embeddings, bag of phrases encodings) about the resume or job, and writes them to the serving store. This workload is:

  • a short computationally-intensive task, that doesn’t have tight latency requirements (seconds is OK)
  • needs to be performed reliably and securely
  • occurs all the time, normally at low volumes but sometimes bursting to high volumes (during replay — more on that later).

It turns out this is a good fit for AWS Lambda, which also couples seamlessly with SNS.

AWS Lambda provides an operational model that delegates infrastructure capacity, reliability and scaling responsibilities to someone else, in this case, AWS. This allows you to focus on the business problem, rather than managing infrastructure.

Lambda provides an elastic scheduler with short-lived containers for compute, plus load balancers and queues for traffic and retries. The control plane and compute are highly available and spread over three Availability Zones (AZs). Lambda scales workloads from/to zero based on throughput. Warm invocation times are in milliseconds, and our experience has been that cold starts occur roughly 1 in 1000 times when individual Lambda instances are scaled up or repaved. And all of this is fully managed and mostly invisible to you. Finally, with the right volumes and compute requirements, Lambda is very cost-effective.

Lambda not only handles many operational concerns but also provides a standard library (the AWS SDKs) with a simple single-request programming model. Combined with CloudWatch logging, alerts, metrics, and dashboards, and a simple deployment story, it provides an easy to use and rapid development experience. This AWS Lambda Scala example is similar in outline to what we use in production and showcases key aspects of Lambda development and deployment.

Because our workload could fit into the cold start, running time, and cost constraints of Lambda, we were able to take advantage of its operational benefits, ease of use, and speed to market.

Finally, we needed to decide on a serving store. DynamoDB fits together neatly with Lambda and has a big advantage that is doesn’t require a VPC. It has single-digit millisecond response times, is elastic, highly available, low maintenance and at our volumes low cost. DynamoDB’s one weakness is in analytical workloads that require full table scans. For use as a serving store that only occasionally requires a full table scan (typically during replay), it has worked well.

One of the major Lambda constraints we had to overcome was the 250MB limit on deployment package size. When our models and code were combined, they exceeded this limit. Lambda containers provide 512MB of local disk storage in /tmp. The solution was to download the models from Amazon S3 on the first invocation and save them to local disk. This increases the cold start time by an order of magnitude but is still within acceptable bounds.

Let it crash

Photo by Ricky Kharawala on Unsplash

Resumes are unstructured user-generated input, of unknown provenance, in varying formats and generated on a wide range of devices and software. Combine that with a processing pipeline that includes a number of regex patterns and Tika, a complex third party text processing library with a large surface area. It’s nearly inevitable that we will encounter some unexpected input that we can’t handle. One way would be to try and prevent all failures, but that’s very hard given the huge input space.

The alternative is to expect failure and handle it gracefully. Building a reliable system by expecting failure is a core idea in the Erlang community. To make this work you need:

  • Isolation
  • A limited blast radius
  • A supervisory process that detects failure
  • A clean way for the supervisor to restart a process that fails.

Unfortunately, the JVM doesn’t have these primitives. It’s not possible to cleanly kill a long-running thread when, for example, a regex pattern gets stuck on an unusual input and spins forever.

But we can implement a reliable system using Lambda.

First, events are isolated from each other when using SNS. Because events are unordered and fan out, and assuming you have enough capacity, failures won’t cascade and affect other events. Second, we process one event per Lambda. If the Lambda fails, only that event fails. Finally, we put a timeout on the Lambda, so when it fails to complete, we timeout and the Lambda control plane will restart the JVM leaving no dangling threads or inconsistent heap.

There’s just one catch. It turns out the life cycle of the container doesn’t match the life cycle of the JVM. The JVM can die and restart, whilst the container and its local disk storage persist. If you recall, on the first invocation we download the models to disk. If we are not careful, and the JVM restarts a few times, the Lambda container’s disk will quickly fill up. So to prevent running out of disk space we must proactively remove any temporary files from previous invocations.

Replay

For any system with a serving store, there needs to be a way to populate the store when a new model is released. This can typically be done by reprocessing historical events. We call this process remining or replay. SNS doesn’t provide a replay capability out of the box, and so there needs to be some warehouse store of prior events. However, not only do we listen to events, but so does SEEK’s Data Platform which batches them up and stores them in parquet files in S3 for long-term storage.

When we need to replay, we stand up a new stack in parallel to the live one and immediately start receiving and processing current events into a new DynamoDB table. We then kick off a process which replays all prior events (based on timestamp) from the Data Platform (or in some cases, the live DynamoDB tables) through our Lambdas to generate features using the new models and store them in the new table.

The volume of events during replay is much higher than normal production workloads, but here the elasticity of a serverless stack can shine. With Lambda + DynamoDB we can scale up the number of instances and write capacity so the whole process completes in about 4 hours.