Quick ML model serving with Java

Source: Deep Learning on Medium

Quick ML model serving with Java

The microservices architecture allows you to do something, that was not really viable, or even previously possible — you can build applications with multiple languages at once — and that’s simply amazing! But what if you are restricted to a certain tech stack and languages? Or what if all you need is MVP (minimal viable product)? Or you don’t want to build/use heavy-weight ML systems? Or you just need something to make the aforementioned MVP quickly in your given stack and language?

Let’s consider the case of an MVP for a distributed Java application that uses Apache Kafka for message passing. At some point(s) we need to apply machine learning to the incoming data, get predictions, process them, and send updated data further along the pipeline. As with my earlier posts, I’ll use DeepLearning4J framework for the code samples here.

First of all we’ll need to set up something to interact with Apache Kafka. Something like this:

Abstract class for consuming data from Apache Kafka

The idea is pretty trivial. We’re setting up a Kafka consumer that gets data in one Thread, executes thetransform() method to process data, and writes it back to Kafka from the other Thread. transform() is made abstract, so we can provide in-place implementations, for better readability.

Now, with Kafka set up, we can get to actual machine learning for our MVP. I see at least two approaches here:

Approach 1. ML as part of the pipeline:

Basic schema. Consumer gets House objects, and produces ProcessedHouse objects

If for a single data entity coming in, you have exactly one ML application to use — you might get away with a simple Kafka consumer/producer combine or Kafka Stream node. Besides obvious edge cases, like EmbeddingLayer, this approach can be scaled pretty easily to meet your throughput requirements.

giKafka consumer/producer example

The implementation is fairly obvious: we subscribe to certain Kafka topic, and as soon a message comes in — transform() method is invoked, with House object provided as argument, we can pass the featurized data to the next Kafka topic, where some another part of our application will make some use of it.

There’s a performance tweak we can do — batching: instead of processing data entries one by one, we process them in batches, using the hardware more effectively with vectorized data. This is especially important if you’re using GPUs. They typically have somewhat higher latency when executing computations, but significantly higher memory bandwidth. Therefore with bigger batches the overall system throughput will increase.

But what if you have more than one entity coming out of here? Or if you have more than a single ML model to be invoked for a given data entry, especially if models are imbalanced in terms of computational expenses? The “ML as part of the pipeline” approach won’t be efficient. Here is another more suitable approach.

Approach 2. ML as standalone apps:

Basic schema. ML models run in separate pods, and accessed asynchronously from thin consumer.

In this approach, we use something like the sidecar pattern. There’s an application that consumes data from Kafka, and sends requests to the ML models using JSON/gRPC/etc. This allows us to scale ML pods individually, as required for maximizing throughput.

First we set up a model server that will accept JSON, deserialize it into a POJO and run actual inference

JSON model server setup example

Next, we set up a remote client that will be used to communicate with remote model.

If we have more than one ML model to be invoked here, we’d better use an asynchronous client and invoke the most computationally expensive model first, thus hiding at least part of computations behind execution latency.

Remote clients setup example

In comparison to our original application based on Kafka nothing has changed. There is still one consumer group that reads House topic, and produces ProcessedHouse entries. But internally we use a couple of sidecar applications — one for House price prediction, and one for potential target customer group. In this way we can deploy and scale our models independently (as long as the House POJO stays the same), since all ML-related code is limited to the model serving containers without bloating your generic Java application.

Obviously, both approaches shown here can’t be considered “ultima ratio” for ML model serving and are more like ad-hoc solutions. There are lots of nuances to be covered by proper serving tool: ETL consistency, versioning system, scalability tuning for various edge cases, A/B testing etc to name a few. Some of these problems are serious enough to be worthy of dedicated posts. Stay tuned 🙂