Machine Learning System Design: Real-time processing

Original article was published on Artificial Intelligence on Medium

Machine Learning System Design: Real-time processing

Lambda and Kappa architecture in ML system design

Nathan Marz described Lamda architecture in 2011. Later in 2015, in the book Big Data he talks about “lambda architecture” :

Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem. There is no single tool that provides a complete solution. Instead, you have to use a variety of tools and techniques to build a complete Big Data system.

The lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.

Since 2011 and even till today, Lambda architectures are deployed in all major technology companies in different ways. The concept of Lambda architecture has surely evolved over time with new tools and frameworks but the fundamentals of processing are still the same.

A good real-time data processing architecture needs to be fault-tolerant and scalable; it needs to support batch and incremental updates, and must be extensible.

Lambda Architecture

Lambda Architecture

A Lambda architecture consists of three main layers, batch, real-time and serving.

Batch

The batch layer comprises of two main tasks: management of historical data; and recomputing results like machine learning models. The batch layer receives arriving data, combines it with historical data and recomputes results by iterating over the entire combined data set. This layer operates on the full data and thus allows the system to produce the most accurate results. The accuracy trade off in paid in terms of CPU utilization and time. Hence, the results come at the cost of high latency due to high computation time.

Real-time

The real-time (speed)layer is used to provide results in a low-latency, near real-time fashion. The real-time layer receives the arriving data and performs incremental updates to the batch layer results. Different incremental algorithms are implemented at this layer which help achieve a significant reduction in computation cost.

Serving

The serving layer enables various queries of the results sent from the batch and real-time layers.

In the summer of 2014, Jay Kreps from LinkedIn posted an article describing what he called the Kappa architecture, which addresses some of the pitfalls associated with Lambda architecture. Kappa is not a replacement for Lambda, though, as some use-cases deployed using the Lambda architecture cannot be migrated.

Kappa Architecture

Kappa Architecture

The fundamental basis of the Kappa architecture was to avoid maintaining two separate code bases for the batch and real-time layers.

Like Lambda architecture, the serving layer is used to query the results here too. For the other two layers, the key idea is to handle both real-time data processing and continuous data reprocessing using a single stream processing engine. Data reprocessing is an important requirement for making visible the effects of code changes on the final results. As a result, the Kappa architecture is composed of only two layers: real-time processing and serving. The real-time processing layer runs the stream processing jobs.

Tools

The two architectures can be implemented by combining various technologies, such as Apache Kafka/Kinesis, Apache HBase, Apache Hadoop (HDFS, MapReduce), Apache Spark, Apache Drill, Spark Streaming, Apache Storm, and Apache Samza.