Source: Deep Learning on Medium
Real-Time Machine Learning
Imagine this scenario: You have an app that uses machine learning and you want the app to learn from your user’s data in real-time. That means as new user data is generated, your app is able to make predictions and perform training on the incoming data-stream to improve itself automatically. How would you go about building this? Take some time to stare at this chart, it’s an example of this pipeline.
So there’s the text data first. That text data is being streamed in real-time using a software product called “Apache Kafka” to a model. That model is built and trained using a library called “Spark”. The results are saved to a database, and then we can do whatever we want with those predictions (analysis, visualizations, UI updating, etc.) Now, none of these technologies are required to make this happen, but each of them are popular for this use case. We could easily replace “Spark” with “Tensorflow” or “PyTorch”. The same goes for “Kafka”. We could easily replace it with “Redis” or other data streaming protocols. But Kafka is pretty powerful, so let’s talk about it
What does Apache Kafka mean?
Kafka is an open source software that provides a framework for data streaming processing, reading and analysis.
Being open source ensures it is essentially free to use and has a large network of users and developers who contribute with updates, new features and new user support.
Kafka is designed to run in a “distributed” environment, meaning that it operates through many (or many) servers instead of sitting on one user’s computer, exploiting the additional computing power and storage capacity this provides.
Originally created at LinkedIn, Kafka played a role in analyzing the interactions among their millions of professional users to build networks among individuals. It was granted open source status and transferred in 2011 to the Apache Foundation, which is coordinating and overseeing open source software development.
What is the use of Kafka?
Today, companies are increasingly relying on real-time data analytics to allow them to obtain better insights and faster response times in order to remain competitive. Real-time insights allow businesses or organizations to predict what they should store, promote, or pull from the shelves on the basis of the most up-to-date information.
Data has historically been stored and distributed in “batches” through networks. This is due to pipeline limitations–the rate at which CPUs are able to handle the calculations involved in reading and storing information, or at which sensors can detect data. As this interview points out, when humans first started recording and exchanging information in written records, such “bottlenecks” in our ability to process data existed.
Kafka is able to operate very quickly due to its distributed nature and the automated way in which it handles incoming data–large clusters can track and respond to millions of changes in a dataset each second. This means working with–and responding to–streaming data in real time is possible.
Originally designed to track visitors ‘ actions on big, busy websites (such as LinkedIn). Through analyzing each session’s clickstream data (how the user navigates the page and what features they use) it is possible to gain a better understanding of user behavior. This makes it possible to foresee what a visitor might be interested in news articles, or products for sale.
Since then, Kafka has become commonly used and is an integral part of Spotify, Airbnb, Uber, Goldman Sachs, Paypal and CloudFlare stack, all of which use it to process data streaming and understand user or device behaviour. In reality, one out of five Fortune 500 companies is using Kafka to some degree, according to their website.
One specific area where Kafka has achieved supremacy is the travel industry, where its streaming technology makes it ideal for monitoring millions of flights, package vacations and hotel vacancies all over the world.
How does Kafka function?
Apache takes information–which can be read from a large number of sources of data–and organizes it into “themes.” One of these data sources could be a transactional log where each sale is reported by a grocery store as a very simple example.
Kafka will process this information flow and create “topics” that could be “number of apples sold” or “number of sales between 1 pm and 2 pm” that could be analyzed by anyone wanting insights into the data.
This may sound similar to how a traditional database allows you to store or sort information, but it would be ideal for a national chain of grocery stores that records thousands of apple sales each minute in the case of Kafka.
This is accomplished through a feature known as a processor, which is an interface between applications (e.g. the code which tracks the organized yet unsorted transaction database of grocery stores) and the topics–Kafka’s own ordered, segmented information archive, known as the Kafka Topic Log.
This data flow is often used to fill data lakes such as distributed databases from Hadoop or to feed pipelines such as Spark or Storm in real-time storage.
Another software–known as the Consumer–allows reading of the subject logs and transferring the information stored in them to other applications that might need it–for example, the program of the grocery store to renew exhausted inventory or discard obsolete products.
Kafka works by shaping the “central nervous system” that information moves through input and capture frameworks, data processing engines, and storage lakes when you bring its components together with the other common elements of a Big Data analytics framework.
OK So what’s Spark?
Apache Spark is a powerful open-source processing engine with Java, Scala, Python, R, and SQL APIs designed around speed, ease of use, and sophisticated analytics. Spark runs programs in memory up to 100 times faster than Hadoop MapReduce, or on disk 10 times faster. It can be used as a library for developing software applications and interactively conducting ad-hoc data analysis. Spark power a library stack including SQL, DataFrames, and Datasets, MLlib for machine learning, GraphX for graph storage, and Spark Streaming. In the same program, you can easily combine these libraries. Spark also runs on a stand-alone or cloud laptop, Hadoop, Apache Mesos. It can access different sources of data such as HDFS, Apache Cassandra, Apache HBase, and S3.
It was first developed in 2009 at UC Berkeley. (Note that Spark’s founder Matei Zaharia has since become a CTO at Databricks and a faculty member at MIT.) Spark has seen rapid adoption by businesses across a wide range of industries since its publication. Internet powerhouses like Netflix, Twitter, and Tencent have been ready to rapidly deploy Spark, collectively storing several petabytes of data on clusters of over 8,000 nodes. With over 1000 project contributors and over 187,000 participants in 420 Apache Spark Meetups groups, it has rapidly become the largest open source network in big data.
That’s a lot to take in all at once! I’m slightly confused.
That makes total sense! These 2 technologies could each have their own course. I wanted to mention them because the help with scalability, an issue all businesses have to deal with as they grow their software product. But luckily i’ve found some perfect blog posts that tie everything together to give you a high-level, implementation focused understanding how they work. It’s OK if you don’t understand every detail, as long as you know what they generally do, and when to use them. Intuition is the key here!
Spark vs Tensorflow: https://analyticsindiamag.com/tensorflow-vs-spark-differ-work-tandem/
ML in Production Example, Kafka + Python: https://towardsdatascience.com/putting-ml-in-production-i-using-apache-kafka-in-python-ce06b3a395c8
Sentiment Analysis with Spark and Kafka: https://mapr.com/blog/streaming-machine-learning-pipeline-for-sentiment-analysis-using-apache-apis-kafka-spark-and-drill-part-1/