Easy Machine Learning Tricks for Linux Developers in 2018


How deep learning becomes easy to integrate with your application

Learning ML in 5 minutes

I love this Matrix scene where you can learn Kung fu in the blink of an eye! It would be cool if learning and embedding machine learning in your application was that easy… Well, now it is!

This application can really be your application, so let’s do it with “Kung fu” style!

Loud ML is your best Kung fu trick

This is 2018 and you can find machine learning APIs available widely, both with and without cloud service connectivity. We will use today an API named Loud ML (www.loudml.io) that can be installed on your favorite Linux host.

It facilitates deep learning usage in many ways:

  • It is data source agnostic (it will connect to all major NoSQL databases) which is cool because learning requires a lot of historic data: pulling data and formatting the data to apply machine learning is no longer a pain.
  • The API are well-documented with a CLI and REST endpoints so it can be controlled remotely
  • The developer edition is free!

Getting started: First five minutes with Loud ML

The package is currently available in RPM and DEB format so you can install it on most Linux distributions using standard tools. For example, in EL7:

$ yum install loudml

After installation, find the configuration file located in /etc/loudml/config.yml

You must declare the data sources i.e. where to read the data and how to connect to NoSQL databases.

The Loud ML 1.1 beta release already supports popular databases InfluxDb and Elasticsearch.

To configure InfluxDb, define the name and address to connect to the database:

- name: my-influx-datasource
type: influxdb
addr: <host>:<port>
database: <your database name, e.g. telegraf>

To configure Elasticsearch, define the address of your node and name of the index (aka index pattern) to pull the data:

- name: my-elastic-datasource
type: elasticsearch
addr: <host>:<port>
index: <your index name or index pattern>

Your first predictive model

Say your InfluxDb database (or your Elasticsearch indexes) contains CPU measurements for the server hosting your web application, let’s name them cpu_load, and you have 30 days of history with 1 minute resolution.

Your first model will predict a single feature avg_cpu_load, and will:

  • average data over five-minute intervals (the bucket_interval); and
  • assume the last three bucket intervals (span=3, so, in total 15 minutes) will be used to guess the next cpu_load value.

Let’s create this model using the CLI.

First, you must write the file that describe your model. This file can be either JSON or YAML.

We will define in model.yml file a single ‘feature’ to learn the shape of the average cpu load using 5 minutes bucket intervals:

name: my-timeseries-model
type: timeseries
default_datasource: my-influx-datasource
# Size of buckets for data aggregation and prediction 
bucket_interval: 5m
# Number of preceding buckets required to predict the next bucket
span: 3
interval: 60s
offset: 30
max_evals: 5
- name: avg_cpu_load
measurement: system
metric: avg
field: cpu_load
default: 0

And now, let’s create this first model:

$ loudml create-model model.yml

The Kung fu lesson is about to begin. By training this model, it will learn how the data evolves over time. You can think of deep learning as a way to approximate *any* function.

The model training is launched with the following command:

$ loudml train <model_name> --from <from_date> --to <to_date>

Accepted date formats are:

  • UNIX timestamp in seconds
  • ISO 8601 format, example: 2018–01–26T16:47:25Z
  • Relative date, example: now-20s, now-45m, now-3h, now-1d, now-3w…

We will train the above model using a 30 days history:

$ loudml train my-timeseries-model --from now-30d --to now

When it’s done, the command will report the accuracy percentage: the higher the better!

To show additional model information, you can run:

$ loudml show-model <model_name>

It is show time! Predictive capabilities

You know Kung fu, so let’s practice. You’ve trained a model. Now, you can make this model output predictions for avg_cpu_load on a regular interval. This output may be written to another data source, your application, or merely stdout, and compared to the actual values for anomaly detection.

You can enable the loudmld service to execute in the background; if you are running EL7 the system command will be:

$ systemctl enable loudmld.service
$ systemctl start loudmld.service

The loudmld process exposes a HTTP API that you can control using curl. To start predicting future avg_cpu_load values, you can issue this curl command:

curl http://localhost:8077/models/my-timeseries-model/_start

It will tell the loudmld process to wake up at periodic intervals (interval in config.yml — don’t confuse with bucket_interval), pull the data from your data source, and finally output the predicted values to the time series database.

The original data in purple; the predictions in green

If you’re using InfluxDb, you can visualize the result using Chronograf. The prediction is stored in the same database under the measurement called prediction_<model_name> and must be displayed using a GROUB BY time(bucket_interval) query.

Hopefully, this first Kung fu lesson has been a success!

Loud ML documentation is available on Github (https://github.com/regel/loudml) and online at www.loudml.io/guide