Outlier Detection with RNN Autoencoders

Original article was published by David Woroniuk on Deep Learning on Medium


TL DR: Historic-Crypto Package, Code.

What are Anomalies?

Anomalies, often referred to as outliers, are data points, data sequences or patterns in data which do not conform to the overarching behaviour of the data series. As such, anomaly detection is the task of detecting data points or sequences which don’t conform to patterns present in the broader data.

The effective detection and removal of anomalous data can provide highly useful insights across a number of business functions, such as detecting broken links embedded within a website, spikes in internet traffic, or dramatic changes in stock prices. Flagging these phenomena as outliers, or enacting a pre-planned response can save businesses both time and money.

Types of Anomalies?

Anomalous data can typically be separated into three distinct categories, Additive Outliers, Temporal Changes, or Level Shifts.

Additive Outliers are characterised by sudden large increases or decreases in value, which can be driven by exogenous or endogenous factors. Examples of additive outliers could be a large increase in website traffic due to an appearance on television (exogenous), or a short-term increase in stock trading volume due to strong quarterly performance (endogenous).

Temporal Changes are characterised by a short sequence which doesn’t conform to the broader trend in the data. For example, if a website server crashes, the volume of website traffic will drop to zero for a sequence of datapoints, until the server is rebooted, at which point normal traffic will return.

Level Shifts are a common phenomena in commodity markets, as high demand for electricity is inherently linked to inclement weather conditions. As such, a ‘level shift’ can be observed between the price of electricity in summer and winter, owing to weather driven changes in demand profiles and renewable energy generation profiles.

What is an Autoencoder?

Autoencoders are neural networks designed to learn a low-dimensional representation of a given input. Autoencoders typically consist of two components: an encoder which learns to map input data to a lower dimensional representation and a decoder, which learns to map the representation back to the input data.

Due to this architecture, the encoder network iteratively learns an efficient data compression function, which maps the data to a lower dimensional representation. Following training, the decoder is able to successfully reconstruct the original input data, as the reconstruction error (difference between input and reconstructed output produced by the decoder) is the objective function throughout the training process.

Implementation

Now that we understand the underlying architecture of an Autoencoder model, we can begin to implement the model.

The first step is to install the libraries, packages and modules which we shall use:

Secondly, we need to obtain some data to analyse. This article uses the Historic-Crypto package to obtain historical Bitcoin (‘BTC’)data from ‘2013–06–06’ to present day. The code below also generates the daily Bitcoin returns and intraday price volatility, prior to removing any rows of missing data and returning the first 5 rows of the DataFrame.

Now that we have obtained some data, we should visually scan each series for potential outliers. The plot_dates_values function below enables the iterative plotting of each series contained within the DataFrame.

We can now iteratively call the above function, generating Plotly charts for the Volume, Close, Open, Volatility and Return profiles of Bitcoin.

Image generated by Author.

Notably, a number of spikes in trading volume occur in 2020, it may be useful to investigate if these spikes are anomalous or indicative of the broader series.

Image generated by Author.

A pronounced spike exists within the closing price in 2018, followed by a crash to a technical support level. However, a positive trend broadly exists throughout the data.

Image generated by Author.

The daily opening price follows a similar pattern to that of the closing price above.

Image generated by Author.

Price volatility displays a number of pronounced spikes in both 2018 and 2020. As such we could investigate if these volatility spikes are considered anomalous by an Autoencoder model.

Image generated by Author.

Due to the stochastic nature of the Returns series, we have elected to test for outliers within the daily traded volume of Bitcoin, as characterised by Volume.