Deep Learning for Time Series Classification (InceptionTime)

Source: Deep Learning on Medium

Deep Learning for Time Series Classification (InceptionTime)

Fig. 1: The Inception module of InceptionTime.


  1. Motivation
  2. Machine Learning for Time Series Classification
  3. Best Deep Learning practices for Time Series Classification: InceptionTime
  4. Understanding InceptionTime
  5. Conclusion

1. Motivation

Time series data have always been of major interest to financial services, and now with the rise of real-time applications, other areas such as retail and programmatic advertising are turning their attention to time-series data driven applications. In the last couple of years, several key players in cloud services, such as Apache Kafka and Apache Spark, have released new products for processing time series data. It is therefore of great interest to understand the role and potentials of Machine Learning (ML) in this rising field.

In this article I discuss the (very) recent discoveries on Time Series Classification (TSC) with Deep Learning, by following a series of publications from the authors of [2].

2. Machine Learning for Time Series Classification

Defining the problem:

TSC is the area of ML interested in learning how to assign labels to time series. To be more concrete, we are interested in training an ML model which when fed with a series of data points indexed in time order (e.g. the historical data of a financial asset), it outputs labels (e.g. the industry sector of the asset).

More formally, let (X, y) be a training instance with T observations (X¹, … ,Xᵀ)≡ X (the time series) and a discrete class variable y which takes k possible values (the labels). A dataset S is a set of n such training instances: S= { (X₍₁₎, y₍₁₎), … , (X₍ n₎, y₍ n₎)) }. The task of classifying time series data consists of learning a classifier on S in order to map from the space of possible inputs {X} to a probability distribution over the labels {y¹, … , yᵏ}.

Do we really need DL?

It is always important to remind ourselves that DL is nothing but a set of tools for solving problems, and although DL can be very powerful, that doesn’t mean that we should blindly apply DL techniques to every single problem out there. After all, training and tuning a neural network can be very time-consuming so it is always a good practice to test the performance of other ML models and then seek for any potential shortcomings.

Oftentimes the nature of a problem is determined by the data itself; in our case, the way one chooses to process and classify a time series depends highly on the length and statistics of the data. That being said, let us run a quick dimensional analysis to estimate the complexity of our problem.

Suppose that we wish to learn a one-nearest neighbor classifier for our TSC problem (which is pretty common in the literature). Now given a dataset of n time series of length T, we must compute some sort of a distance measure for ⁿC₂=n(n-1)/2 unique pairs. Moreover, in order to find the “optimal distance” between two time series X₍₁₎ and X₍₂₎, we must compute the T×T point-wise distance matrix Mʲᵏ=(X₍₁₎ ʲ -X₍₂₎ᵏ)² for every unique pair of training instances and then seek for the path which optimizes our objective function. As explained in [3], there are several optimization algorithms in the literature for this setup, and they all have complexity O(n²⋅ Tᶜ) with c=3 or 4. Evidently, the length of the time series can really hurt the computational speed. However for certain types of data, this problem can be alleviated without digging into sophisticated machine learning models such as deep neural networks.

In signal processing, a complex signal is analyzed by decomposing the signal into a series of “elementary” signals, called Fourier modes. For instance, the square wave below can be approximated by three sinusoidal signals of distinct frequencies (f₁, f₂, f₃)=(ω, 3ω, 5ω), for some constant angular frequency ω.

Fig. 2: The Fourier series expansion of a square wave (red line). Here I present only the first three modes (blue dashed lines) and their addition (green line). Hopefully, it is not hard to see that by adding the next modes the series quickly converges to a square wave.

By taking the linear sum of these signals, we can reconstruct our original signal:

Square Wave(t) = W¹ ⋅ sin(f₁ t) + W² ⋅ sin(f₂ t) + W³ ⋅ sin(f₃ t)+ …

where the coefficients (W¹, W², W³) = (1, 1/3, 1/5) specify the weight that each mode contributes to the square wave.

Now consider a dataset within which any time series, originally represented by a series of T time-ordered data points, can also be represented by a weight vector in the space spanned by the three elementary frequency modes:

X=(X¹, … , Xᵀ )W=(W¹, W², W³, …) .

Going from the “time” representation to the “frequency” representation of our time series data is called Fourier transformation, and though the Fourier space in theory is infinite-dimensional (rather than 3-dimensional), we can apply various approximation techniques to truncate the Fourier series down to finite dimensions. Most importantly, we can reduce the T-dimensional representation of our time series data, to a number of dimensions (in Fourier space) that makes our classification problem computationally trackable. Overall, we can apply Fourier transformation during the data pre-processing phase to convert the input time series into weight vectors, and thereafter proceed by building our classification model (e.g. a 1-nearest neighbors classifier). Working with such “well-behaved” time series we can achieve high performance without the use of DL.

Now the aforementioned processing method assumed that any input signal can be approximated by a Fourier series of elementary (harmonic) functions. However a lot of real-world time-series data are so noisy (e.g. financial data) that do not admit such an elegant decomposition (or any sort of mathematical pre-processing). It is precisely for this type of data that DL comes to the rescue: letting the model learn how to process time series data on its own is a more promising solution when dealing with unstructured noisy data.

3. Best DL practices for TSC: InceptionTime

As of today, there are two state-of-the-art DL models for TSC. The oldest model, called HIVE-COTE [4], is based on the nearest neighbor algorithm coupled with the Dynamic Time Warping similarity measure. Although this algorithm has achieved an outstanding performance on the benchmark datasets [5], it suffers from O(n ²⋅ T ⁴)time complexity. Recently the authors of [6], introduced a deep Convolutional Neural Network (CNN), called InceptionTime, which not only outperforms the accuracy of HIVE-COTE but it is also substantially faster. InceptionTime’s high accuracy together with its scalability renders it the perfect candidate for product development!

To this end, let us present the most important components of InceptionTime and how these are implemented in Keras.

3.1 The Input Layer

In general, each data observation Xʲ (j=1, … ,T) of a time series X can be a list of one or more data measurements, i.e. Xʲ = ( X₁ʲ, … ,X_m ʲ ) for m data measurements, all taken at the jth time instance. For example, the velocity of a particle moving in 3D space consists of three spatial components: V=(V₁, V₂, V₃). Keeping track of the velocity of the particle for T seconds, with one observation per second, amounts to collecting the series of data: (V¹, … , Vᵀ ).

Definition 1: An m-dimensional Multivariate Time Series (MTS) X= (X¹, … , Xᵀ) consists of T ordered elements Xʲ ∈ ℝᵐ.

Definition 2: A Univariate time series X of length T is simply an MTS with m=1, i.e. Xʲ ∈ ℝ and X= (X¹, … , Xᵀ ).

Like in image classification problems, we can think of the input MTS as an array of shape (1, T, m), with m denoting the number of channels (the depth). In fact, it is convenient to suppress the width of the inputs and work directly with an input_shape = (T, m).

3.2 The Inception Module

The major building block of InceptionTime is the inception module, shown in the figure below:

Fig. 3: The inception module of InceptionTime. The first number in the boxes indicates the kernel size while the second indicates the size of the stride. “(S)” specifies the type of padding, i.e. “same”.

This consists of the following layers:

  • A bottleneck layer to reduce the dimensionality (i.e. the depth) of the inputs. This cuts the computational cost and the number of parameters, speeding up training and improving generalization.
  • The output of the bottleneck is fed to three one-dimensional convolutional layers of kernel size 10, 20 and 40.
  • The input of the inception module is also passed through a Max Pooling layer of size 3 and in turn, through a bottleneck layer.
  • The last layer is a depth concatenation layer where the outputs of the four convolutional layers at step 2 are concatenated along the depth dimension.

All layers (excluding the concatenation layer) have stride 1 and “same” padding . In addition, all convolutional layers come with 32 filters.

Keras Implementation