Original article was published on Deep Learning on Medium
Exploring the solution space
We then moved on to the models proposed in Time Series Classification from Scratch with Deep Neural Networks: A Strong Baseline from Zhiguang Wang et al. This is a more popular article (as can be seen from the ~500 citations) and it comes with code for the three models it introduces. Also, training their model is very simple, which meant convergence would be easy.
We started training with their ResNet model. Unfortunately, the first run didn’t succeed most probably due to the length of our time series. They were longer than most that were explored in the paper.
Convolutional models, like ResNet, have a ”pattern size” called a receptive field that is the length of the longest pattern they can possibly detect. For example, if the receptive field is 24 hours then the model can only detect daily patterns and will not be able to detect a weekly regularity.
The receptive field of the default ResNet came out to a few dozen hours on our data, which we felt was insufficient. This turned out to be the key observation. We finally got our first working model by augmenting the receptive field to about a week (using strided convolutions).
This model was about 4 times more specific than a dummy baseline.
The next period, from November to January, was focused on trying as many different architectures as we could.
ConvRNN is a deep learning architecture of a few convolutional layers followed by a recurrent layer. All the layers are arranged sequentially as shown in the diagram above. It is an end to end model: input is the sensor data and output of the last layer is the default probability.
Convolutions take time series as input and also produce time series output. The output length is divided by the striding, for example a striding of 2 will halve the length.
Since there are multiple stridings happening sequentially, we can achieve very high reduction in the time series length. In some sense, the convolutions act as smart “sub samplers”.
This reduction in size is important if the sensor data time series are long. Recurrent neural networks can only effectively be trained on short time series, in the order of hundreds of points.
The recurrent layer would diverge if trained on the original data, so the convolutions are needed to shorten the time dimension.
Finally, we noted that the receptive field of this model is theoretically infinite. This is the advantage of the recurrent layer, which is able to combine and accumulate information over long ranges. As noted in the ResNet model discussion, a large receptive field allows us to understand a farmer’s behavior at a scale ranging from various time frames — days to seasons.
This model had the best performance among deep learning architectures. The original idea came from Charles Nichols. We haven’t so far seen this architecture being used in the literature.
The lesson learned for choosing architecture is not to hesitate trying something new. Even if something hasn’t been done before it can still work great. Though you might still want to start your project with an existing model for simplicity.
For each of the 14 architectures we had different runs and configurations. At the end of the project, the total number of individual runs was easily in the thousands. A big part of a data science project is to manage them efficiently.
Each experiment, whether it brings an improvement in the metrics or not, needs to be documented so that we can use it to guide our future decisions. Some of these experiments were quite long, up to a few days, hence we saved all the key information to avoid having to rerun it later.
At first I used an online wiki system to document the experiments. At one point we realized that a significant portion of my time was spent documenting the experiments, as opposed to running them. We then started using a website called weight & biases, which made the whole process much more productive.
There were also other issues at this point: processing new data that kept coming, versioning datasets, maintaining high data quality. The main problem was having to manage two things: to do things quickly, which requires fixed processes, and to explore the solution space, which requires flexibility.
We’ll go over three main aspects of data management: data versioning, data processing and data debugging.
Data Version Control or DVC is a very useful tool for managing datasets. It allows control of revisions of data just like git does for code.
Every update corresponds to a version and keeps accessible records of all previous versions. If a dataset update proved to be a mistake, we can recover a previous version and cancel the changes. So it is kind of a safety net.
Working in a distributed environment makes data management more difficult. In our case, we were using up to 5 different machines at the same time, including laptops and cloud instances.
This is common in machine learning where the bulk of work is done on a local machine, but training is done on powerful cloud instances. The challenge is to synchronize the data.
We set up an Azure blob storage which, through DVC, centralized all the data. This is like an online hard-drive that can be accessed from anywhere, as long as you have the credentials. It stored not just the current data but all the revisions. This blob storage was essentially our data library.
This allowed machines to synchronize and modify the global library by using a system of push and pull. Typically, once a worker machine has finished processing a dataset it pushes its updates to the central storage. Then, a training machine can pull the latest data from the central storage. DVC makes these operations very easy to do.
We use scripts for data processing because they are easier to reuse and maintain than notebooks. Scripts can be called directly from the command line, like any UNIX utility. They operate “file-to-file”, that is both the input dataset and the resulting processed data are written on the disk, which simplifies debugging.
For data debugging, we realized that unit testing is not sufficient. It is easy to miss data errors if we do not actually plot it. At the same time data testing is useful for checking basic sanity of the data such as outliers and formats. So, we decided to use both formal testing and visualizations.
The natural tool for this is Jupyter notebooks. Each would contain a few plots such as distribution histograms and display some days of data for a randomly chosen sensor. This is very useful as many errors can be detected visually. It would also show basic statistics. In a sense, a notebook could be used as a quick “identity card” of a particular asset.
Notebooks would also contain tests. Typically, we test for outliers and possible issues in the data. If we expect the data to have a particular property, for example the average is 0, then we write a test for it.
Tests are interesting in that they operate on the whole dataset, whereas visualization typically can only display detailed information about a few samples.
Finally, notebooks would be automatically converted to Markdown format and then saved to Wiki for documentation. This process can be made very efficient by using Papermill to execute the notebooks on the fly.