Original article was published by Theodoros Ntakouris on Deep Learning on Medium
Windowing Labelled Data
With the dataset api this is simple to do. Assume the following configuration. input feature is
a and label is
Each row can be described by a tensor shaped
(2,) . So our dataset is of size
(6, 2) .
Now, with a window size of
2 , produce training data. This would look like:
1, 2 -> 0 (label of row 2)
2, 3 -> 1 (label of row 3)
4, 5 -> 0 ...
.window() function actually produces a set of datasets. This is why we need to do a
.flat_map(batch) operation to end up with a series of tensors we can treat uniformly.
Windowing Unlabelled Data by Looking Ahead
Sometimes you just want to predict the next tick of a sequence. This can be done without it being labelled. For an input dataset:
The training pairs would be (again, with a window of size
1, 2 -> 3
2, 3 -> 4
4, 5 -> 6
Let’s try to take the previous approach, with a window of size
3 , to keep the last element as a label.
The best — practisey way to do this would be to create 2 similar dataset pipelines with a window size of
2 , but with one of 2, lagging infront.
Sharding TF Record Files Tips for Efficiency and No Data Loss
The quintessential way of large scale deep learning input pipelines is to shard your input data to files in the 1–100MBs range, that are able to be read sequentially and in parallel. This means that the storage server cost is reduced drastically by just using rotational drives instead of SSDs, by keeping up the performance.
A very common case of this practise is to store TF Records in a Hadoop File System or on bucket — based public cloud solutions like Google Cloud Storage.
Splitting X-Y pair datasets (like images) to multiple files is trivial. With a time series windowing task at hand, it becomes tricky to maintain sequential integrity and avoid data loss.
Let’s walk through a simple example.
1, 3File 2
For sure, you can get a file glob and load them up and push em straight to training. But you’ve got many issues.
Firstly, for maximum performance, order is non-deterministic between 2+ shards. This means that your data will be out of order, similarly to position — invariant parallel algorithms like map — reduce.
You will end up windowing together line 1 from file 1 and line 2 from file 2, while another worker windows together line 2 from file 1 and line 1 from file 2.
The out — of — order part is now fixed. But there’s another problem that needs careful inspection. Missing cross-file data. If file 2 is the sequence continuation of file 1, you need to make sure that your look — back window includes a link between those 2 files. Else, you lose data points.
This means that while
tf.data.Dataset is parsing the sharded files, the final dataset that is generated for your model to train, needs to include (let’s demonstrate this for a window of 2):
(a)1,2 and (a)1,3 -> (b)1, 4
(a)1,3 and (b)1,4 -> (b)1, 5
Sure, you can drop those few datapoints. But keep in mind that the bigger your window size is, the more data points you’re going to lose. Depending on the nature and the size of data, make the correct decision.
If you enforce a purely sequential read like we demonstrated on the Windowing Unlabelled Data by Looking Ahead section above, you’re going to enjoy no benefit from the parallel read speedup.
The simple and effective solution is to make multiple datasets, and sequentially concatenate them by 2.
There we go! Parallel, sharded file reading for superb performance and elegant model training.