Source: Deep Learning on Medium
Replication woes: Hurdles for reproducibility in deep learning
- Ask a question
- Do some background research
- Construct a hypothesis
- Test hypothesis with an experiment
- Analyze data, draw results
- Compare and communicate results with respect to hypothesis
These are the primary steps of the scientific method. There is, however, an additional step to this process that is just as, if not more, important to the whole than any other step; replication. This is particularly important to fields that are in periods of rapid growth, like that of deep learning. In this post I will relate my recent experience in replicating results from a recent arxiv paper describing an end-to-end speech-to-text network, Jasper. But first, an introduction.
I am a current PhD student at the University of Connecticut in the Speech, Language, and Hearing sciences department where I study how listeners accommodate for variation in the speech signal. I study this with respect to Bayesian approaches and have always been keen on studying how deep learning networks tackle the rampant variation in speech. My research assumes that, while there is a lot of variation in the speech signal, this variation is structured in such a manner that listeners can leverage to achieve fluent perception. I believe this same assumption holds for deep learning networks.
I had recently come across an arxiv paper out of NYU and NVIDIA that introduces a fully convolutional end-to-end speech-to-text network named Jasper. In the paper, the authors introduced a network based on the fully convolutional architecture of Wav2letter, which they extended to make deeper through the inclusion of residual layers. They tested several different networks constructions varying residual layer depth, activation function, and optimizer. These networks were trained on an NVIDIA DGX-1, this is important, across the LibriSpeech, WSJ, and Hub5 Year 2000 corpora. Total network depth varied from 19 to 54 layers. The best performing networks achieved a word error-rate (WER) of 3.64% on the LibriSpeech dev-clean set.
Of particular interest to me, the Jasper architecture uses a connectionist temporal classification (CTC) loss function. The original paper implementing this loss function does so in a network composed of recurrent layers. This loss function allows for the labeling of unsegmented sequence data through the assignment of variance to specific characters. When using speech, these characters typically represent the alphabet of the language in question. This allows one to provide the network with sequential power spectrum data as input and get character-level text as output. For speech scientists, this loss function is particularly interesting because one can visualize which aspects of the power spectrum the network correlates with each character.
The initial goal of my project was to replicate the results of the Jasper arxiv paper by re-implementing a subset of the networks in Intel optimized TensorFlow using the included Keras API on an Intel NUC. After the replication, my plan was to extend this work by changing the types of input and output of the network. Specifically, I would switch the input from a power spectrum of a narrow spectrogram to that of a wide spectrogram and the output from characters of the English alphabet to that of the International Phonetic alphabet (IPA).
The change to the wide spectrogram input would provide the network with information relating to the resonant harmonics of the vocal tract, or formants, which are more informative as to the speech sound being produced than natural harmonics are, provided by narrow band spectrograms. Below is an image showing both a wide spectrogram and a narrow spectrogram of the same sentence.
That said, there are still many other adjustable parameters in spectrogram creation that would be worth exploring. In particular, the inclusion of Praat’s dynamic range to the typical Python pipeline.
The change in the output of the network, from the English alphabet to IPA, would significantly increase the specificity of the model because each IPA character corresponds to a specific speech sound. There are even modifiers to each character to account for slight variation in productions, increasing specificity even further.
However, before I could even get to extending the Jasper network, I need to first re-implement it. Unfortunately, this has proven rather difficult…
Rapidly changing fields need a way to disseminate information in a rapid manner. This is rapidity is not available in the current model for academic research, which can often take years to get published. Thus, pre-print publications, like those on arxiv, have been the preferred way to disseminate information on breakthroughs and marginal advances in many tech oriented fields. However, because these publications are often not peer-reviewed, it is even more critical that enough information is provided to allow for replication. This includes information that researchers may not feel to be particularly important, like training time.
Recall that the first goal of my project was to replicate the findings of Jasper arxiv paper by re-implementing a subset of the networks presented in Keras API of an Intel optimized version of TensorFlow. To start, I went about building the smallest presented network and training it on the smallest training corpus from LibriSpeech consisting of 100 hours of cleaned speech. The architecture consisted of 19 total layers, with 5 residual blocks that were each 3 layers deep. When the NUC itself was unable begin training due to memory errors, I turned to Intel DevCloud. This service provides TensorFlow users with Intel Xeon Gold 6128 scalable processors and is comprised of compute and interactive nodes. The interactive nodes can be useful for debugging, but have limited compute compared to the compute nodes.
The code that I was able to run on my initial foray used Intel optimized TensorFlow 1.14 through Intel optimized Python 3.6 with Intel MKL on an interactive node. On the 100 hour LibriSpeech training corpus, my training time was approximately 30 hours per epoch. With a target of 30 epochs, this would take approximately 38 days to fully train. That is excluding the other LibriSpeech corpora, which sum to about 1000 hours of data. Needless to say, there is some further optimization that needs to occur before I make any progress with the replication.
In a field where one has to train, debug, and test a model, it is important to provide an approximate duration of time that would be taken up by training alone. This is especially important when it can be taken in context with the specifications of the machine the network is trained on. The arxiv paper that I am attempting to replicate and eventual extend does not indicate how long each epoch of training took, let alone how long training took in total. This crucial bit of information is left out of most papers on these pre-print forums within the field.
Training time and machine specifications relate to replication in a major way. Not many people have access to state-of-the-art GPUs and CPUs, which severely limits what findings can be replicated and by whom. This all culminates in advances that may be more specified than originally thought, particularly because people generally don’t want to read about things that did not work.
Anyway, this training time of 30 hours per epoch is far from optimized and I am currently working on incorporating the Horovod package which should allow the network to take advantage of multiple CPU cores across multiple compute nodes. This should provide a much more manageable training time. Once I accomplish this, I will publish a much more in-depth article on the replication and any extensions I am able to achieve.
In the mean time, you can follow the GitHub repo for development progress.
See you all next time.