Unsupervised Pre-training for Speech Recognition (wav2vec)

Source: Deep Learning on Medium

Deep learning model breaks through lots of state-of-the-art records in many fields which includes computer vision (CV), natural language processing (NLP) and automatic speech recognition (ASR). In CV, we can use pre-trained R-CNN, YOLO model on our target domain problem. In NLP, we can also leverage pre-trained model such as BERT and XLNet. In ASR, we now have a pre-trained model to convert audio input to a vectors.

Learning Feature Representation

In the previous stories, we went through classic methods and Speech2vec to learn vector representations for audio inputs. Those approaches learn vectors from scratch on target domain data. Those research also demonstrated a good result on target domain. However, the limitation is that we cannot apply it when size of target domain is small. Therefore, Schneider et al. proposed wav2vec to convert audio to features.

Similar to word2vec in NLP, wav2vec is pre-trained on common data (Librispeech) while the training objective is predicting k steps in the future. It is quite similar to CBOW(continuous bag of words) model architecture. One of the advantage is that you can just load the model and then you can retrieve feature.

Trianing objective (Schneider et al., 2019)

Trending AI Articles:

Downstream Task

One of the possible downstream of leveraging pre-trained model is automatic speech recognition. Schneider et al. built 4-Gram Language Models (LM), Word Convolutional Neural Network (CNN) LM and Character CNN LM model to evaluate performance with baseline model.

You can notice that wav2vec got a lower Lattice Error Rate (LER) and Word Error Rate (WER) from the following figure.

Evaluation wav2vec performance (Schneider et al., 2019)

Code Sample

Facebook already source code to demonstrate how to use pre-trained model and train a wav2vec model based on customize data.

As long as downloaded the model file, you only need to fit the audio input (in tensor format).

import torch, librosa
from fairseq.models.wav2vec import Wav2VecModel

cp = torch.load('/path/to/wav2vec.pt')
model = Wav2VecModel.build_model(cp['args'], task=None)
model.load_state_dict(cp['model'])
model.eval()
wav_input = librosa.load(wave_file_path)
tensors = torch.from_numpy(wav_input)
z = model.feature_extractor(tensors)
c = model.feature_aggregator(z)

Like to learn?

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, especially in NLP, data augmentation and platform related. Feel free to connect with me on LinkedIn or Github.

Extension Reading

Reference

Don’t forget to give us your 👏 !