Source: Deep Learning on Medium
Deep learning model breaks through lots of state-of-the-art records in many fields which includes computer vision (CV), natural language processing (NLP) and automatic speech recognition (ASR). In CV, we can use pre-trained R-CNN, YOLO model on our target domain problem. In NLP, we can also leverage pre-trained model such as BERT and XLNet. In ASR, we now have a pre-trained model to convert audio input to a vectors.
Learning Feature Representation
In the previous stories, we went through classic methods and Speech2vec to learn vector representations for audio inputs. Those approaches learn vectors from scratch on target domain data. Those research also demonstrated a good result on target domain. However, the limitation is that we cannot apply it when size of target domain is small. Therefore, Schneider et al. proposed wav2vec to convert audio to features.
Similar to word2vec in NLP, wav2vec is pre-trained on common data (Librispeech) while the training objective is predicting k steps in the future. It is quite similar to CBOW(continuous bag of words) model architecture. One of the advantage is that you can just load the model and then you can retrieve feature.
Trending AI Articles:
One of the possible downstream of leveraging pre-trained model is automatic speech recognition. Schneider et al. built 4-Gram Language Models (LM), Word Convolutional Neural Network (CNN) LM and Character CNN LM model to evaluate performance with baseline model.
You can notice that wav2vec got a lower Lattice Error Rate (LER) and Word Error Rate (WER) from the following figure.
Facebook already source code to demonstrate how to use pre-trained model and train a wav2vec model based on customize data.
As long as downloaded the model file, you only need to fit the audio input (in tensor format).
import torch, librosa
from fairseq.models.wav2vec import Wav2VecModel
cp = torch.load('/path/to/wav2vec.pt')
model = Wav2VecModel.build_model(cp['args'], task=None)
model.eval()wav_input = librosa.load(wave_file_path)
tensors = torch.from_numpy(wav_input)
z = model.feature_extractor(tensors)
c = model.feature_aggregator(z)