Wakeup Word Classification

Source: Deep Learning on Medium

DATA SET DESCRIPTION

WAKE WORD AUDIO DATA SET SOURCE: Here.

A huge shout out to Alireza Kenarsari for providing the dataset of different Wake Words on GitHub. The dataset consists of 6 directories and each directory has audio files containing that particular wake word.

WAKE WORD CLASSES:

The dataset consists of the following 6 classes:

0: Alexa
1:Computer
2:Jarvis
3:Smart Mirror
4:Snowboy
5:View Glass

The entire dataset comprises of 2293 audio files(.wav and .flac)

Phase 1: Data Reading and Preprocessing

IMPORTING ALL THE LIBRARIES

LibROSA is one of the python packages which I have used for music and audio analysis. It provides the building blocks necessary to create music information retrieval systems.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import librosa
import librosa.display
import os
import warnings
import IPython.display as ipd
warnings.filterwarnings('ignore')

READING THE DATA

folder = os.listdir('wake-word-benchmark/audio')y=[]
temp = []
data =[]
for i in folder:
file=os.listdir('wake-word-benchmark/audio/'+i)
for j in file:
y.append(i)
x , sr = librosa.load('wake-word-benchmark/audio/'+i+'/'+j)
data.append(x)
temp.append((librosa.load('wake-word-benchmark/audio/'+i+'/'+j)))

SAMPLE WAVEFORM OF EACH WAKE WORD CLASS

In order to feed the data to any learning algorithm, we first need to extract meaningful features from the given audio signals. Short-Term Fourier Transform (STFT) is one such important feature that can be extracted from audio signals.

So, What is Short-time Fourier Transform (STFT)?

Short time Fourier transform (STFT) is one of the methods of linear time-frequency analysis that can provide localized spectrum in time domain by applying Fourier transform in a localized time window. The STFT represents a signal in the time-frequency domain by computing discrete Fourier transforms (DFT) over short overlapping windows.

Converting input signals to STFT features

Xstft=[]
for i in range(len(data)):
Xstft.append(np.abs(librosa.stft(data[i]).mean(axis = 1).T))
Xstft=np.array(Xstft)
Xstft.shape
##OUTPUT(2293, 1025)

SAMPLE STFT PLOT OF EACH WAKE WORD CLASS

Now, we have to convert our target variable i.e. Class Name to One Hot Encoding.

What is One Hot Encoding

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

One Hot Encoding the target variable

Y = pd.get_dummies(y)
Y.head()

Once we are done with the preprocessing and everything.

It’s time for us to get into the exciting part, which is to apply different machine/deep learning algorithms and observe their performance on the preprocessed dataset.

Phase 2: Trying out different learning algorithms

My Neural Network Architecture

1.Neural Network

A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. Neural networks can adapt to changing input; so the network generates the best possible result without needing to redesign the output criteria.

I used Keras library in Python to create an artificial neural network. Keras is a high-level API wrapper for the low-level API, capable of running on top of TensorFlow, CNTK, or Theano.

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras import regularizers


model = Sequential()
model.add(Dense(256, activation='relu', input_dim=1025))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dense(6, activation='softmax'))

model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])


history = model.fit(X_train, y_train, epochs = 25, batch_size = 128, verbose=1,validation_data=(X_test,y_test),shuffle=True)


score = model.evaluate(X_test, y_test, batch_size=128)
##OUTPUTTrain on 1375 samples, validate on 918 samples
Epoch 1/25
1375/1375 [==============================] - 1s 616us/step - loss: 0.4468 - acc: 0.8333 - val_loss: 0.4372 - val_acc: 0.8333
Epoch 2/25
1375/1375 [==============================] - 0s 49us/step - loss: 0.4314 - acc: 0.8333 - val_loss: 0.4102 - val_acc: 0.8333
Epoch 3/25
1375/1375 [==============================] - 0s 53us/step - loss: 0.4079 - acc: 0.8353 - val_loss: 0.3739 - val_acc: 0.8404
Epoch 4/25
1375/1375 [==============================] - 0s 49us/step - loss: 0.3812 - acc: 0.8407 - val_loss: 0.3429 - val_acc: 0.8551
Epoch 5/25
1375/1375 [==============================] - 0s 54us/step - loss: 0.3632 - acc: 0.8487 - val_loss: 0.3216 - val_acc: 0.8676
Epoch 6/25
1375/1375 [==============================] - 0s 54us/step - loss: 0.3414 - acc: 0.8520 - val_loss: 0.3149 - val_acc: 0.8680
Epoch 7/25
1375/1375 [==============================] - 0s 52us/step - loss: 0.3253 - acc: 0.8581 - val_loss: 0.3003 - val_acc: 0.8787
Epoch 8/25
1375/1375 [==============================] - 0s 54us/step - loss: 0.3158 - acc: 0.8657 - val_loss: 0.2862 - val_acc: 0.8834
Epoch 9/25
1375/1375 [==============================] - 0s 52us/step - loss: 0.3033 - acc: 0.8715 - val_loss: 0.2796 - val_acc: 0.8865
Epoch 10/25
1375/1375 [==============================] - 0s 51us/step - loss: 0.2887 - acc: 0.8768 - val_loss: 0.2647 - val_acc: 0.8932
Epoch 11/25
1375/1375 [==============================] - 0s 54us/step - loss: 0.2871 - acc: 0.8789 - val_loss: 0.2614 - val_acc: 0.8960
Epoch 12/25
1375/1375 [==============================] - 0s 52us/step - loss: 0.2690 - acc: 0.8874 - val_loss: 0.2512 - val_acc: 0.8985
Epoch 13/25
1375/1375 [==============================] - 0s 52us/step - loss: 0.2572 - acc: 0.8910 - val_loss: 0.2446 - val_acc: 0.9018
Epoch 14/25
1375/1375 [==============================] - 0s 53us/step - loss: 0.2485 - acc: 0.8947 - val_loss: 0.2369 - val_acc: 0.9074
Epoch 15/25
1375/1375 [==============================] - 0s 52us/step - loss: 0.2364 - acc: 0.8985 - val_loss: 0.2314 - val_acc: 0.9098
Epoch 16/25
1375/1375 [==============================] - 0s 50us/step - loss: 0.2229 - acc: 0.9072 - val_loss: 0.2242 - val_acc: 0.9101
Epoch 17/25
1375/1375 [==============================] - 0s 54us/step - loss: 0.2208 - acc: 0.9073 - val_loss: 0.2229 - val_acc: 0.9138
Epoch 18/25
1375/1375 [==============================] - 0s 57us/step - loss: 0.2135 - acc: 0.9131 - val_loss: 0.2182 - val_acc: 0.9138
Epoch 19/25
1375/1375 [==============================] - 0s 50us/step - loss: 0.2049 - acc: 0.9170 - val_loss: 0.2160 - val_acc: 0.9172
Epoch 20/25
1375/1375 [==============================] - 0s 50us/step - loss: 0.1982 - acc: 0.9185 - val_loss: 0.2131 - val_acc: 0.9167
Epoch 21/25
1375/1375 [==============================] - 0s 51us/step - loss: 0.1923 - acc: 0.9196 - val_loss: 0.2077 - val_acc: 0.9185
Epoch 22/25
1375/1375 [==============================] - 0s 51us/step - loss: 0.1781 - acc: 0.9273 - val_loss: 0.2015 - val_acc: 0.9216
Epoch 23/25
1375/1375 [==============================] - 0s 49us/step - loss: 0.1756 - acc: 0.9304 - val_loss: 0.2027 - val_acc: 0.9201
Epoch 24/25
1375/1375 [==============================] - 0s 50us/step - loss: 0.1672 - acc: 0.9331 - val_loss: 0.2055 - val_acc: 0.9216
Epoch 25/25
1375/1375 [==============================] - 0s 50us/step - loss: 0.1619 - acc: 0.9354 - val_loss: 0.2068 - val_acc: 0.9203
918/918 [==============================] - 0s 15us/step

After running the model for 25 epochs, my model achieved a validation accuracy of 92.03%. The change in model accuracy and loss with respect to the number of epochs can be observed in the visualization given below:

Now experimenting and comparing with other classification models…

2.Gradient Boosting

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. (Wikipedia definition)

from sklearn.ensemble import GradientBoostingClassifier
gb_clf = GradientBoostingClassifier(n_estimators=50, learning_rate=0.5, max_features=3, max_depth=2, random_state=0)
gb_clf.fit(X_train, y_train)
print("Learning rate: ", 0.5)
print("Accuracy score (training): {0:.3f}".format(gb_clf.score(X_train, y_train)))
print("Accuracy score (validation): {0:.3f}".format(gb_clf.score(X_test, y_test)))
##OUTPUTLearning rate: 0.5
Accuracy score (training): 0.981
Accuracy score (validation): 0.715

It is visible that on setting the learning rate 0.5, I got a 98.1% accuracy on the training set and 71.5% accuracy on the testing set.

It is clear that this gradient boosting classifier is overfitting. Overfitting refers to a model that models the training data too well but does not generalize well for new unseen data.

Overfitting can be addressed but that’s a discussion for another time.

3.Logistic Regression

Like all regression analyses, logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. Logistic Regression is basically linear regression but with a non-linear sigmoid activation function.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
clf = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial',max_iter=1000).fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_pred,y_test)
##OUTPUT0.673202614379085

Using Logistic Regression I was able to get a classification accuracy of around 67.32% on the validation set.

4.CART

Classification and Regression Trees or CART for short is a term introduced by Leo Breiman to refer to Decision Tree algorithms that can be used for classification or regression predictive modeling problems. Classically, this algorithm is referred to as “decision trees”, but on some platforms like R they are referred to by the more modern term CART. The CART algorithm provides a foundation for important algorithms like bagged decision trees, random forest and boosted decision trees

from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(random_state=0)
decision_tree = decision_tree.fit(X_train,y_train)
y_pred = decision_tree.predict(X_test)
accuracy_score(y_pred,y_test)
##OUTPUT0.5915032679738562

Using CART I was able to get an accuracy of 59.15%.

4.SVM

SVM or Support Vector Machine is a linear model for classification and regression problems. It can solve linear and non-linear problems and work well for many practical problems. The idea of SVM is simple: The algorithm creates a line or a hyperplane which separates the data into classes. It is a large margin classifier i.e. it tries to separate data with as large margin as possible

from sklearn import svm
clf = svm.LinearSVC()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_pred,y_test)
##OUTPUT0.6470588235294118

And finally using Support Vector Machines (SVM), I was able to get a validation accuracy of 64.70%

Conclusion

As it can be observed that our Neural Network performs much better than any other classical machine learning approaches like Logistic Regression, CART and Gradient Boosting. This is because Neural Networks are capable of extracting complex relationships in data and model non-linear decision boundaries using non-linear activation functions.

The entire code can be accessed through my GitHub Repository:

This is one of my first applications in machine learning. Thank you for reading my article, and I hope you’ve enjoyed.