Spoken Digit Classification

Source: Deep Learning on Medium

Spoken Digit Classification

Classification of isolated spoken digits is the core procedure for a large and important number of applications mainly in telephone-based services, such as dialing, airline reservation, bank transaction, and price quotation, only using speech. Spoken digit classification is generally a challenging task since the signals last for a short period and often some digits are acoustically very similar to each other.

The objective of this article is to investigate the use of machine learning algorithms for digit classification. The most important task for successfully classifying spoken digits is attribute extraction. Audio data is composed of a huge amount of very weak features, and most machine learning algorithms will not be able to build accurate classifiers. In particular, the choice of the right attribute extraction method is more important than the specific classification paradigm, and that the right combination of classifier and attributes can provide almost perfect accuracy.

Dataset

You can download the dataset from here.

Out data consists of recordings of spoken digits in files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends.

Importing Packages

import numpy as np
import matplotlib.pyplot as plt
import librosa
import os
import librosa.display
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import GradientBoostingClassifier

Reading Dataset

We will be reading our dataset using the library “librosa”. Then extract STFT features from the audio. STFT (Short Term Fourier Transform) is a frequency feature representation for audio.

Note :- Explore more about STFT features. It would help you to understand why and how we are using STFT here.

Code :

file = os.listdir('free-spoken-digit-dataset/recordings')
data=[]
for i in file:
x , sr = librosa.load('free-spoken-digit-dataset/recordings/'+i)
data.append(x)

This is what the output looks like :

print(data[10])

A plot of frequency vs Time :

Code :

# for any random audio%matplotlib inline
X = librosa.stft(data[10])
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar()

Output :

Input Signal to STFT features :

Code:

X=[]
for i in range(len(data)):
X.append(abs(librosa.stft(data[i]).mean(axis = 1).T))
X= np.array(X)

Let’s see the output :

print(X)

One Hot Encoding :

So, What is it? let’s see

One hot encoding transforms categorical features into a format that works better with classification and regression algorithms.

One hot Encoding the target:

y = [i[0] for i in file]import pandas as pd 
Y = pd.get_dummies(y)

Output :

print(Y)
Only 20 rows are shown here. There are total of 2000 rows and 10 columns.

NOTE :- X contains the STFT features of each audio sample, where Y contain the target audio class label.

Now we are done with preprocessing of data. It’s time for our next interesting step i.e. using ML to see the performance on our data.

Splitting Dataset :

Split the dataset (75% training and 25% testing) into training and testing sets with STFT audio features as input, audio class as target label.

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)

1. Neural Network :

Code :

model = Sequential()
model.add(Dense(256, activation='tanh', input_dim=1025))
model.add(Dense(128, activation='tanh'))
model.add(Dense(128, activation='tanh'))
model.add(Dense(10, activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy',
optimizer=sgd,
metrics=['accuracy'])
history = model.fit(X_train, y_train,
epochs = 20,
batch_size = 128,
verbose=1,
validation_data=(X_test, y_test),
shuffle=True)
score = model.evaluate(X_test, y_test, batch_size=128)

Output :

Train on 1200 samples, validate on 800 samples Epoch 1/201200/1200 [==============================] - 1s 818us/step - loss: 0.3214 - acc: 0.9000 - val_loss: 0.3166 - val_acc: 0.9000 Epoch 2/201200/1200 [==============================] - 0s 87us/step - loss: 0.3129 - acc: 0.9000 - val_loss: 0.3077 - val_acc: 0.9000 Epoch 3/201200/1200 [==============================] - 0s 88us/step - loss: 0.3033 - acc: 0.9000 - val_loss: 0.2993 - val_acc: 0.9000 Epoch 4/201200/1200 [==============================] - 0s 95us/step - loss: 0.2944 - acc: 0.9000 - val_loss: 0.2916 - val_acc: 0.9000 Epoch 5/201200/1200 [==============================] - 0s 85us/step - loss: 0.2863 - acc: 0.9000 - val_loss: 0.2844 - val_acc: 0.9000 Epoch 6/201200/1200 [==============================] - 0s 86us/step - loss: 0.2789 - acc: 0.9006 - val_loss: 0.2779 - val_acc: 0.9002 Epoch 7/201200/1200 [==============================] - 0s 84us/step - loss: 0.2722 - acc: 0.9008 - val_loss: 0.2720 - val_acc: 0.9007 Epoch 8/201200/1200 [==============================] - 0s 92us/step - loss: 0.2663 - acc: 0.9012 - val_loss: 0.2666 - val_acc: 0.9016 Epoch 9/201200/1200 [==============================] - 0s 88us/step - loss: 0.2608 - acc: 0.9022 - val_loss: 0.2618 - val_acc: 0.9026 Epoch 10/201200/1200 [==============================] - 0s 86us/step - loss: 0.2559 - acc: 0.9045 - val_loss: 0.2570 - val_acc: 0.9046 Epoch 11/201200/1200 [==============================] - 0s 88us/step - loss: 0.2511 - acc: 0.9062 - val_loss: 0.2525 - val_acc: 0.9059 Epoch 12/201200/1200 [==============================] - 0s 94us/step - loss: 0.2468 - acc: 0.9071 - val_loss: 0.2485 - val_acc: 0.9061 Epoch 13/201200/1200 [==============================] - 0s 88us/step - loss: 0.2427 - acc: 0.9082 - val_loss: 0.2446 - val_acc: 0.9074 Epoch 14/201200/1200 [==============================] - 0s 87us/step - loss: 0.2391 - acc: 0.9095 - val_loss: 0.2410 - val_acc: 0.9077 Epoch 15/201200/1200 [==============================] - 0s 95us/step - loss: 0.2354 - acc: 0.9106 - val_loss: 0.2377 - val_acc: 0.9085 Epoch 16/201200/1200 [==============================] - 0s 89us/step - loss: 0.2320 - acc: 0.9119 - val_loss: 0.2346 - val_acc: 0.9101 Epoch 17/201200/1200 [==============================] - 0s 85us/step - loss: 0.2288 - acc: 0.9132 - val_loss: 0.2315 - val_acc: 0.9111 Epoch 18/201200/1200 [==============================] - 0s 88us/step - loss: 0.2257 - acc: 0.9139 - val_loss: 0.2285 - val_acc: 0.9115 Epoch 19/201200/1200 [==============================] - 0s 92us/step - loss: 0.2228 - acc: 0.9146 - val_loss: 0.2257 - val_acc: 0.9122 Epoch 20/201200/1200 [==============================] - 0s 90us/step - loss: 0.2200 - acc: 0.9156 - val_loss: 0.2231 - val_acc: 0.9124 800/800 [==============================] - 0s 30us/step

After 20 epochs we achieved an accuracy of 91.24%.

The change in model accuracy and loss with respect to the number of epochs can be observed in the visualization given below:

Loss :

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.xlabel('Epoch')
plt.ylabel('loss')

Accuracy :

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.xlabel('Epoch')
plt.ylabel('loss')

2. Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Code:

y=np.array(list(map(int,y)))X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)clf = RandomForestClassifier(n_estimators=100, max_depth=5)
clf = clf.fit(X_train, y_train)
Y_predict = clf.predict(X_test)
accuracy=accuracy_score(Y_predict,y_test)
print(accuracy)

Output :

0.722

Validation accuracy is 72.2 %.

3. Gradient Boosting

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

Code :

gb_clf = GradientBoostingClassifier(n_estimators=20, learning_rate=0.5, max_features=3, max_depth=2, random_state=0)gb_clf.fit(X_train, y_train)print("Learning rate: ", 0.5)print("Accuracy score (training): {0:.3f}".format(gb_clf.score(X_train, y_train)))print("Accuracy score (validation): {0:.3f}".format(gb_clf.score(X_test, y_test)))

Output :

Learning rate: 0.5 
Accuracy score (training): 0.835
Accuracy score (validation): 0.482

On the learning rate 0.5, we got an accuracy of 83.5% on the training set and 48.2% accuracy on the testing set.

Therefore we can conclude that the gradient boosting classifier is overfitting.

You can read more about overfitting and why does it occurs.

In the end, we can conclude that the Neural Network produces the most accurate results above all!

This is my first article on machine learning. Thank you for reading it, and I hope you’ve enjoyed it so far!

You can access the full code from my github repository.