How we built Arabic Speech Recognition System using KALDI

Kaldi is a speech recognition toolkit, freely available under the Apache License


This was our graduation project, it was a collaboration between Team from Zewail City (Mohamed Maher & Mohamed ElHefnawy & Omar Hagrass & Omar Merghany) and RDI.


Arabic is considered as one of the challenging languages to be used in speech recognition systems due to its large lexical variety and complicated morphology . Arabic language can be categorized into three different types which are the classical Arabic, Modern standard Arabic (MSA) and Colloquial Arabic. The MSA is the one which is used widely in newspapers, broadcast and formal communications. On the other hand, classical Arabic is the standard form of language which is found in “Holy Quran” and Colloquial Arabic is the natural spoken Arabic in everyday life which has no common standard due to existence of many dialects for each Arabic country with different colloquial forms within each country, that’s why we have focused on
MSA in our study.

Speech recognizer is a device that automatically transcribes speech into text based on some finite vocabulary that restricts the words being printed out. The recognizer needs to segment the speech signals into successive phones and identify the particular phones corresponding to segments before transcribing the phone strings to text.


Kaldi is a speech recognition toolkit, freely available under the Apache License.

KALDI Installation

First make sure that you have installed CUDA on Ubuntu 
Then download kaldi from and follow the installation instructions in tools & src folder.


We have used 100-Hours of News Broadcast of Modern Standard Arabic (MSA) with different dialect and for both male and female speakers collected in period from 2005 to 2015.
Dataset was divided into 90 Hours for training and validation in a ratio of 9:1 and 10 Hours for testing


MFCC Feature Extraction

The features are extracted using the standard 13-dimensional cepstral mean-variance normalized (CMVN) MelFrequency Cepstral Coefficients (MFCCs) which are derived from the fast Fourier transform of any signal and are used to define the real cepstrum of the windowed short-time signal. These features are used as they approximate the behavior of the auditory system. Then, a mono-phone model is built to use the contextual information of the phones
without neither of the preceding nor of the following phones to act as a building block to the next tri-phone models.

Higher degree features with Acoustic Training and Alignment

  • In the training step, the acoustic parameters can be estimated. However, cycling through training and alignment phases can better optimize the process. So, we align the audio to the reference transcript with the current acoustic model.
  • The tri-phone training algorithms used were the delta + delta-delta training and the Linear Discriminant Analysis
    (LDA) + Maximum likelihood Linear Transform (MLLT).

Language Model and Lexicon

  • N-Gram Language Model and Corpus Used

A tri-gram language model (LM) was built using a training corpus of MSA broadcast news transcripts with a total of 10M words.
This corpus was found to have around 500k unique grapheme words. We selected all the words that have occurred more than once in dataset were used to build the ASR grapheme lexicon to have more than 478K entries with one
unique grapheme sequence per word. All the text in corpus and generated grapheme lexicon were mapped to English letters as Kaldi toolkit doesn’t support UTF-8 characters. This mapping was performed as a simple one-to-one mapping.

  • Lexicon Preparation

As Arabic words consist of two different categories of letter which are the written normal characters like ,ق ,ب ,ا etc, and the diacritics which aren’t usually written like ُ, ُ, ُ, etc. So, a grapheme based lexicon would be easier to use where each entry correspond to one unique grapheme sequence per word. On the other hand, a phone based lexicon can be made by generating top vowelized candidates to each grapheme representation of a word. It was shown that each grapheme word has an average of 3 to 4 vowelized representation. However, automatic vowelization of words can’t be made without knowing the context. So, we preferred to use the grapheme based lexicon during our study.
We used a phoneme set of 45 phonemes in our lexicon where 44 phonemes correspond to speech and one phoneme
for silence.

Neural Nets to Estimate Phone State Probabilities

The resulting features after acoustic training and alignment are passed through different neural network architectures like Deep Neural Network (DNN), Time Delay Neural Network (TDNN) and hybrid architectures of TDNN and Long Short Term Memory (LSTM) Recurrent Neural Networks (RNN).

KALDI Recipe

Data Preparation

utils/ data_folder

Your data_folder should contain these files “utt2spk” , “wav.scp” and “text”, you can check the description of each file and the step of Data preparation form the official documentation of Kaldi .
This step will create new file “spk2utt” which will be needed during computing the MFCC features

LEXICON Preparation

local/ $LEX_DIR

This lexicon was provided by RDI Company

Language Model Preparation

#L Compilation
utils/ local/dict “<UNK>” local/lang data/lang
#G compilation

Compute MFCC Features

steps/ — nj $nj — cmd “$train_cmd” $path/testSample exp2/make_mfcc_sampleFinal/train/log $mfccdir

steps/ $path/testSample exp2/make_mfcc_sampleFinal/train/log $mfccdir


Project Repository

Source: Deep Learning on Medium