The convolutional neural networks for Amazigh speech recognition system

ABSTRACT


INTRODUCTION
Deep learning is a branch of machine learning.It consists of learning high-level representations of data using deep neural networks.With technological and scientific advances, deep learning has made a place in many areas especially in the field of automatic speech recognition.Automatic speech recognition is a computer technique intended to transcribe a speech signal into text [1].Since a long time, the hidden markov models [2,3] it was a perfect solution to the problems of speech recognition.But, in 2012, deep learning [4] has a revolution with the appearance of convolutional neural network (CNN) [5].It is arguably that the most popular architecture, they have applications in image and video recognition, recommender systems [6], medical image and audio analysis [7], successfully applied in speech recognition.In this work, we built an Amazigh speach recognition system based on CNN and GPU computation using TensorFlow, which is an open source library written in python and C++ with a model and robust architecture that can be run on multiple CPUs and GPUs.This paper is organized as follows: section 2 we present the related work, section 3 we describe the principle and the theory of speech recognition, section 4 we describe the CNN, section 5 we present TensorFlow.Finally, the section 6 illustrates the experimental results followed by conclusion.

RELATED WORK
In our previous work [3] we have developed an Amazigh speech recognition system based on hidden Markov model HMMs using an open source CMU Sphinx-4.The corpus consists of 11220 audio files.The best obtained accuracy was 90% when we have trained our model by using 128 Gaussian mixture models, and 5 number of HMMs states.Palo H K, and et al. [8] have determined the age of speaker based on emotional speech prosody and clustering them using fuzzy c-means algorithm.This recognition of speech emotion based on suitable features provides age information that helped the society in different ways.They have used many feature extraction techniques.Among the extracted features, the F0, energy or amplitude, and speech rate.
Zhang H., and et al. [9] have studied a series of neural networks based acoustic models; time delay neural network (TDNN), CNNs, and the long short-term memory (LSTM), applied them in the Mongolian speech recognition systems, and compared their performance.The result shows that the LSTM is the most accurate model with 8, 12% WER.Satori H., and et al. [10] have developed and Amazigh ASR based on the CMU-Sphinx.The system generated 92.89 % of accuracy.The training was performed by using using 16 Gaussian mixture models.Kumar K., and Aggarwal R. [11] have built a Hindi recognition system using HTK based on the hidden Markov models HMMs.The corpus of training consists of 102 words.The system produced 87.01% of accuracy.

AUTOMATIC SPEECH RECOGNITION SYSTEM
The problem of speech recognition aims to convert the speech signal to sequence of observations X, in a process called feature extraction.The decoder looks for the sequence of words W* maximizing the following equation: After applying the Bayes theorem, this equation becomes: P (X) is considered constant and removed from (2).
3.1.Pre-processing 3.1.1.Audio to spectrum Speech, whatever its language, is constitute of a finite number of distinctive sound elements.These elements form elementary linguistic units and have the property of changing the meaning of a word.These elementary units are called phonemes [3].The Phonemes can be seen as the basic elements for coding linguistic information.The Amazigh alphabet contains 33 phonemes [10] as shown in Figure 1.
Figure 1.Official table of the tifinaghe alphabet as recommendedby IRCAM [11] has officially been the only writing system for transcribing the Amazight language in Morocco since 2003 TELKOMNIKA Telecommun Comput El Control
-2 semi-consonants: ⵢ and ⵡ -vowels: the full ones (ⴰ, ⵉ, ⵓ), neutral (ⵓ).CNN takes input an image, so to be able to recognize phonemes it is necessary to pass on spectrum to transform audio into image.This pre-processing phase is the longest and most important phase to build ASR system.In speech recognition system, the most common feature extraction techniques are based on spectrum: PLP, the spectrogram, the mel spectrogram [12], mel frequency cepstral coefficient (MFCC).In this work we have used MFCC technique.

The spectrogram
The spectrogram is a representation of an audio file in a frequency domain.In order to convert raw data to spectrogram we apply short-time fourier transform [13].The produce matrice is then fed into a multi-layer CNN followed with a fully-connected with softmax activation which generates the classification vector.The following Figure 2 lists the spectrogram of the alphabet ya, yab, and yad: The MFCCs are the amplitudes of the resulting spectrum.The image produced by these Pre-processing steps is then fed into multi-layer convolutional neural networks, with a fully-connected layer followed by a softmax at the end

CONVOLUTIONAL NEURAL NETWORKS 4.1. The perceptron
Perceptron is a very simple learning machine algorithm based on a model of biological neurons, which takes an input vector, weigh matrice, and an activation function to produce the desired output [15,16].The weights are the property of the connection which represent the strength of the connection.Each connection has a different weight value while bias is the property of the neuron as shown in Figure 3.

The multilayer perceptrons MLP
When we combine many perceptrons, we form a multilayer perceptron or more precisely an artificial neural network [15].The first layer is the input layer, corresponding to the data features.The last layer is the output layer, which provides the output probabilities of classes or labels as shown in Figure 4.

CNN
The CNNs or ConvNets these are a particular form of neural network [17,18] that takes an input image inspired by the work of Hubel and Wiesel on the primary visual cortex of the cat [19] as shown in Figure 5.The CNN architecture has two components: the convolutive part or feature extraction part, we use spectrogram technique to extract the feature.And the classification part, the vector of feature extracted by the convolutive part is feed to the fully connected layers leading into the output layer which represents the classifier.The convolutive part consists of [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20].
Convolutional layer: convolution is one of the main building blocks of a CNN.based on its convolutional mathematical principle [21], is consists of a set of learnable filters, or kernels.Each filter is applied by independently striding over the entire input, creating an output feature map for each filter.At every location, a matrix multiplication is performed and sums the result onto the feature map as shown in Figure 6.
Figure 5. Illustration of the architecture used for the CNN with many layers Figure 6.Convolutional mathematical principle [20] Pooling layer: reduces the size of the image.It is an essential layer often placed between two layers of convolution.The ReLU correction layer therefore replaces all negative values received.The classification part consists of fully connected layers.Fully-connected layers are used after the final convolutional layer in order to match the output size of the neural network to the desired output size.The output is then passed through a softmax function in order to create a probability representation for the predictions for each class in the supervised learning setting.To use the fully-connected layers, the output from the final convolutional layer is commonly flattened out, or the feature maps are subsampled to a size of 1.


The convolutional neural networks for Amazigh speech recognition system (

TENSORFLOW
TensorFlow is an open source library developed by Google's AI organization, as a middlewear library that can be used to build deep learning neural networks, TensorFlow is written in python and c++ with a model and robust archetecture that can be run on multiple CPUs and GPUs [22] as shown in Figure 7. Speakers and the test has with Tensor Flow, machine learning algorithms are based on the concept of the data flow graph or computional graph [23] Session: A session creates a runtime in which operations are executed and Tensors are evaluated.We opted for TensorFlow for the following reasons: TensorFlow comes with a complete set of visualization tools that make it easy to understand, debug, and optimize applications.TensorFlow also has a large community of users and lots of documentation.

Environment
We choose to install TensorFlow with GPU, use virtualenv installation on a workstation hp Z640 Intel 4 core, and we use Linux Ubuntu 16.04, to avoid the problem [23].The following software needs to be installed properly [24] − Priore to installing TensorFlow with GPU support, ensure that the system support all NVIDIA software requirements.

Corpus
To train our model we use the dataset collected by Satori H, and al [10].The signals were recorded in a non-noisy space by the same microphone; the recording files are in MS WAV format with a specific sample rate-16 kHz, 16 bit mono.Each speaker was invited to prononce 33 Amazigh letters 10 times.During training, the corpus is separated into: − Training data: 80% of the data; − Validation data:10% of the data is reserved for the evaluation of the precision during the training; − Data tests: 10% of the data is used to evaluate accuracy once the training is complete.In the following Table 1.We define how many audio files used in training, validation, and test data for 3 separated experiments described in this paper.Train and estimate the model on some data; We feed our CNN by spectorgram results from the preprocessing phase to train and predict the labels.The labels used in this paper are «silent", "ya", "yab", "yad"...Each column represents a set of samples that was predicted to be each, so the first column represents all the clips that were intended to be silence.The second column represents all those that were predicted to be ya word, and the third "yab" [14].At the end of the training.A final confusion matrix will generate.The columns of this matrix represent the prediction labels and the lines represent the actual labels.So, she gives a good summary of training errors.

Experiment 1
The corpus consists of 9240 audio files, 28 speakers was invited to pronounce 33 letters Amazigh 10 times.The corpus is divided into 7392 train, 924 test, and 924 validation audio files.The results show that the system produces 89.8% of accuracy as shown in Figure 8.


The convolutional neural networks for Amazigh speech recognition system (Meryam Telmem) 521

Experiment 2
In order to test the effect of the gender on the quality of the acoustic model, the corpus consists of 9240 audio files: 14 females and 14 females' speakers were invited to pronounce 33 letters Amazigh 10 times.In the following Table 2 we defined how many audios files training, validation, and test data, and result.The results show that the best results were recorded for males with 93.8% of accuracy.

Experiment 3
In this experiment, we evaluate the performance of a system, which was trained and tested for different age.The corpus consists of 13860 audio files: 42 speakers was invited to pronounce 33 letters Amazigh 10 times, we have classified the speakers 'ages into three categories: age 9-15, age 16-30, and age +30.In the following Table 3 we defined how many audios files training, validation, and test data, and result.The results show that the best results were recorded for +30 age category.To test the effect of sex or age variation on the quality of the system, it has been trained and tested with different corpuses our results are already encouraged, the best results produce 93.9% of accuracy.

Comparative analysis
The presented work has been compared with the existing similar task recognition, especially, the emotion recognition system SER and sound event recognition.The following Table 4 lists a number of results from our previous work [3], Zheng W. Q., et al. [25], Zhang H., et al. [26], and our Proposed.In our previous work [3] we have developed the Amazigh speech recognition system based on hidden Markov model HMMs using the open source CMU Sphinx-4.The corpus consists of 11220 audio files.The system obtained best performance of 90 % when trained using 128 Gaussian mixture models, and 5 number of HMMs states.Zheng W. Q., et al. [24], have developed the emotion recognition system based on convolution neural networks with 2 convolutions+2 pooling layers, and using labelled training audio data and used the log-spectrogram to extract feature, component analysis PCA to reduce the dimensionality.The system achieved about 40% accuracy.Zhang H., et al. [25], have proposed the sound event detection based on convolution neural networks with 2 convolution +2 pooling layers, and spectrogram to extract feature.The system the system achieved about 94.07% of accuracy.In our proposed work; the system obtained the best performance of 93.9% of accuracy when trained using +30 age category.Results are very satisfactory if compared with the existing similar works.


Figure 3.A perceptron Figure 4.The multilayer perceptrons . The nodes of this graph represent mathematical operations.The edges are tensors.In terms of TensorFlow, a tensor is just a multi-dimensional array.Each data flow graph computation runs within a session on one or more CPUs or one or more GPUs.A computational graph in TensorFlow consists of several parts: − Tensor: a multi-dimensional array.− Graph: a central hub that connects all the variables, placeholders, constants to operations.− Constants: are fixed value tensors-not trainable.− Variables are tensors initialized in a session-trainable.− Placeholders: are tensors of values that are unknown during the graph construction, but passed as input during a session.− Operations: are functions on tensors.−

Figure 8 .
Figure 8.A graph showing the training CNN models progress

Table 1 .
Training, validation, and test data for the 3 Experiments 6.3.Train CNN with TensorFlowBasically, there are 3 steps to build a CNN model in Tensorflow: − Preprocessing the data; − Build the model; build the nodes and operations and how they are connected to each other; −

Table 2 .
Recognition accuracy for experiment2: corpus consists of 9240 audio files

Table 3 .
Recognition accuracy for experiment3: corpus consists of 13860 audio files

Table 4 .
Tabular comparison of the recognition accuracy of the proposed systems