Wavelet Based Feature Extraction for the Indonesian CV Syllables Sound

This paper proposes the combined methods of Wavelet Transform (WT) and Euclidean Distance (ED) to estimate the expected value of the possibly feature vector of Indonesian syllables. This research aims to find the best properties in effectiveness and efficiency on performing feature extraction of each syllable sound to be applied in the speech recognition systems. This proposed approach which is the state-of-the-art of the previous study consist of three main phase. In the first phase, the speech signal is segmented and normalized. In the second phase, the signal is transformed into frequency domain by using the WT. In the third phase, to estimate the expected feature vector, the ED algorithm is used. The result shows the list of features of each syllables can be used for the next research, and some recommendations on the most effective and efficient WT to be used in performing syllable sound recognition


Introduction
The attempt to realize an intelligent pattern recognition system requires the ancillary systems development that are effective, reliable, and efficient to be integrated well in an intuitive interaction system [1].One of the pattern recognition support system that has been so much developed is a speech recognition system [1]- [3].Challenges in creating a good speech recognition systems include feature extraction [3]- [4], namely how to find the unique features of a speech sound signal that distinguishes it from other speech signals so that a collection of these unique features which will constructing a reference database to identify each certain speech signal as an input command from the user.One of the feature extraction method is the Wavelet Transform (WT) which is the suitable method for exploring the frequency component of speech signals [3].Computing the Euclidean Distance (ED) is a key part in many machine learning and template matching methods to find the closest members of the training set as well as to estimate the possibly feature vector.A speech recognition is the one of the pattern recognition support system that has been so much developed [2]- [4].Feature extraction become a challenges in creating a good speech recognition systems [3].There are many feature extraction method, so the most suitable method must be found to be used on specific types of sound signals [3].
There are several previous studies that combining WT and ED for analysing both onedimensional (1-D) and two-dimensional (2-D) signal [5]- [12].In [5], a new approach that combines the WT, Phase Space Recontruction (PSR), and ED, was proposed to classify the normal and the epileptic seizure EEG signals.The Daubechies 4 was used as a coefficient at the 1 through 5 level of decomposition.In [6], the ED and WT were used for analysis of the blood flow signal.The decomposition was done at one and three decomposition level by using coefficient of Daubechies 2, Morlet, Symlet 2, and Symlet 4. In [7], The ED was used to estimate the expected value of the possibly incomplete feature vectors.In [12] [11], the study was done for extracting the Indonesian phonemes by using DWT and Wavelet Packet Transform (WPT) at 2nd through 4th level of decomposition by using Haar as mother wavelet.The result of this study showed that the DWT is a method that is more efficient and effective in extracting the Indonesian phonemes compared with the WPT as shown by the effectiveness ratio of 60% versus 40% and efficiency ratio of 57% versus 43% [11].In the context of speech classification, there are several previous studies that combining feature extraction methods based on Wavelet [13]- [17], MFCC [16]- [18], LPC [16]- [17], LPCC [16], and the classifier methods such as MLP [13,18,19], HMM [20], GMM [19], and LDA [14,17].
However, the similar study only focused on the sounds of the Indonesian vowels [12] and phonemes [11].This paper is a development of the previous studies which using the WT and the ED algorithm [11]- [12].The frequency components of vowel (V) and phoneme were combined to form and estimate the frequency component of CV syllable pattern.This paper aims to explore WT as a tool for extracting feature of Indonesian consonant-vowel syllables and comparative study of the different wavelet coeficient for analysis Indonesian syllables sound signal.The reason of the selection of the mother wavelet refers to studies done by previous study.This work restricts the scope of research as far as the combination of the phonemes /m, n, r, s/ with the following vowels /a, i, u, e/ and also velar consonant /g/ with the following vowels.Although not all, but the selection of the phonemes is expected representing a large portion of existing phonemes.For example, the phoneme /g/ represent a velar sound, the phonemes /m, n/ represent nasal sounds, /r/ represents an alveolar trill sound, /s/ represents an alveolar fricative sound, and /t/ represents an alveolar stop sound.The experiment in effectitiveness and efficiency for velar /g/ with the following vowels was done separately.

Research Method
There are three main steps used in this study.The first step is preprocessing, this step aims to select a certain part of the signal that would like to be further processed and to recover speech signal level.Then, the signal is transformed into frequency domain by using the WT algorithm.At this part, the results are the frequency components and the magnitude of each possibly feature vector found.Then, the selection and testing proces s which uses ED algorithm are conducted to get the most reliable and possible features to be used in the speech recognition process.

Pre-processing
The speech sound signal was recorded by using a laptop with a microphone from two males and two female speakers in the open area with the minimal noise.Once it was done, then the signal was segmented at the certain length.After segmentation process, the next step was the peak normalization.The purpose of these proces is to to ensure the match volume and the optimal use of media distributed in the recording stage.
In this study, we used the peak normalization, which is slightly different with loudness normalization.The peak normalization is a process where the gain is changed to bring the highest value or peak of Pulse-Code Modulation (PCM) samples of analogue signal to a certain desired level.It is different with loudness normalization which adjusts a signal's gain so that the signal's loudness level equals some desired level.The peak normalization equation can be written: (1) with X'= output data of peak normalization, X max = maximum value of the input data, X i = input data that will be normalized, X min = minimum value of the input data.

Feature Extraction
Feature extraction is a process that is done to find the specific characteristic of a sound signal by converting speech signal into set parameters called feature vectors.This process plays a very important role in the voice recognition process, or the key stage of an overall ISSN: 1693-6930  Wavelet Based Feature Extraction for the Indonesian CV Syllables Sound (Domy Kristomo) 927 scheme for pattern recognition and classification [21].It is a key stage because a better feature is good for the improving recognition rate.We applied the WT to decompose the signal.Since the speech is a non-stationary signal, it is not suitable to be analyzed using the Fourier Transform (FT) because the FT only provides the frequency information of signal but it does not provide the information about what time which frequency is present.The WT is superior in describing the signal anomaly, pulses, and other events that occur in the short duration time in the signal, e.g.speech signal.
There are several types of WT methods, some of them are DWT and WPT.In the DWT decomposition process, only on the side of approximation is at a lower frequency, whereas WPT is a generalization of the DWT decomposition which gives a wide range of a signal analysis.WPT gives a balanced binary tree structure by decomposing both the lower (approximation) and higher frequency bands (detail) in order to provide more and better frequency resolution features about the speech signal analysis.The basic WT function can be written as: Where ψ(t) is known as wavelet or prototype function, parameter s and τ are called translation and scaling parameter respectively.The term 1/√s is used for energy normalization in the varying scale.In the wavelet research, the selection of the most suitable mother wavelet is still a relative question mark among researchers [13].Figure 1 shows the structure of the WT.In feature extraction using WT, the process of choosing the right mother wavelet is crucial for optimal result of classification [13,22].The mother wavelets filter used in this study are Haar, Daubechies, and Coiflet.

Selection of Features
Selection and testing process are conducted simultaneously on the features obtained to get the most reliable features to be used in the speech recognition process.This process is performed to minimize the Euclidean distance value between the matching test data and the features obtained of the respective syllables sound so we can be sure that the feat ures can distinguish a certain syllables sound from the others accurately.The Euclidean distance (d) between two point p and q is given by: After finding the candidates of the features, they are tested by calculating the Euclidean distance of the syllable features toward another phonemic feature.A feature which is categorized as effective and reliable, for example when a certain feature of the /ga/ syllableis tested with the /ga/ syllable or the syllable itself, it will have a very small ED value, however, when it is tested between the /ga/ syllable and the specific features with other syllables it will have a fairly large ED.

Results and Analysis
The first result is the lists of the features of each phoneme of the syllables in Indonesian which obtained by using three kinds of wavelet transform and using ED as its classifier.The second result is a recommendation of the best wavelet types to be used in the Indonesian syllables recognition system.

List of Features
The vowel and phoneme frequency component which were obtained by using mother wavelet of Haar are shown in Figure 2.Then, the results of vowel and phoneme were combined to estimate the frequency component of syllable.Figure 2 shows the boxplot of frequency component distribution for each vowel and phoneme sound signal using the WT, which is also the exploratory frequency component data chart showing median, central spread of data and position of relative extremes [14].The lower and higher whisker shows the lowest and highest frequency component to form a certain type of vowel or phoneme, whereas the lower, midle and upper box shows the first quartile, median, and the third quartile, respectively.From the box plot is understood that both vowel and phoneme have different range of frequency component.In Figure 2 Figure 3 shows the comparison of the DWT (which is highlighted with purple color) and the WPT (which is highlighted with yellow color) in the 2nd through 4th level of decomposition in term of effectiveness for the phoneme sound signal.From the graph it can clearly be seen at the DWT at 2nd level decomposition has the highest score in effectiveness especially for /m/ and /n/ compared to WPT as well as the other level of decomposition.In the overal observation, the DWT is more effective than the WPT as shown by effectiveness ratio is 9 versus 6.
The results of frequency component distribution of each syllable sound signal using the combination of phoneme and vowel frequency components are shown in Figure 4. Four different types of phoneme followed by four different type of vowel, so there are sixteen different types of CV syllables.From the boxplot, it shows that /n-i/ has the wider frequency component range than the other types of syllables.

Effectiveness Analysis
Effectiveness analysis is conducted to determine the wavelet type that can distinguish a certain syllables sound most accurately and least likely to have errors in recognition due to the good Euclidean distance signature of its features.Effectiveness analysis is performed based on analysis on the Euclidean distance data in the Table 1

Efficiency Analysis
In addition to an effective method, it needs an efficient method in the speech recognition.It is needed in order to find a quick and precise method when it is used in finding the specific features of the syllable.The efficiency of the method is determined by the features used in the method at each level of the decomposition in finding the specific features of the syllable.Table 6 shows the number of the features used by both feature extraction methods on every syllable and on every level of the decomposition.From the analysis of efficiency then compiled the ranking table of the most efficient types of wavelets to distinguish each of the syllables as shown in Table 7.

Choosing the Best Wavelet
Cross ranking process or average ranking is then performed on Table 8 and Table 9 to find the best mother wavelet (the most effective and efficient) to be used to recognize the sound of each syllables.Cross ranking as shown in Table 8, is done by summing each ranking value in Table 5 and Table 7 for each type of wavelet of each syllables.The type of mother wavelet which has the smallest number is the best mother wavelet (the most effective and efficient) to recognize syllables sound.For example, Coif wavelet has the effectiveness rank of 1 and the efficiency rank is 2, then the cross ranking result is 2 + 1 = 3.If another type of wavelet has the value less than 3, that type of wavelet can be said better than the Coif wavelet.

Conclusion
In this paper, the combined methods of Wavelet Transform (WT) and Euclidean Distance (ED) to estimate the expected value of the possibly feature vector of Indonesian syllable was proposed.Based on the experimental result presented in this paper, it can be concluded that the combined method of WT and ED are promising to be used for estimating the expected frequency component value of the possibly feature vector of Indonesian syllable.The effectiveness and efficiency ranking of three mother wavelet for the feature extraction of velar consonant /g/ with the following vowels are Haar, Coiflet 2, and Daubechies 2, respectively.The future work recommended for this research is to use bigger syllable dataset, applied to the Indonesian stop consonant or the other place of articulation (such as labial, dental, etc.), and to use the same level of decomposition for estimating the frequency component of vowel and phoneme.
, Hidayat et al. used the Discrete Wavelet Transform (DWT) on the 7th level decomposition to extract Indonesian vowels.The Daubechies 2, Coiflet 2, Symlet 5, Haar, Bioorthogonal 2.2, and The discrete Meyer wavelet were used as mother wavelet.The result of the study shows that the Haar wavelet is the best wavelet type used in the speech recognition process for all Indonesian  ISSN: 1693-6930 TELKOMNIKA Vol. 16, No. 3, June 2018 : 925 -933 926 vowel sounds.Recently

Figure 1 .
Figure 1.(a) The structure of Wavelet tree for the phoneme signal (b) The structure of Wavelet tree for the vowel signal

Figure 2 .
Figure 2. The boxplot of frequency component: (a) vowel, and (b) phoneme Figure2shows the boxplot of frequency component distribution for each vowel and phoneme sound signal using the WT, which is also the exploratory frequency component data chart showing median, central spread of data and position of relative extremes[14].The lower and higher whisker shows the lowest and highest frequency component to form a certain type of vowel or phoneme, whereas the lower, midle and upper box shows the first quartile, median, and the third quartile, respectively.From the box plot is understood that both vowel and phoneme have different range of frequency component.In Figure2(a) shows that /i/ has the wider range of frequency component compared to the other vowels, whereas /e/ has the

Table 1 .
. Euclidean Distance of the Haar Wavelet One of the syllables sound (/ga/) feature list obtained by using wavelet Daubechies 2 consists of the frequency components that have been tested and can accurately represent each syllables.Average value of the feature magnitude (MEAN X) decreased by the average value of the other types of syllables (ELSE), the decrement result is listed in the Table (DIFF) marked with yellow color for the wavelet type which has the best value (biggest value of Euclidean distance) in performance, as shown in Table 2, Table 3, and Table 4.

Table 6 .
The Number of Features

Table 8 .
The Best Wavelet Ranking