Speech classification using combination virtual center of gravity and k-means clustering based on audio feature extraction

The human voice is one of the biometric forms that can be used to recognize a person's character. The process of automatically recognizing spoken words of a speaker based on information in speech signal is called Speech Recognition[1]. Speech recognition is the machine or program's ability to identify words and phrases from spoken language and convert them into machine-readable format[2]. Feature extraction is most important part of the speech recognition system which distinguishes one speech from another[2]. The method of extracting sound features used is the VCG (Virtual Center of Gravity) method.


I. Introduction
The human voice is one of the biometric forms that can be used to recognize a person's character. The process of automatically recognizing spoken words of a speaker based on information in speech signal is called Speech Recognition [1]. Speech recognition is the machine or program's ability to identify words and phrases from spoken language and convert them into machine-readable format [2]. Feature extraction is most important part of the speech recognition system which distinguishes one speech from another [2]. The method of extracting sound features used is the VCG (Virtual Center of Gravity) method.
The Virtual Center of Gravity method uses the concept of Physics Science center of gravity. Center of gravity is an object's heavy distribution center when the center of gravity can be regarded as a style, this is the point where the object is in perfect balanced state no matter how the object is rotated or flipped at the reversed point at that point [3]. This concept is applied in order to find a special feature in an object. This concept is applied in order to find a special feature in an object.
The sound that is captured by the sound recording tool, will mealui several stages of sound processing to obtain the sound feature. Sound feature extraction is the process of converting voice signals into several parameters, where some sound data is considered useless (noise) will be discarded without removing the true meaning of the sound signal [4]. The process used in this study to eliminate sound data that is considered insignificant (noise) is Truncation, normalization, Frame Blocking, Windowing, Fast Fourier Transform (FFT). In this case, the process is done so that the sound data is good enough to be used in the extraction of sound features. This system will classify the sound data feature into clusters based on the same word pronunciation using K-Means Clustering algorithm [5]. The result of this research system will recognize the voice speech of a person based on the cluster that has formed. The sound identification process is indispensable for knowing the voice greeting's Accuracy based on its features. This research aims to construct a prototype sound feature using Virtual Center of Gravity in 3dimensional form and perform a sound feature test to recognize the system using the K-Means clustering process accurately.

AB S T R A C T
Voice recognition can be done in a variety of ways. Sound patterns can be recognized by performing sound feature extraction. The trainer sound data is built from the best sound data selection using a correlation coefficient based on the level of similarity between sound data for optimal sound features. Extraction of voting features on this research using the Virtual Center of Gravity method. This method calculates the distance between the sound data against the center point of gravity with visualizations in the 3-dimensional form of white, black, and grey pattern spaces. The preprocessing process generates a complex number of data consisting of real numbers and imaginary numbers. The number will be calculated the distance to the Virtual Center of Gravity's pattern space using Euclidean Distance. The sound feature testing is done using K-Means Clustering by means of a speech classification data based sound. The results showed an accuracy of 92.5%.

II. Method
Voice sampling is done by recording sound using the voice recorder app that is on your phone or computer. The voting process is done up to several times.
The system can be divided into audio data collection into two stages: data training and data testing. Data training includes preprocessing I, best sampling, preprocessing II, and audio feature extraction using virtual center of gravity. Data testing includes preprocessing and audio feature extraction using the virtual center of gravity. The system block diagram is shown in Fig. 1.

1) Read Data
Voice data reading is performed by the Audioread function located in Matlab. The audio data format used in this study is M4A.

2) Truncation
The sound data truncation is performed to bypass the sound data portion that is not considered necessary. Cutting is done at the beginning and end of the sound signal data, because at the time of sound recording there is usually a pause that causes the amplitude close to the value 0. To get a voice that only the voice is done, the sound data cuts that have a value of amplitude below 0.01.

3) Data Sampling
Sound data sampling is taking data length data on certain vectors. The sampling is done to homogenate the voice signal data length with the other.

4) Normalization
Feature normalization techniques represent a vital part of each biometric recognition system [6]. Normalization is used to keep the value range amplitude the sound signal does not differ considerably from other voice signals. (1) Where denotes result vector normalization of sound signal samples to n. And denotes n vector audio samples.

B. Best Sampling
The best data selection stage aims to get the best feature by optimizing the training data selection. The trainer data is selected through the selection of similarities between other sound sample data. The best data selection flow is shown in Fig. 3. The process of selecting this data uses the value of the correlation coefficient. The correlation coefficient is a value that indicates the level of similarity between 2 variables. The value of a correlation coefficient approaching a value of 1 means indicating a very similar level of resemblance. Conversely if the value of a correlation coefficient approaching a value of 0 means that the level of similarity between 2 variables is very low [7].
The calculation of the correlation coefficient applies to all relationships between 2 different sound data variables formed. Then calculate the average of all values of the correlation coefficient. Specifies the value of the corelation coefficient closest to the average value of the correlation coefficient. The level of resemblance between 2 sound data variables, e.g. x and y expressed in the following equation:

1) Frame Blocking
The sound signal is divided into several frames, where each frame will consist of a sample of the same data. Frame Blocking is generally done overlapping for every frame [8]. Overlapping is done to avoid loss of characteristic or sound characteristics at the border of each frame. The length of overlapping areas, in general, is used is approximately 30% to 50%.
Suppose M is the number of samples between adjacent frames. N is the number of data samples per frame, then M < N. Illustration of the frame blocking shown in Fig. 5. The overlap is usually expressed in percentages as follows : If L is the normalized amount of sampled sound data and W the number of frames, then:

2) Windowing
In the next process is the Windowing. The windowing technique is an important part of the data processing used to find the window's optimum length for the feature extraction process [9]. There are many types of window, e.g., Rectangular, Bartlet, Welch, Hanning, Hamming [10]. The type of Windowing used in this research is the Hamming window. The windowing process reduces unsustainable signals at the beginning and at the end of each frame. Signal generated from the process windowing, expressed in the form of the following equation: Where w(n) uses the Hamming window function, so that the equation becomes:

3) Fast Fourier Transform
Fast Fourier Transform used for transforming the discrete-time signal from the time domain into its frequency [11] using a formula like the following: These signals essentially represent a signal decomposition in regards to sinusoidal components. Sinusoidal is a sinusoid of the same frequency, but the amplitude and the different phases [12]. FFT is an algorithm developed by CooLey, and Turkey is a signal from the realm of time to be a frequency.
The result of this stage is a complex number consisting of imaginary numbers and real numbers. The number will be used to characterize the pattern space of the sound feature extraction pattern.

D. Audio Feature Extraction
The preprocessing sound Data will be performed extraction feature by using the Virtual Center of Gravity (VCG). This method determines the sound characteristic by looking for the center point of gravity of a pattern space visualized in a 3-dimensional form with black, white, grey, each space of the patterns. Black means the maximum value of an object; white means the minimum value of the object, and gray means the value between the maximum and minimum of the object. The audio feature extraction process flow is shown in Fig. 6. The VCG is a feature representation of the IP Center's of gravity of the feature space (FS/space pattern) and Background (pattern background) [13], the concept of the VCG conducted in this study is explained through the representations shown in Fig. 7.
Representations of real numbers and imaginary numbers that have been formed against virtual pattern spaces are visualized in 3-dimensional form. Virtual Center of Gravity is derived from calculating the distance of real numbers against a white pattern space, an imaginary distance to a black-patterned pattern, and a real and imaginary distance from the gray pattern space. The calculation of this distance uses Euclidean distance.

E. Classifier
The method used for the classification of sound feature is K-Means Clustering. K-Means is one of the algorithms in data mining that can be used to group/clustering data [14]. The K-Means method is a method included in the distance-based clustering algorithm that divides the data into a number of clusters, and this algorithm only works on numeric attributes [15]. The K-Means algorithm is an algorithm that is often used in grouping techniques because it creates an efficient estimate and does not require many parameters. Flow clustering of sound features shown in Fig. 8.

F. Performance
Sound data that has formed a new cluster will be tested using sound feature test data extraction. The test data is to be P as valid sound data and N as the forgery sound data. The measurement success rate is seen with 2 fault models, namely the False Acceptance Rate (FAR) ratio and False Rejectance Rate (FRR) [16]. Then it is necessary to find the True Positive Rate (TPR), False Positive Rate (FPR), and True Negative Rate (TNR), which are described as follows:  TPR also called with sensitivity, or accuracy ratio, described as valid match audio hereinafter called True Positive (TP) divided the number of valid audio (P):  FPR can also be called alarm error or ratio imprecision, outlined into a valid Unmatch audio hereinafter called False Positive (FP) divided the number of audio forgery (N)  TNR can also be called by specificity, described as match forgery audio hereinafter called True Negative (TN) divided the number of audio forgery (N).

(10)
 False Acceptance Rate is the value of the False Positive Rate, expressed with the following equation: (11)  False Rejectance Rate is the value of False Negative Rate, the similarities are: (12)  Accuracy (ACC) is a percentage of the Accuracy of the total success submission to the prototype of the stated characteristics with the following equation :

III. Results and Discussion
The sound data of the trainer is obtained from 1 person who speaks the word "Morning", "Certification of Appreciation", "Information technology", each consisting of 10 sound data. As for the test sound data is obtained from 3 respondents who say the word "Morning", "Certification of Appreciation", "information technology" each of 3 sound data. The received sound Data will be created in one folder with the naming of consecutive files shown in Table 1. In the next process of preprocessing level 1 consists of truncation, sampling data, normalization data. The sample result preprocessing level 1 is shown in Fig 9. Then best sampling from each spoken word is shown in Table 2.  After getting the best training, data will be done preprocessing level 2, namely Frame blocking, Windowing, and Fast Fourier Transform. The sample result preprocessing level 2 is shown in Fig.  10. The extraction of sound features using the virtual Center of Gravity is to build a virtual background/pattern space with a 3-dimensional shape that is visualized enameled white, gray, black. The white pattern space value is depicted with a value of 0, the black pattern space value is 1, and the space value of the gray pattern is 0.5. Sound feature extraction generates three sound pattern features of each data. The prototype visualization of a white pattern room sound feature is depicted in blue, the grey pattern space is depicted with a green color, and a black pattern space with the yellow color shown in Fig. 11. K-Means Clustering is required in the classification process of sound features. The sound feature in classification is based on the type of spoken word so that the clusters formed in the research there are 3. The cluster to be formed is a morning cluster representing the morning word, the cluster certi represents the word certification of appreciation and the techno cluster represents 93 Vol. 14, No. 2, May 2020, pp. 85-94 Kumalasari et.al (Speech classification using combination virtual center of gravity and k-means clustering…) the word information technology. Voice classification is used to test the stability of each speech's sound characteristics. Sound feature visualization after clustering is shown in Fig. 12. After classification, there are differences in the member cluster of each word spoken that cause a difference in voice recognition of the training data. The voice recognition result is shown in Table  3. Table 3. Voice recognition of the training data

File Name
Morning Certification of Appreciation Information Technology Testing by using the sound feature of the test data. The sound test Data is obtained from 3 respondents who will pronounce the word in the morning, said certification of appreciation, and said information technology. There are 9 sample test data sounds that will be tested with members of the sound feature to all the clusters that have been formed. The result of the test is voice recognition by the system based on the number of members in a cluster shown in Table 4.  After getting the test results, the FAR, FRR, and Accuracy obtained refer to equation 11-13, so the results shown in Table 5 are obtained.

IV. Conclusion
The results showed that the voice characteristics were built from the best training data selection results using correlation coefficients to get the best 3 sura data for each category. Voice feature extraction is done using imaginary numbers and real numbers formed from the fast Fourier transform stage. Sound features are visualized in 3-dimensional shapes that have white, gray, and black space patterns. Sound feature testing is performed using the K-Means method, which forms 3 clusters based on speech, namely the morning cluster, certi cluster, and techno cluster. The accuracy rate that was identified from 9 test data with different people's voices was 92.59%