The influence of sampling frequency on tone recognition of musical instruments

,


Introduction
Nowadays, the development of technology leads to digital technology. This digital technology requires the conversion of data from analog type to discrete type. In order to convert data from analog to discrete, the existence of a sampler is required. The major parameter in a sampler is sampling frequency. Generally in the field of digital signal processing, the sampling frequency used is the frequency that follows the Shannon sampling theorem. Basically, this theorem needs to be followed in order that analog signal can be perfectly recovered from its sampled version [1]. In addition, in the field of pattern recognition, the Shannon sampling theorem is generally followed, such as: in researches relating to tone recognition using the time domain approaches [2][3][4][5], the transformation domain approaches that are based on fundamental frequencies [6][7][8][9][10][11], and the transformation domain approaches that are not based on fundamental frequencies [12][13][14][15][16][17][18].
If it is analyzed further, in researches relating to musical instruments tone recognition, signal processing is not carried out on processes that contribute to producing the output signal. In this case, signal processing will stop before feature extraction process. There were no previous researches, relating to musical instruments tone recognition made use of sampling frequencies that did not follow the Shannon sampling theorem. Thus, a research of musical instruments tone recognition that makes use of sampling frequencies that do not follow the Shannon sampling theorem, is still wide open. As a note, the advantage of using sampling frequencies that do not follow the Shannon sampling theorem is smaller data for processing. This paper will discuss the influence of sampling frequency that does not follow the Shannon sampling theorem, on musical instruments tone recognition. The tone recognition system uses a transformation domain approach that does not use fundamental frequencies [18]. The musical instruments use bellyra, flute, and pianica. They are chosen to represent tones that have one, a few, and many significant local peaks in the Discrete Fourier Transform (DFT) domain as shown in Figure 1. In more detail, it will be explored if there is a lowest sampling frequency that does not follow the Shannon sampling theorem, which can be used for tone recognition system. Representation of tone C in the normalized DFT domain X(k) for bellyra, flute, and pianica, by using a sampling frequency 5000 Hz and a DFT 128 points. As a note, only the left half of the normalized DFT domain is shown.

Research Method 2.1. Overall system development
A tone recognition system shown in Figure 2 had been developed in order to explore the influences of sampling frequency. The input is an isolated signal tone in wav format. The output is a text that indicates the recognized tone. The input and the function of each block in

Input
The input of the developed tone recognition system was an isolated tone signal from a musical instrument (bellyra, flute, or pianica) in wav format. There were eight tones (C, D, E, F, G, A, B, and C) of each musical instrument. The tones were obtained by using various sampling frequencies: 5000 Hz, 2500 Hz, 1250 Hz, 625 Hz, 312 Hz, and 156 Hz. The highest sampling frequency of 5000 Hz was chosen because it met the following the Shannon sampling theorem [1]: where is the highest frequency component of the analog signal to be sampled, and is the sampling frequency. Based on the evaluation of the signal spectrum, the highest significant frequency components for C' on bellyra, flute, and pianica were 2109 Hz, 1602 Hz, and 2100 Hz, respectively. The recording duration of 2 seconds was chosen since based on the evaluation of signal amplitude, it was sufficient to get more than half of the signal that was already in the steady state condition.
The three musical instruments we used for generating the tone signals above were, Isuzu ZBL-27 bellyra, Yamaha YFL-221 flute, and Yamaha P-37D pianica, as shown in Figure 3. In order that the generated tone signals could be processed by the computer, they were captured by using a Samson Meteor USB microphone.

Frame Blocking
Frame blocking is the process of taking a short signal from a long signal [19]. This process is needed in order to reduce the number of signal data from the input. By reducing the number of signal data, it could reduce computing time. In this research, a short signal was taken from the beginning area of the tone signal that had been in the steady state condition. Based on the evaluation of the tone signals, 200 milliseconds after the silent part of the signal, the steady state condition had been reached. In this research, frame blocking length was evaluated with values of 16, 32, 64, and 128 points.

Initial Normalization
Initial normalization is a process for setting a maximum absolute value to a value of one. Initial normalization is required since the signals from the frame blocking process have variations in maximum absolute value. Initial normalization is carried out using the following (2): where signal vectors xin and xout are the input and output of normalization process respectively.

Windowing
Windowing is the process of reducing discontinuities in the areas of signal edges. These areas appear as a result of signal cutting in the frame blocking process. Discontinuity reduction will reduce the appearance of harmonic signals, after the normalized signals is transformed using FFT (Fast Fourier Transform). Hamming window [20] is a type of window that can be used for windowing purposes. This type of window is commonly used in digital signal processing [21]. In this research, the window width was the same as the frame blocking length.

FFT
FFT is a process to transform a discrete signal from windowing process, from the time domain to the DFT domain. Basically FFT is an efficient method to perform the DFT calculation. In this research, DFT calculation was performed using a radix-2 FFT. This type of FFT is widely used in the field of digital signal processing [21]. As well as the above windowing, the FFT length in this research had the same length as the frame blocking length.

Symmetry Cutting
Symmetry cutting is the process of cutting half part of the FFT result. This cutting is necessary since between the left and right half of the FFT result shows a symmetry property. Therefore, when only half part of the left or right of the FFT result, it will be sufficient. In this research, only the left half part of the FFT result was used.

Segment Averaging
Segment averaging is a process to shorten the signal length from symmetry cutting. In this case, the shortened signal still shows similarity to the basic shape of the long signal. Segment averaging [18] used in this research using the principles of segment averaging inspired by Setiawan [22]. In this research, the segment length in segment averaging was evaluated with values 1, 2, 4, ..., and where N is the length of frame blocking.

Final Normalization
As with initial normalization, final normalization is also the process of setting the maximum absolute value to a value of one. Final normalization is necessary because the results of segment averaging have variations in the maximum absolute value. As with initial normalization, final normalization also carried out using (2). As a note, the final normalization result is called feature extraction of the input signal.

Distance calculation
Distance calculation is a comparison process between an input signal feature extraction and a number of tone signal feature extractions stored in the tone database. Distance calculation is an indication of pattern recognition using a template matching method [23,24]. The Euclidean distance function can be used for this kind of distance calculation. This distance function is commonly used in the field of pattern recognition [25].

Tone decision
Tone decision is the process of deciding an output tone that corresponds to the input signal. The first step of tone decision is finding a minimum value from a number of distance calculation results. These are the results of distance calculation between an input signal feature extraction (after final normalization process) and a number of tone feature extractions in the tone database. The next step is to decide an output tone. A tone associated with one of the signal feature extraction in the tone database that has a minimum distance, will be decided as the output tone.

Tone database
The tone database shown in Figure 2 is generated using the tone feature extraction shown in Figure 4. In this research, for each musical instrument, we took 10 samples for each tone (C, D, E, F, G, A, B, and C'). It was assumed that by taking 10 samples, all variations for each signal tone have been obtained. The results of these 10 samples were 10 feature extraction results for each tone. Furthermore, the 10 feature extraction results were averaged as follows: where vector {Zi | 1 ≤ i ≤ 10} is the 10 results of feature extraction, and vector {RT | T = C, D, E, F, G, A, B, and C'} is a vector of eight tones stored in a tone database. As a note, a tone database is generated from a value of sampling frequency, a value of frame blocking length, and a value of segment length.

Test Tones and Recognition Rate
Test tones were used to examine the developed tone recognition system. The developed tone recognition system was tested using 160 tones for each musical instrument, each sampling frequency, each frame blocking length, and each segment length. Tones come from eight tones (C, D, E, F, G, A, B, and C), where each of these tones was recorded 20 times. In order to measure the performance of the developed tone recognition system, we used the recognition rate formula as follows.

Results and Analysis
The developed tone recognition system as shown in Figure 2, was used to test the effect of sampling frequency of the tone recognition system. The results of these tests are shown in Tables 1, 2 and 3, each of which is the result of pianica, flute, and bellyra, respectively. As a note, based on Figure 1, bellyra, flute, and pianica tones have one, a few, and many significant local peaks in the DFT domain, respectively.
Tables 1 and 2 indicate that from a sampling rate 5000 Hz down to 312 Hz, there is a little influence on the recognition rate, since it is reduced in the range of 5%. However, Table 3 indicates that from a sampling rate 5000 Hz down to 312 Hz, there is no influence on the recognition rate. As a note, the highest frequency components of bellyra, flute, and pianica are 2109 Hz, 1602 Hz, and 2100 Hz, respectively. Therefore, based on these highest frequency components, (1), and also Tables 1, 2 and 3, the sampling frequency of 5000 Hz follows the Shannon sampling theorem. Starting from a sampling frequency of 2500 Hz and below, the sampling frequency deviates from the Shannon sampling theorem.  From the point of view of signal reconstruction, in general, when the sampling frequency decreases, the aliasing area (which is a high frequency region) increases. Therefore, the effect obtained from decreasing the sampling frequency is low pass filtering by decreasing the cutoff frequency. By the same reasoning, from the point of view of signal sampling, if the sampling 259 frequency decreases, low pass filtering (with a decrease in the cutoff frequency) will also appear. The effect is that the number of significant frequency components that can be obtained will be fewer. Therefore, if the number of significant frequency components (which in this case is represented by a number of significant local peaks in the DFT domain) of the signal becomes fewer, the recognition process will be more sensitive, when the frequency of sampling decreases. This is caused by the number of significant frequency components that are used to distinguish a tone with other tones becoming fewer. This incident will decrease the recognition rate.
From the point of view of data size, based on Tables 1 and 2, a decrease in the sampling frequency of 93.76% (from 5000 Hz down to 312 Hz) will reduce data size by 93.76%. However, a decrease in this data size will only cause a decrease in the recognition rate of less than 5%. This indicates that the tone recognition system is not sensitive to sampling frequencies that do not follow the Shannon sampling theorem.

Conclusion
The conducted research aims to explore if there was a lowest sampling frequency that did not follow the Shannon sampling theorem that could be used for the tone recognition system. For this purpose, we used the tone recognition system using segment averaging for feature extraction and template matching for classification.
Based on our experiments, until the sampling frequency is as low as 312 Hz, if the tone recognition system was used to recognize the tones that have one and a few significant local peaks in the DFT domain (such as bellyra and flute tones), the sampling frequency has a little influence on the recognition rate, since it reduced the recognition rate in the range of 5%. However, if the tone recognition was used to recognize the tones that have many significant local peaks in the DFT domain (such as pianica tones), that sampling frequency has no influence on the recognition rate. If the reduced recognition rate (in the range of 5%) could be accepted, the sampling frequency as low as 312 Hz could be used for the musical instruments tone recognition. In other words, if that kind of reduced recognition rate could be accepted, the sampling frequency does not need to follow the Shannon sampling theorem when recording musical instrument tones.
For further research, the exploration of sampling rates that do not meet the Shannon sampling theorem can be conducted for other tone recognition systems. In this case, the tone recognition systems could use a different approach of feature extraction (not segment averaging), and a different approach of classification (not template matching).