Machine Learning-Based Music Genre Classification with Pre- Processed Feature Analysis

Md Shofiqul Islam , Md Munirul Hasan , Md Abdur Rahim , Ali Muttaleb Hasan , Mohammad Mynuddin , Imran Khandokar , Md Jabbarul Islam 4 1 Faculty of Computing, Universiti Malaysia Pahang, 26600, Kuantan, Pahang, Malaysia. 2 Department of Mechanical Engineering, College of Engineering, Universiti Malaysia Pahang,26300 Gambang, Kuantan, Pahang, Malaysia 3 Department of Civil, Transportation Engineering, Environmental and Construction Engineering, University of Central Florida, Orlando, Fl 32816 USA 4 Department of Mathematics, National University, Gazipur-1704, Dhaka, India


INTRODUCTION
The most important and well-known element of any song is its genre. Many music fans make a playlist based on genre. Many music streaming apps, such as Apple Music, Spotify, Wynk, and others, suggest new songs to these customers based on their favorite genres. Because the music industry is rapidly evolving and establishing new song categorization can be challenging, technology can help solve this challenge. A song can be represented as an audio signal [1] with bandwidth, frequency, spectral roll-off, and other characteristics. The frequency under which a certain proportion of the total spectral energy is lost is referred to as spectral rolloff. The roll-off frequency is the frequency at which a certain percent of age (cutoff) of something like the spectrum's amount of energy is contained. The gap between the top and bottom frequencies in a continuous range of frequencies is known as bandwidth. It's possible to tell the difference between harmonics (below rolloff) and noisy noises using the roll-off wavelength (above roll-off). These characteristics identify two audio transmissions. Using these qualities, we used the notion of data filtering to classify the songs into their various genres. The main issue with dealing with datasets is that the data will either be incomplete, i.e. incorrect data, or it will be outliner values. As a result, pre-processing of this data requires data filtering. Data filtering refers to a variety of approaches or tactics for fine-tuning datasets to meet the needs of users. Data filtering removes any noisy, undesirable, and duplicate values from the information, resulting in clean, ready-to-process data. To acquire the best result and show how data filtering improves accuracy, a comparison study is carried out.
Consumers can better grasp social networking sites and feedback by mining public emotions. Sentiment analysis is a type of computer-based mining that saves time and money by applying various voice analysis technologies. As the human brain learns from sample data, a machine learning system predicts fresh unlabeled data quickly. Machine learning algorithms outperform alternative methods or approaches based on handcraft aspects. There has been a lot of research and development in the field of voice classification using machine learning, and the results have been rather good. Some of the earlier systems have successfully achieved qualitative and meaningful knowledge; nevertheless, they are limited to coherence, and information extraction presents a significant challenge [2]. In terms of accuracy and performance, deep attempting to learn procedures, on the other hand, outperform other similar techniques. In the classification of image [3][4], video [5][6], speech [7], and text [2][8][9] [10], IoT based analysis [11] [12], deep learning outperformed. Several neural network models have already been proved superior to others. For example, Convolution Neural Network (CNN) [13], Long Short Term Memory (LSTM) [14], Bidirectional LSTM [15], Gated Recurrent Unit (GRU) [16] have improved textual grading significantly.
The remainder of the paper is laid out as follows: Section 2 formulates the problem and examines the context. Section 3 is for data and its preprocessing. The experimental technique and development are detailed in Section 4. The results are presented in Section 5, followed by Section 6 provides the conclusion.

RELATED STUDY
The music industry has grown rapidly during the previous decade, with new songs produced every day and new artists, bands, and songwriters joining the sector. This has resulted in a fascinating subject of study for scholars. Deep learning solutions have been employed by researchers for music genre recognition and music recommender systems. Using the GTZAN dataset, the author [17] created a residual neural network to train on audio snippets of 3 seconds duration. Certain qualities that overlapped were taken into account for various genres, and the author was able to obtain a 94% accuracy rate. For categorization, many writers have created convolution neural networks (CNN). Since spectrograms, chromatograms, and Mel-frequency cepstral coefficient are three alternative ways to visualize music (MFCC). The mel-spectrogram of the audios was utilized as input for the author's models in [18]. The researcher used a double convolution layer that went through multiple layers as well as data analysis. In [19], a comparison of two model classes that traveled through a CNN architecture in the sound and harmonic domains spectrum analyzer was performed (VGG-16). The author achieved a precision of 65 %. Author [20] conducted a comparative assessment of the benefits and drawbacks of various filter algorithms, using six different datasets for their research. In addition, the author created a new hybrid algorithm to improve feature selection. For predicting, the author [21] utilized various data filtering approaches on neural networks, with local regression filtering yielding the best results. For packaging the functionality, the author suggests an ensemble-based technique [22]. The data set was a very imbalanced distribution of the class, with the majority of the work being the development, evaluation, and application of several datasets. Traditional approaches failed to perform as well as the proposed method. The author [23] constructed a wrapper approach by extending an existing filter method and testing it on a variety of datasets. Another successful approach with some machine learning approaches and data filtering techniques, this method uses MLP tools and got 88.10% accuracy [24]. Another method used Single-Layer Feedforward Neural Network (SFNN). This method got 84.83% accuracy. A more successful method using Deep Attention Neural Network (DANN) got an accuracy of 90.00%.
The results of the preceding investigation reveal that present music analysis algorithms have issues with context and characteristics. It's envisaged that a new multi-label music analysis system with improved performance and adaptability will emerge. However, in order to design a current and highly accurate machine learning system for music analysis, we need to solve mentioned limitations of the existing methods. As a result, three objectives have been established to fulfill the research goal by resolving the problems: 1. To create a flexible, adaptive, and high-performance hybrid profound training approach with multi-label music analysis knowledge. 2. To propose a new technique to analyze the music using the hybrid machine learning algorithm. 3. To assess the performance of the proposed algorithm on the GTZAN dataset using current methodologies.

DATA
We employ GTZAN music information in our scientific implementation. The GTZAN dataset [25] is the most often utilized for music genre categorization. The GTZAN dataset has ten genres and 100 audio files for each class with 30 seconds duration. In this study, we used the GTZAN dataset, whose frequency is presented in Table 1.

Preprocessing
When visualized, each audio can be represented as a sound signal [24]. This signal allows us to discriminate between two audios. As a result, we were able to extract several properties from these audio waves [24]. To tackle the problem, relevant audio features can be extracted [26]. Two subcategories of acoustic files can be found.

Feature Extraction a) Time Domain Feature
There are two of the Time Domain Features we used in our method. Point "a" state Root Mean Square Energy and the point "b" state definition of point "a."

a. Root Mean Square Energy:
The Root Mean Square Energy(RMSE) determines how loud the audio is (RMSE). Because RMSE is computed frame by frame, the mean, standard deviation is calculated across all frames. The energy of the Root Mean Square is illustrated in the equation (1), where is a set of values:

b. Zero-Crossing Rate:
The signaling rate, or the number of times the sign shifts from positive to negative and, conversely, is altered from positive to negative or negative to positive. The implementation of Zero Crossing Rate(ZCR) is done using equation (2). Let signal s with the time duration T, ZCR will be as follows:

b) Frequency Domain Features
We use some frequency domain features in our method. These are listed below:  [10][11][12][13][14][15][16][17][18][19][20] that describe the contour of an audio stream. b. Chroma Features: Each pitch class, i.e., the 12 unique semitones of the musical octave, has a feature vector of size 12. It establishes the degree of similarity between musical works. c. Spectral Centroid: It denotes the signal's 'center of mass,' or the wavelength at which the spectrum's energy is centered. d. Spectral Roll-off: It defines the signal's form. The frequency where the frequency components are set to 0 is determined by this variable.

Data Filtering
The basic objective of the feature is to exclude predictors that aren't useful or repetitive from the model. Data filtering is critical in machine learning [24]: a) Parsimony (or Simplicity) Advanced forms are harder to interpret than simple ones, especially while inferring.

b) Time is Money
As the amount of features is lower, the computation time falls, and thus the training time reduces. c) Avoiding the Curse of Dimensionality A highly accurate model trained with a big dataset can be deceiving since it may indicate overfitting and will not generalize to samples.

Data Filtering Methods
There are few data filtering methods in speech preprocessing. Some of it's we used in our method are listed below:

a) Filter Methods
Filter approaches employ statistical calculations to assess the importance of predictors outside of prediction models and keep only those that meet a set of criteria. The kinds of data included, both in predictors and outcomes -numeric and categorical are factors to take into account when selecting filter techniques [24].

b) Wrapper Methods
Wrapper approaches analyze certain machine learning techniques to determine the best features. Wrapper methods for increased datasets take a long time to compute. Because machine learning models are trained with diverse combinations and features, there is a considerable risk of overfitting. Forward Collection (starts with one prediction and includes additional repeatedly), Backward Selection (starts with all indicators and removes one by one iterative manner), and Step-Wise Selection (starts within all predictors and removes one by one iteratively) are the three potential directions of processes (bi-directional) [24].

c) Embedded Methods
The characteristics of both the wrapper feature selection methods are combined in reinforcing elements. The classification algorithm in their models happens automatically throughout the model fitting procedure. The most common embedding method uses tree-based techniques, such as a decision tree as well as an alleged woodland. The general concept of function choice in the split node is related to information acquisition. Other embedded approaches include the LASSO charge and the L2 penalty for Ridge for creating a linear model [24]. Fig. 4 shows the spectral roll-off, which is a measurement of the signal's form. It denotes the frequency at which a certain proportion of overall spectral energy, such as 85%, is found.

METHOD
The fundamental model proposed consists of multiple parts. First, it collects data from a source and then pre-processes it in various ways mentioned in the previous section. Then, using the proposed methods, feed the which was before speech signal. Following the execution of many algorithms, the final prediction and analysis are completed. Certain genres have always been easily classed by all of the modeling techniques, while others, such as hip-hop as well as disco, were more difficult to categorize because the sound of these art forms had characteristics of other genres, i.e., the audio was a mix of multiple genres rather than a single genre, making it difficult to classify. Several artificially intelligent algorithms were presented, in which input was transmitted without even any filter techniques have been applied first, and with filter acts on the object second. When data filtering procedures were employed vs. when they weren't, there was a considerable difference in accuracy. The models were developed and evaluated with ten traditional machine learning algorithms. Finally, they compared their results to recommend the best algorithm in music classification.
After the data preprocessing, the data is partitioned into two parts, 80% for training and 20% for validation. Implementing the proposed model is a crucial task, and then the result is analyzed to predict music class. The main operational flow chart of our model is shown in Fig. 6.   Fig. 6. Proposed machine learning model to classify music.

a) Naive Bayes
A Naive Bayes classifier is built using the Bayes Theorem. It estimates member probabilities for each classification, such as the probability that a particular record or data point belongs to that group. The most probable class is the one with the greatest probability.

b) Stochastic Gradient Descent
The iterative approach of stochastic gradient descent (commonly shortened SGD) is used to optimize an objective function with sufficient smoothness criteria (e.g., differentiable or sub differentiable). Because it substitutes the actual gradients (derived from the complete data set) with an approximation, it can be considered stochastic approximations of steepest descent efficiency (calculated from a randomly selected subset of the data). This minimizes the computing cost, especially for high optimization algorithms, allowing for quicker iterations in exchange for a reduced convergence speed.

c) K-Nearest Neighbors (KNN)
The supervised learning method k-nearest neighbors (KNN) is a simple, easy-to-implement technique that may be used to address both classification and regression issues. The KNN algorithm believes that objects that are similar are close together. To put it another way, related items are close together. A clustering procedure known as K Means separates observations into k groupings. Because the number of nodes may be specified, it can be simply applied to categorization when data is divided into clusters that are equal to or greater than the number of classes. The k-means clustering technique tries to divide a given anonymous data set (one that has no information about class identity) into a collection of k clusters. A total of k centroids are picked at the start. A centroid is an imaginary or actual piece of information in the center of a clustering.

d) Decision trees
A binary tree is used to depict the pattern. The very first node was indeed the root node. A flow path is constructed by a series of dataset-related inquiries (character traits as well as accompanying attributed information), the pathway is divided up, and the outcome is then anticipated. The many dataset classes are represented by the leaves. The decision tree is mostly employed in multiclass categorization.

e) Random Forest
Because of the large number of decision trees involved in the process, random forests are regarded as a highly robust and accurate technique. It is not affected by the problem of overfitting. The fundamental reason for this is that it averages all of the forecasts, canceling out any biases.

f) Support Vector Machine
Every data item is plotted as a spot in n-dimensional dimensions (where n is the number of characteristics you have), with both the value of each feature becoming the result of a certain position in the SVM algorithm. Then we accomplish classification by locating the hyperplane that best distinguishes the two classes. Even though SVM is commonly used only for the classification model, the result for all ten classes was calculated using the single-vs-rest approach. Each class is independently assessed, and the most engaged class is chosen.

g) Logistic Regression
A classification model is a linear model. Logistics regression is a statistical analytical method that forecasts data based on past dataset observations. Even though the expansions are more complicated, it exploits a logistical feature to represent a binary variable.

h) Neural Nets
A machine learning model is based on brain activity. Neurons, including such central nervous systems, are present in each step of the network. Whenever information passes thru these synapses, the model is capable of understanding it and providing the desired result. A complex link between outputs and inputs is often used as a non-linear framework.

i) XGBoost
The gradient boosted trees technique is designed in XGBoost, a popular and efficient open-source implementation. The weak learners in gradient boosting for prediction are regression trees, and each regression tree transfers input data points to one of its leaves containing a continuous value.

Evaluation Metrics
For the training of the model, validation, and simulation time with implementation, we find measures named accuracy in the evaluation. The operational formulae for these are given below. We use curvature values, mean correctness, mean recall, and a means F1 score of receiving operational properties to assess the determine incidence of the architectural outputs in this categorized risks (ROC). The submitted equation's correctness is evaluated in equation (3).

Main algorithm to classify Music
On GTZAN information, this section explains how the core algorithm works. The algorithm accepts raw information from the database and forecasts target mood categories based on data. Our model's operating processes at each phase are detailed in this core algorithm. After prepossessing raw voice or audio, it takes input and forwards it to multiple levels.

Experimental setup
The model was developed on a laptop with a good Internet connection. Google Colab was implemented, which is a free cloud-based service for machine research and education. The Jupyter interface is identical. It comes with a fully configured environment for in-depth research and free access to a powerful GPU. The device acceleration was chosen as the standard-setting for the remainder of the options. The categorization was carried out on a computer with five Intel(R) 3.60 GHz processors, 16 GB of RAM, and Windows 10 installed. The better outcomes from all experimental experiments are shown in Table 2 and Table 3 of the results.

RESULT
Predefined genres were easily classified by all of the modeling techniques, while others, such as hip-hop and disco, were more difficult to categorize because the sound of these genres had characteristics of other genres, i.e., the audio was a mix of multiple genres rather than a single genre, making it difficult to categorize. Several artificial intelligence algorithms were presented, in which input was passed without any filtering techniques have been applied first, and with filtering products which contain second. When data filtering procedures were employed vs. when they weren't, there was a considerable difference in accuracy after the preprocessing of signal in various ways mentioned in the previous section. The models developed and evaluated are Naive Bayes, Stochastic Gradient Descent, KNN, Decision trees, Random Forest, Support Vector Machine, Logistic Regression, Neural Nets, Cross Gradient Booster, Cross Gradient Booster (Random Forest), and XGBoost. Accuracy gained by Naive Bayes is 51.95%, Stochastic Gradient Descent 65.53%, KNN 80.58%, Decision trees 63.997%, Random Forest is 81.41%, Support Vector Machine 75.41%, Logistic Regression 69.77%, Neural Nets 67.73%, Cross Gradient Booster 90.22%, Cross Gradient Booster (Random Forest) 74.87% and XGBoost is the best-performed machine learning with an accuracy of 90.22%. Table 2 shows a comparison of the current model with a previously constructed model. Because some authors utilized a different dataset or generated one from scratch, some people have only taken 5-6 classes in order to be classified. As a result, our work is put up for comparison, with all ten classes of the GTZAN dataset factored into the model architecture.

Model Authors
Model Name Accuracy [18] Convolutional Neural Network (CNN) 65.00% [27] Single-Layer Feedforward Neural Network (SFNN) 84.83% [25] Recurrent Neural Network (RNN) 64.00% [28] Deep Belief Neural Network (DBNN) 84.30% [29] Support Vector Machine (SVM) 84.40% [24] Multi-Layer Perception (MLP) 88.10% [30] Deep attention Neural Network (DANN) 90.00% Proposed XGBoost 90.22% We have implemented ten machine learning models by developing them, and our analyzed result is presented here. Table 3 shows the comparative accuracy analysis of the proposed ten models. Based on the metric evaluation performance, it is clearly shown that the XGBoost algorithm performs better than other algorithms. Here, Fig. 7 shows the correlation matrix of the XBoost algorithm result on the target result. Here the number of the target class is ten, and it is illustrated. In this figure, the relation between the target music class is clearly shown. This figure is designed from the test data result by the XBoost algorithm. For testing purposes, we took 20% of the data.

CONCLUSION
The suggested research uses the GTZAN dataset to construct multiple AI models for music genre classification. The audio characteristics for all of the audio recordings were included in the dataset. Several artificial intelligence methods were presented, in which input was passed without any filtering techniques have been applied first, and then with filtering techniques applied second. When data filtering procedures were employed vs. when they weren't, there was a significant difference of 2% in accuracy. The models developed and evaluated are Naive Bayes, Stochastic Gradient Descent, KNN, Decision trees, Random Forest, Support Vector Machine, Logistic Regression, Neural Nets, Cross Gradient Booster, Cross Gradient Booster (Random Forest), and XGBoost. Accuracy gained by Naive Bayes is 51.95%, Stochastic Gradient Descent 65.53%, KNN 80.58%, Decision trees 63.997%, Random Forest is 81.41%, Support Vector Machine 75.41%, Logistic Regression 69.77%, Neural Nets 67.73%, Cross Gradient Booster 90.22%, Cross Gradient Booster (Random Forest) 74.87% and XGBoost is the best-performed machine learning with the accuracy of 90.22%. In the future, we will improve our model to handle more classes of music on diverse datasets. We have also planned to handle real-time music or live music with less computational complexity.