Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Classifying Music by Genre Using Machine Learning Techniques

Joos Akkerman (2690849), Jacob Cytron (2689488), Riley Hager (2689507),


Jennifer Lu (2689523) and Guido Vaessen (2597917).

March 29th 2020

Abstract
With digital streaming at the forefront of music consumption today, classification of songs is
immensely important to make suggestions to listeners. We chose to work with arguably one of the
most important music classifiers: genre. We modeled this feature using k-Nearest Neighbors, a
Decision Tree, and a Neural Network to see which would most accurately classify our set of songs.
We found that the k-Nearest Neighbors and the Decision Tree models did not give the desirable level
of accuracy when predicting genre. Our Neural Network, however, had a higher level of accuracy,
with certain genres proving more easily predictable than others.

1 Introduction
As the world becomes more digital, audio streaming services have overtaken the majority of the music
industry. Platforms such as Spotify, Apple Music, and SoundCloud have curated catalogs of over 50
million songs each, containing gigabytes of files and track information. With the quantity of new music
produced increasing, the need for accurate database management and classification grows in proportion.
Being able to identify and label tracks by genre is a key function for these applications to allow users to
search for and play songs at ease.
Music genre classification using machine learning techniques has been attempted through multiple
approaches. Oftentimes the approaches rely on spectrograms, which are graphical representations of
audio data. While this yields promising results, processing music files into spectrograms and performing
image processing is computationally intensive. Therefore, in our research, we looked at a variety of
machine learning algorithms in search of a less intensive way to classify music.
Our paper explores the performances of our models, given various numerical values of technical features
extracted from different segments of audio files in our dataset. Our data is a subset of the Million Song
Dataset provided by the Free Music Association and will be further discussed in section 2. Section 3
discusses the models we regarded promising for analysing the type of data we use. Section 4 discusses
the results each model has given, and in section 5 a final conclusion is drawn.

2 Data Specifications
The data used for training the models is metadata which describes technical features of the given song.
Each song is labelled with one genre, which is the dependent variable. In the table below the amounts
for each parameter are shown:

1
Parameters Amount
Classes (genres) 8
Total Samples 8000
Samples per class ∼1000
Dimensionality 518

Figure 1
As shown in Figure 1, the dataset is roughly evenly distributed over eight genres: Instrumental, Folk,
Electronic, Pop, International, Hip-Hop, Rock, and Experimental.
The features in the dataset (described below) we examined are: (1) Duration – the length of the
song; (2) Chroma cens – The chroma variant feature that is robust to parameters such as timbre,
dynamics, articulation, and tempo; (3) Chroma cqt – The Constant-Q chromagram; (4) Chroma stft –
The normalized energy for each chroma bin; (5) Mfcc – The mel-frequency cepstral coefficients sequence;
(6) Spectral bandwidth – The frequency bandwidth; (7) Spectral centroid – The centroid frequency
extracted from a magnitude spectrogram that is normalized and treated as a distribution over frequency
bins; (8) Spectral contrast – The energy contrast value corresponding to a given octave-based frequency;
(9) Spectral roll-off – The roll-off frequency, which is the center frequency for a spectrogram bin such
that at least 85% of the energy of the spectrum in this frame is contained in this bin and the bins below;
(10) Tonnetz - The tonal centroid features; and (11) Zero-crossing Rate – The fraction of zero-crossings
in each frame. The features were extracted from each audio track and its segments using LibROSA, a
python package for music and audio analysis.

3 Specification of Used Methods


For our feature extraction approach we use three ML methods: k-Nearest neighbors, Decision Tree, and
Neural Network. This section describes the working of these methods and their potential limitations.
Additionally, we augmented our data using Principle Component Analysis as described below.

3.1 Principle Component Analysis


Given the high dimensionality of the dataset, we performed Principle Component Analysis (PCA) on our
dataset. The performance of our models on the pre-processed data will be compared to the performance
on the raw data in section 4. PCA extracts the most relevant information from a dataset, which it
converts to new variables, called principal components. This allows for a quicker classification, as the less
relevant information is filtered out.
PCA is typically conducted on inter-correlated data, because if multiple features are linearly correlated
to each other, they can be reduced to a linear combination. PCA finds the features that are most
correlated to each other, and transforms them to linear combinations called principal components. It
does this for a specified number of principal components, which is a hyperparameter [1]. Given that audio
data is sequential, it is likely intercorrelated and thus well suited for PCA.

3.2 k-Nearest Neighbors


The first ML method is k-nearest neighbors (kNN). The kNN algorithm observes the k amount of neigh-
bors closest to the point of interest in the feature space. Then, based on which class these neighbors

2
adhere to most often, it estimates the class of the point of interest. This method is simple and often
effective, but can be biased by the choice of k [2]. Choosing k too small makes kNN sensitive to local noise
in the sample data, while choosing k too large makes the algorithm overlook clusters of classifications,
given that other clusters will also be taken into account. There is no one way to determine the optimal
value of k. Hence, k is often determined using trial and error [2].
kNN has been implemented by Korstrzewa, Brzeski and Kubanski [3] to classify music genres. The
highest accuracy they achieved was 0.735. However, the data they used is divided in two classes, whereas
the data we are using is divided into 7 classes and has significantly more instances and dimensions.
Nevertheless, their results do indicate kNN could be applicable. In case the size of the dataset poses a
problem, we will also train the kNN algorithm on data augmented by PCA and compare the results. We
will also test if weighing neighbors based on their proximity improves performance [4].

3.3 Decision Tree


Another relatively simple and widely used ML algorithm is the decision tree. A decision is a tree-like
structure in which each node represents a split on one of the attributes of the data. Every split in the tree
is based on one of the features, and the split is chosen such that after the split the gini impurity is as low
as possible [5]. The gini impurity is defined as the probability that a random point will be misclassified.
One of the dangers of using a decision tree is that it tends to overfit to the training data. The tree
becomes so big that every instance has its own unique path. This will give a high accuracy to the training
set but a low accuracy on the validation set. What happens in these cases is that the decision tree not
only captures the trend in the training set but also the noise and thus overfits to the training set [6].
Numerous techniques exist for pruning a decision tree to prevent overfitting, which will be discussed in
section 4.2.
Decision trees have been successfully used for music genre classification for example by Downie et al.
[7]. In that example the feature space was 40 dimensional with 14 classes, this shows that a decision tree
should be applicable to our dataset as well. We will also train our tree model on PCA-augmented data
and compare the results.

3.4 Neural Network


The third ML method is a Deep Feedforward Neural Network, which is a type of deep learning neural
network that uses two or more hidden layers. Neural networks are built up of multiple layers, consisting
of a set of processors, called neurons. These neurons receive information from one or multiple neurons in
a previous layer, process the information and send it to one or multiple neurons in the next layer. We use
a Feedforward Network, which means input layers only send information to output without back loops
while keeping all nodes connected [8].
Based on the contribution to the assigned task, the neurons are weighted. This is done through a
process called backpropagation. This process first assigns the connections between nodes initial weights
and then determines how well the network predicts, based on a loss function. This loss function measures
the difference between the prediction and the true outcome, given the input values. Then, the back-
propagation algorithm iteratively changes the weights and recalculates the loss, improving the predictive
ability of the model. How the weights should be changed is determined through gradient descent, which
shows the changes for which the loss is decreased [9].
Neurons can process the received information in different ways, of which Sigmoid, Tanh and Relu
activation are commonly chosen options. Sigmoid and Tanh activation are very similar methods that
map input values to the interval [0, 1] and [-1, 1], respectively. Both methods form a logistic function,
which thus means extreme values are curbed to the respective edges of the interval [10]. Both the functions
for Sigmoid and Tanh activation are defined below. A disadvantage of these methods is the vanishing
gradient problem. When the NN has multiple layers using logistic activations, the gradient gets pushed
towards zero because large input differences are squished between 0 and 1 or -1 and 1. An often used
alternative that does not have this problem is Relu activation. This is a very simple function that filters
out all negative inputs it gets. This activation usually has better performance than Sigmoid and Tanh.

3
A disadvantage is that negative values are lost [10], for which there are different approached to solve this.
However, we will only implement Relu in its basic form as defined below:

(
1 2 x if x > 0
Sigmoid : σ(x) = Tanh : tanh(x) = −1 Relu : f (x) =
1 + e−x 1 + e−2x 0 if x < 0

4 Results
4.1 k-Nearest Neighbors
To determine the optimal value of k, the algorithm was trained on multiple values of k and the resulting
accuracy scores were compared. This was done 20 times, and the average outcomes are computed. In
the graph below, the results of the iterations are shown.

Figure 2: Average accuracy scores for different values of k, performed 20 times for each value of k.
Accuracy scores are based on the validation dataset, not the test dataset.

The graph also shows the accuracy scores of the kNN algorithm after PCA has been performed on the
data. The average of multiple iterations is taken because the performance of kNN differs depending on
the iteration. This is likely due to the differing splits in validation and training set, which was conducted
for each iteration. Since the dataset is not very big, this has an impact on the classifying power of the
model. Therefore, to assess the performance of different values of k, we look at the average performance.
After a k of 10, average performance of kNN stabilizes at around 0.39 for non-PCA data and 0.4 for PCA
data. Given the small variation in average performance, we picked an optimal value of k of 25.
In order to determine to which amount of features the PCA algorithm had to reduce the dataset,
iterations were performed over different feature amounts. These feature amounts were applied to the
data and with the augmented data, the kNN algorithm was performed for values of k between 1 and 40.
This yielded the accuracy scores shown in the graph below. Due to the volatile performance of kNN, it
is not clear which amount of components performs best. Therefore, we picked 100 components since this
lies in the middle of the spectrum.

4
Figure 3: Relation between number of components (amount of features PCA reduces the data to, originally
516) and the achieved accuracy scores in kNN. Accuracy scores are based on the validation dataset, not
the test dataset.

After the hyperparameters were set at k is 25 and number of components is 100, the kNN algorithm
was run on the test data. This was done with both the standard kNN algorithm and kNN with weighted
distances, based on inverse Euclidian distance. This yielded the following confusion matrices:

0 1 2 3 4 5 6 7
Hip-Hop Pop Folk Experimental Rock International Electronic Instrumental

Figure 4: Confusion matrix for k=25, with PCA and/or weighted distances appliedl

5
The highest accuracy on the test set with a value for k of 25, weighted distances and PCA applied is
0.436. From this we conclude that the relatively simple method kNN is not very successful in classifying
songs based on music genre. PCA and weighted distances do improve performance, albeit slightly.

4.2 Decision Tree


The Decision Tree model was run with and without first applying PCA to the data and for different
training/validation splits to get a good indication of the ideal parameters. Doing multiple train/validation
splits helps in identifying ideal parameters for the whole dataset and not the noise in a particular training
set. To prevent overfitting, the techniques as suggested by Bramer [6] were applied. Firstly, we started
by pre-pruning the tree by running the model for different depths, the results of which are shown in figure
5.

Figure 5: Accuracies for a decision tree with different depths with and without PCA applied for 10
different training validation splits

As seen in figure 5, the accuracies on the validation set are highest for a depth of 8 with an accuracy
of 0.408. It also becomes clear from Figure 5 that unlike with kNN, the accuracies of the decision tree
are lower after applying PCA. This indicates that for a decision tree it is better to have more features
because there are more options to choose from for a split.
Another thing that can be done to improve the accuracy is set a maximum to the amount of leaves
of which the tree consists. This method, like the maximum depth, prevents the tree from growing too
big and overfitting to the training data. To determine the optimum number of leaves, accuracies on the
validation set were calculated for maximum amounts of leaves between 10 and 105 on a logarithmic scale.
This gives an indication of what the ideal size of the tree would be. In Figure 6, it can be seen that the
best accuracies are achieved for tree sizes between 80 and 110. To find the ideal tree size the accuracies
for all values between 80 and 110 are calculated and shown in Figure 6 as well, for these runs there was
no maximum depth limit.
The accuracy on the validation was highest for a tree consisting of 94 leaves with an accuracy of 0.428.

6
Figure 6: Accuracies for decision trees of different sizes

For the test set the maximum depth was set to 7 and the maximum number of leaves to 94, this gives
an accuracy of 0.398. Ideally we would test all values of all parameters and see their mutual influence on
the accuracy, but this is too computationally expensive and the methods of parameter fitting go beyond
the scope of this course, so we choose to test the parameters independent of each other.

4.3 Neural Network


Due to the dataset’s large quantity of parameters, it was decided to minimize the amount of hidden layers
in the network to reduce computation time. The network therefore features two hidden layers and was
built using the Python packages keras and sklearn.
First the data was normalized using the sklearn package in Python. The eight genre labels in the
dataset were then coded into eight numbers. While this allows the network to numerically classify the
genres, it introduces new problems into the dataset. Hip-Hop is not numerically greater than Pop, and
Electronic is not less than Folk. To solve this, a One Hot Encoder is used to split the genres into eight
binary variables that the network uses as output for training - one for each genre.
The Neural network Model is trained using a 90/10 train/test split. Cross-validation is not utilized
as to keep the test set the same as the other models. For training the network, the optimizer ADAM was
used with the recommended default learning rate α = 0.001 [11]. To determine the hyperparameters of
the network, the numbers of neurons in the first and second hidden layers and the activation functions
were tested. The numbers of neurons in the first hidden layer that were considered are 10, 50, 100, 200,
and 500. The numbers of neurons in the first hidden layer that were considered are 5, 15, 25, 35, 45, and
55. The Tanh activation and no dropout were used. This yielded the following accuracies:

7
Layer 1 Neurons \ Layer 2 Neurons 5 15 25 35 45 55
10 .5292 .5268 .5227 .5257 .5209 .5140
50 .4944 .4825 .4823 .4815 .4819 .4821
100 .5219 .5213 .5227 .5219 .5244 .5229
200 .5225 .5269 .5298 .5332 .5297 .5298
500 .4668 .4900 .4935 .4970 .4984 .5000

For Layer 1 the accuracies were all similar, but 10 neurons and 200 neurons had the best overall
accuracies. 200 neurons was chosen because its accuracies were more consistent. For Layer 2, the
accuracy trended upward until 35 neurons, so that was chosen as well. The next hyperparameter was
the activation function. Three neural networks were trained using the three most common activation
functions: Relu, Sigmoid, and Tanh.

Activation Function Relu Sigmoid Tanh


Accuracy .5957 .5457 .5332

The accuracy by the Tanh and Sigmoid functions were similar, while the Relu function performed
slightly better. This makes sense, as the Tanh and Sigmoid functions are fundamentally similar. Due
to its accuracy advantage, the Relu function was chosen. Below is a summary of the hyperparameter
selection:

Neurons in 1st Hidden Layer 200


Neurons in 2nd Hidden Layer 35
Activation Function Relu

The resulting neural network had an accuracy of .5927. The confusion matrix, shown below in Figure
7, gives a lot of insight into the problems a genre classifier runs into. Generally, the confusion matrix shows
a diagonal line, indicating that the classifier performs well. Out of the eight genres, Hip Hop performed
the best. Rock and International also performed well. Instrumental and Folk were often misclassified as
Experimental, implying that those genres share some similarities. Pop was the most misclassified genre,
notably being misclassified as Hip Hop and Rock.

Figure 7: Confusion Matrix for DFF Neural Network

8
To further test if the classifier’s implications on the genres were valid, the same neural network was
trained out the output of each genre’s binary variable. This shows how the classifier performs when
classifying whether a song is or isn’t of a single genre. The precision, recall, and false positive rates of
each genre were then compared.

(a) ROC Space (b) Precision vs. Recall

Figure 8: Analysis of Binary Neural Networks

When switching to a binary classifier, class imbalance becomes a problem as 7/8 of the output is
zero. This has an impact on the metrics above, notably the false positive rate which is deceivingly low.
The above figures further confirm that in relation to each other, Hip-Hop, Rock, and International are
among the best-classified genres, while Pop and Experimental performed are among the worst. Folk
music, however, performed far better in a binary classifier, which shows an inconsistency between the two
classifiers.

5 Conclusion
We implemented three models to determine the best way to classify songs by genre: k-Nearest Neighbors,
a Decision Tree, and a Neural Network. Of those three, kNN and the Decision Tree are the easier models
to understand. These two models, however, did not capture the level of accuracy for which we had
hoped when categorizing the songs. Their performance was noisy for different train/validation splits, and
on average did not reach above 43.6% for kNN (with PCA and weights) and 39.8% for Decision Trees.
The Neural Network, on the other hand, performed better than the other two models with an accuracy
approaching 60%. This result indicates that music classification using technical features is promising.

Method Accuracy
kNN with k=25 0.436
Decision Tree 0.398
Neural Network (with ReLU) 0.5927

A possible improvement could be to pre-select the features that are most indicative. This requires
technical analysis of the features, which was beyond the scope of our research. Although this was
approximated with PCA, it would good to test the model with a reduced and more selective dataset.
Another improvement could be made in the classification of pop songs, which were often confused. While
our dataset was balanced, in most cases this category will make up the majority of songs. Finding a way
to better distinguish this genre will thus not only improve the results of our models, but also results on
other datasets.
Overall, the performance of our most complicated model clearly surpassed that of the simpler models.
Thus, these results show that our research did not directly find a computationally simpler method to

9
classify music. However, classification based on audio features shows to be a promising alternative to the
intensive spectogram approach, and thus should be explored further.

References
[1] H. Abdi and L. J. Williams, “Principal component analysis,” Wiley Interdisciplinary Reviews: Com-
putational Statistics, vol. 2, no. 4, pp. 433–459, 2010.
[2] G. Guo, H. Wang, D. Bell, and Y. Bi, “Knn model-based approach in classification,” 08 2004.

[3] D. Kostrzewa, R. Brzeski, and M. Kubanski, “The classification of music by the genre using the
knn classifier,” in Beyond Databases, Architectures and Structures. Facing the Challenges of Data
Proliferation and Growing Variety, (Cham), pp. 233–242, Springer International Publishing, 2018.
[4] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighbourhood components analysis,”
Advances in Neural Information Processing Systems, vol. 17, pp. 513–520, 2005.

[5] S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier methodology,” IEEE Trans-
actions on Systems, Man, and Cybernetics, vol. 21, no. 3, pp. 660–674, 1991.
[6] M. Bramer, Avoiding Overfitting of Decision Trees, pp. 119–134. London: Springer London, 2007.
[7] J. Downie, A. Ehmann, and D. Tcheng, “Real-time genre classification for music digital libraries,”
Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, Nov. 2005. 5th ACM/IEEE
Joint Conference on Digital Libraries - Digital Libraries: Cyberinfrastructure for Research and
Education ; Conference date: 07-06-2005 Through 11-06-2005.
[8] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–
117, 2015.

[9] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating


errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[10] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, “Activation functions: Comparison of
trends in practice and research for deep learning,” 2018.

[11] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980,
2014.

10

You might also like