Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

MUSIC GENRE CLASSIFICATION

2021-22

SAURABH KUMAR (19106082)


LIKITHA ESLAVATH (19106049)
SHIWALI (19106088)

1
MUSIC GENRE CLASSIFICATION
MINOR PROJECT REPORT

Minor Project Report Submitted to


Dr. B.R. Ambedkar National Institute of Tech
Technology
nology Jalandhar,
for the award of the degree

of

Bachelor of Technology in
Instrumentation and Control Engineering

by

SAURABH KUMAR
LIKITHA ESLAVATH
SHIWALI

Supervised
by

DR. ROOP PAHUJA


Associate Professor

DEPARTMENT OF INSTRUMEN
INSTRUMENTATION
TATION AND CONTROL ENGINEERING
DR. B. R. AMBEDKAR NATIONAL INSTITUTE OF TECHNOLOGY
JALANDHAR 14
144011, PUNJAB (INDIA)
May 2022

2
DECLARATION

We certify that

a. The work contained in this report is original and has been done by me under the
guidance of my supervisor(s).

b. The work has not been submitted to any other Institute for any degree
or diploma.

c. We have followed the guidelines provided by the Institute in preparing


the report.

d. We have confirmed to the norms and guidelines given in the Ethical Code of
Conduct of the Institute.

e. Whenever we have used materials (data, theoretical analysis, figures, and text) from
other sources, we have given due credit to them by citing them in the text of the
report and giving their details in the references. Further, we have taken permission
from the copyright owners of the sources, whenever necessary.

Signature of the students

Likitha Eslavath(19106049)
Shiwali(19106088)
Saurabh Kumar(19106082)

3
CERTIFICATE

This is to certify that the Minor Project Report entitled as “Music genre classification” in partial
fulfilment of the requirements for the award of the degree of Bachelor of Technology in
Instrumenrtation and Control Engineering of Dr. B. R. Ambedkar National Institute of
Technology Jalandhar, Punjab, India is an authentic record of our on work carried out during
the period of August 2021 to May 2022 under the guidance of Dr. Roop Pahuja at Dr. B. R.
Ambedkar National Institiute of Technology, Jalandhar.

Students Name Signatures


Likitha Eslavath (19106049)
Shiwali(19106088)
Saurabh kumar(19106082)

This is to certify that the above satement made by the candidate is correct to the best of my
knowledge.

Date:

Signature of supervisor(s)

\Name and Designation

4
ACKNOWLEDGEMENT

We take this opportunity to express our profound gratitude and deep regards to our project guide
Dr. Roop Pahuja for her exemplary guidance, monitoring and constant encouragement
throughout the course of this semester. The help and guidance given by her from time to time
shall carry us a long way in the journey of life on which we are to embark.
We are grateful to Dr. Rajesh Singla (HoD, I.C.E. Dept.) and once again Dr. Roop Pahuja
(Associate Professor, I.C.E. Dept.) and the entire management of Instrumentation and Control
Department for reaching out to the students in these challenging time and making it possible for
us to work on this project.
We perceive this opportunity as a milestone in our career development. We strive to use the
gained skills and knowledge in the best possible way, and we will continue to work on their
improvement, in order to attain our desired objective.

Saurabh Kumar (19106082)


Likitha Eslavath (19106049)
Shiwali (19106088)

5
ABSTRACT

Our goal in this project is to implement an algorithm that reduces the number of data points we
need to work with and build a classifier (KNN and ANN) using a set of 1000 music samples
(blues(100), classical(100), country(100), disco(100), hiphop(100), jazz(100), metal(100),
pop(100), reggae(100), rock(100)). In this we developed an automated music genre
categorization model system. The initial stage was to identify good traits that clearly defined the
genre's borders. The GTZAN music dataset was preprocessed and multiple time-frequency
domain parameters such as MFCC vector, chroma waves, spectral rolloff, spectral centroid, and
egg crossing rate were obtain to find the feautre vector for each music sample. Further, The
music sample features were given to train the supervised learning algorithms such as KNN and
ANN with two separate dataset such as GTZAN and MFCC. GTZAN contain labeled audio files
of different music types and genres whereas, MFCC dataset contain labeled numeric feature
vectors for different music types. Finally, the trained models were tested using the train data sets
from both the categories and performance of each was evaluated. KNN provided the maximum
accuracy of 55% using GTZAN and 87% using MFCC dataset. However, the classification
accuracy of ANN using GTZAN dataset remained lower (56%).

6
INDEX

TITLE PAGE I
CERTIFICATE BY THE STUDENT(S) III
CERTIFICATE BY THE SUPERVISOR IV
ACKNOWLEDGEMENT V
ABSTRACT VI

CHAPTER 1
INTRODUCTION 10
1.1 Music and classes
1.2 Related work
1.3 Dataset
1.4 Preprocessing and Features

CHAPTER 2
CLASSIFIERS 18
2.1 K-Nearest Neighbors
2.1.1 KNN Algorithm
2.2 Artificial neural network

CHAPTER 3
MODELING 22

7
CHAPTER 4
METHODOLOGY 23
4.1 Pre -Modelling
4.2 Libraries
4.3 Normalization
4.4 Zero Crossing Rate
4.5 Extraction of Features
4.5.1 Chroma Frequencies
4.5.2 Spectral Centroid
4.5.3 Spectral Roll-Off
4.6 Classification
4.6.1 Confusion matrix

CHAPTER 5
RESULTS AND DISCUSSION 30
5.1 Accuracy with pre-processed features(MFCC data set)
5.2 Accuracy with GTZAN dataset
5.3 Accuracy with ANN and GTZAN dataset

CONTRIBTUIONS 32

REFERENCES 33

8
FIGURE INDEX

Fig. 1.1 Examples of log-scaled mel-spectrograms for three different genres


Fig. 2.1 KNN Plotting
Fig. 2.2 Euclidean Distance
Fig. 2.3 KNN Alogo Plot
Fig. 2.4 ANN Plot
Fig. 4.1 Work Flow
Fig. 4.2 Libraries
Fig 4.3 Normalization
Fig 4.4 Zero Crossing Rate
Fig 4.5 Spectrograms
Fig 4.6 MFCC Flow Graph
Fig 4.7 Extracted Features
Fig 4.8 Confusion Matrix
Fig 4.9 Result (Confusion Matrix)
Fig 5.1 Result (MFCC Data Set)
Fig 5.2 Result (GTZAN Data Set )

9
CHAPTER 1
INTRODUCTION

“Music gives life to the universe, wings to the mind, escape thought and life throughout.” - Plato
Music is similar to mirror, and it tells people a lot about who you are. Companies currently use to
separate music, or make a recommendation their customers (like Spotify, Soundcloud) or just
like product (like Shazam). It is thereforesi important for machines the ability to see without
asking a solution to a question - What kind of music do you like? Determining the types of music
is the first step in this process. Machine learning strategies that have proven to be effective in the
release of trends and patterns from the big data pool. The same principles they also apply to
music analysis. We create something new types of everyday music to combine the same songs,
which is why Many new models have been created such as Space Trap, Country Dance Music,
IDM Orchestral etc. Music companies like spotify, apple music serve music to their customers.
So, if we want to listen to something, we just go to any site, like Spotify. We just look for the
type that suits our emotions and start listening to it. Yes, because there are millions of songs
throughout the universe, it is impossible to listen to each song and then divide it into categories
and organize them. As a result, doing such hard work is almost impossible for humans. Music
genre is a classification that separates music into several genres. The intricate design of the vocal
and vocal tones is what gives the music its unique character. As a result, these platforms use the
"Automatically Divide Music Generation" method. Companies are increasingly using music
classification to offer suggestions to their customers or as a mere product. The first step in the
right direction is to find out what kind of music you want to listen to [16].

Wikipedia states that “the genre of music is a common denominator pieces of music as
traditional or collective meetings. “I The word “species” is a matter of interpretation and it often
arises that species may exist it is too dark in their definition. In addition, genres do not always
have sound music foundations of theory, e.g. - Indian species are described geographically,
Baroque is a genre of classical music. Despite the lack of quality how to describe genres, genre
based on genre is one of comprehensive and widely used. Genre often takes a heavy toll on
music recommendation systems. The classification, so far, has been done by hand add it to the
audio file metadata or add it to album information. This project however it aims for content-
based classification, focusing on the information within audio has information installed outside.
Traditional machine using classification method - find relevant data features, train
breaks in feature data, makes predictions. A novel item we tried the use of an ensemble classifier
in different categories basically for achievement our ultimate goal [15].

10
1.2 Music and classes

The skill of organising sounds across time utilising the components of music, such as harmony,
rhythm, and timbre, is known as music. It is a common cultural characteristic of all human
communities. Tone (which regulates music and harmony), rhythm (and associated tempo ideas,
metre, and speaking), flexibility (loud and soft voice), and auditory aspects of timbre and texture
are all prevalent features in music definitions (sometimes called "color" of music sound).
Different musical styles or genres can emphasise, enhance, or exaggerate some of these
characteristics. There are only musical instruments, only singing pieces (such as songs without
instrumental accompaniment), and pieces that include both singing and instruments; there are
only musical instruments, only singing pieces (such as songs without instrumental
accompaniment), and pieces that include both singing and instruments.

A music genre or class is a term used to classify certain types of music as adhering to a common
tradition or set of rules. It is to be differentiated from musical form and musical style, which are
occasionally used interchangeably in practise.

Blues: According to Wikipedia, The blues is a music genre and musical style that emerged in the
Deep South of the United States by African-Americans in the 1860s, with roots in African-
American labour songs and spirituals. Spirituals, labour songs, field hollers, yells, chants, and
rhymed basic narrative ballads were all included into blues. The blues form, found everywhere in
jazz, rhythm and blues and rock and roll, is characterized by a call-and-response pattern, a blue
scale and the progression of a specific chord, in which 12-bar blues are most common. Blue
notes (or "concerned notes"), usually third, fifth or seventh, are also an important part of sound.
The Blues push or the moving buses reinforce the trance-like rhythm and create a repetitive
effect known as the trench.

Classical: According to Wikipedia, Classical music refers to the Western world's formal musical
heritage, which is distinguished from Western folk or popular music traditions. It is frequently
referred to as Western classical music, despite the fact that the word "classical music" can also
apply to non-Western traditions with similar formal characteristics. In addition to formality,
classical music is typically distinguished by its melodic structure and harmonic arrangement,
particularly when polyphony is used. It has been largely a written tradition since at least the 9th
century, with a complex notational system and associated literature in analytical, critical,
historiographical, musicological, and philosophical techniques.

11
Country: According to Wikipedia, The blues, church music such as Southern and spiritual,
ancient, and types of traditional American music such as Appalachian, Cajun, Creole, and
cowboy Western music genres all contributed to the development of country (also known as
country and west). New Mexico, Red Dirt, Tejano, and the state of Texas are all associated with
New Mexico. Its legendary roots may be traced back to the early 1920s in the United States'
South and Southwest.
Ballads and dancing melodies with generally basic forms, names of traditional characters, and
harmony are often accompanied by stringed instruments such as electric and acoustic guitars,
orchestras (such as drums and instruments) dobros), banjo, fiddles, and harmonicas. Throughout
their historical history, the Blues' tactics have been employed frequently.

Disco: According to Wikipedia, Disco is a type of dance music and subculture that first
appeared in the 1970s in the nightlife scene of the United States. On-the-floor rhythms,
synchronised basslines, stringed sections, horns, electric piano, synthesisers, and electric rhythms
characterise its sound.

Hip-hop: According to Wikipedia, Hip hop music, often known as rap music, is a popular type
of music created in the United States in the 1970s by African Americans and Caribbean
Americans in the Bronx neighbourhood of New York City. It has rhythmic music, which is
frequently accompanied by rhyme, rhythmic rhythm, and rhyme.

Jazz: According to Wikipedia, Jazz is a musical genre that evolved in the late 19th and early
20th centuries in the African-American populations of New Orleans, Louisiana, with origins in
blues and ragtime. It has been acknowledged as a key way of expressing music in traditional and
popular music, tied to the shared bonds of African-American and European-American music
parents, since the 1920s, when the Jazz Age began. Swaying and blue notes, complicated tunes,
call and response, polyrhythms, and embellishments are all hallmarks of jazz. European harmony
and African rhythms are the foundations of jazz.

Metal: According to Wikipedia, Heavy metal (or just metal) is a rock music genre that
originated in the United Kingdom and the United States in the late 1960s and early 1970s. Heavy
metal bands created a thick, massive sound typified by distorted guitars, lengthy guitar solos,
forceful rhythms, and loudness, with origins in blues rock, psychedelic rock, and acid rock.

12
Pop: According to Wikipedia, Pop is a popular music genre that emerged in its contemporary
form in the United States and the United Kingdom in the mid-1950s. Although the phrases
popular music and pop music are sometimes interchanged, the former refers to all popular music,
which comprises a wide range of forms. Pop music covered rock & roll and the youth-oriented
forms it influenced during the 1950s and 1960s. Until the late 1960s, rock and pop music were
practically synonymous, with pop being associated with music that was more commercial,
transitory, and accessible.

Reggae: Reggae is a music genre that began in Jamaica in the late 1960s. The word also alludes
to the fragmentation of popular current Jamaican music. "Do the Reggay," a song by Toots and
the Maytals released in 1968, was the first popular song to utilise the term "reggae," thereby
naming the genre and introducing it to a global audience. Although the term reggae is sometimes
used in a broad sense to refer to a variety of popular Jamaican dance music genres, it refers to a
specific style of music that was heavily influenced by traditional mento and American jazz with
rhythm and blues, and emerged from earlier ska and rocksteady styles. Reggae has its roots in
ska and rocksteady, with the latter passing down the usage of the bass as a percussion instrument
to reggae.

Rock: Rock music is a genre of popular music that began as "rock and roll" in the late 1940s and
early 1950s in the United States, evolving into a variety of genres in the mid-1960's and beyond,
especially in the United States and the United States. States. United Kingdom. It has its roots in
rock and roll from the 1940s and 1950s, a style that originated directly from African-American
blues music with rhythm and blues, as well as country music. Rock music has also been widely
borrowed from other genres, including electric blues and folk, as well as jazz and other musical
cultures.

13
1.2 Related Work

1. A major influence on the classification of species using mechanical learning techniques


was initiated by Tzanetakis and Cook [1]. The GTZAN database was created by them and
to this day is regarded as a standard for species classification. Scaringella et al. [2]
provides a comprehensive survey of both aspects and classification strategies used in
classification. Changsheng Xu et al. [3] base was shown how to use support equipment in
this project. For voice recognition and musical knowledge, machine learning techniques
such as egg crossing rate, spectral centroid, spectral rolloff, Mel frequency cepstral
coefficients, and Chroma Frequencies are used.

2. In the last 5-10 years, convolutional neural networks have shown an extremely accurate
type of music. class dividers [1] [4] [5], with excellent results showing both complexity
provided by having multiple layers and the ability of flexible layers to effectively identify
patterns within images (i.e. mel-spectrograms and similar MFCCs). These results far
exceed the human population diversity of our species, and our study has found current
high-performance models with approximately 91% accuracy [5] when using a full track
length of 30s. Many papers have used CNN to compare their models with other ML
strategies, including k-NN, a mixture of Gausans, SVMs, and CNN performed well in all
cases. So we decided to focus on our efforts using the most accurate CNN, with other
models being used as a base.

3. In Hareesh Bahuleyan, (2018). Music Genre Classification using Machine Learning


techniques, Convolutional Neural Network trained using Spectrograms features of audio
signal. The second method uses various machine learning algorithms such as Logistic
Regression, random forest etc., where it uses hand-drawn features from time zone and
frequency signal frequency.

4. In Tzanetakis G. et al., (2002). Musical genre classification of audio signals, These


features are usually related to the instruments used, the rhythmic structures, and
especially the consistent music content. The genre categories are often used to organize
the largest music collections available on the web. Using a set of suggested features, this
model can distinguish approximately 61% of the ten music genres correctly.

14
5. In Lu L. et al., (2002). Content analysis for audio classification and segmentation, They
presented their classification and classification analysis of audio content. Here audio
streams are categorized according to the type of sound or the identity of the speaker.
Their approach is to create a robust model that is able to distinguish and differentiate a
given sound signal into speech, music, local sound and silence. This division is processed
in two major steps, which has made it suitable for various other applications as well. The
first step is to discriminate against the other person. The second step is to divide the non-
speaking category into music, natural sounds, and peace in a legal-segregated way.

6. In Tom LH Li et al., (2010). Automatic musical pattern feature extraction using


convolutional neural network, CNN has a strong ability to capture educational features
from a variety of music patterns. Features extracted from audio clips such as spectral
mathematical features, rhythm and tone are less reliable and produce less accurate
models. The musical patterns were like that tested using the WEKA tool when dividing
into several categories models were considered. Section accuracy was 84% and finally it
was high. Compared to the MFCC, Chroma, temporary features, features released by
CNN provided good results and was reliable. Accuracy can still be enhanced by parallel
computing in different combinations of types.

15
1.3 Dataset

We obtained all of our musical data from the public GTZAN dataset and used the GTZAN
dataset from the Music Genre Classification | Kaggle website. This is Database used in [1]. This
dataset provides us with 1000 ,30-second audio clips, all labeled as one out of 10 possible
genre(jazz, classical, rap ,rock etc) and presented as .wav files. The dataset incorporates samples
from variety of sources like CDs, radios, microphone recordings etc. The genres such as hip-
hop, rock, pop and electronic music had louder low frequencies (up to 150 Hz) than genres such
as jazz and folk music. The same relationship was evident also for the high frequencies (5 kHz
and above), with hip-hop, rock, pop and electronic music being the loudest.

1.4 Preprocessing and Features

First, we made the audio data normal for each song to remove the base volume difference where
different pieces were recorded, which did not affect their genre. Crude audio is hard to work with
as it contains too many data points (22,500 per second) to be computer-generated on multiple
sensory networks. It can take a very long time to train, and the data details will make pattern
recognition difficult without a large model with prohibition. Therefore, we examined two other
integrated data representation methods: Fourier-transform coefficient and Mel-frequency cepstral
coefficients (MFCC).
Performing Fourier transform involves breaking the audio sample into small segments (~0.1
seconds), and taking a Fast Fourier Transform (FFT) of each segment. The resulting Fourier
coefficients vectors were stacked along the time axis to form a 4 time-series matrix of Fourier
coefficients, which can be treated like an “image” when training. The FFT was performed using
a Numpy function [6].

Making changes to the MFCC involves the following steps [7]:


1. Take a Fourier version of each audio component
2. Spectacle power map obtained above Mel scale, using triangular windows
3. Take power logarithms for each Mel-frequencies
4. Take the direct cosine-transform of the Mel-log power list
5. MFCCs are the amplitudes of the resulting spectrum

This process is done using the Librosa library (a tool designed for audio processing) [8]. We tried
the MFCC because according to, it is the industry standard for sound processing, and the author
noted that it leads to the development of significant accuracy.

16
We also tried to pre-process
process our data by converting crude audio into mel
mel-spectrograms.
spectrograms. In doing
so, we haveve experienced a significant increase in performance across all models. Mel- Mel
spectograms are a commonly used method of input sound because they closely represent the way
people perceive sound (i.e. at log frequency). To convert crude sound into mel
mel--spectogram, one
must apply Fourier transitions to sound slide windows, usually around 20ms wide. With signal
x[n], window w[n], frequency axis ω, and shift m, this is computed as

These are computed more quickly in practice using sliding DFT algorithms. These are
ar then
mapped to the mel scale by transforming the frequencies f by

Then we took a discrete cosine effect modification (common in signal processing) to obtain our
final mel-spectogram
spectogram effect. In our case, we used the Librosa library and chose to use 64 mel-
mel
bins and a 512 sample length window with 50% spacing between windows. Based on previous
academic achievements with such a change, we then proceed to measure the log using a formula
log (X2)[17].

The resulting data


ata can be visualized below:

Fig. 1.1
log-scaled mel-spectrograms for three different genres [17]
Examples of log

17
CHAPTER 2
CLASSIFIER

We decided to first use two simple models as basic performance measures, and then move on to
more complex models to increase
ncrease accuracy. We have made a difference for neighbors near k, a
fully connected neural network, and a neural convolutional network. In addition to the mel-
mel
spectogram features, we used key component analysis to reduce the size of inputs in K-NN
K and
ANN.

2.1 K-Nearest Neighbors

The k-nearest
nearest neighbors algorithm ((k-NN) is a non-parametric supervised learning method first f
developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is
used for classification and regression. In both cases, the input consists of the k closest training
examples in a data set. The KNN algorithm is based on the idea that similar objects live close
together. In other words, the same data points are closer. The output depends on whether
whet k-NN is
used to split or subtract. K-NN
NN is a type of division where work is measured only locally and all
calculations are reversed until the work is evaluated. As this algorithm depends on the distance to
be distinguished, if the features represent di
different
fferent body units or come in very different scales to
make normal training data can improve its accuracy dramatically[20].

In both divisions and regimes, a practical approach would be to allocate weights to neighbors'
contributions, so that nearby neighb
neighbors
ors may contribute more to the measure than those far away.
For example, a standard measurement scheme consists of giving each neighbor a weight of 1 / d,
where d is the distance to the neighbor. Neighbors are taken from a group of items class (k-NN
(k
division) or item of property (k-NN
NN reclassification). This can be thought of as algorithm-set
algorithm
training, although no specific training step is required.

Fig. 2.1
KNN Plotting [18]

18
Finding comparable nearby locations, distance measurements including Euclidean distance,
Hamming distance, Manhattan distance, and Minikowski distance are used. The length of a line
segment connecting two points in Euclidean space is defined as the Euclidean distance.

Fig 2.2
Euclidean distance[19]

2.1.1 KNN Algorithm


Examples of vector training are in the multi
multi-dimensional
dimensional feature area, each with a class label.
The training section of the algorithm consists only of keeping feature vectors and class labels for
training samples. In the classification phase, k is a user
user-defined condition,
ndition, and a vector without a
label (question or test point) is separated by providing the most common label between k training
samples adjacent to that question point.
The most commonly used distance metrics for continuous Euclidean distance fluctuations.
fluctuation For
various variations, such as text separating, other metrics can be used, such as scattering metrics
(or Haming range). In the context of genetic microarray data, for example, kk--NN is used with
compatible coefficients, such as Pearson and Spearman, as metrics. In general, the accuracy of
the k-NN
NN phase can be greatly improved if the distance metric is read by special algorithms such
as Large Margin Nearest Neighbor or analysis of Neighborhood components.
The reversal of the basic “voting for the majorimajority”
ty” phase occurs when class distribution is
distorted. That is, the examples of the most common class tend to dominate the predictions of the
new model, because they are more common among nearby neighbors because of their large
numbers. One way to overcome this problem is to measure the distance, taking into account the
distance from the checkpoint to each of the nearest neighbors. The phase (or value, in reversal
problems) of each adjacent k point is multiplied by weight in proportion to the distance from that
point to the test point. Another way to overcome skew by removing data representation. For
example, on a map (SOM), each node represents (center) a set of similar points, regardless of
their density in the initial training data. K-NN may be used in SOM[9].

19
Fig 2.3
KNN Algorithm plot[20]

2.2 Artificial neural networks

The Artificial Neural Network also known as the neural network is derived from a set of nodes
or units connected to other so-called
called artificial neurons, which are closely related to the
th existing
model in the biological brain. These models are made of multi
multi-layer
layer perceptrons. Neural
networks consist of:
1. Input layer
2. Hidden layers
3. Output layer.

ANN Architecture consists of 1 input layer with 10 nodes, 2 hidden layers with 20 nodes and 10
nodes respectively and the outgoing layer contains only one node. Input Background: Input
Background is the background in which we feed all inputs / information from the external world

20
to the model in order to read and obtain certain useful information fr
from
om it. The information is
then transferred to the next layer i.e. Hidden Layer.

Hidden Layout: Hidden layer is the layer in which a set of existing neurons perform all
calculations in input data. We can select any number of hidden layers depending on the various
v
elements in the neural network. The simplest network contains only one hidden layer.

Output layer: The output layer is a layer that provides the final output of a model based on all
calculations made from previous layer data. There may be one or mor moree nodes in the output layer
depending on what type of data it pulls out. For example, if we create a binary separation
problem the output area may be one but in the case of multiple classification, the output nodes
may be more than one.

Fig 2.4
ANN Plot[21]

21
CHAPTER 3
Modeling

The study of numerous algorithms that may automatically improve via experience and old data
and create the model is known as machine learning. A machine learning model is a sort of
software that recognizes patterns or behaviors based on prior experience or data. The learning
algorithm finds patterns in the training data and creates a Machine learning model that collects
and predicts new data.

The anticipated data is regularly put back into the learning model to help it learn better,
increasing the model's accuracy.

In this project we are using the following:

1. Machine Learning Models


● KNN
● ANN(Artificial Neural Network)

3.1 Pre-Modeling:

After processing the data, We have split the data into two categories namely:

●Training Dataset(70% of total dataset):

The train set would contain the data which will be fed into the model. In simple terms,
our model would learn from this data.

●Testing Dataset(30% of total dataset):

The test set contains the data on which we test the trained and validated model. It tells us
how efficient our overall model is and how likely is it going to predict something.

22
CHAPTER 4
Methodology

The methodology for the modeling process follows the mechanism below:

Fig 4.1
Workflow

4.2 Libraries:

Here we have used some libraries including librosa, python_speech_features, numpy, random,
pickle, etc. shown below.

Fig 4.2
Libraries

Librosa: It is a python package for music and audio analysis. It provides the building blocks
necessary to create music information retrieval systems[8].

23
Os: This module provides a portable way of using operating system dependent functionality. If
you just want to read or write a file see open(), if you want to manipulate paths, see the os.path
module, and if you want to read all the lines in all the files on the command line see the fileinput
module. For creating temporary files and directories see the tempfile module, and for high-level
high
[10]
file andd directory handling see the shutil module .

Matplotlib: Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python[11].

Python_speech_features: This library provides common speech features for ASR including
MFCCs and filterbank energies[12].

Scipy: SciPy provides algorithms for optimization, integration, interpolation, eigenvalue


problems, algebraic equations, different
differential
ial equations, statistics and many other classes of
[13]
problems .

Pandas: It is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool, built on top of the Python programming language[14].

Pickle: Python pickle module is used for serializing and de


de-serializing
serializing a Python object structure

4.3 Normalization:

Fig 4.3
Normalization

24
4.4 Zero Crossing Rate:
It represents the number of times the waveform crosses 0. It usually has higher values for highly
percussive sounds like those in metal and rock.

Fig 4.4
Zero Crossing Rate
4.5 Extraction of features:

The Mel-Frequency
Frequency Cepstral Coefficients MFCC represents short
short-term
term features of spectrum
sound and is used in modern perceptions and sound separation methods. Displays features of the
human voice. These features are a large part of the final vector element (13 coefficient). How to
do this feature below:

1. Divide the signal into a few short frames. The purpose of this sstep
tep is to save a constant
sound signal.

2. In each frame, calculate the periodogram of the energy level of the spectrum. This is what
you know about the frequencies available in short frames.

25
Fig 4.5
Spectrogram

3. Push the energy spectra into mel filterba


filterbank
nk and collect energy on each filter to find it. We
will then know the amount of energy available different regions usually.

4. Calculate the filterbank energy logarithm in the past It makes people have our features
closer to what people can hear. - Calculatee the Discrete Cosine Transform (DCT) effect.
Decorates the filtering capacity of the bank and others.

M(f) = 1125 ln(1+ f/700)

We remove the high DCT coefficient that can eliminate errors by representing filtering
power changes.

26
Fig 4.6
MFCC flow graph[22]

4.5.1 Chroma Frequencies:- The Chroma frequency vector separates the chromatic spectrum
keys, and represents the presence of each key. We take the histogram of current notes on a 12-
12
point scale as a vector of 12-element
element feature. Chroma usually ha
hass a musical theoretical meaning.
Histogram over note 12 scale is actually enough to describe the song played in that window. It
provides a solid way to define the level of similarity between pieces of music.

4.5.2 Spectral Centroid:- Specifies where the ""center


center of weight" of sound is located. It actually
means the scale of the frequencies present in the sound. Consider two songs, one with blues and
the other with metal. The blues song is usually consistent throughout the length while the
instrumental song usually has many frequencies collected at the end of the stage. So the spectral
centroid of a blues song will lie somewhere near the center of its spectrum while that metal song
can sing normally to its end.

27
4.5.3 Spectral Roll-off:-It
It is a measure of tthe
he shape of the signal. Represents the frequency at
which the maximum frequency drops to 0. To find it, we have to calculate the fraction of barrels
in the energy spectrum where 85% of their energy is in the low frequencies.

Fig 4.7
Extracted Features

4.6 Classification

Once the feature vectors are obtained, we train different classifiers on the training set of feature
vectors. Following are the different classifiers that were used -
– K Nearest Neighbours
– Artificial Neural Network

28
4.6.1 Confusion matrix

A confusion matrix is a table that is often used to describe the performance of a classification
model (or "classifier") on a set of test data for which the true values are known.

Fig 4.8
Confusion matrix[23]

Fig 4.9
Result (confusion matrix)

29
CHAPTER 5
Results and Discussions

After collecting the data set, we randomly compiled the data and split it into training and test
sets. To avoid data leaks, we analyzed data using various data preparation techniques first in the
training set,
et, and then applied to both training and test sets. In selecting an element, we used a
duplicate selection that provides a specific value for each feature that indicates how appropriate
that feature is to predict the database category. We organized the fe
features and
d used those features
to train our Machine Learning models. Wee have faced problem regarding accuracy and in data
import. We have used Different algorithms with different data sets and tried to optimize it.
Finally we got our accuracy w with confusion matrix as mentioned below. In matrix all rows
element showing different featuures those are actual and column elements sh howing different
features those are predicted. Rows are of features and columns are of music genrres.

Classifier Dataset Result (Accuracy)


KNN GTZAN 55%
KNN MFCC 87%
ANN GTZAN 56%

5.1 Accuracy with pre-processed


processed features (mfcc dataset):

Fig 5.1
Result (MFCC Data Set)

30
5.2 Accuracy with GTZAN dataset :

Fig 5.2
Result (GTZAN dataset)

5.3 Accuracy with ANN and GTZAN dataset:

We got the maximum accuracy of 56% with ANN.

31
CHAPTER 6
Contributions

Our group has collaborated closely on the majority of this project. We worked together to find a
project idea, lay out the direction of the project, and determine matters of data processing as well
as implementation details. The writing of the project proposal, presentation, and this final report
have all been a team effort. However, when it came to coding, we realized that our team worked
better when we could mostly code by ourselves.

32
References
[1]. G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Transactions
on Speech and Audio Processing, 10(5):293–302, 2002.

[2]. N. Scaringella, G. Zoia, and D. Mlynek. Automatic genre classification of music content: a
survey. IEEE Signal Process. Mag., 23(2):133–141, 2006.

[3]. Changsheng Xu, MC Maddage, Xi Shao, Fang Cao, and Qi Tian. Musical genre
classification using support vector machines. In Acoustics, Speech, and Signal Processing, 2003.
Proceedings.(ICASSP’03). 2003 IEEE International Conference on, volume 5, pages V–429.
IEEE, 2003.

[4]. Hareesh Bahuleyan. Music genre classification using machine learning techniques. CoRR,
abs/1804.01149, 2018

[5]. Y. Panagakis, C. Kotropoulos, and G. R. Arce. Music genre classification via sparse
representations of auditory temporal modulations. In 2009 17th European Signal Processing
Conference, pages 1–5, Aug 2009.

[6]. "Discrete Fourier Transform (numpy.fft) — NumPy v1.15 Manual", Docs.scipy.org, 2018.
[Online]. Available: https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.fft.html.

[7]. "mlachmish/MusicGenreClassification", GitHub, 2018. [Online]. Available:


https://github.com/mlachmish/MusicGenreClassification

[8]. https://librosa.org/doc/latest/index.html

[9]. https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

[10]. https://docs.python.org/3/library/os.html

[11]. https://matplotlib.org/stable/index.html

[12]. https://python-speech-features.readthedocs.io/en/latest/

[13]. https://scipy.org/

[14]. https://pandas.pydata.org/

[15]. Archit Rathore, Margaux Dorido (Music Genre Classification ) - Indian Institute of
Technology, Kanpur Department of Computer Science and Engineering

[16]. Rudra Pratap Singh, Siddhant Singh, Soukumarya Datta, D. Prabhu - Automated Music
Genre Detection and Classification using Machine Learning.

33
[17]. Derek A. Huang, Arianna A. Serafini, Eli J. Pugh - Music Genre Classification.

[18]. https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning

[19]. https://en.wikipedia.org/wiki/Euclidean_distance

[20]. https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

[21]. https://www.analyticsvidhya.com/blog/2021/07/understanding-the-basics-of-artificial-
neural-network-ann/

[22]. https://jonathan-hui.medium.com/speech-recognition-feature-extraction-mfcc-plp-
5455f5a69dd9

[23]. https://manisha-sirsat.blogspot.com/2019/04/confusion-matrix.html

34

You might also like