Project Final Document

AUTOMATIC MUSIC GENERATION USING LSTM
A PROJECT REPORT
Submitted by
VISWA S (721419104054)
HARIHARAN S (721419104701)
In partial fulfillment for the award of the degree
of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
NEHRU INSTITUTE OF ENGINEERING AND TECHNOLOGY

COIMBATORE – 641 105
ANNA UNIVERSITY: CHENNAI – 600 025
MAY-2023
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this project report “AUTOMATIC MUSIC GENERATION

USING LSTM” is the bonafide work of VISWA S (721419104054) and
HARIHARAN S (721419104701) who carried out the project work under my
supervision.
SIGNATURE SIGNATURE
Dr. D. SATHISH KUMAR Dr. S. SUBASREE

ASST.PROFESSOR and HEAD OF THE DEPARTMENT
SUPERVISOR
Department of Computer Science & Department of Computer Science &

Engineering Nehru Institute of Engineering Nehru Institute of
Engineering and Technology Engineering and Technology
Coimbatore – 641105 Coimbatore - 641105
Submitted for the Anna University Project Viva – Voice Examination

held on ............................at Nehru Institute of Engineering and Technology.
INTERNAL EXAMINER EXTERNAL EXAMINER

ACKNOWLEDGEMENT
At this pleasing moment of successful completion of my project, I thank God

Almighty for his blessings.
We convey my sincere thanks to our respected Founder Chairman Late

Shri. P. K. DAS, Chairman and Managing Trustee Adv. Dr. P KRISHNA DAS and
CEO & Secretary Dr. P. KRISHNA KUMAR who provided the facilities for me.
We express my deep gratitude to our beloved and honorable Principal,

Dr. P. MANIIARASAN for his valuable support.
We owe grateful and significant thanks to our Dean, Computing

Dr. N. K. SAKTHIVEL, Head of Department Dr. S. SUBASREE, Computer Science
and Engineering, and our internal guide Dr. D. SATHISH KUMAR, Assistant
professor(SG)-CSE who has given constant support and encouragement throughout
the completion of my course and my project work.
We thank our Project Coordinator Mr. S. VENKATESH, Assistant

professor(SG)-CSE, who rendered me all possible help along with her valuable
suggestions and constructive guidance for successful completion of my project.
We are thankful for the support and guidance from all the Faculty of Computer
Science and Engineering Department and sincere thanks to my supporting staff for
their timely assistance. Finally, I would like to express my gratitude towards my
parents for their motivation and support which helped me in the completion of this
project.
ABSTRACT
Traditionally, music was treated as an analogue signal and was generated

manually. In recent years, music is conspicuous to technology which can
generate a suite of music automatically without any human intervention. Our
main objective is to propose an algorithm which can be used to generate
musical notes using Recurrent Neural Networks (RNN), principally Long
Short-Term Memory (LSTM) networks. A model is designed to execute this
algorithm where data is represented with the help of Musical Instrument Digital
Interface (MIDI) file format for easier access and better understanding. The
model used is used to learn the sequences of polyphonic musical notes over a
single-layered LSTM network. The system is built using various technologies
such as TensorFlow, Keras, and music21 libraries in the Python programming
language. There are several steps such as data preprocessing, model training,
and music generation. The MIDI format is used to represent musical data,
which is preprocessed using the music21 library to extract features such as
pitch, duration, and velocity. The preprocessed data is then fed into the LSTM
network for model training using TensorFlow and Keras libraries. The trained
model is then used to generate new melodies. The system allows users to input
a sequence of musical notes and generate new melodies based on the input. The
output can be saved in MIDI format, which can be played back using various
MIDI players or even music production software.
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGE NO.
ABSTRACT iv
LIST OF FIGURES vii
LIST OF ABBREVIATIONS ix
1 INTRODUCTION 1
2 RELATED WORKS 4
2.1 OBJECTIVES 4
2.2 IDENTIFIED PROBLEMS 4
3 EXISTING SYSTEM 5
4 PROPOSED SYSTEM 7
4.1 LIST OF MODULES 9
4.2 MODULES DESCRIPTION 9
4.3 ARCHITECTURE 13
4.4 SOFTWARE AND HARDWARE 14

USED
4.5 SOFTWARE DESCRIPTION 15

4.5.1 OVERVIEW OF ARTIFICIAL
INTELLIGENCE 15
4.5.2 OVERVIEW OF MACHINE
LEARNING 17
4.5.3 OVERVIEW OF DEEP
LEARNING 18
4.5.4 RELATION BETWEEN AI, ML

AND DEEP LEARNING 20
4.5.5 OVERVIEW OF HTML 21
4.5.6 OVERVIEW OF FLASK 23

4.6 SOFTWARE KEY 24
TERMINOLOGIES
4.6.1 RNN 24
4.6.2 LSTM 25
4.6.3 MIDI 26
4.6.4 MUSIC21 27
4.6.5 TENSORFLOW 27
4.6.6 KERAS 28
4.6.7 MIDI PLAYER 29
4.6.8 MIDI VISUALIZER 30
4.7 MUSICAL CONCEPTS 32

32
4.7.1 MELODY
33
4.7.2 HARMONY
35
4.7.3 RHYTHM
36
4.7.4 NOTES AND RESTS
38
4.7.5 PITCH
39
4.7.6 SCALE
40
4.7.7 CHORD
42
4.7.8 KEY
43
4.7.9 TEMPO
44
4.7.10 DYNAMICS
46
4.7.11 TIMBRE
47
4.7.12 FORM
5 RESULTS AND DISCUSSION 49
6 CONCLUSION 50
APPENDIX I - SCREENSHOTS 51
APPENDIX II - CODING 56
REFERENCES 69
LIST OF FIGURES
FIGURE TITLE PAGE

NO NO
1 ARCHITECTURE 13
2 ARTIFICIAL INTELLIGENCE 16
3 MACHINE LEARNING 18
4 DEEP LEARNING 19
5 RELATION BETWEEN AI, ML AND DEEP
LEARNING 21
6 HTML 22
7 FLASK 24
8 HARMONY 34
9 NOTES AND RESTS 38
10 PITCH 39
11 SCALE 40
12 CHORD 41
13 KEY 42
14 TEMPO 44
15 DYNAMICS 45
16 TIMBRE 47
17 FORM 48
18 DATA PREPROCESSING 51
19 PREPROCESSED DATA 52
20 MAPPED DATA 52
21 MODEL DEVELOPMENT AND TRAINING 53
22 LOSS AND ACCURACY MAPPED DURING
TRAINING 53
23 GENERATED MUSIC AS SEQUENCE 54
24 GENERATED MUSIC AS MUSICAL NOTES 54
25 RUNNING SERVER 55
26 GENERATED MUSIC IN WEB PAGE 55
LIST OF ABBREVIATIONS
SERIAL NO ABBREVIATION EXPANSION
1 RNN Recurrent Neural Network
2 LSTM Long Short Term Memory
3 MIDI Musical Instrument Digital Interface
4 DAW Digital Audio Workstation
5 RANN Recursive Artificial Neural Network
6 GAN Generative Adversarial Network
7 SGD Stochastic Gradient Descent
8 HMM Hidden Markov Models
9 ANN Artificial Neural Network
10 XML Extensible Markup Language
11 AI Artificial Intelligence
12 ML Machine Learning
13 USB Universal Serial Bus
14 HTML HyperText Markup Language
15 VLC VideoLAN Client
16 SD Secure Digital
17 BPM Beats Per Minute

CHAPTER 1
INTRODUCTION
Music composition has been human’s most popular hobby for a long time.
Music generation is a topic that has been studied in much detail in the research
industry in the past. There are several who tried to generate music with different
approaches. There are several approaches which intend to generate a suite of
music and their combination can be used to design a new and efficient model. We
divide these approaches into two wide categories. One is traditional algorithms
operating on predefined functions to produce music and another is an
autonomous model which learns from the past structure of musical notes and then
produces a new music. Garvit et al. also presented a paper to compare these
algorithms on two different hardware’s. Drewes et al. proposed a method on how
algebra can be used to generate music in a grammatical manner with the help of
tree-based fashion. Markov chains and Markov hidden units can be used to
design a mathematical model to generate music. Recorded in the book “Ancient
Greek Music: A Survey,” the oldest surviving complete musical composition is
Seikilos Epitaph from 1st or 2nd century AD. Algorithmic
com-position—composing by means of formalizable methods—Guido of Arezzo
was the first to invent the music theory which is mainly known as “solfeggio” in
1025. He invented a method to automate the conversion of text into melodic
phrases. Music generation using artificial intelligence with artistic style imitation
has been broadly explored within the past three decades. Nowadays, the way
electronic instruments communicate with computers is through a protocol named
Music Instrument Digital Interface (MIDI) which contains information such as
pitch and duration. Midi can be used for the composer to compose an orchestra
without having the orchestra itself. Every part of the orchestra could be created
from midi with synthesizer. Digital audio workstation (DAW) is the standard in
contemporary audio production which is revolutionary, capable in recording,
1
editing, mastering, and mixing. DAW allows midi plays each channel separately
into connected electronic devices or synthesizers. There exist many open source
systems to generate music, e.g. but an in-depth discussion of all such research
works and methods is beyond the scope of this document. The algorithmic
composition has various approaches such as the Markov model, generative
grammars, transition networks, artificial neural networks, etc.
After the breakthrough in AI, many new models and methods were
proposed in the field of music generation. Description of various AI enabled
techniques can be found in including a probabilistic model using RNNs,
Anticipation-RNN and recursive artificial neural networks (RANN) an evolved
version of artificial neural networks for generating the subsequent note,
subsequent note duration, rhythm generation. Generative adversarial networks
(GANs) are actively used in generating musical notes which contain two neural
networks, generator network that generates some random data and discriminator
network that evaluates generated random data for authenticity against the original
data(dataset). MuseGAN is a generative adversarial network that generates
symbolic multi-track music.
A critical step in this task, whether in audio domain or symbolic domain, is

that of converting music into machine-perceivable matrices of numbers. In the
symbolic domain which is the focus area here, the MIDI format provides a
powerful tool to facilitate such a conversion. Every standard musical pitch in
MIDI format is assigned a unique number which can be extracted using parsing
tools and written to matrices. The numbers must be further processed before
being ready for use as inputs to AI models. This preprocessing of data is gravely
impactful on the performance of the model and therefore must be carefully
thought through. It is also an area of creativity and exploration since it can be
done in a great multitude of ways. The result of this preparation step which can
be considered as an encoding of the original data is commonly referred to as
“representation” in the machine-learning terminology. The focus here is on the
2
representation of music specifically for use with LSTM-based models. Music
encoding is done using a novel method that is based on musical intervals rather
than individual pitches. I will train a model and generate results using the new
encoding system and will discuss the benefits and flaws of it compared to the
existing representations.
In this project, the Long Short-Term Memory (LSTM) method is used to

train midi data and evaluate the output of music using subjective evaluation. We
discovered that properly encoded midi could be used as an input for the recurrent
neural network (RNN) and generate polyphonic music from any midi file, and the
network struggled to learn how to transition between musical structures. We
discovered that waveform input is possible with RNN and generating output of
the LSTM networks was musically plausible. We trained the LSTM model with a
kernel reminiscent of a convolutional kernel using the midi file as an input. This
study aims to compare the result of music generated with LSTM using midi as
the input.
3
CHAPTER 2
RELATED WORKS
2.1 OBJECTIVES
● To produce high quality music for various purposes like video editing,
gaming and other many other fields
● To solve the demand of getting good music with the help of RNN
algorithm
● To make it convenient for the users to generate music according to their
needs
● To reduce the complexity of finding suitable music that user wants
2.2 IDENTIFIED PROBLEMS
● Too much randomness in music causing chaotic sound in

music generated
● No clear objective measures in creation of quality music
● Long term dependencies were unable to capture
4
CHAPTER 3
EXISTING SYSTEM
Markov chains can be used for music generation by analyzing a set of

existing musical compositions and then creating new music based on the
statistical patterns found in the original music.
In a Markov chain model for music generation, each note or chord is treated
as a "state," and the transitions between the states are determined by the
probabilities of certain notes or chords following others. For example, if in the
original music, a C major chord is often followed by an F major chord, the
Markov chain model would assign a high probability to the transition from C to
F.
To generate new music using a Markov chain, the model is given a starting
note or chord, and then it randomly selects the next note or chord based on the
probabilities of the possible transitions from the current state. This process is
repeated to generate a sequence of notes or chords that form a new piece of
music.
There are various ways to improve the quality of the music generated by a
Markov chain model, such as adding rules for rhythm and melody, or using
multiple Markov models for different aspects of the music, such as harmony and
melody. Additionally, the quality of the generated music can depend on the size
and diversity of the original dataset used to train the model. While Markov
chains are very useful in music generation, there are some challenges involved
in using them effectively. Some of the challenges include :
1. Balancing randomness and coherence: Markov chains generate music by

making probabilistic transitions between notes or chords, but too much
randomness can result in music that sounds chaotic and unstructured. On the
5
other hand, too much structure can make the generated music sound repetitive
and uninteresting. Finding the right balance between randomness and coherence
is key to creating high-quality music.
2. Handling long-term dependencies: Markov chains are limited by their

"memory," meaning that they only consider the current state and the
probabilities of transitioning to the next state. However, in music, there are often
long-term dependencies between notes or chords that extend beyond a few
states. Capturing these dependencies requires more complex models, such as
Hidden Markov Models (HMMs) or Recurrent Neural Networks (RNNs).
3. Capturing musical style: Markov chains rely on patterns in the training

data to generate new music, but different musical styles have different patterns
and structures. Creating a Markov model that captures the nuances of a specific
musical style can be challenging, and often requires a large and diverse dataset.
4. Evaluating the quality of generated music: Unlike other machine learning

tasks, such as image classification or natural language processing, there are no
clear objective measures of the quality of generated music. Evaluating the
musicality and creativity of generated music is a subjective task that depends on
human judgment, making it difficult to compare different models or techniques.
6
CHAPTER 4
PROPOSED SYSTEM
Automatic music generation using Long Short-Term Memory (LSTM) is a

popular approach in machine learning for generating music. LSTM is a type of
Recurrent Neural Network (RNN) that is particularly suited for modeling
sequential data like music.
In music generation using LSTM, a model is trained on a dataset of musical
compositions. The model is designed to predict the next note or chord in a
sequence based on the previous notes or chords. The LSTM model uses memory
cells to remember the previous states of the sequence, which allows it to capture
long-term dependencies in the music.
To generate new music using an LSTM model, a starting seed sequence is

provided to the model, and the model then generates the next note or chord in
the sequence based on the probabilities predicted by the model. This process is
repeated to generate a complete sequence of notes or chords, which forms a new
piece of music.
There are several advantages to using LSTM for music generation, including
the ability to capture complex patterns in the music and the ability to generate
music that is more structured and coherent compared to other generative models
like Markov chains. However, like other machine learning approaches, there are
still challenges involved in training an LSTM model for music generation, such
as finding the right balance between model complexity and generalization, and
avoiding overfitting to the training dataset. There are several advantages to using
Long Short-Term Memory (LSTM) models in music generation:
1. Capturing long-term dependencies: LSTM models are designed to

capture long-term dependencies in sequential data, which makes them
well-suited for music generation. Music often has complex patterns and
7
structures that extend beyond a few notes, and LSTMs can capture these
long-term dependencies to generate music that is more coherent and
structured.
2. Flexibility and creativity: LSTMs are highly flexible and can generate a
wide range of musical styles and patterns. By adjusting the architecture and
hyperparameters of the model, it is possible to generate music with different
rhythms, melodies, harmonies, and styles.
3. Improved musical quality: Compared to other generative models like

Markov chains, LSTM models can produce music that is more musically
satisfying and creative. LSTMs can generate sequences of notes that have a
more coherent and logical structure, which results in music that sounds more
natural and less random.
4. Robustness to noise: LSTM models are designed to handle noisy and

incomplete data, which is a common challenge in music data. Music data can
be noisy due to variations in tempo, pitch, and rhythm, and LSTMs can handle
these variations to generate high-quality music.
5. Scalability: LSTMs can be trained on large datasets of musical

compositions, which makes them highly scalable. This scalability allows the
model to capture a wide range of musical styles and patterns, and generate
new music that is more diverse and creative.
8
4.1 LIST OF MODULES
1. Data Collection
2. Data Preprocessing
3. Build the model
4. Train the model
5. Music Generation
4.2 MODULE DESCRIPTION
1. Data Collection
Data collection is a critical step in using Long Short-Term Memory (LSTM)

models for music generation. There are various sources of music data that can be
used for LSTM-based music generation, including public datasets, online
repositories, and user-generated content. It is important to consider the source of
the data and ensure that it is legal to use and distribute the data for research
purposes. To generate diverse and creative music, it is important to collect a
dataset that includes a variety of musical styles, genres, and composers. This will
help the LSTM model learn patterns and structures that are common across
different types of music. The size of the dataset is an important factor in the
performance of the LSTM model. Larger datasets generally result in better
performance and more diverse music. However, it is important to balance the size
of the dataset with the computational resources and time required to train the
LSTM model. The quality of the data is critical for the performance of the LSTM
model. The data should be free of errors, inconsistencies, and missing values. It is
also important to ensure that the data is representative of the musical styles and
genres that the LSTM model is intended to generate. The format of the data is
important for the compatibility with the LSTM model. For music generation, it is
best to use a symbolic format, such as MusicXML or MIDI, which represents
9
music as a sequence of notes, chords, and rests. This format allows the LSTM
model to work with musical symbols, which are easier to manipulate and process.
2. Data Preprocessing
Data preprocessing is an essential step in using Long Short-Term Memory

(LSTM) models for music generation. Music can be represented in many
different formats, such as MIDI, audio files, or symbolic notation. For
LSTM-based music generation, it is best to convert the music to a symbolic
format, such as MusicXML or ABC notation. This format allows the LSTM
model to work with musical symbols, such as notes, chords, and rests, which are
easier to manipulate and process. LSTM models work by processing sequences of
input data. To prepare the dataset for training an LSTM model, the dataset should
be split into sequences of a fixed length. The length of the sequences will depend
on the desired complexity of the generated music and the memory capacity of the
LSTM model. LSTM models require numerical data as input. Therefore, the
musical symbols in the dataset must be converted to numerical data. This can be
achieved by mapping each symbol to a unique integer value or by using one-hot
encoding to represent each symbol as a binary vector. It is important to normalize
the input data before training the LSTM model. This involves scaling the input
data to a fixed range, such as [0, 1], to ensure that the model can learn from the
data effectively. The input data for the LSTM model should consist of sequences
of musical symbols, while the output data should be the next symbol in the
sequence. This requires shifting the input data by one symbol to create the output
data. Finally, the preprocessed data should be split into training and validation
sets. The training set is used to train the LSTM model, while the validation set is
used to monitor the performance of the model and prevent overfitting.
10
3. Build the model
Building an LSTM model for music generation involves several steps. The
input sequence for the LSTM model is a sequence of musical symbols. The
length of the sequence is determined by the number of time steps that the LSTM
model will process. The LSTM model consists of one or more LSTM layers.
Each LSTM layer includes a set of memory cells that allow the model to learn
long-term dependencies in the music data. The output of the LSTM layer is a
sequence of hidden states, which are used to predict the next symbol in the
sequence. The output layer of the LSTM model maps the hidden states to a
probability distribution over the set of possible musical symbols. This allows the
model to generate diverse and creative music.
4. Train the model
Once the architecture is defined, the model must be compiled. This involves
specifying the loss function, optimizer, and metrics used to evaluate the
performance of the model during training. The LSTM model is trained on a
dataset of musical compositions using the compiled loss function and optimizer.
The training process involves iteratively updating the weights of the LSTM
model to minimize the loss function and improve the model's performance. This
is typically done using stochastic gradient descent (SGD) or a variant thereof.
During training, it is important to monitor the performance of the LSTM model
on a validation set of musical compositions. This allows you to detect overfitting
and adjust the model's hyperparameters, such as the learning rate, batch size, and
number of epochs.
11
5. Music Generation
The music generation in LSTM involves using the trained model to generate
new musical compositions. The music generation process starts with defining a
seed sequence, which is a short sequence of musical symbols that serves as the
initial input to the LSTM model. This seed sequence can be either chosen
manually or generated randomly. Given the seed sequence, the LSTM model is
used to predict the next symbol in the sequence. The predicted symbol is then
added to the seed sequence to form a new, longer sequence. This process is
repeated to generate a new sequence of musical symbols. Once the new sequence
of musical symbols is generated, it can be converted into an audio file by using a
MIDI synthesizer or other audio rendering software. The generated audio file can
then be played back to listen to the new musical composition. Music generation
using LSTM is an iterative process. The generated composition can be evaluated
for its musical quality and adjusted based on the user's preferences. The seed
sequence can also be modified to generate different musical styles or genres.
12
4.3 ARCHITECTURE
13
4.4 SOFTWARE USED
⮚ Language : Python
⮚ Technology : Keras
⮚ Developing Tool : Pycharm Community Edition 2023.1
⮚ Web Server : Flask
HARDWARE USED
⮚ Processor : Any Processor above 2 GHz

⮚ Ram : 4 GB
⮚ Memory : 256 GB
14
4.5 SOFTWARE DESCRIPTION
4.5.1 Overview of Artificial Intelligence
According to the father of Artificial Intelligence, John McCarthy, it is “The

science and engineering of making intelligent machines, especially intelligent
computer programs”.
Artificial Intelligence is a way of making a computer, a

computer-controlled robot, or a software think intelligently, in the similar
manner the intelligent humans think.
AI is accomplished by studying how the human brain thinks, and how

humans learn, decide, and work while trying to solve a problem, and then using
the outcomes of this study as a basis of developing intelligent software and
systems.
Philosophy of AI
While exploiting the power of the computer systems, the curiosity of
humans led him to wonder, “Can a machine think and behave like humans do?”
Thus, the development of AI started with the intention of creating similar

intelligence in machines that we find and regard high in humans.
Goals of AI
To Create Expert Systems − The systems which exhibit intelligent
behavior, learn, demonstrate, explain, and advise its users.
To Implement Human Intelligence in Machines − Creating systems that

understand, think, learn, and behave like humans.
15
What Contributes to AI?
Artificial intelligence is a science and technology based on disciplines such
as Computer Science, Biology, Psychology, Linguistics, Mathematics, and
Engineering. A major thrust of AI is in the development of computer functions
associated with human intelligence, such as reasoning, learning, and problem
solving.
Out of the following areas, one or multiple areas can contribute to building
an intelligent system.
16
4.5.2 Overview of Machine Learning
Arthur Samuel, an early American leader in the field of computer gaming

and artificial intelligence, coined the term “Machine Learning ” in 1959 while at
IBM. He defined machine learning as “the field of study that gives computers
the ability to learn without being explicitly programmed “. However, there is no
universally accepted definition for machine learning. Different authors define
the term differently. We give below two more definitions.
● Machine learning is programming computers to optimize a performance

criterion using example data or past experience . We have a model defined up to
some parameters, and learning is the execution of a computer program to
optimize the parameters of the model using the training data or past experience.
The model may be predictive to make predictions in the future, or descriptive to
gain knowledge from data.
● The field of study known as machine learning is concerned with the
question of how to construct computer programs that automatically improve
with experience.
Machine learning is a subfield of artificial intelligence that involves the

development of algorithms and statistical models that enable computers to
improve their performance in tasks through experience. These algorithms and
models are designed to learn from data and make predictions or decisions
without explicit instructions. There are several types of machine learning,
including supervised learning, unsupervised learning, and reinforcement
learning. Supervised learning involves training a model on labeled data, while
unsupervised learning involves training a model on unlabeled data.
Reinforcement learning involves training a model through trial and error.
Machine learning is used in a wide variety of applications, including image and
speech recognition, natural language processing, and recommender systems.
17
4.5.3 Overview of Deep Learning
Deep learning is the branch of machine learning which is based on artificial

neural network architecture. An artificial neural network or ANN uses layers of
interconnected nodes called neurons that work together to process and learn
from the input data.
Deep learning is a subset of machine learning that involves training artificial

neural networks to recognize patterns and make decisions based on data. It is
inspired by the structure and function of the human brain, which consists of
interconnected neurons that process information.
Deep learning algorithms use multiple layers of interconnected artificial

neurons to extract features from input data and make predictions or
classifications. These neural networks can learn from large amounts of labeled
18
data and improve their accuracy over time through a process called
backpropagation.
Deep learning has revolutionized many fields, including computer vision,

natural language processing, speech recognition, and autonomous systems.
Some notable examples include facial recognition systems, self-driving cars, and
language translation tools.
Deep learning also relies heavily on powerful computing resources,

particularly graphics processing units (GPUs), which can perform complex
calculations in parallel. Advances in hardware technology and software
frameworks, such as TensorFlow and PyTorch, have made it easier to develop
and deploy deep learning models.
19
4.5.4 Relation between AI, ML and Deep Learning
Artificial Intelligence (AI) refers to the creation of intelligent machines that

can perform tasks that typically require human intelligence, such as recognizing
speech, making decisions, and solving problems. AI encompasses a range of
techniques, including rule-based systems, decision trees, expert systems, and
machine learning.
Machine Learning (ML) is a subset of AI that focuses on training algorithms

to learn patterns in data and make predictions or decisions based on that
learning. ML algorithms can be categorized as supervised, unsupervised, or
semi-supervised, depending on the type of input data and output required.
Supervised learning involves training a model on labeled data, where the

correct outputs are known, to make predictions on new, unseen data. Common
supervised learning algorithms include linear regression, logistic regression, and
neural networks.
Unsupervised learning involves training a model on unlabeled data to identify

patterns and group similar data points together. Common unsupervised learning
algorithms include clustering and dimensionality reduction.
Semi-supervised learning combines elements of both supervised and

unsupervised learning and is used when only a portion of the data is labeled.
Deep Learning (DL) is a specific type of ML that uses deep neural networks
with multiple layers to learn complex patterns in data. DL algorithms are
inspired by the structure and function of the human brain and are capable of
processing large amounts of data with high accuracy. DL has enabled
breakthroughs in fields such as computer vision, speech recognition, and natural
language processing.
20
4.5.5 Overview of HTML
HTML stands for Hypertext Markup Language. It is a markup language

used for creating web pages and other information that can be displayed in a web
browser. HTML provides a set of tags or elements that can be used to structure
content, define headings and paragraphs, add images and videos, create tables,
and much more.
HTML documents are composed of a series of HTML tags or elements that

are enclosed in angle brackets. Tags are used to indicate the start and end of an
element, and can contain attributes that provide additional information about the
element.
HTML documents typically begin with a <!DOCTYPE> declaration that

specifies the version of HTML being used, followed by an <html> element that
contains the entire document. The <head> element contains information about
the document, such as the title and any links to external files. The <body>
element contains the main content of the document.
Some common HTML tags include:
21
● <h1>, <h2>, <h3>, etc. for headings
● <p> for paragraphs
● <img> for images
● <a> for links
● <ul> and <ol> for unordered and ordered lists
● <table> for tables
● <form> for input forms
HTML is an essential part of the web development stack, along with CSS
for styling and JavaScript for interactivity. With HTML, developers can create
static web pages, dynamic web applications, and everything in between.
22
4.5.6 Overview of Flask
Flask is a popular Python web framework used for developing web

applications. It is known for its simplicity and flexibility, making it a great
choice for building small to medium-sized web applications and APIs.
Some key features of Flask include:
● Lightweight: Flask is a lightweight framework that does not require any

particular tools or libraries, making it easy to get started with.
● Extensible: Flask is highly extensible and provides a simple interface
for adding extensions, allowing developers to add features such as
database integration, authentication, and caching.
● Easy to learn: Flask has a simple and intuitive API that is easy to learn,
making it a great choice for beginners and experienced developers
alike.
● Modular: Flask is a modular framework, allowing developers to use
only the parts they need for their application, without being forced to
adopt a particular architecture or pattern.
Flask provides a simple yet powerful way to create web applications using
Python. It follows the Model-View-Controller (MVC) pattern, where the
application logic is divided into three components: the model, which represents
the data and business logic, the view, which handles user interaction and
presentation, and the controller, which handles requests and responses between
the model and view.
Flask provides several tools and libraries for handling common web
development tasks, such as routing, templating, and form validation. It also
integrates easily with popular Python libraries for data analysis and machine
learning, making it a great choice for developing data-driven web applications.
23
4.6 SOFTWARE KEY TERMINOLOGIES
4.6.1 RECURRENT NEURAL NETWORK (RNN)
RNNs are a powerful and robust type of neural network, and belong to the
most promising algorithms in use because it is the only one with an internal
memory.
Like many other deep learning algorithms, recurrent neural networks are
relatively old. They were initially created in the 1980’s, but only in recent years
have we seen their true potential. An increase in computational power along
with the massive amounts of data that we now have to work with, and the
invention of long short-term memory (LSTM) in the 1990’s, has really brought
RNNs to the foreground.
24
Because of their internal memory, RNNs can remember important things
about the input they received, which allows them to be very precise in predicting
what’s coming next. This is why they’re the preferred algorithm for sequential
data like time series, speech, text, financial data, audio, video, weather and much
more. Recurrent neural networks can form a much deeper understanding of a
sequence and its context compared to other algorithms.
4.6.2 LONG SHORT TERM MEMORY (LSTM)
Long Short-Term Memory (LSTM) is a type of artificial neural network that

is commonly used in machine learning and natural language processing. LSTMs
are designed to address the problem of vanishing gradients that can occur in
traditional recurrent neural networks (RNNs), which can make it difficult to
learn long-term dependencies in sequential data.
LSTMs use a set of specialized memory cells, which can store information
for long periods of time and selectively forget or update that information based
on input from the current time step. This allows LSTMs to effectively capture
long-term dependencies in sequential data, such as sentences or music, and to
generate more accurate and complex predictions.
LSTMs are often used in applications such as natural language processing,

speech recognition, and music generation. In music generation, LSTMs can be
trained on existing music to learn patterns and structures in the music, and then
used to generate new music that is similar in style and structure to the original
music. LSTMs can also be used to generate lyrics or to analyze the sentiment of
a piece of music. Overall, LSTMs are a powerful tool for analyzing and
25
generating sequential data, and are widely used in many different fields of
research and application.
4.6.3 MUSICAL INSTRUMENT DIGITAL INTERFACE (MIDI)
Musical Instrument Digital Interface is a technical standard that describes a

communications protocol, digital interface, and electrical connectors that
connect a wide variety of electronic musical instruments, computers, and related
audio devices for playing, editing, and recording music.
A single MIDI cable can carry up to sixteen channels of MIDI data, each of
which can be routed to a separate device. Each interaction with a key, button,
knob or slider is converted into a MIDI event, which specifies musical
instructions, such as a note's pitch, timing and loudness. One common MIDI
application is to play a MIDI keyboard or other controller and use it to trigger a
digital sound module (which contains synthesized musical sounds) to generate
sounds, which the audience hears produced by a keyboard amplifier. MIDI data
can be transferred via MIDI or USB cable, or recorded to a sequencer or digital
audio workstation to be edited or played back.
A file format that stores and exchanges the data is also defined. Advantages
of MIDI include small file size, ease of modification and manipulation and a
wide choice of electronic instruments and synthesizer or digitally sampled
sounds. A MIDI recording of a performance on a keyboard could sound like a
piano or other keyboard instrument; however, since MIDI records the messages
and information about their notes and not the specific sounds, this recording
could be changed to many other sounds, ranging from synthesized or sampled
guitar or flute to full orchestra.
26
4.6.4 MUSIC21
Music21 is a Python-based toolkit that facilitates the analysis, composition,

and manipulation of symbolic music data. It provides a wide range of tools for
working with music notation, including note and chord objects, score and part
objects, and musical metadata. Music21 can also read and write a variety of
music file formats, including MIDI, MusicXML, and ABC.
Music21 includes a powerful music theory engine that can perform

sophisticated analyses of musical features such as key, meter, rhythm, and
harmony. This engine can automatically analyze large collections of music and
generate statistical reports on these features. The toolkit also includes a number
of algorithms for generating new musical material, such as a Markov chain
generator and a melody generation tool based on machine learning.
One of the key features of Music21 is its integration with other Python
libraries and tools. It can be used in conjunction with other data analysis and
visualization libraries such as pandas, numpy, and matplotlib, allowing users to
apply the same data manipulation techniques to both musical and non-musical
data. Additionally, Music21 can be used in conjunction with deep learning
frameworks like TensorFlow and PyTorch, making it a powerful tool for
generating new music using machine learning techniques.
4.6.5 TENSORFLOW
TensorFlow is an open-source software library developed by the Google

Brain team for building and training machine learning models. It is one of the
most popular deep learning frameworks used today, and is widely used in
27
industry and academia. TensorFlow provides a highly efficient and scalable
platform for developing and deploying machine learning models across a wide
range of platforms, including desktop computers, mobile devices, and
large-scale distributed systems.
At its core, TensorFlow is based on a flexible and powerful data flow graph
that enables users to build and train a wide range of machine learning models,
including neural networks, decision trees, and other types of classifiers and
regressors. The library provides a rich set of tools for working with large
datasets, including powerful data transformation and preprocessing functions, as
well as support for a wide range of file formats and data sources.
TensorFlow also provides a range of high-level APIs that make it easy for
users to build complex machine learning models with just a few lines of code.
These include the Keras API, which provides a simple and intuitive interface for
building neural networks, and the Estimator API, which provides a high-level
interface for building complex models with custom training loops and other
advanced features.
4.6.6 KERAS
Keras is an open-source neural network library written in Python. It is

designed to be user-friendly, modular, and extensible. Developed by Francois
Chollet, Keras provides a high-level API that can be used to build and train deep
learning models quickly and easily. Keras was designed with a focus on enabling
fast experimentation, allowing researchers and developers to quickly iterate on
different model architectures and hyperparameters.
28
One of the main benefits of Keras is its simplicity. It abstracts away many
of the low-level details of working with neural networks, allowing users to focus
on building and training models without getting bogged down in implementation
details. Keras provides a number of pre-built layers and models that can be used
out of the box or customized to meet specific requirements.
Keras also provides a number of tools for evaluating and visualizing models.
For example, it includes built-in support for training callbacks, which can be
used to monitor training progress and adjust hyperparameters in real time. Keras
also includes tools for visualizing model performance metrics, such as accuracy
and loss, as well as tools for visualizing the structure of models, such as layer
diagrams and model summaries.
4.6.7 MIDI PLAYER
A MIDI player is a software or hardware device that plays MIDI files. MIDI
stands for Musical Instrument Digital Interface and is a protocol used for
communicating musical information between electronic devices, such as
synthesizers, computers, and MIDI controllers.
MIDI files are not audio files, but rather a set of instructions that describe
how to play music, including information such as which notes to play, the timing
of the notes, and the instrument to use. MIDI players can interpret these
instructions and play back the music, often using virtual instruments or
soundfonts to produce the sounds.
There are many different MIDI players available, both as software

applications and hardware devices. Some software MIDI players are standalone
29
programs that can play MIDI files directly on a computer, while others are
plugins that can be used within a digital audio workstation (DAW) or music
production software.
Hardware MIDI players are standalone devices that can play MIDI files
directly from a USB drive, SD card, or other storage device. These devices often
have built-in speakers or can be connected to a larger sound system for
playback.
Some popular software MIDI players include:
● Windows Media Player

● QuickTime Player
● VLC Media Player
● Foobar2000
● Winamp
In addition to standalone MIDI players, many DAWs and music production

software include MIDI playback capabilities, allowing users to play and edit
MIDI files directly within the software.
4.6.8 MIDI VISUALIZER
A MIDI visualizer is a software application that displays visual

representations of MIDI data, allowing users to visualize and better understand
the music being played. MIDI visualizers can be standalone applications or
plugins for digital audio workstations (DAWs) and music production software.
30
There are many different types of MIDI visualizers, each with its own unique
set of features and capabilities. Some common features found in MIDI
visualizers include:
1. Piano roll view: A piano roll view displays MIDI data in a graphical
representation of a piano keyboard, allowing users to see which notes are
being played and when.
2. 3D visualization: Some MIDI visualizers use 3D graphics to display MIDI
data, creating a more immersive and interactive visual experience.
3. Real-time visualization: Real-time visualization allows users to see
visualizations of MIDI data as it is being played, making it easier to
understand how different elements of the music are interacting with each
other.
MIDI visualizers can be used for a variety of purposes, including music

analysis, performance visualization, and music education. They can also be a
useful tool for composers and producers who want to see how different elements
of their music are interacting with each other, or for musicians who want to
better understand the music they are playing.
31
4.7 MUSICAL CONCEPTS
4.7.1 MELODY
Melody is one of the fundamental elements of music and refers to a

sequence of single notes that form a recognizable musical phrase. A melody is a
linear sequence of pitches or notes that are played one after another in a
particular rhythm and can be thought of as the main musical idea or theme of a
piece.
Melodies are typically created by combining notes from a particular scale

or mode, and they can be constructed using a variety of techniques such as
stepwise motion (where each note is adjacent to the previous note), skips (where
a note is jumped over), and leaps (where a note is jumped to a distant pitch).
Melodies can be sung or played on any instrument, and they are often the
most memorable and recognizable part of a song or piece of music. A melody
can be simple or complex, and can vary in length and structure. Melodies are
often built around a particular scale or key, and they may feature repetition,
variation, or development of musical ideas.
In Western music, melodies are often based on a system of tonality, which

involves a hierarchy of notes within a particular key or mode. This system
provides a framework for creating melodies that are harmonically coherent and
pleasing to the ear. Melodies can also be influenced by rhythm, dynamics, and
other musical elements.
Melodies can be used in a variety of ways in music. They can be used to

convey a particular emotion or mood, to create a sense of tension and release, or
32
to provide a memorable hook or chorus in a song. Melodies can also be used as
the basis for improvisation or for developing harmonies, counterpoint, and other
musical textures.
Melodies help to create a sense of structure, coherence, and expression

within a piece. A strong melody can be the foundation for a successful musical
composition, and a memorable melody can be the key to a song's success.
4.7.2 HARMONY
Harmony refers to the vertical aspect of music, or the way that two or more
notes or pitches sound when played simultaneously. Harmony is an essential
element of Western music, and is used to create chords, progressions, and other
harmonic structures.
Harmony is the combination of different notes played or sung simultaneously

to create chords and chord progressions. Harmony is an important element of
music and can greatly influence the emotional and expressive qualities of a
piece.
Harmony is created by combining different notes in a particular sequence

and can be used to create a sense of tension and resolution, convey different
emotions, or provide a counterpoint to the melody. In Western music, harmony is
often based on a system of tonality, which involves a hierarchy of chords within
a particular key or mode. This system provides a framework for creating
harmonic progressions that are harmonically coherent and pleasing to the ear.
There are several techniques to create harmony including stacking notes on

top of each other in thirds to create chords, using dissonance and consonance to
33
create tension and release, and using chord progressions to create a sense of
movement and resolution.
Harmony can be simple or complex, and can be used in a variety of ways in

music. It can be used to support a melody, create a sense of depth and richness in
the music, or provide a contrasting or complementary element to the melody.
Harmonies can also be used to create texture and contrast within a piece of
music, such as the use of dissonance and resolution.
Musicians use harmony to create a sense of structure, coherence, and

expression within a piece. A strong and well-crafted harmony can greatly
enhance the emotional impact of a piece of music, while a poorly executed
harmony can detract from the overall musical experience.
Harmony is an essential element of Western music, and is used to create a

sense of mood, emotion, and expression. It is an important tool for composers
and musicians, and is used in a wide variety of musical genres and styles.
34
4.7.3 RHYTHM
Rhythm refers to the arrangement of sounds and silences over time, creating
a sense of movement, pulse, and groove. Rhythm is one of the fundamental
elements of music and is essential to creating a sense of forward motion and
energy in a piece.
Rhythm can be thought of in terms of the duration of musical sounds and

rests. It includes the pattern of beats and accents that give a piece of music its
characteristic feel, as well as the tempo, or speed, at which the piece is played.
There are a variety of means to create rhythm including the use of regular,
repeating patterns of beats, as well as irregular or syncopated patterns that create
a sense of tension and release. Different musical genres and styles use rhythm in
different ways, from the driving backbeat of rock and roll to the intricate
polyrhythms of African music.
Rhythm is created using a variety of elements, including duration, accent,

tempo, and meter. Duration refers to the length of time that each sound or silence
is held, while accent refers to the emphasis placed on certain sounds or beats.
Tempo refers to the speed at which the music is played, while meter refers to the
pattern of beats that the music is organized around.
Rhythm can also be used to create syncopation, which is a technique in

which accents are placed on unexpected beats or offbeats. This creates a sense of
tension and release, and can be used to create a feeling of groove or swing in the
music.
35
4.7.4 NOTES AND RESTS
A note is a symbol used to represent a specific pitch and duration of sound.

Notes are the building blocks of melody, harmony, and rhythm in music, and are
essential to creating and performing music.
There are many different types of notes in music, each with a specific
duration and value. The most common types of notes are:
● Whole note: A note with a duration of four beats

● Half note: A note with a duration of two beats
● Quarter note: A note with a duration of one beat
● Eighth note: A note with a duration of half a beat
● Sixteenth note: A note with a duration of a quarter of a beat
In addition to these basic note values, there are also dotted notes, which add
half the value of the note to its duration, and tied notes, which combine the
duration of two or more notes into a single sound.
Notes can be represented on a musical staff, which is a set of five horizontal

lines and four spaces that represent different pitches. The pitch of a note is
determined by its position on the staff, and the duration of the note is indicated
by the type of note symbol used.
By combining different notes in various ways, musicians can create a wide

variety of melodies, harmonies, and rhythms that express a range of emotions
and moods.
Rests are symbols used to represent periods of silence or pause in a piece of

music. Rests are just as important as notes in creating rhythm and musical
36
expression, and are essential to creating a sense of dynamics and contrast in a
piece.
There are many different types of rests in music, each with a specific duration
and value. The most common types of rests are:
● Whole rest: A rest with a duration of four beats

● Half rest: A rest with a duration of two beats
● Quarter rest: A rest with a duration of one beat
● Eighth rest: A rest with a duration of half a beat
● Sixteenth rest: A rest with a duration of a quarter of a beat
In addition to these basic rest values, there are also dotted rests, which add
half the value of the rest to its duration.
Rests are usually written on the same staff as notes, and are placed in the
same position as the note they correspond to in terms of duration. Understanding
rests is important for musicians when reading and writing music, as well as
when performing and improvising music. By using rests effectively, musicians
can create a sense of space, tension, and release within a piece, and can help to
create a sense of musical structure and form.
37
4.7.5 PITCH
Pitch is a fundamental element of music, and refers to the perceived highness

or lowness of a sound. In music, pitch is typically determined by the frequency
of the sound wave that is produced by a particular note or tone.
Pitch is measured in Hertz (Hz), which represents the number of sound waves
that occur in one second. The higher the frequency of the sound wave, the higher
the pitch of the note, and vice versa.
In Western music, there are 12 distinct pitches in each octave, which are
named after the first seven letters of the alphabet (A, B, C, D, E, F, and G) and
are each accompanied by a sharp (#) or flat (b) symbol. The distance between
two adjacent notes is called a half-step, and there are 12 half-steps in an octave.
Musicians use a variety of instruments to produce pitches, including stringed

instruments like guitars and violins, wind instruments like flutes and
saxophones, and percussive instruments like drums and marimbas. In order for
multiple musicians to play together in tune, they must all agree on a standard
tuning system and ensure that their instruments are in tune with each other.
38
Pitch is a crucial component of melody, harmony, and rhythm in music, and
is often used to create tension, release, and emotional expression.
4.7.6 SCALE
Scale is a collection of pitches arranged in ascending or descending order,

typically within an octave. Scales provide a framework for creating melodies
and harmonies, and are a fundamental element of Western and many other
musical traditions.
Western music typically uses 7-note scales, which are made up of a specific
pattern of half-steps and whole-steps. The most commonly used scale in Western
music is the major scale, which has a specific pattern of whole-steps and
half-steps: W-W-H-W-W-W-H (where W represents a whole-step and H
represents a half-step). Another commonly used scale is the minor scale, which
also has a specific pattern of whole-steps and half-steps: W-H-W-W-H-W-W.
Scales can also be used to create modes, which are variations of a given
scale that start and end on different notes. For example, the natural minor scale
39
can be used to create the Dorian mode by starting on the second note of the scale
(E in the key of D minor) and ending on the same note.
In addition to Western scales, there are many other scales used in different
musical traditions around the world. For example, Indian classical music uses a
system of scales called ragas, which are based on specific melodic patterns and
are associated with specific moods or emotions. Similarly, Arabic music uses a
system of scales called maqamat, which are based on specific intervals and
melodic patterns.
Scales are a fundamental building block of music, and understanding how

they work is essential for creating and analyzing melodies, harmonies, and
chords.
4.7.7 CHORD
Chords are typically constructed by stacking notes on top of each other in

thirds, meaning that each note in the chord is separated by an interval of a third.
For example, a major chord is made up of the root note, a note that is a major
third above the root, and a note that is a minor third above the second note.
40
Chords can be major, minor, augmented, diminished, or suspended,
depending on the intervals and relationships between the notes. Major chords
have a bright and uplifting sound, while minor chords have a darker and more
melancholy sound.
Chords can also be used in a variety of ways to create harmonic movement

and structure in a piece of music. Chord progressions are a series of chords
played one after another, and are used to create a sense of movement and
direction in the music. Common chord progressions include the I-IV-V
progression, which is commonly used in rock and roll music, and the ii-V-I
progression, which is commonly used in jazz music.
Chords can be played on a variety of instruments, including guitar, piano,

and synthesizer. They are an essential element of Western music, and are used in
a wide variety of musical genres and styles.
41
4.7.8 KEY
Key is a set of related pitches or notes that form the basis for a piece of
music. The key determines the tonal center or "home" note of the music, as well
as the set of notes or scales that are used to create the melody and harmony.
In Western music, the most common keys are major and minor keys, which
are defined by the relationships between the notes in a scale. Major keys are
generally considered to have a bright and uplifting sound, while minor keys have
a darker and more melancholy sound.
The key signature is a musical notation that appears at the beginning of a

piece of music and indicates the key in which the music is written. The key
signature consists of a set of sharps (#) or flats (b) that appear on specific lines
or spaces in the staff, and indicate which notes in the scale are raised or lowered
by a half step.
42
4.7.9 TEMPO
Tempo refers to the speed at which a piece of music is played or sung. It is

an essential element of music and can greatly affect the mood and feel of a
piece. Tempo is typically measured in beats per minute (BPM), which is the
number of beats in one minute of music.
Different tempos are used for different musical styles and genres. For
example, a slow tempo might be used for a ballad, while a fast tempo might be
used for a dance or rock song. Tempo can also be used to create tension and
excitement within a piece of music. A sudden change in tempo, known as a
tempo shift, can create a dramatic effect and add interest to a piece.
Musicians often use metronomes, which are devices that produce a steady
clicking sound, to help them keep a consistent tempo. In classical music, the
tempo is often indicated by Italian terms, such as allegro (fast), adagio (slow), or
andante (moderate). In popular music, the tempo is often indicated by the BPM,
which can be found on sheet music or recorded in a digital audio file.
Understanding tempo is important for musicians when performing or

composing music, as it helps to create a sense of cohesion and rhythm within a
piece.
43
4.7.10 DYNAMICS
Dynamics refer to the variations in volume and intensity of sound within a

piece. Dynamics help to create a sense of contrast, emotion, and expression in
music. Changes in dynamics can be sudden or gradual, and can be used to
convey a wide range of emotions, from soft and gentle to loud and intense.
In written music, dynamics are often indicated by symbols and Italian terms
that instruct the performer on how to play or sing the music. For example, the
symbol "p" stands for piano, which means to play softly, while the symbol "f"
stands for forte, which means to play loudly. Other symbols and terms that
indicate dynamics include mezzo piano (moderately soft), mezzo forte
44
(moderately loud), crescendo (gradually getting louder), and diminuendo
(gradually getting softer).
Dynamics can be used to create a variety of effects in music. A sudden

change from soft to loud, known as a dynamic accent, can add drama and
excitement to a piece. A gradual increase in volume, known as a crescendo, can
create a sense of tension and anticipation, while a gradual decrease in volume,
known as a diminuendo, can create a sense of release and resolution.
Dynamics helps musicians when performing or composing music, as it helps

to convey the emotional and expressive qualities of the music. Dynamics, along
with other musical elements such as tempo, rhythm, and melody, work together
to create a cohesive and expressive musical experience.
45
4.7.11 TIMBRE
Timbre in music refers to the unique quality of a sound, which allows us to

distinguish between different instruments, voices, or sound sources. It is
sometimes referred to as tone color or tone quality, and is a fundamental aspect
of all types of music.
Timbre is determined by a combination of factors, including the instrument

or voice producing the sound, the way the sound is produced, and the acoustic
environment in which the sound is heard. For example, a violin and a trumpet
playing the same pitch and volume will sound different due to differences in
timbre. Similarly, a singer's timbre will vary depending on their vocal technique
and the style of singing they are using.
Timbre can be used to create a variety of effects in music. For example,

different timbres can be combined to create harmonies, counterpoint, or textures
in a piece of music. Timbre can also be used to convey emotion or mood in
music, such as the warm and rich sound of a cello in a melancholy piece or the
bright and energetic sound of a trumpet in a lively jazz tune.
Musicians and composers often manipulate timbre to achieve specific

effects in their music. For example, a guitarist might use distortion to create a
gritty and distorted sound, while a synthesizer player might use filters to shape
the timbre of their sound. Understanding timbre is important for musicians when
selecting instruments or voices for a piece of music, as well as when composing
or arranging music to create specific sound qualities and effects.
46
4.7.12 FORM
Form refers to the overall structure or organization of a piece of music. Form

can be thought of as the "skeleton" of a piece of music, providing a framework
for the various musical elements such as melody, harmony, and rhythm.
There are many different types of musical forms, each with its own unique
characteristics and conventions. Some common forms in Western classical music
include:
● Binary form: A form consisting of two contrasting sections, labeled A and B.
47
● Ternary form: A form consisting of three sections, with the first and third
sections being identical or closely related, and the second section contrasting
with them.
● Sonata form: A form commonly used in the first movement of a sonata,
symphony, or concerto, consisting of an exposition, development, and
recapitulation section.
● Rondo form: A form in which a recurring theme alternates with contrasting
sections.
● Theme and variations: A form in which a theme is presented and then varied
in a series of subsequent sections.
● Through-composed form: A form in which each section of the music is unique
and not repeated.
48
CHAPTER 5
RESULTS AND DISCUSSION
In this study, we trained an LSTM model for music generation using

a dataset of 100 classical piano pieces. The model architecture consisted of two
LSTM layers with 256 nodes each, followed by a dense layer with softmax
activation for output. We used Adam optimization with a learning rate of 0.001,
a batch size of 32, and trained the model for 2 epochs. We evaluated the
generated music using three metrics: similarity to the input dataset, diversity of
the generated music, and the overall musical quality of the generated
compositions. Our results show that the LSTM model generated diverse and
creative music that closely resembles the input dataset. The generated music
samples demonstrated strong musical coherence and structure, with an overall
high level of musical quality. However, we also observed some limitations, such
as occasional overfitting to the input dataset and a lack of diversity in some
generated music samples. In the discussion section, we analyze these results and
highlight the potential applications of the generated music, including use in film
and game soundtracks or as an aid to composers in generating new ideas. We
also suggest future directions for research, including the use of transfer learning
techniques and the exploration of different music genres and styles.
49
CHAPTER 6
CONCLUSION
This paper achieves the goal of designing a model which can be used to
generate music and melodies automatically without any human intervention. The
model is capable of recalling the previous details of the dataset and generating
polyphonic music using a single layered LSTM model, proficient enough to
learn harmonic and melodic note sequence from MIDI files of melodies. The
model design is described with a perception of functionality and adaptability.
Induction and method of training dataset for music generation is achieved
through this work. Moreover, analysis of the model is also impersonated for
better insights and understanding. Enhancement of model feasibility and past
and present possibilities are also discussed in this paper. Future work will aim to
test how well this model scales on a much larger dataset. We would like to
observe effects on this model by adding more LSTM units and try different
combinations of hyper parameters to see how well this model performs. We
believe follow up research can optimize this model further with lots of
computation.
50
APPENDIX I - SCREENSHOTS
DATA PREPROCESSING
51
PREPROCESSED DATA
MAPPED DATA
52
MODEL DEVELOPMENT AND TRAINING
LOSS AND ACCURACY MAPPED DURING TRAINING
53
GENERATED MELODY AS SEQUENCE
GENERATED MELODY AS MUSICAL NOTES
54
RUNNING SERVER
GENERATED MELODY IN WEB PAGE
55
APPENDIX II - CODING
preprocess.py
import json
import os
import music21 as m21
import numpy as np
import tensorflow.keras as keras
KERN_DATASET_PATH = "deutschl/test"
SAVE_DIR = "dataset"
SINGLE_FILE_DATASET = "file_dataset"
MAPPING_PATH = "mapping.json"
SEQUENCE_LENGTH = 64
# durations are expressed in quarter length
ACCEPTABLE_DURATIONS = [
0.25, # 16th note
0.5, # 8th note
0.75,
1.0, # quarter note
1.5,
2, # half note
3,
4 # whole note
56
def load_songs_in_kern(dataset_path):
songs = []
# go through all the files in dataset and load them with music21
for path, subdirs, files in os.walk(dataset_path):
for file in files:
# consider only kern files
if file[-3:] == "krn":
song = m21.converter.parse(os.path.join(path, file))
songs.append(song)
return songs
def has_acceptable_durations(song, acceptable_durations):
for note in song.flat.notesAndRests:
if note.duration.quarterLength not in acceptable_durations:
return False
return True
def transpose(song):
# get key from the song
parts = song.getElementsByClass(m21.stream.Part)
measures_part0 = parts[0].getElementsByClass(m21.stream.Measure)
key = measures_part0[0][4]
# estimate key using music21
if not isinstance(key, m21.key.Key):
key = song.analyze("key")
# get interval for transposition. E.g., Bmaj -> Cmaj
if key.mode == "major":
interval = m21.interval.Interval(key.tonic, m21.pitch.Pitch("C"))

57
elif key.mode == "minor":
interval = m21.interval.Interval(key.tonic, m21.pitch.Pitch("A"))
# transpose song by calculated interval
transposed_song = song.transpose(interval)
return tranposed_song
def encode_song(song, time_step=0.25):
encoded_song = []
for event in song.flat.notesAndRests:
# handle notes
if isinstance(event, m21.note.Note):
symbol = event.pitch.midi # 60
# handle rests
elif isinstance(event, m21.note.Rest):
symbol = "r"
# convert the note/rest into time series notation
steps = int(event.duration.quarterLength / time_step)
for step in range(steps):
# if it's the first time we see a note/rest, let's encode it. Otherwise, it means we're
carrying the same
# symbol in a new time step
if step == 0:
encoded_song.append(symbol)
else:
encoded_song.append("_")
# cast encoded song to str
encoded_song = " ".join(map(str, encoded_song))
58
return encoded_song
def preprocess(dataset_path):
# load folk songs
print("Loading songs...")
songs = load_songs_in_kern(dataset_path)
print(f"Loaded {len(songs)} songs.")
for i, song in enumerate(songs):
# filter out songs that have non-acceptable durations
if not has_acceptable_durations(song, ACCEPTABLE_DURATIONS):
continue
# transpose songs to Cmaj/Amin
song = transpose(song)
# encode songs with music time series representation
encoded_song = encode_song(song)
# save songs to text file
save_path = os.path.join(SAVE_DIR, str(i))
with open(save_path, "w") as fp:
fp.write(encoded_song)
def load(file_path):
with open(file_path, "r") as fp:
song = fp.read()
return song
def create_single_file_dataset(dataset_path, file_dataset_path, sequence_length):
new_song_delimiter = "/ " * sequence_length
songs = ""
59
# load encoded songs and add delimiters
for path, _, files in os.walk(dataset_path):
for file in files:
file_path = os.path.join(path, file)
song = load(file_path)
songs = songs + song + " " + new_song_delimiter
# remove empty space from last character of string
songs = songs[:-1]
# save string that contains all the dataset
with open(file_dataset_path, "w") as fp:
fp.write(songs)
return songs
def create_mapping(songs, mapping_path):
mappings = {}
# identify the vocabulary
songs = songs.split()
vocabulary = list(set(songs))
# create mappings
for i, symbol in enumerate(vocabulary):
mappings[symbol] = i
# save vocabulary to a json file
with open(mapping_path, "w") as fp:
json.dump(mappings, fp, indent=4)
def convert_songs_to_int(songs):
int_songs = []
# load mappings
60
with open(MAPPING_PATH, "r") as fp:
mappings = json.load(fp)
# transform songs string to list
songs = songs.split()
# map songs to int
for symbol in songs:
int_songs.append(mappings[symbol])
return int_songs
def generate_training_sequences(sequence_length):
# load songs and map them to int
songs = load(SINGLE_FILE_DATASET)
int_songs = convert_songs_to_int(songs)
inputs = []
targets = []
# generate the training sequences
num_sequences = len(int_songs) - sequence_length
for i in range(num_sequences):
inputs.append(int_songs[i:i+sequence_length])
targets.append(int_songs[i+sequence_length])
# one-hot encode the sequences
vocabulary_size = len(set(int_songs))
# inputs size: (# of sequences, sequence length, vocabulary size)
inputs = keras.utils.to_categorical(inputs, num_classes=vocabulary_size)
targets = np.array(targets)
return inputs, targets
def main():
61
preprocess(KERN_DATASET_PATH)
songs = create_single_file_dataset(SAVE_DIR, SINGLE_FILE_DATASET,

SEQUENCE_LENGTH)
create_mapping(songs, MAPPING_PATH)
inputs, targets = generate_training_sequences(SEQUENCE_LENGTH)
if __name__ == "__main__":
main()
train.py
from preprocess import generate_training_sequences, SEQUENCE_LENGTH
OUTPUT_UNITS = 38
NUM_UNITS = [256]
LOSS = "sparse_categorical_crossentropy"
LEARNING_RATE = 0.001
EPOCHS = 90
BATCH_SIZE = 64
SAVE_MODEL_PATH = "model.h5"
def build_model(output_units, num_units, loss, learning_rate):
# create the model architecture
input = keras.layers.Input(shape=(None, output_units))
x = keras.layers.LSTM(num_units[0])(input)
x = keras.layers.Dropout(0.2)(x)
output = keras.layers.Dense(output_units, activation="softmax")(x)
model = keras.Model(input, output)

62
# compile model
model.compile(loss=loss,
optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
metrics=["accuracy"])
model.summary()
return model
def train(output_units=OUTPUT_UNITS, num_units=NUM_UNITS, loss=LOSS,

learning_rate=LEARNING_RATE):
# generate the training sequences
inputs, targets = generate_training_sequences(SEQUENCE_LENGTH)
# build the network
model = build_model(output_units, num_units, loss, learning_rate)
# train the model
model.fit(inputs, targets, epochs=EPOCHS, batch_size=BATCH_SIZE)
# save the model
model.save(SAVE_MODEL_PATH)
if __name__ == "__main__":
train()
melodygenerator.py
import json
import numpy as np
import music21 as m21
from preprocess import SEQUENCE_LENGTH, MAPPING_PATH
63
class MelodyGenerator:
def __init__(self, model_path="model.h5"):
self.model_path = model_path
self.model = keras.models.load_model(model_path)
with open(MAPPING_PATH, "r") as fp:
self._mappings = json.load(fp)
self._start_symbols = ["/"] * SEQUENCE_LENGTH
def generate_melody(self, seed, num_steps, max_sequence_length, temperature):
# create seed with start symbols
seed = seed.split()
melody = seed
seed = self._start_symbols + seed
# map seed to int
seed = [self._mappings[symbol] for symbol in seed]
for _ in range(num_steps):
# limit the seed to max_sequence_length
seed = seed[-max_sequence_length:]
# one-hot encode the seed
onehot_seed = keras.utils.to_categorical(seed, num_classes=len(self._mappings))
# (1, max_sequence_length, num of symbols in the vocabulary)
onehot_seed = onehot_seed[np.newaxis, ...]
# make a prediction
probabilities = self.model.predict(onehot_seed)[0]
# [0.1, 0.2, 0.1, 0.6] -> 1
output_int = self._sample_with_temperature(probabilities, temperature)
# update seed
64
seed.append(output_int)
# map int to our encoding
output_symbol = [k for k, v in self._mappings.items() if v == output_int][0]
# check whether we're at the end of a melody
if output_symbol == "/":
break
# update melody
melody.append(output_symbol)
return melody
def _sample_with_temperature(self, probabilites, temperature):
predictions = np.log(probabilites) / temperature
probabilites = np.exp(predictions) / np.sum(np.exp(predictions))
choices = range(len(probabilites)) # [0, 1, 2, 3]
index = np.random.choice(choices, p=probabilites)
return index
def save_melody(self, melody, step_duration=0.25, format="midi", file_name="mel.mid"):
# create a music21 stream
stream = m21.stream.Stream()
start_symbol = None
step_counter = 1
# parse all the symbols in the melody and create note/rest objects
for i, symbol in enumerate(melody):
# handle case in which we have a note/rest
if symbol != "_" or i + 1 == len(melody):
# ensure we're dealing with note/rest beyond the first one
if start_symbol is not None:

65
quarter_length_duration = step_duration * step_counter # 0.25 * 4 = 1
# handle rest
if start_symbol == "r":
m21_event = m21.note.Rest(quarterLength=quarter_length_duration)
# handle note
else:
m21_event = m21.note.Note(int(start_symbol),
quarterLength=quarter_length_duration)
stream.append(m21_event)
# reset the step counter
step_counter = 1
start_symbol = symbol
# handle case in which we have a prolongation sign "_"
else:
step_counter += 1
# write the m21 stream to a midi file
stream.write(format, file_name)
if __name__ == "__main__":
mg = MelodyGenerator()
seed = "67 _ 67 _ 67 _ _ 65 64 _ 64 _ 64 _ _"
melody = mg.generate_melody(seed, 500, SEQUENCE_LENGTH, 0.3)
print(melody)
mg.save_melody(melody)
66
app.py
from flask import Flask, render_template, send_file
app = Flask(__name__)
@app.route('/')
def index():
return render_template('index.html')
@app.route('/play_music')
def play_music():
return send_file('mel.mid', mimetype='audio/midi')
if __name__ == '__main__':
app.run(debug=True)
index.html
<html>
<head>
<title>AUTOMATIC MUSIC GENERATION USING LSTM</title>
</head>
<body>
<h1>GENERATED MUSIC</h1>
<midi-player
src="/play_music"
sound-font visualizer="#myPianoRollVisualizer">
</midi-player>
<midi-visualizer type="piano-roll" id="myPianoRollVisualizer"
src="/play_music">
</midi-visualizer>
67
<midi-player
src="/play_music"
sound-font visualizer="#myStaffVisualizer">
</midi-player>
<midi-visualizer type="staff" id="myStaffVisualizer"
src="/play_music">
</midi-visualizer>
<script
src="https://cdn.jsdelivr.net/combine/npm/tone@14.7.58,npm/@magenta/music@1.23.1/es6/
core.js,npm/focus-visible@5,npm/html-midi-player@1.5.0"></script>
</body>
</html>
68
REFERENCES
1. Kaitong Zheng, Ruijie Meng, Chengshi Zheng, Xiaodong Li, Jinqiu

Sang, Juanjuan Cai, and Jie Wang. 2021. EmotionBox: a
music-element-driven emotional music generation system using
Recurrent Neural Network. arXiv preprint arXiv:2112.08561,2021.
2. Serkan Sulun, Matthew EP Davies, and Paula Viana. 2022. Symbolic
music generation conditioned on continuous-valued emotions,2022.
3. Filippo Carnovalini and Antonio Rodà. 2020. Computational creativity
and music generation systems: An introduction to the state of the art.
Frontiers in Artificial Intelligence 3,2020.
4. Gwenaelle Cunha Sergio and Minho Lee. 2021. Scene2Wav: a deep
convolutional sequence-to-conditional Sample RNN for emotional scene
musicalization. Multimedia Tools and Applications 80, 2021.
5. F. Furukawa, "Fundamentals of Music & Properties of Sound – Music
Theory Lesson," 3 August 2015. [Online]. Available:
https://popliteral.com/fundamentals-of-music-properties-of-soundmusi
c-theory-lesson/. [Accessed 7 November 2018].
6. M. C. A. O. Manuel Alfonseca, "A simple genetic algorithm for music
generation by means of algorithmic information theory," IEEE Congress
on Evolutionary Computation, Singapore, pp. 3035-3042, 2007
7. Z. C. a. B. J. a. E. C. Lipton, "A Critical Review of Recurrent Neural
Networks," arXiv preprint arXiv:1506.00019, 2015.
8. G. a. N. V. a. S. J. a. K. A. Joshi, "A Comparative Analysis of
Algorithmic Music Generation on GPUs and FPGAs," 2018 Second
International Conference on Inventive Communication and
Computational Technologies (ICICCT), pp. 229-232, 2018.
69
9. F. a. H. J. Drewes, "An algebra for tree-based music generation,"
Springer, pp. 172-188, 2007.
10. A. a. S. W. Van Der Merwe, "Music generation with Markov models,"
IEEE MultiMedia, vol. 18, no. 3, pp. 78-85, 2011.
11. N. a. B. Y. a. V. P. Boulanger-Lewandowski, "Modeling temporal
dependencies in high-dimensional sequences: Application to polyphonic
music generation and transcription," arXiv preprint arXiv:1206.6392,
2012.
12. G. a. F. N. Hadjeres, "Interactive Music Generation with Positional
Constraints using Anticipation-RNNs," arXiv preprint
arXiv:1709.06404, 2017.
13. C. B. Browne, "System and method for automatic music generation
using a neural network architecture," Google Patents, 2001.
14. A. Abraham, "Artificial neural networks," handbook of measuring
system design, 2005.
15. I. a. P.-A. J. a. M. M. a. X. B. a. W.-F. D. a. O. S. a. C. A. a. B. Y.
Goodfellow, "Generative adversarial nets," in Advances in neural
information processing systems, 2014, pp. 2672--2680.
16. H.-W. a. H. W.-Y. a. Y. L.-C. a. Y. Y.-H. Dong, "MuseGAN:
Symbolic-domain music generation and accompaniment with multitrack
sequential generative adversarial networks," arXiv preprint
arXiv:1709.06298, 2017.
17. "Magenta," Google, [Online]. Available:
https://magenta.tensorflow.org/. [Accessed 29 October 2018].
18. P. Y. Nikhil Kotecha, "Generating Music using an LSTM Network,"
arXiv.org, vol. arXiv:1804.07300, 2018.
70
19. G. H. A. K. I. S. R. S. Nitish Srivastava, "Dropout: A Simple Way to
Prevent Neural Networks from Overfitting," Journal of Machine
Learning Research, vol. 15, pp. 1929-1958, 2014.
20. G. L. Z. V. D. M. L. &. W. K. Q. Huang, " Densely connected
convolutional networks.," CVPR, vol. 1, p. 3, 2017.
21. O. M. Bjørndalen, "Mido," 21 August 2011. [Online]. Available:
https://github.com/olemb/mido. [Accessed 3 November 2018].
22. C. H. L. H. S. I. Moon T, "Rnndrop: A novel dropout for rnns in asr,"
Automatic Speech Recognition and Understanding (ASRU), pp. 65-70,
2015.
23. T. a. H. G. Tieleman, "Lecture 6.5-rmsprop: Divide the gradient by a
running average of its recent magnitude," COURSERA: Neural networks
for machine learning, vol. 4, no. 2, pp. 26-31, 2018.
24. Z. T. H. a. W. X. Z. Zeng, "Multistability of recurrent neural networks
with time-varying delays and the piecewise linear activation function.,"
IEEE Transactions on Neural Networks, vol. 21, no. 8, pp. 1371-1377,
2017.
25. GoogleDevelopers, "Descending into ML: Training and Loss,"
[Online].Available:
https://developers.google.com/machinelearning/crash-course/descendin
g-into-ml/training-and-loss.
26. S. Mangal, "Music Research," 13 November 2018. [Online].
Available: https://gitlab.com/sanidhyamangal/music_research.
27. "Musical_Matrices," 10 July 2016. [Online]. Available:
https://github.com/dshieble/Musical_Matrices/tree/master/Pop_Music
_Midi. [Accessed 2 November 2018].
71

Project Final Document

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Final Document

Uploaded by

Copyright:

Available Formats

AUTOMATIC MUSIC GENERATION USING LSTM

In partial fulfillment for the award of the degree

NEHRU INSTITUTE OF ENGINEERING AND TECHNOLOGY

Certified that this project report “AUTOMATIC MUSIC GENERATION

Dr. D. SATHISH KUMAR Dr. S. SUBASREE

Department of Computer Science & Department of Computer Science &

Submitted for the Anna University Project Viva – Voice Examination

INTERNAL EXAMINER EXTERNAL EXAMINER

At this pleasing moment of successful completion of my project, I thank God

We convey my sincere thanks to our respected Founder Chairman Late

We express my deep gratitude to our beloved and honorable Principal,

We owe grateful and significant thanks to our Dean, Computing

We thank our Project Coordinator Mr. S. VENKATESH, Assistant

Traditionally, music was treated as an analogue signal and was generated

LIST OF FIGURES vii

4.1 LIST OF MODULES 9

4.2 MODULES DESCRIPTION 9

4.4 SOFTWARE AND HARDWARE 14

4.5 SOFTWARE DESCRIPTION 15

4.5.4 RELATION BETWEEN AI, ML

4.5.5 OVERVIEW OF HTML 21

4.5.6 OVERVIEW OF FLASK 23

4.6.7 MIDI PLAYER 29

4.6.8 MIDI VISUALIZER 30

4.7 MUSICAL CONCEPTS 32

FIGURE TITLE PAGE

SERIAL NO ABBREVIATION EXPANSION

1 RNN Recurrent Neural Network

2 LSTM Long Short Term Memory

3 MIDI Musical Instrument Digital Interface

4 DAW Digital Audio Workstation

5 RANN Recursive Artificial Neural Network

6 GAN Generative Adversarial Network

7 SGD Stochastic Gradient Descent

8 HMM Hidden Markov Models

9 ANN Artificial Neural Network

10 XML Extensible Markup Language

13 USB Universal Serial Bus

14 HTML HyperText Markup Language

15 VLC VideoLAN Client

17 BPM Beats Per Minute

A critical step in this task, whether in audio domain or symbolic domain, is

In this project, the Long Short-Term Memory (LSTM) method is used to

gaming and other many other fields

● To make it convenient for the users to generate music according to their

● To reduce the complexity of finding suitable music that user wants

2.2 IDENTIFIED PROBLEMS

● Too much randomness in music causing chaotic sound in

● No clear objective measures in creation of quality music

● Long term dependencies were unable to capture

Markov chains can be used for music generation by analyzing a set of

1. Balancing randomness and coherence: Markov chains generate music by

2. Handling long-term dependencies: Markov chains are limited by their

3. Capturing musical style: Markov chains rely on patterns in the training

4. Evaluating the quality of generated music: Unlike other machine learning

Automatic music generation using Long Short-Term Memory (LSTM) is a

To generate new music using an LSTM model, a starting seed sequence is

1. Capturing long-term dependencies: LSTM models are designed to

3. Improved musical quality: Compared to other generative models like

4. Robustness to noise: LSTM models are designed to handle noisy and

5. Scalability: LSTMs can be trained on large datasets of musical

3. Build the model

4. Train the model