Mohini Dey - Capstone

LIST OF FIGURES
Fig Title Page

No. No.
1.1 Workflow of Project 3
2.1 Audio Fingerprint matching 6
2.2 MFCC extraction 7
3.1 Branches of Machine Learning 13
3.2 Generalised Working of a Machine Learning 14

model
3.3 Structure of a DNN 15
3.4 Implementation of CNN using an example 16
3.5 Layers of DNN used in noise classification 17
3.6 KNN example 18
3.7 Audio slicing representation of FFT 22
4.1 Workflow of the project 23
4.2 Representation of Pre-processing 23
4.3 Audio fingerprinting and matching 24
4.4 Deep Neural Network Workflow 24
4.5 Speaker Identification Representation 25
6.1 Spectrogram of input audio signal 29
6.2 Hanning window of input signal 29
6.3 Audio fingerprint 30
6.4 Test Case 1 30
6.5 Test Case 2 31

6.6 Training of Deep Neural Network 32
6.7 Noise Classification Result 33

6. Downloading of dataset and reducing the same 33
8
6. Audio wave of the sample audio of a speaker 34
9 uttering
“two”
6. Pitch of the Speaker 34
10
6. Confusion Matrix 36
11
6. Validation Accuracy 36
12
LIST OF TABLES
Table Tit Page

No. le No.
Confidence percentage of the KNN
classifier Number of speakers vs accuracy
6. 35
1
36
6.
2
LIST OF ABBREVIATIONS
2.
1. DNN Deep Neural Network

2. MFCC Mel frequency cepstral coefficients
3. KNN K- Nearest Neighbour
4. MIDI Musical Instrument Digital Interface
5. WASAPI Windows Audio Session API
6. ALSA Advanced Linux Sound Architecture
7. LSTM Long short-term memory
8. GAN Generative Adversarial Network
9. GPU Graphics processing unit
10. CNN Convolutional Neural Network
11. SVM support vector machines
12. PCA Principal Component analysis
13. DSP Digital Signal Processing
14. ONNX Open Neural Network Exchange
15. FPGA Field-programmable gate array
16. SoC System on a chip
17. ANOVA Analysis of Variance
18. DOE Design of Experiments
19. HMM Hidden Markov Model
20. FLAC Free lossless audio codec
SYMBOLS AND NOTATIONS
� Mean
�
π Pi
θ Theta
∫ Integration
α Alpha
∑ Summation
X Eigen vector
Λ Eigen value
� Variance
�
2
ω Omega
R Resistance
C Capacitance
ε Epsilon
j imaginary/iota
w Window function
n
Σ Standard
Deviation
1.1 Motivation
Audio Fingerprinting advancements permit the utilization of sound samples to be assembled,

extricated, or explored for music/sound data. Fingerprinting is only a profoundly explicit
substance-based sound recovery method. In this work, we approach audio recognition by
means of audio fingerprinting and its characterization through deep learning. We propose to
play out a superior and proficient approach to make an extraordinary database of fingerprints
of recorded sound samples, testing the dataset with some arbitrary brief snippets in any event,
when the sound signs are somewhat or genuinely contorted. We additionally constructed a
framework where we utilize a deep neural network (DNN) and contrast this with other
characterization techniques. We propose a novel improvement to the pre-processing cycle of
the network which is helpful when preparing with various sorts of sound. It is shown that
DNN enjoys some upper hands over other characterization strategies. People have certain
highlighted features in their voices which separate them from each other, additionally people
have this capacity to isolate the sound they wish to concentrate on from the noise and
different types of the sound present in their environmental factors, we wish to imitate this
capacity into a neural network and test its precision. This proposition likewise proposes
perceiving a speaker's character by their voice with using concepts of deep learning.
1.2 Background
Previous works on audio fingerprinting and matching have mostly used the method of hash
functions to store the fingerprints, but they used to focus more on mathematical equality
rather than noncognitive similarity. This project hence, keeps aside the idea of using hash
functions to store the fingerprints and instead uses the method of storing of energy peaks of
audio signals by using Parseval’s theorem. Unlike previous works, our work also focuses on
classification of noises from the audio into white, brown and pink and hence giving an idea of
what kind of music is the listener interested in. This is additionally followed by recognition of
speaker through the audio. The experiments are performed to achieve better accuracy. All
these experiments are performed in MATLAB considering it as a convenient platform to
conduct experiments related to signal processing.
CHAPTER 2
PROJECT DESCRIPTION AND GOALS
An audio fingerprint is just an information that stores many highlights of a sound example
which can be subsequently utilized for comparing or for some other purposes. It is a
condensed form of an advanced rundown of the tune that assists with recognizing it from
different tunes. In the digital signal processing, we convert analog signals into digital signals
through a digitization process. The analog to digital conversion process consists of two steps
called sampling and quantization. These steps together form the base of pre-processing
techniques of any given audio signal. Quantization allows the restriction of the amplitude of a
specific/limited set of values. Along the course of this thesis, the terms audio signal and audio
recording refer to DT-signal.
Pre-
processing
Audio
fingerprinting
Audio
Matching
Noise
Classification
Speaker
classification
Figure1.1 Workflow of the project
The workflow of the project is divided lucidly into four phases namely Audio Fingerprinting,
Audio matching, Noise classification, and Speaker Identification as shown in figure 1.1. An
audio fingerprint is a densely progressed summary, deterministically created from a sound
signal that can be used to recognize a sound sample or quickly discover practically identical
samples in an audio database. The creation of a unique Audio fingerprint using energy peaks
of audio input. At the point when the audio fingerprints are made, they are accumulated in an
information database. Continuing with this a new solid sample is recorded utilizing an
external microphone and is then matched. Seeing and realizing how to perceive intricate and
high-
dimensional sound signals is perhaps the most difficult tasks in today’s world. This profound
algorithm attempts to gain proficiency with the straightforward simple features in the lower
layer and more complex features in the higher layers. Then comes the part of classification of
noise. We train the network, in a layer-wise fashion. In this method the hidden layers are
trained each at a turn. After training and testing the divided samples, we are finally able to
classify the noises into pink, brown and white noises.
The final part of this project work is the distinguishing of speaker from the audio. The
speaker distinguishing process dependent on the sound signals is treated as quite possibly the
most important innovations of human/individual acknowledgment. Sound signal highlights
can be characterized either in the perceptual mode or in the physical mode. Here in this
thesis, we are proposing the extraction of highlight vectors and later training the dataset. As
already mentioned before, this framework comprises of two parts; speaker recognition and
speaker verification. In the primary segment, we figuring out which enrolled speaker gives
the speech input, while verification is the task of consequently deciding whether an individual
is an individual the person professes to be. Speaker identification can be characterized
dependent on text-reliant or text-free techniques. In the text dependent technique, the speaker
needs to say words or key phrases having same text for both training and recognition trials. In
the text free technique, the framework can distinguish the speaker irrespective of what is
being spoken. In our thesis, we propose a text independent speaker recognition framework
dependent on Mel- frequency cepstral coefficients as highlight extraction and vector
quantization method. It additionally helps in limiting information needed for processing. The
extricated highlights of the speaker are quantized by several centroids and the K-Nearest
neighbor algorithm has been incorporated into the proposed speaker identification
framework. Speaker distinguishing targets perceiving speakers from their voices as every
individual has his remarkable set of speech qualities and his methods of talking. We are also
producing a confusion matrix at the end to show the probability percentages of each speaker
and hence, it helps in identifying the exact speaker.
The objective of this thesis is to examine and study the text independent speaker
identification framework, where we look at a speech audio signal from the obscure speaker to
the database of known speakers. Moreover, we have extricated highlights from unlabeled
sound signal data and it has shown excellent execution for numerous audio classification
tasks.
2.1 Algorithms:
1. Pre-processing
Step 1: Give an audio sample as input into MATLAB
Step 2: Perform Audio sampling of the input audio at 4410kHz
Step 3: Use Hanning window to split the input audio into overlapping frames with each
frame at a frame duration of 370ms
Step 4: Apply Fast Fourier Transform on the audio sample.
Step 5: Determine the energy peaks and then calculate the mean of all the energy peaks
Step 6: Use the mean of the energy peaks as the calculated cut-off frequency and pass the
data through a Butterworth high pass filter.
Step 7: We can obtain the energy peaks of the audio both as a discrete data and in the
form of a spectrogram.
2. Audio Fingerprinting
Step 1: Pre-processing process

Step 2: Storing all the obtained discrete energy peaks in an excel sheet.
3. Audio matching
Step 1: Give the random audio signal as the input into MATLAB
Step 2: Pre-processing process
Step 3: Correlate the energy peaks obtained by the random audio signal with the sets
present in the database
Step 4: The correlation coefficient is returned.

Figure 2.1 Audio Fingerprint Matching
4. Speaker identification
Step 1: The speaker dataset from AN4 database is downloaded
Step 2: The helper function is employed to convert the audio signals into flac format.
Step 3: The dataset from AN4 database is reduced to 5 females and 5 males.
Step 4: To store the data used for further processing a datastore is created
Step 5: The datastore is split into two datastores where 8% of the data in the datastore is
used for training and 2% of the data in the datastore is used for testing.
Step 6: Pitch and MFCC is extracted from the training data.
Step 7: The KNN classifier is trained based on the extracted features of the data.
Step 8: The classifier is validated by looking at the accuracy of the classifier.

5. MFCC extraction.
Figure 2.2 MFCC extraction
The first 10 triangular filters of the mel filterbank are linearly spaced and the remaining
filters are logarithmically spaced. For even energy, the individual bands are weighted. The
figure represents a typical mel filterbank.
6. KNN algorithm.
Step 1 – The training dataset is loaded
Step 2 – The value of k is chosen i.e., the nearest data
points. ( k belongs to Z) Step 3 – For each point in the
dataset, the procedure mentioned below is followed −
● 3.1 − The distance between test data and each row of

training data is calculated using any of the known
methods namely: Euclidean, Manhattan or Hamming
distance. Euclidean method is most commonly used
method to calculate distance.
● 3.2 − Now, sort the data in ascending order based on the
distance value.
● 3.3 – From the already sorted array, the KNN classifier will choose the top K rows.
● 3.4 –Based on the most frequent class of these rows, a class is assigned to the test
point.
Step 4 − End
CHAPTER 3
TECHNICAL
SPECIFICATIONS
This project has been performed on MATLAB. The various toolboxes of MATLAB used to
achieve the results are listed as below: -
● MATLAB Software (R2019b)

● Audio Toolbox
● Deep learning toolbox
● Statistics and machine learning toolbox
SOFTWARE DESCRIPTION:
3.1 MATLAB
MATLAB (Matrix Laboratory) possesses one of the most convenient environments for
conducting experiments related to digital signal processing. It includes the MATLAB
language, the only top programming language dedicated to numerical and technical
computing. MATLAB gives a multi-worldview mathematical figuring environment and it's
an intelligent framework whose fundamental information component is an exhibit that doesn't
need dimensioning. MATLAB hence, allows us to decode problems in a much easier way
using languages like C or Fortran. This matrix-based language lets you express math directly.
Linear algebra in MATLAB is intuitive and concise. The same is true for data analytics,
signal and image processing, control design, and other applications wherein it integrates
computation, visualization, and programming in an easy-to-use environment where problems
and solutions are expressed in familiar mathematical notation.
Hence, call it be the engineers or the researchers or the scientists, MATLAB has become a
non- detachable part of their lives. The richness of the MATLAB computational environment
combined with an integrated development environment (IDE) and easy interface, toolkits, and
simulation and modelling capabilities, creates a research and development tool that has no
equal.
Its uses include
• Math and computation
• Algorithm development
• Scientific and engineering graphics
• Modeling, simulation, and prototyping
• Data analysis, exploration, and visualization
3.2 SIMULINK
Simulink is a graphical extension to MATLAB for integrated simulation and model design
system. We generally use Simulink for displaying, reproducing and examining of MDS. It’s
essentially a diagrammatical framework with customizable block library packages with
product style control, traceability criteria and application coverage analysis, Similar Link are
able to routinely verify models. The main use of Simulink is its ability to model a nonlinear
system, which a transfer function is unable to do. Another main advantage of Simulink is the
ability to take on initial conditions.
In Simulink, systems are drawn on screen as block diagrams. Many elements of block
diagrams are available, such as transfer functions, summing junctions, etc., as well as virtual
input and output devices such as function generators and oscilloscopes. Simulink is
coordinated with MATLAB and information can be moved from project to project.
Accordingly, Simulink is productive for precise confirmation and testing of frameworks
through planning style checking, necessities, and analysis of model completion.
We can also easily identify mistakes through the Simulink Model Verifier and can also be
used in framework checking by creating test cases.
MATLAB and Signal Processing
Digital signal processing becomes way easier with MATLAB and Simulink. they provide a
systematic work flow for the advancement of inserted frameworks. One can apply signal
processing tools to:
● Procure, measure, and investigate signals from numerous sources without expertise
hypothesis in processing of signals.
● Configuring streaming calculations for sound, intelligent sensor, instrumentation,

and IoT gadgets.
● Making a prototype, testing it, and then carrying out DSP calculations on PCs,
various kinds of processors, SoCs, and FPGAs.
● Pre-process and filter signals before analysis.
● Investigate and concentrate highlighted information for analysis of information and
applications based on Artificial Intelligence.
● Analyse trends and discover patterns in signals.
● Visualize and measure the time and frequency characteristics of signals.
3.3 Audio Toolbox (MATLAB R2019b)
MATLAB's Audio Toolbox gives instruments to sound preparing, discourse examination, and
acoustic estimation. It involves calculations for sound sign handling, (for example,
adjustment and dynamic reach control) and acoustic estimation, (for example, drive reaction
assessment, octave sifting, and perceptual weighting). It additionally gives calculations to
sound and discourse highlight extraction (like MFCC and pitch) and sound sign change, (for
example, gammatone channel bank and Mel-separated spectrogram).
Tool compartment applications hold up live computation testing, drive reaction estimation,
and sound sign naming. It gives spouting interfaces to ASIO, WASAPI, ALSA, and
CoreAudio sound cards and MIDI contraptions, and stages for making and working with
standard sound modules, for instance, VST and Audio Units. With Audio Toolbox one can
import, name, and add sound informational indexes, just as concentrate highlights and
convert signals for AI and profound learning. One can synchronize continuous sound
calculations with low-dormancy
sound streaming while at the same time changing boundaries and envisioning them. One can
likewise approve their calculation by changing it over to a sound module to chip away at
outside hold strategies like Digital Audio Workstations. Module facilitating permits one to
utilize outside sound modules as standard highlights to deal with MATLAB clusters. Sound
card interfaces permit one to utilize custom adjustment for genuine sign signs and acoustic
frameworks.
Audio Toolbox gives usefulness to foster sound, discourse, and acoustic applications utilizing
AI and profound learning. Utilize the sound information store to oversee and stack enormous
informational collections. Use Audio labeller to intelligently characterize and envision
ground truth. Use Audio Data Augmenter to extend informational collections utilizing sound
explicit increase methods. Utilize a sound element extractor to make productive and secluded
component extraction pipelines.
Audio Toolbox is enhanced for ongoing sound stream preparing. Utilize these highlights
exclusively or as a feature of a bigger calculation to make impacts, examine signals, and
interaction sound.
Capabilities and Features
● Sound Toolbox empowers constant processing and testing of sound signal and
MATLAB and Simulink. It provides a network which can operate on a very large set
of data in minimum time to stream sound accompanying certain driver norms.
● Smooth multi-channel audio streaming in real-time is possible in MATLAB. Planning
of channel permits directing of signals to random channel choices when we use multi-
channel sound gadgets.
● C-language coding is easily supported by MATLAB and Simulink, even for creation
of libraries. For example, one can generate libraries or standalone applications that
process audio in real-time on the desktop.
● Audio Toolbox also enables one to tune algorithm parameters interactively during
simulations using external MIDI controls. Audio Toolbox is streamlined for
continuous sound stream handling. Utilize these characteristics separately or as a
feature of a bigger calculation to make impacts, examine signals, and interact with
sound signals
3.4 Deep Learning Toolbox:
Deep Learning Toolbox provides a systematic framework to help planning and executing
deep neural networks with calculations, algorithms, pre-trained models, and applications. One
can utilize convolutional neural network (ConvNets, CNNs) and long short-term memory
(LSTM) networks to perform characterization and relapse on picture, time-arrangement, and
text information. One can construct network structures, for example, generative adversarial
networks (GANs) and Siamese networks utilizing automatic differentiation, custom training
loops, and shared weights. With the Deep Network Designer application, one can configure,
examine, and train networks graphically. The Experiment Manager application assists one
with dealing with numerous profound learning tests and experiments, monitor training
parameters, analyze results, and analyze code and compare it with various experiments. One
can visualize layer initiations and graphically screen the training progress.
One can trade models with TensorFlow and PyTorch through the ONNX configuration and
import models from TensorFlow-Keras and Caffe. It works for transfer learning with
applications like DarkNet-53, ResNet-50, NASNet, SqueezeNet, and numerous other pre-
trained models. One can support training on a solitary or many GPU workstation (with
Parallel Computing Toolbox), or scale up to groups, clusters and clouds, including
NVIDIA GPU Cloud and Amazon EC2 GPU instances.
3.5 Statistics and Machine Learning Toolbox
Machine Learning Toolbox and Statistics gives capacities, functions and applications to
depict, examine, analyze and model any given data. One can utilize expressive and
descriptive measurements, statistics and plots for exploratory data analysis, fit probability
distributions to information, produce arbitrary random numbers for Monte Carlo simulations,
and perform various hypothetical tests. Relapse, regression and classification algorithms
permit one to draw deductions from information and construct predictive models.
For multidimensional information analysis, data mining, Statistics and Machine Learning
Toolbox provides us with tools like, feature selection, stepwise regression, principal
component analysis (PCA), regularization, and other dimensionality decrease strategies that
allow one to distinguish factors or highlight features that impact their model.
The tool kit gives regulated supervised and unsupervised ML algorithms, including support
vector machines (SVMs), boosted and bagged decision trees, k-nearest neighbors, k-means,
k- medoids, various leveled hierarchical clustering, Gaussian combination models, and
hidden Markov models. A considerable lot of the statistics and ML algorithms can be utilized
for calculations on data sets that are too large to be put away in memory.
Figure 3.1 Branches of Machine Learning
Product capabilities of the Statistics and Machine Learning Toolbox
● Regression strategies, including direct linear, generalized linear, nonlinear, robust,

regularized, ANOVA, repeated measures, and mixed-effects models
● Big data calculations and algorithms for measurement decrease, dimension reduction,
expressive insights, descriptive statistics, k-means clustering, linear regression,
logistic regression, and discriminant analysis
● Univariate and multivariate probability distributions, irregular and semi arbitrary
number generators, random and quasi random number generators and Markov chain
samplers
● Hypothesis tests for distributions, dispersion, and location; Design of Experiments
(DOE) procedures for ideal, factorial, and reaction surface plans
● Classification Learner application, calculations and algorithms for administered
supervised ML, boosted and bagged decision trees, KNN, Naive Bayes, discriminant
analysis, and Gaussian process regression
● Unsupervised ML algorithm calculations, including k-means, k-medoids, various
hierarchical clustering, Gaussian mixtures, and HMMs
● Bayesian optimization advancement for tuning ML algorithm calculations via
searching for ideal optimal hyperparameters.
Figure 3.2 Generalised Working of a Machine Learning model
3.6 Machine learning algorithms
3.6.1 Deep neural networks:
Deep neural networks comprise of various degrees of nonlinear tasks, like neural nets with
many secret hidden layers. The idea of deep learning depends on learning highlight feature
progressions, where highlights at more significant levels of hierarchical chain of command
are shaped utilizing the highlights at lower levels. Much better outcomes could be
accomplished in more profound designs when each layer relates to an unsupervised learning
alg. At that point sent to a supervised mode, the network is trained utilizing the
backpropagation calculation algorithm, to change weights. Additionally, DNNs outflank
GMM (Gaussian Mixture Model) and HMM (Hidden Markov Model) on an assortment of
speech handling tasks by a huge margin. Deep learning networks are recognized from the
more ordinary single secret hidden layer neural networks by their profundity that is, the
quantity of node layers through which information should pass in a multistep process of
pattern recognition. This deep structured
organized learning is important for a more extensive family of ML strategies. It is likewise
based on artificial neural networks with representation learning where this learning can be
supervised or unsupervised. Deep neural networks, deep belief networks, recurrent neural
networks, and convolutional neural networks have been applied to numerous fields like
speech acknowledgment, regular language processing, sound signal recognition, social
network community filtering, machine interpretation bioinformatics, drug design plan, and so
on, Deep neural networks helps in creating results similar to and now and again awe-inspiring
human master execution and in some cases surpassing human expert performance. The deep
neural network is nothing but an artificial neural network with multiple different layers
between contributing the input and output yield layers. Regardless of whether it be a direct
linear relationship or a non-linear relationship the DNN tracks down the right numerical
control to mathematically manipulate and transform the input into the output. The network
travels through each layer ascertaining the likelihood and calculating the probability of each
output. This network has a specific degree of intricacy, a network with multiple layers. It
utilizes refined sophisticated numerical mathematical modeling to demonstrate and to handle
data intricately.
A deep neural network has one input, one output layer each an in any event at least one secret
hidden layer sandwiched in the middle, where each layer performs specific kinds of
arranging, sorting, requesting and ordering, each layer comprises of neuron/nodes that
perform mathematical calculations and other different tasks. Every node in a layer is
interconnected to different nodes present in the other consecutive layers. At every
interconnection, there are weights doled out and a bias is allotted to each layer. These weights
and biases are the parameters of the given network.
Figure 3.3 Structure of a DNN

3.6.2 Convolutional neural networks:
In deep learning CNN, most generally applied to dissect and analyze visual symbolism and
imagery. In light of their common shared weight architecture loads and interpretation
invariance qualities, they are otherwise called shift invariant or space invariant artificial
neural networks (SIANN). CNN is generally utilized in picture and video acknowledgment,
recommender frameworks, picture arrangement, clinical picture examination, common
language handling, monetary time arrangement. It is the regularized variants of a multi-layer
perceptron. Multi-layer perceptron typically implies completely connected networks which
mean, that every neuron in one layer is associated with all neurons in the following layer. The
full connectedness property of these organizations makes them inclined to overfitting
information. General methods of regularization incorporate adding some type of extent
estimation of loads to the misfortune work. CNN has an alternate methodology towards
regularization they exploit the various leveled design in information and amass more
unpredictable examples utilizing more modest and less difficult examples. Accordingly, on
the size of connectedness and intricacy, CNNs are on the lower furthest point.
In convolutional networks, the availability design between neurons takes after the association
of the creature Visual cortex. The individual cortical neurons react to improvements just in a
confined district of the visual field known as the responsive field. The open fields of various
neurons in part cover with the end goal that they cover the whole visual field. Contrasted with
picture arrangement calculations, CNN's utilization generally minimal pre-preparing. This
network learns the channels that in conventional calculations are hand-designed.
Figure 3.4 Implementation of CNN using an example

Layers used to build CNN's:
A convolutional neural network is a continuous sequence of many layers, and every layer
helps in transformation of one volume to another through differentiable functions.
1. Input layer: This layer consists of raw input given by the user
2. Convolutional layer: In this layer, the output volume is calculated using computing
dot product applied between all the filters and image patches.
3. Activation function: In this layer, we apply the element-wise activation function to the
output of the convolution layer. Some common activation functions are RELU:
max(0,
x), Sigmoid: Sigmoid: 1 (1+e-x ) Tanh, Leaky RELU, etc.

4. Pool layer: This layer is inserted periodically in the covnets and the main function of
this layer is to reduce the size of volume which aids in making the computation fast,
reduces memory and also prevents overfitting. Two common types of pooling layers
are average pooling and max pooling.
5. Fully connected layer: This is a normal neural network layer. It takes input from the
previous layer and calculates the class scores to output the 1-D array of size that is
equal to the number of classes.
Figure 3.5 Layers of CNN used in noise classification

3.6.3 K-nearest neighbour algorithm
K-nearest neighbor algorithm is utilized in the example as a non-parametric strategy for
classification and regression. The data comprises of the k nearest training models that
includes space, in the two cases yet the output, relies upon whether k-NN is utilized for
classification or regression.
K-NN in classification: the output is a class enrolment. An object is characterized by a
majority vote of its neighbours. The object is relegated to the class generally regular among
its k nearest neighbours. K is a positive whole number, commonly little. On the off chance
that k=1, the object is basically allocated to the class of the single nearest neighbour. K-NN in
regression: the product is the value estimated for the item. This value calculated is the mean
of the KNN values.
This strategy helps in foreseeing objects qualities or class enrollments dependent on k nearest
preparing models in include space. K-NN is a sort of case-based learning. There is
additionally languid realizing where the capacity is just approximated locally and all
calculation is conceded until order. This calculation is the easiest among all ML algorithms.
Utilizing a k-NN classifier an item is characterized by a greater part vote of its neighbors,
with the article being doled out to the class generally normal among its k closest neighbors
where k is a positive whole number, regularly little. Assuming k=1, the article is just doled
out to the class of that closest neighbor. A wide range of strategies utilize k-NN for speaker
identification.
The forte of this algorithm is that it is delicate to the local construction of the information.
The neighbors are taken from the object property estimation or a bunch of objects for which
the class is known.
Figure 3.6 KNN example

Algorithm:
K-NN stores every single accessible case and characterizes new cases dependent on a
likeness measure. A case is arranged utilizing the majority share vote of its neighbors, with
the case being relegated to the class generally most common among its K-NN estimated by a
distance function. Assuming k=1, the object is essentially appointed to the class of that
nearest neighbor.
Distance functions:
k
Euclidean (xi  i
2
y )
i

1
Manhattan k
∑| x − y |i i
i=1
|) ⎞q
q
Minkowski ⎛k (| xi − yi
⎜ ⎠ 
⎝ i=1
All these distance measures are just legitimate for consistent continuous variables. The
hamming distance should be utilized in the example of straight-out categorical variables. It
additionally raises the issue of normalization of the mathematical factors somewhere in the
range of 0 and 1 when there is a combination of mathematical and downright categorical
variables in the dataset.
Hamming distance:
D H = ∑ | xi − y i |
i=1
x=y⇒D=0
x≠y⇒D=1
An ideal worth of k should be picked by first examining the information. a huge K worth is
more exact as it diminishes the general clamor yet there is no assurance. There is another
method of deciding k - Cross-approval by utilizing a free dataset to approve the K worth.
Generally, the ideal K for most datasets has been between 3-10. That produces much
preferable outcomes over 1NN.
The k-NN classifier is a non-parametric technique utilized for grouping and relapse. This
technique helps in anticipating the article's qualities or class enrollments dependent on k
nearest preparing models in highlight space. K-NN is a sort of supervised learning. It is
likewise sluggish realizing where the capacity is just approximated locally and all calculation
is conceded until grouping. This calculation is least complex among all AI calculations.
Utilizing a k-NN classifier an item is characterized by a larger part vote of its neighbors, with
the object being allotted to the class generally regular among its k closest neighbors where k
is a positive number, commonly little. Assuming k=1, the object is just allotted to the class of
that closest neighbor. A wide range of strategies utilize k-NN for speaker distinguishing
proof.
SVM Classifier: The essential thought of SVM is to track down the ideal direct choice
surface dependent on the idea of underlying danger minimization. SVM is a double
arrangement technique. In ML, support vector networks are supervised training models with
related learning calculations that examine information and perceive designs, utilized for
classification and relapse investigation. Essential SVM noticed a bunch of information and
helps in anticipating for each given information, which of two potential classes frames the
yield, making it a non- probabilistic double linear classifier. A SVM preparing calculation
constructs a model for a given arrangement of preparing models, each set apart as having a
place with one of two classes that allocates new models into one classification or the other. A
SVM model is a portrayal of the model as a point in space, planned so instances of the
different classifications are partitioned by a reasonable hole that is just about as wide as could
be expected. New models are then planned into that equivalent space and anticipated to have
a place with a classification dependent on which side of the hole they fall on. SVMs can
perform both linear and non-linear classification.
Conclusion, similarly as with the ANN characterization plot, both K-NN and SVM need a
huge arrangement set of feature data. These datasets are separated in an element extraction
module from an assorted assortment of recorded speech signals. Classification is directed on
various gatherings of informational indexes to test various blends of features extracted from
the speech signals utilizing MATLAB.
3.7 Spectrogram
Frequency domain spectrum gives data about describing the limit between two classes.
Nonetheless, it doesn't consider the time measurement which is likewise fundamental for
acoustic signals. For this situation, the range is utilized as a spectrogram, it's only a visual
portrayal of the range of an acoustic signal that fluctuates with time.
Α spectrogram has the horizontal axis indicating time whereas, the vertical axis displaying
amplitude which are indicated by colour dots in the figure. Spectrogram addresses how the
range of a sound wave changes after some time. In the computerized sound sign, it is
normally determined utilizing the Short-Time Fourier Transform. The advanced time-space
tests are separated into covering outlines, which is known as the windowing process. Famous
window capacities incorporate rectangular window, Hamming window, Hanning window,
and so forth.
Rectangular window w=⎧1 0≤n<W

⎨
n
0 otherwise
⎧
0.54
-0.46 ≤ n < W
2
∙cos(
πn
)
0
Hamming window wn ⎪
 ⎨ W
⎪
⎩ 0 otherwise
⎧
0.5- ≤ n < W
0.5 ∙
2π
cos(
n
)
0
Hanning window wn ⎪
 ⎨ W
⎪
⎩ 0 otherwise
A rectangular window is once in a while utilized on the grounds that it will cause
discontinuities between frames when spectrum is calculated. Following this, each frame
passes through FFT to get the relating spectrum. Finally, every spectrum is considered as a
section and they are connected along time. Diagram of short-time transform:
Figure 3.7 Audio slicing representation of FFT
3.8 Window function
The window function is a numerical function which generally is zero-esteemed outside of

some picked span, ordinarily symmetric around the centre of the stretch, but generally
greatest in the centre, and typically bending away from the centre in signal statistics. In the
numerical methodology, when another capacity or waveform/information arrangement is
"increased" by a window function, it is additionally zero-esteemed outside the limit: all what
remains is the section where they overlap. In genuine practice, the section of information
inside the window is first disengaged, and afterward just that information is duplicated by the
window function values. Consequently, tapering is the principal reason for the window
function.
Gaussian window: The Fourier change of Gaussian is a Gaussian, it is simply Eigenfunction

of the Fourier transform. The expanse of the Gaussian function seems to be endless, it must
either taper at the endings of the window or should window down with a different zero-limit
window.
We get a parabola if we apply log to the Gaussian function.
1
w[n] = exp(− ( N
n−
2
2
)2 ), 0 ≤ n ≤ N σ ≤ 0.5
σN
2
The standard deviation of the Gaussian function is σ · N/2 sampling period.

CHAPTER 4
DESIGN APPROACH AND DETAILS
4.1 Design Approach:
Pre-
processing
Audio
fingerprinting
Audio
Matching
Noise
Classification
Speaker
classification
Figure 4.1 Project Workflow
Figure 4.2 Representation of Pre-processing

Figure 4.3 Audio fingerprinting and matching
Database Exploring Classifying

creation the as training
database and testing
dataset
Feature
extraction
Define and
Testing of train the
networks
network
Figure 4.4 Deep Neural Network workflow for Noise Classification

Figure 4.5 Speaker Identification Representation
4.2 Codes and Standards
Speaker recognition and classification of noise have utilized both the calculations – Deep
Neural Network and K- nearest neighbour Algorithm. While these algorithms were being
coded in MATLAB to obtain results, the following codes and standards were considered
and kept in mind: -
● Audio Toolkit from MATLAB R2019b

● Deep learning toolkit from MATLAB R2019b
● Statistics and machine learning toolkit from MATLAB R2019b
4.3 Constraints, Alternatives and Trade-offs
Constraints:
o For every new sound signal is fed as input to the software, the database has to be
updated right from the scratch.
o The sound source and the receiver should be in a nearby range.
o The source should not be in motion.
o More than a single source can cause overlapping and thus not produce required
results.
Alternatives:
o Despite having a slight or more mutilation in the input audio signal, the code
works efficiently to search for a particular fingerprint in the database.
o Techniques of deep learning enjoy numerous upper hands over different strategies
as it extricates the element and train the model single-handedly. It is simpler,
quicker and there are less odds of miscommunication.
Tradeoffs:
o Audio fingerprint mechanism that is resistant to sound disturbances and is also

dense.
o In the presence of disturbances, most pinnacles ought to get by as they have high
energy and we can recognize the fingerprints.
CHAPTER 5
SCHEDULE, TASKS AND MILESTONES
5.1 Schedule (2021)
February 1st- Researching on the objective of the project
February 7th- Studying the basic concepts of signal and audio processing.
February 11th- Studying different Audio Fingerprinting Algorithms
February 13th- Downloading the necessary packages of MATLAB required
February 15th- Simulation of creating and storing of Audio Fingerprints on MATLAB
February 22nd- Simulation of matching of Audio Fingerprints on MATLAB.
February 25th- Studying and researching about deep learning techniques and its coding.
March 2nd- Studying about Noise and its types.
March 10th- Simulation of Noise segregation on MATLAB using deep learning on MATLAB
March 16th- Studying about feature extraction, MFCC and pitch.
April 3rd- Simulation of Speaker Recognition using DNN on MATLAB
April 7th- Studying the comparison of DNN with other classification methods.
April 11th- Drafting of paper and poster
May 12th- Submission of paper draft and poster to the guide.
May 23rd- Submission of final report and poster to the guide.
May 26th- Final thesis submission on VTOP.
June 5th- Final review and project demonstration.

5.2 Tasks and Milestones
The major tasks and milestones achieved while completing this project -
● Creating a whole new algorithm for audio fingerprinting and matching, different from
the ones discussed in previous papers.
● To create an algorithm which is resistant to environmental disturbances.
● The model proposed for deep learning had to be trained properly to make it able to
precisely classify the types of noise which was achieved by simulating in MATLAB.
● This was followed by training of another deep learning model, called the KNN
classifier model which was used to detect the speaker from the audio after extracting
features like pitch and MFCC.
CHAPTER 6
PROJECT DEMONSTRATION
After running the code and giving the input, followed by performing several other
simulations, the following results were obtained: -
6.1 Audio Fingerprinting and Matching
Figure 6.1 Spectogram of input audio signal
Figure 6.1 shows the spectogram of the unlabelled music given as input to the software.
The 3 seconds audio clip recorded and given as an input to the software is further tested if it
is recognised by the algorithm applied to give it’s name as the output or not.
Fig 6.2 Hanning Window of input audio

Fig 6.2 displays the input audio signal after passing it through Hanning window to smoothen
the ends to prevent any kind of distortion while computing its Fast Fourier Transform.
Figure 6.3 Audio fingerprint
Figure 6.3 Displays the fingerprint of the audio created and stored and similarly how the fingerprints
of all the audios in the database are stored.
Audio Matching
Test Case-1 (for existing audio in database)
Figure 6.4 Test Case 1

Test Case-2 (for non-existing audio in database)
Figure 6.5 Test Case 2
Figure 6.4 and 6.5 are the two test cases which show results produced by the software on
running the code and hence can be used to compare both the results.
Test case 1 displays that the unlabelled audio recorded or provided as input to the software,
exists already in the database and hence was instantly recognised by the software.
The software hence produces the output as “the data is matched”, along with the name of the
audio by which it is already saved in the database
Test case 2 whereas shows that the unlabelled audio recorded by the software doesn’t exist in
the database and hence the system fails to recognise it, and hence the output “the data is not
matched” is displayed.
6.2 Noise Classification
Figure 6.6 Training of Deep Neural Network
From figure 6.6, we decipher that the model is trained for a total of 30 epochs to get the
perfect results. The duration of the training reduces each time we reset it.
It was observed that, for the first time when the model was being trained, it took a total of 2
mins 43 secs, which reduced to 1 min 34 seconds when it was being trained for the second
time.
Figure 6.7 Noise classification result
Fig 6.7 displays the results of noise classification and simultaneously produces precise results
for other types of noises as well.
There were a total of 1000 signals taken to perform the experiment. These were further
divided into sets of 800 and 200, the first set being used to train the model and the rest to
validate the same.
6.3 Speaker identification
Figure 6.8 Downloading of dataset and reducing the same

A dataset of 1000 speakers are downloaded from the AN4 library and it is compressed to a
dataset of 10 speakers which consists of voices of 5 males and 5 females.
Figure 6.9 Audio wave of the sample audio of a speaker uttering “two”
Figure 6.10 Pitch of the speaker
Figure 6.10 displays how the immaterial pieces of the speech signals are discarded out in the
pitch recurrence figure, and only the relevant part is recorded.
Table 6.1 Confidence percentage of the KNN classifier
Filename ActualSpeaker PredictedSpeaker ConfidencePercentage

{'cen6-fejs-b.flac'} Fejs Fejs 94.41
{'cen6-fmjd-b.flac'} Fmjd Fmjd 77.019
{'cen6-fsrb-b.flac'} Fsrb Fsrb 55.721
{'cen6-ftmj-b.flac'} Ftmj Ftmj 59.615
{'cen6-fwxs-b.flac'} Fwxs Fwxs 71.946
{'cen6-mcen-b.flac'} Mcen Mcen 74.359
{'cen6-mrcb-b.flac'} Mrcb Mrcb 77.519
{'cen6-msjm-b.flac'} Msjm Msjm 58.571
{'cen6-msjr-b.flac'} Msjr Msjr 64.828
{'cen7-fwxs-b.flac'} Fwxs Fwxs 69.345
{'cen7-msmn-b.flac'} Msmn Msmn 75.191
{'cen8-msmn-b.flac'} Msmn Msmn 87.773
Figure 6.11 Confusion Matrix
We can decipher the confidence percentage of each and every speaker from the confusion
matrix as it related between the true class and the predicted class of the classifier. The
confusion matrix displays the number of times the classifier is predicting which speaker it is
against the actual speaker.
Figure 6.12 Validation accuracy
Table 6.2 Number of speakers vs accuracy
Number of Accura
speakers cy
10 92.93%
20 89.5%
With a total of 10 speakers, the classifier acquires an accuracy of 92.93% in producing

accurate results. On increasing the number of speakers to 20, the accuracy drops down to
89.5%.
CHAPTER 7
RESULT & DISCUSSION
We have observed the results obtained after complete simulation of each part of the project in
the previous section. The algorithm for creating audio fingerprints and matching them with
the sound clip given as an input works as a robust algorithm against the noise and gives
accurate results by detecting the song within 2 seconds, if the audio is detected, otherwise if
the audio or song is not found in the database, the output is displayed as “The data is not
matched”. The next phase of the project is to classify the noise sample either into a pink, a
brown or a white noise. After training the deep neural network with 800 samples and using
200 samples to test, the network is finally able to produce precise output for the input audio
provided by the user to the software. The next and the final phase of the project is speaker
identification. Here we have used a KNN classifier which follows deep learning to produce
output. After an AN4 dataset is downloaded, which consists of a total of 1000 of male and
female voices which are labelled, a random voice is selected and the classifier produces the
output as the identified speaker with the help of confusion matrix which displays the
confidence percentage of every speaker.
CHAPTER 8
SUMMARY
In the aforementioned thesis, we learnt about the parameters and components of an audio
fingerprint generation and audio matching system of various random audio signals.
Experiments were conducted with adding effects to a data set on a larger scale that have led
to exciting conclusions that are presented below.
1. The audio fingerprinting algorithm that has been presented in this thesis is robust
against any additive unnecessary noise and various noise distortions.
2. By training a deep neural network, we were successful in classification of
different types of noise.
3. In speaker recognition, feature extraction is performed in such a way that they
can reliably distinguish speakers by themselves or in combination with other
feature sets.
4. The speakers were correctly identified, and the validation accuracy of ten
speakers turned out to be 92.93%.
5. As the process continued further, when the number of speakers is expanded to
20, the accuracy drops to 89%
Future Scope
The goal in the future is to achieve the integration of speaker verification technology into a
hybrid and multi-level authentication method, where the results of different biometric
technologies such as fingerprint, iris, speaker, facial recognition etc., can be combined to
achieve better reliability in authentication.
The most important advantage of using audio biometrics is the ability to perform
authentication in situations where one does not have access to direct eye contact or direct
subject contact to authenticate.
.
CHAPTER 9
REFERENCES
[1] G. Deepsheka, R. Kheerthana, M. Mourina and B. Bharathi, "Recurrent neural network

based Music Recognition using Audio Fingerprinting," 2020 Third International
Conference on Smart Systems and Inventive Technology (ICSSIT), 2020, pp. 1-6, doi:
10.1109/ICSSIT48917.2020.9214302.
[2] X. Chen and X. Kong, "An Efficient Music Search Method in Large Audio Database,"
2018 3rd International Conference on Mechanical, Control and Computer
Engineering (ICMCCE), 2018, pp. 484-487, doi: 10.1109/ICMCCE.2018.00108.
[3] H. Cho, S. Ko, H. Kim, “A Robust Audio Identification for Enhancing Audio-Based
Indoor Localization”, 2016 IEEE International Conference on Multimedia & Expo
Workshops (ICMEW),2016, pp. 1-6, oi: 10.1109/ICMEW.2016.7574701
[4] T. Tsai, T. Prätzlich and M. Müller, "Known-Artist Live Song Identification Using Audio
Hashprints," in IEEE Transactions on Multimedia, vol. 19, no. 7, pp. 1569-1582, July
2017, doi: 10.1109/TMM.2017.2669864.
[5] P. Seetharaman and Z. Rafii, "Cover song identification with 2D Fourier Transform
sequences," 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2017, pp. 616-620, doi: 10.1109/ICASSP.2017.7952229.
[6] R. Sonnleitner and G. Widmer, "Robust Quad-Based Audio Fingerprinting," in

IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3,
pp. 409-421, March 2016, doi: 10.1109/TASLP.2015.2509248.
[7] S. Yao, B. Niu and J. Liu, "Audio Identification by Sampling Sub-fingerprints and
Counting Matches," in IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 1984-
1995, Sept. 2017, doi: 10.1109/TMM.2017.2723846.
[8] W. Xiong, X. Yu and J. Shi, "An improved audio fingerprinting algorithm with robust and
efficient," IET International Conference on Smart and Sustainable City 2013 (ICSSC
2013), 2013, pp. 377-380, doi: 10.1049/cp.2013.1960.
[9] F. Saki and N. Kehtarnavaz, "Real-Time Unsupervised Classification of Environmental

Noise Signals," in IEEE/ACM Transactions on Audio, Speech, and Language
Processing, vol. 25, no. 8, pp. 1657-1667, Aug. 2017, doi:
10.1109/TASLP.2017.2711059.
[10] F. Saki and N. Kehtarnavaz, "Background noise classification using random forest tree
classifier for cochlear implant applications," 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3591-3595, doi:
10.1109/ICASSP.2014.6854270.
[11] B. D. Barkana and I. Saricicek, "Environmental Noise Source Classification Using
Neural Networks," 2010 Seventh International Conference on Information
Technology: New Generations, 2010, pp. 259-263, doi: 10.1109/ITNG.2010.118.
[12] Z. Alavi and B. Azimi, "Application of Environment Noise Classification towards Sound
Recognition for Cochlear Implant Users," 2019 6th International Conference on
Electrical and Electronics Engineering (ICEEE), 2019, pp. 144-148, doi:
10.1109/ICEEE2019.2019.00035.ification with 2D Fourier Transform EE Int. Conf.
Acoust. Speech Signal Process. (ICASSP), New Orleans, LA, pp. 616–620, 2017, doi:
10.1017/CBO9781107415324.004.
[13] E. C. K. M. M. S. Kalaivani, “A Study on Speaker Recognition System and Pattern

classification Techniques,” vol. 2, no. 2, pp. 963–967, 2014.
[14] V. Sharma and P. K. Bansal, “A review on speaker recognition approaches and

challenges,” vol. 2, no. 5, pp. 1581–1588, 2013.
[15] M. A. Nasr, M. Abd-Elnaby, A. S. El-Fishawy, S. El-Rabaie, and F. E. Abd El-Samie,

“Speaker identification based on normalized pitch frequency and Mel Frequency
Cepstral Coefficients,” Int. J. Speech Technol., vol. 21, no. 4, pp. 941–951, 2018, doi:
10.1007/s10772-018-9524-7.

Mohini Dey - Capstone

Uploaded by

Copyright:

Available Formats

You might also like

Mohini Dey - Capstone

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mohini Dey - Capstone

Uploaded by

Copyright:

Available Formats

LIST OF FIGURES

Fig Title Page

2.1 Audio Fingerprint matching 6

2.2 MFCC extraction 7

3.1 Branches of Machine Learning 13

3.2 Generalised Working of a Machine Learning 14

3.4 Implementation of CNN using an example 16

3.5 Layers of DNN used in noise classification 17

3.6 KNN example 18

3.7 Audio slicing representation of FFT 22

4.1 Workflow of the project 23

4.2 Representation of Pre-processing 23

4.3 Audio fingerprinting and matching 24

4.4 Deep Neural Network Workflow 24

4.5 Speaker Identification Representation 25

6.1 Spectrogram of input audio signal 29

6.2 Hanning window of input signal 29

6.3 Audio fingerprint 30

6.4 Test Case 1 30

6.5 Test Case 2 31

6.7 Noise Classification Result 33

Table Tit Page

1. DNN Deep Neural Network

Audio Fingerprinting advancements permit the utilization of sound samples to be assembled,

Figure1.1 Workflow of the project

Step 1: Give an audio sample as input into MATLAB

Step 2: Perform Audio sampling of the input audio at 4410kHz

Step 4: Apply Fast Fourier Transform on the audio sample.

Step 1: Pre-processing process

Step 2: Pre-processing process

Step 4: The correlation coefficient is returned.

Step 1: The speaker dataset from AN4 database is downloaded

Step 6: Pitch and MFCC is extracted from the training data.

Step 8: The classifier is validated by looking at the accuracy of the classifier.

Figure 2.2 MFCC extraction

Step 1 – The training dataset is loaded

Step 2 – The value of k is chosen i.e., the nearest data

points. ( k belongs to Z) Step 3 – For each point in the

dataset, the procedure mentioned below is followed −

● 3.1 − The distance between test data and each row of

● MATLAB Software (R2019b)

● Configuring streaming calculations for sound, intelligent sensor, instrumentation,

3.3 Audio Toolbox (MATLAB R2019b)

Capabilities and Features

3.5 Statistics and Machine Learning Toolbox

Figure 3.1 Branches of Machine Learning

Product capabilities of the Statistics and Machine Learning Toolbox

● Regression strategies, including direct linear, generalized linear, nonlinear, robust,

Figure 3.2 Generalised Working of a Machine Learning model

3.6 Machine learning algorithms

3.6.1 Deep neural networks:

Figure 3.3 Structure of a DNN

Figure 3.4 Implementation of CNN using an example

x), Sigmoid: Sigmoid: 1 (1+e-x ) Tanh, Leaky RELU, etc.

Figure 3.5 Layers of CNN used in noise classification

Figure 3.6 KNN example

Rectangular window w=⎧1 0≤n<W