Professional Documents
Culture Documents
Mohini Dey - Capstone
Mohini Dey - Capstone
Mohini Dey - Capstone
“two”
6. Pitch of the Speaker 34
10
6. Confusion Matrix 36
11
6. Validation Accuracy 36
12
LIST OF TABLES
2.
� Mean
�
π Pi
θ Theta
∫ Integration
α Alpha
∑ Summation
X Eigen vector
Λ Eigen value
� Variance
�
2
ω Omega
R Resistance
C Capacitance
ε Epsilon
j imaginary/iota
w Window function
n
Σ Standard
Deviation
1.1 Motivation
1.2 Background
Previous works on audio fingerprinting and matching have mostly used the method of hash
functions to store the fingerprints, but they used to focus more on mathematical equality
rather than noncognitive similarity. This project hence, keeps aside the idea of using hash
functions to store the fingerprints and instead uses the method of storing of energy peaks of
audio signals by using Parseval’s theorem. Unlike previous works, our work also focuses on
classification of noises from the audio into white, brown and pink and hence giving an idea of
what kind of music is the listener interested in. This is additionally followed by recognition of
speaker through the audio. The experiments are performed to achieve better accuracy. All
these experiments are performed in MATLAB considering it as a convenient platform to
conduct experiments related to signal processing.
CHAPTER 2
PROJECT DESCRIPTION AND GOALS
An audio fingerprint is just an information that stores many highlights of a sound example
which can be subsequently utilized for comparing or for some other purposes. It is a
condensed form of an advanced rundown of the tune that assists with recognizing it from
different tunes. In the digital signal processing, we convert analog signals into digital signals
through a digitization process. The analog to digital conversion process consists of two steps
called sampling and quantization. These steps together form the base of pre-processing
techniques of any given audio signal. Quantization allows the restriction of the amplitude of a
specific/limited set of values. Along the course of this thesis, the terms audio signal and audio
recording refer to DT-signal.
Pre-
processing
Audio
fingerprinting
Audio
Matching
Noise
Classification
Speaker
classification
The workflow of the project is divided lucidly into four phases namely Audio Fingerprinting,
Audio matching, Noise classification, and Speaker Identification as shown in figure 1.1. An
audio fingerprint is a densely progressed summary, deterministically created from a sound
signal that can be used to recognize a sound sample or quickly discover practically identical
samples in an audio database. The creation of a unique Audio fingerprint using energy peaks
of audio input. At the point when the audio fingerprints are made, they are accumulated in an
information database. Continuing with this a new solid sample is recorded utilizing an
external microphone and is then matched. Seeing and realizing how to perceive intricate and
high-
dimensional sound signals is perhaps the most difficult tasks in today’s world. This profound
algorithm attempts to gain proficiency with the straightforward simple features in the lower
layer and more complex features in the higher layers. Then comes the part of classification of
noise. We train the network, in a layer-wise fashion. In this method the hidden layers are
trained each at a turn. After training and testing the divided samples, we are finally able to
classify the noises into pink, brown and white noises.
The final part of this project work is the distinguishing of speaker from the audio. The
speaker distinguishing process dependent on the sound signals is treated as quite possibly the
most important innovations of human/individual acknowledgment. Sound signal highlights
can be characterized either in the perceptual mode or in the physical mode. Here in this
thesis, we are proposing the extraction of highlight vectors and later training the dataset. As
already mentioned before, this framework comprises of two parts; speaker recognition and
speaker verification. In the primary segment, we figuring out which enrolled speaker gives
the speech input, while verification is the task of consequently deciding whether an individual
is an individual the person professes to be. Speaker identification can be characterized
dependent on text-reliant or text-free techniques. In the text dependent technique, the speaker
needs to say words or key phrases having same text for both training and recognition trials. In
the text free technique, the framework can distinguish the speaker irrespective of what is
being spoken. In our thesis, we propose a text independent speaker recognition framework
dependent on Mel- frequency cepstral coefficients as highlight extraction and vector
quantization method. It additionally helps in limiting information needed for processing. The
extricated highlights of the speaker are quantized by several centroids and the K-Nearest
neighbor algorithm has been incorporated into the proposed speaker identification
framework. Speaker distinguishing targets perceiving speakers from their voices as every
individual has his remarkable set of speech qualities and his methods of talking. We are also
producing a confusion matrix at the end to show the probability percentages of each speaker
and hence, it helps in identifying the exact speaker.
The objective of this thesis is to examine and study the text independent speaker
identification framework, where we look at a speech audio signal from the obscure speaker to
the database of known speakers. Moreover, we have extricated highlights from unlabeled
sound signal data and it has shown excellent execution for numerous audio classification
tasks.
2.1 Algorithms:
1. Pre-processing
Step 3: Use Hanning window to split the input audio into overlapping frames with each
frame at a frame duration of 370ms
Step 5: Determine the energy peaks and then calculate the mean of all the energy peaks
Step 6: Use the mean of the energy peaks as the calculated cut-off frequency and pass the
data through a Butterworth high pass filter.
Step 7: We can obtain the energy peaks of the audio both as a discrete data and in the
form of a spectrogram.
2. Audio Fingerprinting
3. Audio matching
Step 1: Give the random audio signal as the input into MATLAB
Step 3: Correlate the energy peaks obtained by the random audio signal with the sets
present in the database
4. Speaker identification
Step 2: The helper function is employed to convert the audio signals into flac format.
Step 3: The dataset from AN4 database is reduced to 5 females and 5 males.
Step 4: To store the data used for further processing a datastore is created
Step 5: The datastore is split into two datastores where 8% of the data in the datastore is
used for training and 2% of the data in the datastore is used for testing.
Step 7: The KNN classifier is trained based on the extracted features of the data.
The first 10 triangular filters of the mel filterbank are linearly spaced and the remaining
filters are logarithmically spaced. For even energy, the individual bands are weighted. The
figure represents a typical mel filterbank.
6. KNN algorithm.
This project has been performed on MATLAB. The various toolboxes of MATLAB used to
achieve the results are listed as below: -
SOFTWARE DESCRIPTION:
3.1 MATLAB
MATLAB (Matrix Laboratory) possesses one of the most convenient environments for
conducting experiments related to digital signal processing. It includes the MATLAB
language, the only top programming language dedicated to numerical and technical
computing. MATLAB gives a multi-worldview mathematical figuring environment and it's
an intelligent framework whose fundamental information component is an exhibit that doesn't
need dimensioning. MATLAB hence, allows us to decode problems in a much easier way
using languages like C or Fortran. This matrix-based language lets you express math directly.
Linear algebra in MATLAB is intuitive and concise. The same is true for data analytics,
signal and image processing, control design, and other applications wherein it integrates
computation, visualization, and programming in an easy-to-use environment where problems
and solutions are expressed in familiar mathematical notation.
Hence, call it be the engineers or the researchers or the scientists, MATLAB has become a
non- detachable part of their lives. The richness of the MATLAB computational environment
combined with an integrated development environment (IDE) and easy interface, toolkits, and
simulation and modelling capabilities, creates a research and development tool that has no
equal.
Its uses include
• Math and computation
• Algorithm development
• Scientific and engineering graphics
• Modeling, simulation, and prototyping
• Data analysis, exploration, and visualization
3.2 SIMULINK
Simulink is a graphical extension to MATLAB for integrated simulation and model design
system. We generally use Simulink for displaying, reproducing and examining of MDS. It’s
essentially a diagrammatical framework with customizable block library packages with
product style control, traceability criteria and application coverage analysis, Similar Link are
able to routinely verify models. The main use of Simulink is its ability to model a nonlinear
system, which a transfer function is unable to do. Another main advantage of Simulink is the
ability to take on initial conditions.
In Simulink, systems are drawn on screen as block diagrams. Many elements of block
diagrams are available, such as transfer functions, summing junctions, etc., as well as virtual
input and output devices such as function generators and oscilloscopes. Simulink is
coordinated with MATLAB and information can be moved from project to project.
Accordingly, Simulink is productive for precise confirmation and testing of frameworks
through planning style checking, necessities, and analysis of model completion.
We can also easily identify mistakes through the Simulink Model Verifier and can also be
used in framework checking by creating test cases.
MATLAB and Signal Processing
Digital signal processing becomes way easier with MATLAB and Simulink. they provide a
systematic work flow for the advancement of inserted frameworks. One can apply signal
processing tools to:
● Procure, measure, and investigate signals from numerous sources without expertise
hypothesis in processing of signals.
● Making a prototype, testing it, and then carrying out DSP calculations on PCs,
various kinds of processors, SoCs, and FPGAs.
● Pre-process and filter signals before analysis.
● Investigate and concentrate highlighted information for analysis of information and
applications based on Artificial Intelligence.
● Analyse trends and discover patterns in signals.
● Visualize and measure the time and frequency characteristics of signals.
MATLAB's Audio Toolbox gives instruments to sound preparing, discourse examination, and
acoustic estimation. It involves calculations for sound sign handling, (for example,
adjustment and dynamic reach control) and acoustic estimation, (for example, drive reaction
assessment, octave sifting, and perceptual weighting). It additionally gives calculations to
sound and discourse highlight extraction (like MFCC and pitch) and sound sign change, (for
example, gammatone channel bank and Mel-separated spectrogram).
Tool compartment applications hold up live computation testing, drive reaction estimation,
and sound sign naming. It gives spouting interfaces to ASIO, WASAPI, ALSA, and
CoreAudio sound cards and MIDI contraptions, and stages for making and working with
standard sound modules, for instance, VST and Audio Units. With Audio Toolbox one can
import, name, and add sound informational indexes, just as concentrate highlights and
convert signals for AI and profound learning. One can synchronize continuous sound
calculations with low-dormancy
sound streaming while at the same time changing boundaries and envisioning them. One can
likewise approve their calculation by changing it over to a sound module to chip away at
outside hold strategies like Digital Audio Workstations. Module facilitating permits one to
utilize outside sound modules as standard highlights to deal with MATLAB clusters. Sound
card interfaces permit one to utilize custom adjustment for genuine sign signs and acoustic
frameworks.
Audio Toolbox gives usefulness to foster sound, discourse, and acoustic applications utilizing
AI and profound learning. Utilize the sound information store to oversee and stack enormous
informational collections. Use Audio labeller to intelligently characterize and envision
ground truth. Use Audio Data Augmenter to extend informational collections utilizing sound
explicit increase methods. Utilize a sound element extractor to make productive and secluded
component extraction pipelines.
Audio Toolbox is enhanced for ongoing sound stream preparing. Utilize these highlights
exclusively or as a feature of a bigger calculation to make impacts, examine signals, and
interaction sound.
● Sound Toolbox empowers constant processing and testing of sound signal and
MATLAB and Simulink. It provides a network which can operate on a very large set
of data in minimum time to stream sound accompanying certain driver norms.
● Smooth multi-channel audio streaming in real-time is possible in MATLAB. Planning
of channel permits directing of signals to random channel choices when we use multi-
channel sound gadgets.
● C-language coding is easily supported by MATLAB and Simulink, even for creation
of libraries. For example, one can generate libraries or standalone applications that
process audio in real-time on the desktop.
● Audio Toolbox also enables one to tune algorithm parameters interactively during
simulations using external MIDI controls. Audio Toolbox is streamlined for
continuous sound stream handling. Utilize these characteristics separately or as a
feature of a bigger calculation to make impacts, examine signals, and interact with
sound signals
3.4 Deep Learning Toolbox:
Deep Learning Toolbox provides a systematic framework to help planning and executing
deep neural networks with calculations, algorithms, pre-trained models, and applications. One
can utilize convolutional neural network (ConvNets, CNNs) and long short-term memory
(LSTM) networks to perform characterization and relapse on picture, time-arrangement, and
text information. One can construct network structures, for example, generative adversarial
networks (GANs) and Siamese networks utilizing automatic differentiation, custom training
loops, and shared weights. With the Deep Network Designer application, one can configure,
examine, and train networks graphically. The Experiment Manager application assists one
with dealing with numerous profound learning tests and experiments, monitor training
parameters, analyze results, and analyze code and compare it with various experiments. One
can visualize layer initiations and graphically screen the training progress.
One can trade models with TensorFlow and PyTorch through the ONNX configuration and
import models from TensorFlow-Keras and Caffe. It works for transfer learning with
applications like DarkNet-53, ResNet-50, NASNet, SqueezeNet, and numerous other pre-
trained models. One can support training on a solitary or many GPU workstation (with
Parallel Computing Toolbox), or scale up to groups, clusters and clouds, including
NVIDIA GPU Cloud and Amazon EC2 GPU instances.
Machine Learning Toolbox and Statistics gives capacities, functions and applications to
depict, examine, analyze and model any given data. One can utilize expressive and
descriptive measurements, statistics and plots for exploratory data analysis, fit probability
distributions to information, produce arbitrary random numbers for Monte Carlo simulations,
and perform various hypothetical tests. Relapse, regression and classification algorithms
permit one to draw deductions from information and construct predictive models.
For multidimensional information analysis, data mining, Statistics and Machine Learning
Toolbox provides us with tools like, feature selection, stepwise regression, principal
component analysis (PCA), regularization, and other dimensionality decrease strategies that
allow one to distinguish factors or highlight features that impact their model.
The tool kit gives regulated supervised and unsupervised ML algorithms, including support
vector machines (SVMs), boosted and bagged decision trees, k-nearest neighbors, k-means,
k- medoids, various leveled hierarchical clustering, Gaussian combination models, and
hidden Markov models. A considerable lot of the statistics and ML algorithms can be utilized
for calculations on data sets that are too large to be put away in memory.
Deep neural networks comprise of various degrees of nonlinear tasks, like neural nets with
many secret hidden layers. The idea of deep learning depends on learning highlight feature
progressions, where highlights at more significant levels of hierarchical chain of command
are shaped utilizing the highlights at lower levels. Much better outcomes could be
accomplished in more profound designs when each layer relates to an unsupervised learning
alg. At that point sent to a supervised mode, the network is trained utilizing the
backpropagation calculation algorithm, to change weights. Additionally, DNNs outflank
GMM (Gaussian Mixture Model) and HMM (Hidden Markov Model) on an assortment of
speech handling tasks by a huge margin. Deep learning networks are recognized from the
more ordinary single secret hidden layer neural networks by their profundity that is, the
quantity of node layers through which information should pass in a multistep process of
pattern recognition. This deep structured
organized learning is important for a more extensive family of ML strategies. It is likewise
based on artificial neural networks with representation learning where this learning can be
supervised or unsupervised. Deep neural networks, deep belief networks, recurrent neural
networks, and convolutional neural networks have been applied to numerous fields like
speech acknowledgment, regular language processing, sound signal recognition, social
network community filtering, machine interpretation bioinformatics, drug design plan, and so
on, Deep neural networks helps in creating results similar to and now and again awe-inspiring
human master execution and in some cases surpassing human expert performance. The deep
neural network is nothing but an artificial neural network with multiple different layers
between contributing the input and output yield layers. Regardless of whether it be a direct
linear relationship or a non-linear relationship the DNN tracks down the right numerical
control to mathematically manipulate and transform the input into the output. The network
travels through each layer ascertaining the likelihood and calculating the probability of each
output. This network has a specific degree of intricacy, a network with multiple layers. It
utilizes refined sophisticated numerical mathematical modeling to demonstrate and to handle
data intricately.
A deep neural network has one input, one output layer each an in any event at least one secret
hidden layer sandwiched in the middle, where each layer performs specific kinds of
arranging, sorting, requesting and ordering, each layer comprises of neuron/nodes that
perform mathematical calculations and other different tasks. Every node in a layer is
interconnected to different nodes present in the other consecutive layers. At every
interconnection, there are weights doled out and a bias is allotted to each layer. These weights
and biases are the parameters of the given network.
In deep learning CNN, most generally applied to dissect and analyze visual symbolism and
imagery. In light of their common shared weight architecture loads and interpretation
invariance qualities, they are otherwise called shift invariant or space invariant artificial
neural networks (SIANN). CNN is generally utilized in picture and video acknowledgment,
recommender frameworks, picture arrangement, clinical picture examination, common
language handling, monetary time arrangement. It is the regularized variants of a multi-layer
perceptron. Multi-layer perceptron typically implies completely connected networks which
mean, that every neuron in one layer is associated with all neurons in the following layer. The
full connectedness property of these organizations makes them inclined to overfitting
information. General methods of regularization incorporate adding some type of extent
estimation of loads to the misfortune work. CNN has an alternate methodology towards
regularization they exploit the various leveled design in information and amass more
unpredictable examples utilizing more modest and less difficult examples. Accordingly, on
the size of connectedness and intricacy, CNNs are on the lower furthest point.
In convolutional networks, the availability design between neurons takes after the association
of the creature Visual cortex. The individual cortical neurons react to improvements just in a
confined district of the visual field known as the responsive field. The open fields of various
neurons in part cover with the end goal that they cover the whole visual field. Contrasted with
picture arrangement calculations, CNN's utilization generally minimal pre-preparing. This
network learns the channels that in conventional calculations are hand-designed.
A convolutional neural network is a continuous sequence of many layers, and every layer
helps in transformation of one volume to another through differentiable functions.
1. Input layer: This layer consists of raw input given by the user
2. Convolutional layer: In this layer, the output volume is calculated using computing
dot product applied between all the filters and image patches.
3. Activation function: In this layer, we apply the element-wise activation function to the
output of the convolution layer. Some common activation functions are RELU:
max(0,
The forte of this algorithm is that it is delicate to the local construction of the information.
The neighbors are taken from the object property estimation or a bunch of objects for which
the class is known.
K-NN stores every single accessible case and characterizes new cases dependent on a
likeness measure. A case is arranged utilizing the majority share vote of its neighbors, with
the case being relegated to the class generally most common among its K-NN estimated by a
distance function. Assuming k=1, the object is essentially appointed to the class of that
nearest neighbor.
Distance functions:
k
Euclidean (xi i
2
y )
i
1
Manhattan k
∑| x − y |i i
i=1
|) ⎞q
q
Minkowski ⎛k (| xi − yi
⎜ ⎠
⎝ i=1
All these distance measures are just legitimate for consistent continuous variables. The
hamming distance should be utilized in the example of straight-out categorical variables. It
additionally raises the issue of normalization of the mathematical factors somewhere in the
range of 0 and 1 when there is a combination of mathematical and downright categorical
variables in the dataset.
Hamming distance:
D H = ∑ | xi − y i |
i=1
x=y⇒D=0
x≠y⇒D=1
An ideal worth of k should be picked by first examining the information. a huge K worth is
more exact as it diminishes the general clamor yet there is no assurance. There is another
method of deciding k - Cross-approval by utilizing a free dataset to approve the K worth.
Generally, the ideal K for most datasets has been between 3-10. That produces much
preferable outcomes over 1NN.
The k-NN classifier is a non-parametric technique utilized for grouping and relapse. This
technique helps in anticipating the article's qualities or class enrollments dependent on k
nearest preparing models in highlight space. K-NN is a sort of supervised learning. It is
likewise sluggish realizing where the capacity is just approximated locally and all calculation
is conceded until grouping. This calculation is least complex among all AI calculations.
Utilizing a k-NN classifier an item is characterized by a larger part vote of its neighbors, with
the object being allotted to the class generally regular among its k closest neighbors where k
is a positive number, commonly little. Assuming k=1, the object is just allotted to the class of
that closest neighbor. A wide range of strategies utilize k-NN for speaker distinguishing
proof.
SVM Classifier: The essential thought of SVM is to track down the ideal direct choice
surface dependent on the idea of underlying danger minimization. SVM is a double
arrangement technique. In ML, support vector networks are supervised training models with
related learning calculations that examine information and perceive designs, utilized for
classification and relapse investigation. Essential SVM noticed a bunch of information and
helps in anticipating for each given information, which of two potential classes frames the
yield, making it a non- probabilistic double linear classifier. A SVM preparing calculation
constructs a model for a given arrangement of preparing models, each set apart as having a
place with one of two classes that allocates new models into one classification or the other. A
SVM model is a portrayal of the model as a point in space, planned so instances of the
different classifications are partitioned by a reasonable hole that is just about as wide as could
be expected. New models are then planned into that equivalent space and anticipated to have
a place with a classification dependent on which side of the hole they fall on. SVMs can
perform both linear and non-linear classification.
Conclusion, similarly as with the ANN characterization plot, both K-NN and SVM need a
huge arrangement set of feature data. These datasets are separated in an element extraction
module from an assorted assortment of recorded speech signals. Classification is directed on
various gatherings of informational indexes to test various blends of features extracted from
the speech signals utilizing MATLAB.
3.7 Spectrogram
Frequency domain spectrum gives data about describing the limit between two classes.
Nonetheless, it doesn't consider the time measurement which is likewise fundamental for
acoustic signals. For this situation, the range is utilized as a spectrogram, it's only a visual
portrayal of the range of an acoustic signal that fluctuates with time.
Α spectrogram has the horizontal axis indicating time whereas, the vertical axis displaying
amplitude which are indicated by colour dots in the figure. Spectrogram addresses how the
range of a sound wave changes after some time. In the computerized sound sign, it is
normally determined utilizing the Short-Time Fourier Transform. The advanced time-space
tests are separated into covering outlines, which is known as the windowing process. Famous
window capacities incorporate rectangular window, Hamming window, Hanning window,
and so forth.
⎧
0.54
-0.46 ≤ n < W
2
∙cos(
πn
)
0
Hamming window wn ⎪
⎨ W
⎪
⎩ 0 otherwise
⎧
0.5- ≤ n < W
0.5 ∙
2π
cos(
n
)
0
Hanning window wn ⎪
⎨ W
⎪
⎩ 0 otherwise
A rectangular window is once in a while utilized on the grounds that it will cause
discontinuities between frames when spectrum is calculated. Following this, each frame
passes through FFT to get the relating spectrum. Finally, every spectrum is considered as a
section and they are connected along time. Diagram of short-time transform:
Figure 3.7 Audio slicing representation of FFT
1
w[n] = exp(− ( N
n−
2
2
)2 ), 0 ≤ n ≤ N σ ≤ 0.5
σN
2
Pre-
processing
Audio
fingerprinting
Audio
Matching
Noise
Classification
Speaker
classification
Feature
extraction
Define and
Testing of train the
networks
network
Speaker recognition and classification of noise have utilized both the calculations – Deep
Neural Network and K- nearest neighbour Algorithm. While these algorithms were being
coded in MATLAB to obtain results, the following codes and standards were considered
and kept in mind: -
Constraints:
o For every new sound signal is fed as input to the software, the database has to be
updated right from the scratch.
o The sound source and the receiver should be in a nearby range.
o The source should not be in motion.
o More than a single source can cause overlapping and thus not produce required
results.
Alternatives:
o Despite having a slight or more mutilation in the input audio signal, the code
works efficiently to search for a particular fingerprint in the database.
o Techniques of deep learning enjoy numerous upper hands over different strategies
as it extricates the element and train the model single-handedly. It is simpler,
quicker and there are less odds of miscommunication.
Tradeoffs:
February 7th- Studying the basic concepts of signal and audio processing.
February 25th- Studying and researching about deep learning techniques and its coding.
March 10th- Simulation of Noise segregation on MATLAB using deep learning on MATLAB
April 7th- Studying the comparison of DNN with other classification methods.
The major tasks and milestones achieved while completing this project -
● Creating a whole new algorithm for audio fingerprinting and matching, different from
the ones discussed in previous papers.
● To create an algorithm which is resistant to environmental disturbances.
● The model proposed for deep learning had to be trained properly to make it able to
precisely classify the types of noise which was achieved by simulating in MATLAB.
● This was followed by training of another deep learning model, called the KNN
classifier model which was used to detect the speaker from the audio after extracting
features like pitch and MFCC.
CHAPTER 6
PROJECT DEMONSTRATION
After running the code and giving the input, followed by performing several other
simulations, the following results were obtained: -
Figure 6.1 shows the spectogram of the unlabelled music given as input to the software.
The 3 seconds audio clip recorded and given as an input to the software is further tested if it
is recognised by the algorithm applied to give it’s name as the output or not.
Figure 6.3 Displays the fingerprint of the audio created and stored and similarly how the fingerprints
of all the audios in the database are stored.
Audio Matching
Figure 6.4 and 6.5 are the two test cases which show results produced by the software on
running the code and hence can be used to compare both the results.
Test case 1 displays that the unlabelled audio recorded or provided as input to the software,
exists already in the database and hence was instantly recognised by the software.
The software hence produces the output as “the data is matched”, along with the name of the
audio by which it is already saved in the database
Test case 2 whereas shows that the unlabelled audio recorded by the software doesn’t exist in
the database and hence the system fails to recognise it, and hence the output “the data is not
matched” is displayed.
6.2 Noise Classification
From figure 6.6, we decipher that the model is trained for a total of 30 epochs to get the
perfect results. The duration of the training reduces each time we reset it.
It was observed that, for the first time when the model was being trained, it took a total of 2
mins 43 secs, which reduced to 1 min 34 seconds when it was being trained for the second
time.
Figure 6.7 Noise classification result
Fig 6.7 displays the results of noise classification and simultaneously produces precise results
for other types of noises as well.
There were a total of 1000 signals taken to perform the experiment. These were further
divided into sets of 800 and 200, the first set being used to train the model and the rest to
validate the same.
Figure 6.9 Audio wave of the sample audio of a speaker uttering “two”
Figure 6.10 displays how the immaterial pieces of the speech signals are discarded out in the
pitch recurrence figure, and only the relevant part is recorded.
Table 6.1 Confidence percentage of the KNN classifier
We can decipher the confidence percentage of each and every speaker from the confusion
matrix as it related between the true class and the predicted class of the classifier. The
confusion matrix displays the number of times the classifier is predicting which speaker it is
against the actual speaker.
Number of Accura
speakers cy
10 92.93%
20 89.5%
We have observed the results obtained after complete simulation of each part of the project in
the previous section. The algorithm for creating audio fingerprints and matching them with
the sound clip given as an input works as a robust algorithm against the noise and gives
accurate results by detecting the song within 2 seconds, if the audio is detected, otherwise if
the audio or song is not found in the database, the output is displayed as “The data is not
matched”. The next phase of the project is to classify the noise sample either into a pink, a
brown or a white noise. After training the deep neural network with 800 samples and using
200 samples to test, the network is finally able to produce precise output for the input audio
provided by the user to the software. The next and the final phase of the project is speaker
identification. Here we have used a KNN classifier which follows deep learning to produce
output. After an AN4 dataset is downloaded, which consists of a total of 1000 of male and
female voices which are labelled, a random voice is selected and the classifier produces the
output as the identified speaker with the help of confusion matrix which displays the
confidence percentage of every speaker.
CHAPTER 8
SUMMARY
In the aforementioned thesis, we learnt about the parameters and components of an audio
fingerprint generation and audio matching system of various random audio signals.
Experiments were conducted with adding effects to a data set on a larger scale that have led
to exciting conclusions that are presented below.
1. The audio fingerprinting algorithm that has been presented in this thesis is robust
against any additive unnecessary noise and various noise distortions.
2. By training a deep neural network, we were successful in classification of
different types of noise.
3. In speaker recognition, feature extraction is performed in such a way that they
can reliably distinguish speakers by themselves or in combination with other
feature sets.
4. The speakers were correctly identified, and the validation accuracy of ten
speakers turned out to be 92.93%.
5. As the process continued further, when the number of speakers is expanded to
20, the accuracy drops to 89%
Future Scope
The goal in the future is to achieve the integration of speaker verification technology into a
hybrid and multi-level authentication method, where the results of different biometric
technologies such as fingerprint, iris, speaker, facial recognition etc., can be combined to
achieve better reliability in authentication.
The most important advantage of using audio biometrics is the ability to perform
authentication in situations where one does not have access to direct eye contact or direct
subject contact to authenticate.
.
CHAPTER 9
REFERENCES
[2] X. Chen and X. Kong, "An Efficient Music Search Method in Large Audio Database,"
2018 3rd International Conference on Mechanical, Control and Computer
Engineering (ICMCCE), 2018, pp. 484-487, doi: 10.1109/ICMCCE.2018.00108.
[3] H. Cho, S. Ko, H. Kim, “A Robust Audio Identification for Enhancing Audio-Based
Indoor Localization”, 2016 IEEE International Conference on Multimedia & Expo
Workshops (ICMEW),2016, pp. 1-6, oi: 10.1109/ICMEW.2016.7574701
[4] T. Tsai, T. Prätzlich and M. Müller, "Known-Artist Live Song Identification Using Audio
Hashprints," in IEEE Transactions on Multimedia, vol. 19, no. 7, pp. 1569-1582, July
2017, doi: 10.1109/TMM.2017.2669864.
[5] P. Seetharaman and Z. Rafii, "Cover song identification with 2D Fourier Transform
sequences," 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2017, pp. 616-620, doi: 10.1109/ICASSP.2017.7952229.
[7] S. Yao, B. Niu and J. Liu, "Audio Identification by Sampling Sub-fingerprints and
Counting Matches," in IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 1984-
1995, Sept. 2017, doi: 10.1109/TMM.2017.2723846.
[8] W. Xiong, X. Yu and J. Shi, "An improved audio fingerprinting algorithm with robust and
efficient," IET International Conference on Smart and Sustainable City 2013 (ICSSC
2013), 2013, pp. 377-380, doi: 10.1049/cp.2013.1960.
[10] F. Saki and N. Kehtarnavaz, "Background noise classification using random forest tree
classifier for cochlear implant applications," 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3591-3595, doi:
10.1109/ICASSP.2014.6854270.
[11] B. D. Barkana and I. Saricicek, "Environmental Noise Source Classification Using
Neural Networks," 2010 Seventh International Conference on Information
Technology: New Generations, 2010, pp. 259-263, doi: 10.1109/ITNG.2010.118.
[12] Z. Alavi and B. Azimi, "Application of Environment Noise Classification towards Sound
Recognition for Cochlear Implant Users," 2019 6th International Conference on
Electrical and Electronics Engineering (ICEEE), 2019, pp. 144-148, doi:
10.1109/ICEEE2019.2019.00035.ification with 2D Fourier Transform EE Int. Conf.
Acoust. Speech Signal Process. (ICASSP), New Orleans, LA, pp. 616–620, 2017, doi:
10.1017/CBO9781107415324.004.