Professional Documents
Culture Documents
Report (78,81,93)
Report (78,81,93)
BACHELOR OF ENGINEERING
IN
Submitted by
2021
Date: 01/05/2021
CERTIFICATE
This is to certify that the project dissertation titled “MUSIC RECOMMENDATION
BASED ON MOOD USING FACIAL RECOGNITION” being submitted by
Signatures:
External Examiner
i
DECLARATION
This is to certify that work reported in the major project entitled “MUSIC
RECOMMENDATION BASED ON MOOD USING FACIAL
RECOGNITION” is a record of the bonafide work done by us in the Department
of Computer Science and Engineering, Muffakham Jah College of Engineering
and Technology, Osmania University. The results embodied in this report are
based on the project work done entirely by us and not copied from any other
source.
ii
ACKNOWLEDGEMENT
Our hearts are filled with gratitude to the Almighty for empowering us with
courage, wisdom and strength to complete this project successfully. We give him
all the glory, honor and praise.
We thank our Parents for having sacrificed a lot in their lives to impart the
best education to us and make us promising professionals for tomorrow.
We would like to express our sincere gratitude and indebtness to our project
guide Mr. Syed Md Akbar Hashmi, Assistant Professor CSED, for his valuable
suggestions and guidance throughout the course of this project.
AYMAAN GILLANI
MOHAMMAD ABDUL MANSOOR KHAN
MOHAMMED HASSAN
iii
ABSTRACT
Human emotions play a vital role in recent times. Emotion is based on human
feelings which can be both expressed or not. Emotion expresses the human’s
individual behaviour which can be in different forms. Extraction of the emotion
states humans individual state of behaviour. The objective of this project is to
extract feature from human face and detect emotion and to play music according to
the emotion detected.
The required input is extracted from the human face directly using a camera. One
of the applications of this input can be for extracting the information to deduce the
mood of an individual. This data can then be used to get a list of songs that comply
with the “mood” derived from the input provided earlier. This eliminates the time-
consuming and tedious task of manually segregating or grouping songs into
different lists and helps in generating an appropriate playlist based on an
individual's emotional features.
Thus, our proposed system focusses on detecting human emotions for developing
emotion-based music player. A brief idea about our systems working, playlist
generation and emotion classification is given.
iv
CONTENTS
TITLE
CERTIFICATE i
DECLARATION ii
ACKNOWLEDGEMENT iii
ABSTRACT iv
LIST OF FIGURES viii
LIST OF TABLES viii
1. INTRODUCTION 1
1.1 Problem Statement 2
1.2 Objectives 2
2. LITERATURE SURVEY 3
2.1 Related Work 3
3. PROPOSED SYSTEM 6
3.1 System Requirements 6
3.2 System Architecture 7
3.3 Methodology 7
3.3.1 Face Capturing 8
3.3.2 Face Detection 8
3.3.3 Emotion Classification 9
3.3.4 Music Recommendation 9
3.4 Calculating Eigen Faces 10
3.5 Flow Chart Overview 12
v
4. ALGORITHMS 13
4.1 Haar Cascade Algorithm 13
4.1.1 Definition 13
4.1.2 Functionality 13
4.2 DeepFace Algorithm 15
4.2.1 Definition 15
4.2.2 Functionality 15
4.3 Convolutional Neural Network (CNN) 17
4.3.1 Definition 17
4.3.2 Architecture 18
4.3.3 Convolutional Layer 19
4.3.4 Fully Connected Layer 21
4.3.5 Weights 21
4.3.6 History 22
4.3.6.1 Neocognitron, Origin of CNN Architecture 23
4.3.7 Time Delay Neural Network (TDNN) 24
4.3.7.1 Max Pooling 24
4.3.8 Shift-invariant Neural Network 25
4.3.9 GPU Implementations 25
4.3.10 Intel Xeon Phi Implementations 26
4.4 CNN Layers arranged in 3-Dimensions 27
4.5 CNN Architecture 28
4.5.1 ReLU Layer 32
4.5.2 Fully Connected Layer 33
4.5.3 Loss Layer 34
4.5.4 Hyper Parameters 35
4.5.5 Number of Filters 35
vi
4.6 Applications of CNN 36
4.6.1 Image Recognition 36
4.6.2 Video Analysis 38
4.6.3 Natural Language Processing 38
4.6.4 Drug Discovery 38
4.6.5 Go 39
4.6.6 Cultural Heritage and 3D-Datasets 40
4.6.7 Fine Tuning 40
4.7 Related Networks 42
4.7.1 Deep Belief Networks 42
4.7.2 Deep Q-Networks 43
5. MODULES 44
5.1 Module – 1 45
5.2 Module – 2 48
5.3 Module – 3 51
5.4 Module – 4 52
vii
LIST OF FIGURES
1.2 Different Types of Human Emotions 2
3.2 System Architecture 7
3.5 Flowchart of System Architecture 12
4.1.2 Types of Haar Features 14
4.2.2 Working of DeepFace 16
4.3.2 Architecture of CNN 19
4.5 Layers in CNN 32
4.5.1 ReLU Activation Function 33
4.5.2 Fully Connected Layer 34
4.5.3 Loss Layer in CNN 34
4.5.5 Working of Filter in CNN 36
4.6.7 Fine Tuning 41
4.7.1 Architecture of DBN 42
4.7.2 Architecture of DQN 43
5.1 Module – 1 45
5.2 Module – 2 48
5.3 Module – 3 51
5.4 Module – 4 52
6.1 Output of 1st Module 53
6.2 Output of 2nd Module 54
6.3 Output of 3rd Module 55
6.4 Output of 4th Module 56
LIST OF TABLES
3.1 System Requirements 6
viii
1. INTRODUCTION
In this concept music is recommended to the user by detecting the real time
capturing of user’s emotions. Existing technique were using collaboration
technique which will use previous user data to recommend music and This
technique requires lot of manual work so, we proposed a system to arrange
different music in different categories such as happy, sad or angry etc. Emotion-
Based-music-player It's music player with chrome as front-End which has the
capability to detect emotions I.e., the face of user with the help of machine
learning algorithm using python. Based on the detected user’s mood song list will
be displayed/recommend to the user.
In this application image of a person is captured using a real time machine that has
the access to the local machinery and depending on the captured image it
compares the database data sets that already saved in the local device through
processing it defines the present mood of the user in numerical form based on this
music will be played other than that we have some common features that are
queue playlist so that we can have an individual playlist and the last one is random
it uses python Pandas library so that it can pick a random song without any order
for this we have used libraries like OpenCV, Pandas, Matplotlib, NumPy etc. This
system is mainly proposed because music play a vital role in recent times that is to
reduce stress.
So, in order to detect the emotion, we are using face as a main source of data
because normally face expression defines the Emotion so according to the mood,
we play the music that it can change the user’s mood.
1
1.1 Problem Statement
Music listeners have tough time creating and segregating the play-list manually
when they have hundreds of songs. It is also difficult to keep track of all the songs.
The sequence of songs in a play-list might not be the same every time, and songs
that a user wants to listen frequently might not be given priority or might be left
out from the list. Currently there are no systems that recommends music instantly
without manual work. Facial recognition was never a criterion for suggesting
music to the users.
1.2 Objectives
2
2. LITERATURE SURVEY
The process of multidimensional reduction by taking the primary data that is
lowered to many other classes for sorting out or organizing. Emotion of a user is
extracted by capturing the image of the user through webcam. The captured image
is enhanced by the process of dimensional reduction by tracking the primary data.
These data are converted into binary image format and the face is detected using
Haar cascade method.
The initial or the primary data taken from the human face that is lowered to many
other classes. These classes are sorted and organized using the above methods.
Emotion is detected by extracting the feature from the human face. The main aim
in feature extracting module is to diminish the number of resources required from
the large sets of data. Features in an image consists of 3 parts.
1. Boundaries/edges
2. Corners/projection points
3. Field points
3
The First Phase Face Detection are done by RGB Color model, lighting
compensation for getting face and morphological operations for retaining required
face i.e., eyes and mouth of the face. This System is also used AAM i.e., Active
Appearance Model Method for facial feature extraction in this method the point on
the face like eye, eyebrows and mouth are located and it create a data file which
gives information about model points detected and detect the face an expression is
given as input AAM Model changes according to
expression.
4
Human-computer interaction using emotion recognition from facial expression. F.
Abdat, C. Maaoui et al and A. Pruski et al. They proposed a system fully
automatic facial expression and recognition system based on three step face
detection, facial characteristics extraction and facial expression classification. This
system proposed anthropometric model to detect the face feature point combined
to shi and Thomasi method. In this method the variation of 21 distances which
describe the facial feature from neutral face and the classification base on SVM
(Support Vector Machine).
5
3. PROPOSED SYSTEM
Humans have a tendency to show their emotions unknowingly mainly they reflect
the face. The proposed system helps us to provide an interaction between the user
and the music system. This project mainly focuses on the user’s preferred music
that is recommended due to the emotional awareness. In the initial stage of the
proposed system, we have given 3 options and each contains its functionality. To
this, we have given a list of songs and emotions based on spatial recognition. Once
the application starts working it captures images with the webcam or any other
physiological devices. our main aim in this system is making a sophisticated music
player that could make the user mood better and music is one of the best aids to
change the mood.
In this, images captured by the system are compared with the data sets, and mainly
4 emotions are taken because human have many emotions and it is hard to predict
because they differ from person to person and it will be hard to predict so, four
common and easily identifiable moods of the person. And here there is another
alternative method can be used with the main concept i.e., random picking of
songs that could help us to brighten our moods and the other mode is queue mode
with this we can make a playlist on our own and in all the modes we are not using
the previous user data but we using individual user data.
6
3.2 System Architecture
Face is captured using the webcam and is processed using the Haar Cascade and
DeepFace Algorithms. Images are stored and are analysed to detect the emotion in
the xml file. Later they are fed to the python libraries and using this music is
recommended
3.3 Methodology
Compared to other algorithms used in previous systems, the proposed algorithm is
proficient enough to battle large pose variations. Large pose variations tend to
disrupt the efficiency of pre-existing algorithms. To reduce this Standard image
input format is taken. Few systems detect the faces first and then locate them. On
the other hand, rarely, some other algorithms detect and locate the faces at the
same time. Every face detection algorithm usually has common steps. First, to
achieve a response time, then to perform data dimension. Focusing on data
dimension a few algorithms extract facial measurements and the next react certain
7
relevant facial region. Advantages of the proposed algorithm Using the static
image gives a great advantage on the defect of pose variations. The three most
faced problems are the presence of unidentified elements like glasses or beard,
quality of static images and unidentifiable facial gesture.
8
emotion is mainly concluded by the model that the value evaluated from the
process can help us to deduce the mood of the user.by comparing the data sets that
each emotion is compared with tens of stored images and scale gives the exact
emotion so that it can play the music based on the recommendation made by the
system by using the following steps and methods. And it does not depend on the
other personal details like the other existing software's. linear classification step in
the face detection process. It helps to simplify the linear classification rather than
SVM. It is for decreasing the computational time so, that the classification process
will take and makes a better detection.
When the face is detected successfully, a box will appear as and it overlay the
image to extract the face and for the further analysis. In the next step the images
that are extracted previously will processed using the function. The code will
extract the facial spatial positions from the face image and it is based on the
pixel’s intensity values that are indexed at each point and it uses boosting
algorithm. It performs the comparison between the input data and with stored one
so it can predict the class that contain the emotion. If it contains one of the four
emotions anger, sad, neutral or happy. and detection of the emotion as seems to be
decreasing speed command and it will be executed so that it can reduce the speed
of the wheelchair so, that we could prevent the user from endangerment.
The input images that are acquired is from the web camera and is used to capture
real-time images. And here we are four main emotions because it is very hard to
define all the emotions and by using limited options it can help the compilation
time and the outcome is more sophisticated. it compares the values that are present
as a threshold in the code. The values will be transferred to perform the web
9
service. The song’s will be played from the detected emotion. The emotions are
assigned to every song. When the emotion is transferred the respective song and
the emotions are numbered are arranged and assigned to every song. however, we
can use many kinds of models to recommend because of their accuracy. and we
are using Haar Cascade and DeepFace Algorithms, so it gives the accuracy better
than other algorithms. And for the sound mechanism we are using the win sound
and with frequency and duration it the commonly used python library for basic
sound playing for the mechanism obtained are being compared the values that are
present as a threshold. And there are other alternative options other than the
emotion-based system that queue mode and the random mode. In queue mode, we
can make a playlist as the other usual music software’s and the last one is the
random mode it is for random picking of songs rather than the order and other and
it is also one of the therapies that can brighten up our moods. When the song is
played on the basis of users' emotions at the same time it also represents emotions
in the form of emoticons in four different emotions. Each emotion is assigned with
a number and that implies to music and also emoticons that are detected
respectively.
10
Faces face recognizer works.
Face Feature Extraction Pictures are spoken to as weighted eigen vectors that are
consolidated and known as “Eigenfaces”. One of the focal points taken by Eigen
faces is the comparability between the pixels among pictures by methods for their
covariance network.
µ= n
nn
n
∑x
I=1
S= 1
∑ (x -µ) (x -µ)T
I =1
y= W T(x-µ)
11
3.5 Flow Diagram Overview
12
4. ALGORITHMS
Two different algorithms have been used for developing the system, that serve
different purposes according to the system’s needs.
4.1.2 Functionality
Initially, the algorithm needs a lot of positive images (images of faces) and
negative images (images without faces) to train the classifier. Then we need to
extract features from it. For this, Haar features shown in below image are used.
They are just like our convolutional kernel. Each feature is a single value obtained
by subtracting the sum of pixels under the white rectangle from the sum of pixels
under the black rectangle.
Now all possible sizes and locations of each kernel are used to calculate plenty of
features. For each feature calculation, we need to find the sum of the pixels under
the white and black rectangles. To solve this, they introduced the integral images.
13
It simplifies calculation of the sum of the pixels, how large may be the number of
pixels, to an operation involving just four pixels.
But among all these features we calculated, most of them are irrelevant. For
example, consider the image below. Top row shows two good features. The first
feature selected seems to focus on the property that the region of the eyes is often
darker than the region of the nose and cheeks. The second feature selected relies
on the property that the eyes are darker than the bridge of the nose.
We apply each and every feature on all the training images. For each feature, it
finds the best threshold which will classify the faces to positive and negative. But
obviously, there will be errors or misclassifications. We select the features with
minimum error rate, which means they are the features that best classifies the face
and non-face images. (The process is not as simple as this. Each image is given an
equal weight in the beginning. After each classification, weights of misclassified
images are increased. Then again same process is done. New error rates are
calculated. Also new weights. The process is continued until required accuracy or
14
error rate is achieved or required number of features are found).
OpenCV already contains many pre-trained classifiers for face, eyes, smile etc.
Those XML files are stored in opencv/data/haarcascades/ folder.
4.2.1 Definition
4.2.2 Functionality
15
corrects the angles of an image so that the face in the photo is looking forward. To
accomplish this, it uses a 3-D model of a face. Then the deep learning produces a
numerical description of the face. If DeepFace comes up with a similar enough
description for two images, it assumes that these two images share a face.
2D Alignment
The DeepFace process begins by detecting 6 fiducial points on the detected face
— the center of the eyes, tip of the nose and mouth location. These points are
translated onto a warped image to help detect the face. However, 2D
transformation fails to compensate for rotations that are out of place.
3D Alignment
In order to align faces, DeepFace uses a generic 3D model wherein 2D images are
cropped as 3D versions. The 3D image has 67 fiducial points. After the image has
been warped, there are 67 anchor points manually placed on the image to match
16
the 67 fiducial points. A 3D-to-2D camera is then fitted that minimizes losses.
Because 3D detected points on the contour of the face can be inaccurate, this step
is important.
Frontalization
Because full perspective projections are not modeled, the fitted camera is only an
approximation of the individual’s actual face. To reduce errors, DeepFace aims to
warp the 2D images with smaller distortions. Also, the camera P is capable of
replacing parts of the image and blending them with their symmetrical
counterparts.
4.3.1 Definition
17
CNNs are regularized versions of multilayer perceptron’s. Multilayer
perceptron’s usually mean fully connected networks, that is, each neuron in one
layer is connected to all neurons in the next layer. The "full connectivity" of these
networks makes them prone to overfitting data. Typical ways of regularization,
or preventing overfitting, include: penalizing parameters during training (such as
weight decay) or trimming connectivity (skipped connections, dropout, etc.)
CNNs take a different approach towards regularization: they take advantage of
the hierarchical pattern in data and assemble patterns of increasing complexity
using smaller and simpler patterns embossed in their filters. Therefore, on a scale
of connectivity and complexity, CNNs are on the lower extreme.
Convolutional networks were inspired by biological processes in that the
connectivity pattern between neurons resembles the organization of the animal
visual cortex. Individual cortical neurons respond to stimuli only in a restricted
region of the visual field known as the receptive field. The receptive fields of
different neurons partially overlap such that they cover the entire visual field.
4.3.2 Architecture
The name “convolutional neural network” indicates that the network employs a
mathematical operation called convolution. Convolutional networks are a
specialized type of neural networks that use convolution in place of general
matrix multiplication in at least one of their layers.
A convolutional neural network consists of an input layer, hidden layers and an
output layer. In any feed-forward neural network, any middle layers are called
18
hidden because their inputs and outputs are masked by the activation function and
final convolution. In a convolutional neural network, the hidden layers include
layers that perform convolutions. Typically, this includes a layer that performs a
dot product of the convolution kernel with the layer's input matrix. This product
is usually the Frobenius inner product, and its activation function is commonly
ReLU. As the convolution kernel slides along the input matrix for the layer, the
convolution operation generates a feature map, which in turn contributes to the
input of the next layer. This is followed by other layers such as pooling layers,
fully connected layers, and normalization layers.
19
• Convolutional filters/kernels defined by a width and height (hyper-
parameters).
20
commonly used. Global pooling acts on all the neurons of the feature map. There
are two common types of pooling in popular use: max and average. Max pooling
uses the maximum value of each local cluster of neurons in the feature map, while
average pooling takes the average value.
Receptive field
In neural networks, each neuron receives input from some number of locations in
the previous layer. In a convolutional layer, each neuron receives input from only
a restricted area of the previous layer called the neuron's receptive field.
Typically, the area is a square (e.g., 5 by 5 neurons). Whereas, in a fully
connected layer, the receptive field is the entire previous layer. Thus, in each
convolutional layer, each neuron takes input from a larger area in the input than
previous layers. This is due to applying the convolution over and over, which
takes into account the value of a pixel, as well as its surrounding pixels. When
using dilated layers, the number of pixels in the receptive field remains constant,
but the field is more sparsely populated as its dimensions grow when combining
the effect of several layers.
4.3.5 Weights
21
weights and a bias (typically real numbers). Learning consists of iteratively
adjusting these biases and weights.
The vector of weights and the bias are called filters and represent particular
features of the input (e.g., a particular shape). A distinguishing feature of CNNs
is that many neurons can share the same filter. This reduces the memory
footprint because a single bias and a single vector of weights are used across all
receptive fields that share that filter, as opposed to each receptive field having
its own bias and vector weighting.
4.3.6 History
CNN design follows vision processing in living organisms.
Receptive fields in the visual cortex
Work by Hubel and Wiesel in the 1950s and 1960s showed that cat and monkey
visual cortexes contain neurons that individually respond to small regions of the
visual field. Provided the eyes are not moving, the region of visual space within
which visual stimuli affect the firing of a single neuron is known as its receptive
field. Neighboring cells have similar and overlapping receptive fields.
Receptive field size and location varies systematically across the cortex to form
a complete map of visual space. The cortex in each hemisphere represents the
contralateral visual field.
Their 1968 paper identified two basic visual cell types in the brain:
22
4.3.6.1 Neocognitron, Origin of CNN Architecture
23
4.3.7 Time Delay Neural Network (TDNN)
The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel
et al. and was the first convolutional network, as it achieved shift invariance.
TDNNs are convolutional networks that share weights along the temporal
dimension. They allow speech signals to be processed time-invariantly. In 1990
Hampshire and Waibel introduced a variant which performs a two-dimensional
convolution. Since these TDNNs operated on spectrograms, the resulting
phoneme recognition system was invariant to both shifts in time and in
frequency. This inspired translation invariance in image processing with CNNs.
The tiling of neuron outputs can cover timed stages
TDNNs now achieve the best performance in far distance speech recognition.
24
Yann LeCun et al. (1989) used back-propagation to learn the convolution kernel
coefficients directly from images of hand-written numbers. Learning was thus
fully automatic, performed better than manual coefficient design, and was suited
to a broader range of image recognition problems and image types.
Although CNNs were invented in the 1980s, their breakthrough in the 2000s
required fast implementations on graphics processing units (GPUs).
25
In 2004, it was shown by K. S. Oh and K. Jung that standard neural networks
can be greatly accelerated on GPUs. Their implementation was 20 times faster
than an equivalent implementation on CPU. In 2005, another paper also
emphasized the value of GPGPU for machine learning.
The first GPU-implementation of a CNN was described in 2006 by K.
Chellapilla et al. Their implementation was 4 times faster than an equivalent
implementation on CPU. Subsequent work also used GPUs, initially for other
types of neural networks (different from CNNs), especially unsupervised
neural networks.
In 2010, Dan Ciresan et al. at IDSIA showed that even deep standard neural
networks with many layers can be quickly trained on GPU by supervised learning
through the old method known as backpropagation. Their network outperformed
previous machine learning methods on the MNIST handwritten digits benchmark.
In 2011, they extended this GPU approach to CNNs, achieving an acceleration
factor of 60, with impressive results. In 2011, they used such CNNs on GPU to
win an image recognition contest where they achieved superhuman performance
for the first time. Between May 15, 2011 and September 30, 2012, their CNNs
won no less than four image competitions. In 2012, they also significantly
improved on the best performance in the literature for multiple image databases,
including the MNIST database, the NORB database, the HWDB1.0 dataset
(Chinese characters) and the CIFAR10 dataset (dataset of 60000 32x32 labeled
RGB images).
Compared to the training of CNNs using GPUs, not much attention was given to
the Intel Xeon Phi coprocessor. A notable development is a parallelization
method for training convolutional neural networks on the Intel Xeon Phi, named
Controlled Hog wild with Arbitrary Order of Synchronization (CHAOS).
CHAOS exploits both the thread- and SIMD-level parallelism that is available
26
on the Intel Xeon Phi.
Distinguishing features
In the past, traditional multilayer perceptron (MLP) models were used for image
recognition. However, the full connectivity between nodes caused the curse of
dimensionality, and was computationally intractable with higher resolution
images. A 1000×1000-pixel image with RGB color channels has 3 million
weights, which is too high to feasibly process efficiently at scale with full
connectivity.
27
• Shared weights: In CNNs, each filter is replicated across the entire visual
field. These replicated units share the same parameterization (weight
vector and bias) and form a feature map. This means that all the neurons in
a given convolutional layer respond to the same feature within their
specific response field. Replicating units in this way allows for the
resulting activation map to be equivariant under shifts of the locations of
input features in the visual field, i.e., they grant translational equivariance
- given that the layer has a stride of one.
• Pooling: In a CNN's pooling layers, feature maps are divided into
rectangular sub- regions, and the features in each rectangle are
independently down-sampled to a single value, commonly by taking their
average or maximum value. In addition to reducing the sizes of feature
maps, the pooling operation grants a degree of local translational
invariance to the features contained therein, allowing the CNN to be more
robust to variations in their positions.
Together, these properties allow CNNs to achieve better generalization on
vision problems. Weight sharing dramatically reduces the number of free
parameters learned, thus lowering the memory requirements for running the
network and allowing the training of larger, more powerful networks.
28
The extent of this connectivity is a hyperparameter called the receptive field of
the neuron. The connections are local in space (along width and height), but
always extend along the entire depth of the input volume. Such an architecture
ensures that the learnt filters produce the strongest response to a spatially local
input pattern.
Spatial arrangement
Three hyper parameters control the size of the output volume of the convolutional
layer: the depth, stride and padding size.
• The depth of the output volume controls the number of neurons in a layer
that connect to the same region of the input volume. These neurons learn to
activate for different features in the input. For example, if the first
convolutional layer takes the raw image as input, then different neurons along
the depth dimension may activate in the presence of various oriented edges,
or blobs of color.
• Stride controls how depth columns around the width and height are
allocated. If the stride is 1, then we move the filters one pixel at a time. This
leads to heavily overlapping receptive fields between the columns, and to
large output volumes. For any integer a stride S means that the filter is
translated S units at a time per output. In practice, is rare. A greater stride
means smaller overlap of receptive fields and smaller spatial dimensions of
the output volume.
• Sometimes, it is convenient to pad the input with zeros (or other values,
such as the average of the region) on the border of the input volume. The
size of this padding is a third hyperparameter. Padding provides control of
the output volume's spatial size. In particular, sometimes it is desirable to
exactly preserve the spatial size of the input volume, this is commonly
referred to as "same" padding.
29
The spatial size of the output volume is a function of the input volume size, the
kernel field size of the convolutional layer neurons, the stride, and the amount of
zero padding on the border. The number of neurons that "fit" in a given volume
is then:
If this number is not an integer, then the strides are incorrect and the neurons
cannot be tiled to fit across the input volume in a symmetric way. In general,
setting zero padding to be when the stride is ensures that the input volume and
output volume will have the same size spatially. However, it is not always
completely necessary to use all of the neurons of the previous layer. For example,
a neural network designer may decide to use just a portion of padding.
Parameter sharing
A parameter sharing scheme is used in convolutional layers to control the number
of free parameters. It relies on the assumption that if a patch feature is useful to
compute at some spatial position, then it should also be useful to compute at other
positions.
Denoting a single 2-dimensional slice of depth as a depth slice, the neurons
in each depth slice are constrained to use the same weights and bias.
Since all neurons in a single depth slice share the same parameters, the forward
pass in each depth slice of the convolutional layer can be computed as a
convolution of the neuron's weights with the input volume. Therefore, it is
common to refer to the sets of weights as a filter (or a kernel), which is
convolved with the input. The result of this convolution is an activation map, and
the set of activation maps for each different filter are stacked together along the
depth dimension to produce the output volume.
Parameter sharing contributes to the translation invariance of the CNN
architecture. Sometimes, the parameter sharing assumption may not make sense.
This is especially the case when the input images to a CNN have some specific
centered structure; for which we expect completely different features to be
learned on different spatial locations. One practical example is when the inputs
30
are faces that have been centered in the image: we might expect different eye-
specific or hair-specific features to be learned in different parts of the image. In
that case it is common to relax the parameter sharing scheme, and instead simply
call the layer a "locally connected layer".
Pooling layer
Max pooling with a 2x2 filter and stride = 2
Another important concept of CNNs is pooling, which is a form of non-linear
down- sampling. There are several non-linear functions to implement pooling,
where max pooling is the most common. It partitions the input image into a set of
rectangles and, for each such sub-region, outputs the maximum.
Intuitively, the exact location of a feature is less important than its rough location
relative to other features. This is the idea behind the use of pooling in
convolutional neural networks. The pooling layer serves to progressively reduce
the spatial size of the representation, to reduce the number of parameters,
memory footprint and amount of computation in the network, and hence to also
control overfitting. This is known as down-sampling It is common to periodically
insert a pooling layer between successive convolutional layers (each one
typically followed by an activation function, such as a ReLU layer) in a CNN
architecture. While pooling layers contribute to local translation invariance, they
do not provide global translation invariance in a CNN, unless a form of global
pooling is used. The pooling layer commonly operates independently on every
depth, or slice, of the input and resizes it spatially. A very common form of max
pooling is a layer with filters of size 2×2, applied with a stride of 2, which
subsamples every depth slice in the input by 2 along both width and height,
discarding 75% of the activations:
In this case, every max operation is over 4 numbers. The depth dimension
remains unchanged (this is true for other forms of pooling as well).
In addition to max pooling, pooling units can use other functions, such as
average pooling or ℓ2-norm pooling. Average pooling was often used
31
historically but has recently fallen out of favor compared to max pooling, which
generally performs better in practice.
Due to the effects of fast spatial reduction of the size of the representation,
there is a recent trend towards using smaller filters or discarding pooling
layers altogether.
RoI pooling to size 2x2. In this example region proposal (an input parameter) has
size 7x5. "Region of Interest" pooling (also known as RoI pooling) is a variant of
max pooling, in which output size is fixed and input rectangle is a parameter.[68]
Pooling is an important component of convolutional neural networks for object
detection based on the Fast R-CNN architecture.
32
Other functions can also be used to increase nonlinearity, for example the
saturating hyperbolic tangent, and the sigmoid function. ReLU is often preferred
to other functions because it trains the neural network several times faster without
a significant penalty to generalization accuracy.
33
Figure 4.5.2: Fully Connected Layers
34
4.5.4 Hyper Parameters
CNNs use more hyper parameters than a standard multilayer perceptron (MLP).
While the usual rules for learning rates and regularization constants still apply,
the following should be kept in mind when optimizing.
35
leads to aliasing of the input signal, which breaks the equivariance (also referred to
as covariance) property. Furthermore, if a CNN makes use of fully connected
layers, translation equivariance does not imply translation invariance, as the fully
connected layers are not invariant to shifts of the input. One solution for complete
translation invariance is avoiding any down-sampling throughout the network and
applying global average pooling at the last layer. Additionally, several other partial
solutions have been proposed, such as anti-aliasing, spatial transformer networks,
data augmentation, subsampling combined with pooling, and capsule neural
networks.
CNNs are often used in image recognition systems. In 2012 an error rate of
0.23% on the MNIST database was reported. Another paper on using CNN for
image classification reported that the learning process was "surprisingly fast"; in
the same paper, the best published results as of 2011 were achieved in the MNIST
36
database and the NORB database. Subsequently, a similar CNN called Alex Net
won the ImageNet Large Scale Visual Recognition Challenge 2012.
When applied to facial recognition, CNNs achieved a large decrease in error
rate.[88] Another paper reported a 97.6% recognition rate on "5,600 still images
of more than 10 subjects". CNNs were used to assess video quality in an objective
way after manual training; the resulting system had a very low root mean square
error. The ImageNet Large Scale Visual Recognition Challenge is a benchmark
in object classification and detection, with millions of images and hundreds of
object classes. In the ILSVRC 2014,[89] a large-scale visual recognition
challenge, almost every highly ranked team used CNN as their basic framework.
The winner Google Net (the foundation of Deep Dream) increased the mean
average precision of object detection to 0.439329, and reduced classification
error to 0.06656, the best result to date. Its network applied more than 30 layers.
That performance of convolutional neural networks on the ImageNet tests was
close to that of humans. The best algorithms still struggle with objects that are
small or thin, such as a small ant on a stem of a flower or a person holding a quill
in their hand. They also have trouble with images that have been distorted with
filters, an increasingly common phenomenon with modern digital cameras. By
contrast, those kinds of images rarely trouble humans. Humans, however, tend to
have trouble with other issues. For example, they are not good at classifying
objects into fine-grained categories such as the particular breed of dog or species
of bird, whereas convolutional neural networks handle this. In 2015 a many-
layered CNN demonstrated the ability to spot faces from a wide range of angles,
including upside down, even when partially occluded, with competitive
performance. The network was trained on a database of 200,000 images that
included faces at various angles and orientations and a further 20 million images
without faces. They used batches of 128 images over 50,000 iterations.
37
4.6.2 Video Analysis
Compared to image data domains, there is relatively little work on applying
CNNs to video classification. Video is more complex than images since it has
another (temporal) dimension. However, some extensions of CNNs into the video
domain have been explored. One approach is to treat space and time as equivalent
dimensions of the input and perform convolutions in both time and space. Another
way is to fuse the features of two convolutional neural networks, one for the
spatial and one for the temporal stream. Long short-term memory (LSTM)
recurrent units are typically incorporated after the CNN to account for inter-frame
or inter-clip dependencies. Unsupervised learning schemes for training spatio-
temporal features have been introduced, based on Convolutional Gated Restricted
Boltzmann Machines and Independent Subspace Analysis.
4.6.3 Natural Language Processing
CNNs have also been explored for natural language processing. CNN models are
effective for various NLP problems and achieved excellent results in semantic
parsing, search query retrieval, sentence modeling, classification, prediction and
other traditional NLP tasks.
Anomaly Detection
A CNN with 1-D convolutions was used on time series in the frequency domain
(spectral residual) by an unsupervised model to detect anomalies in the time
domain.
38
image recognition networks learn to compose smaller, spatially proximate
features into larger, complex structures, Atom Net discovers chemical features,
such as aromaticity, sp3 carbons and hydrogen bonding.
Subsequently, Atom Net was used to predict novel candidate biomolecules for
multiple disease targets, most notably treatments for the Ebola virus and multiple
sclerosis.
4.6.5 Go
CNNs have been used in computer Go. In December 2014, Clark and Storkey
published a paper showing that a CNN trained by supervised learning from a
database of human professional games could outperform GNU Go and win some
games against Monte Carlo tree search Fuego 1.1 in a fraction of the time it took
Fuego to play. Later it was announced that a large 12-layer convolutional neural
network had correctly predicted the professional move in 55% of positions,
equaling the accuracy of a 6 dan human player. When the trained convolutional
network was used directly to play games of Go, without any search, it beat the
traditional search program GNU Go in 97% of games, and matched the
performance of the Monte Carlo tree search program Fuego simulating ten
thousand playouts (about a million positions) per move.
A couple of CNNs for choosing moves to try ("policy network") and evaluating
positions ("value network") driving MCTS were used by AlphaGo, the first to
beat the best human player at the time.
Time series forecasting
Recurrent neural networks are generally considered the best neural network
architectures for time series forecasting (and sequence modeling in general), but
recent studies show that convolutional networks can perform comparably or even
better. Dilated convolutions might enable one-dimensional convolutional neural
networks to effectively learn time series dependences. Convolutions can be
implemented more efficiently than RNN-based solutions, and they do not suffer
39
from vanishing (or exploding) gradients. Convolutional networks can provide an
improved forecasting performance when there are multiple similar time series to
learn from. CNNs can also be applied to further tasks in time series analysis (e.g.,
time series classification or quantile forecasting).
For many applications, the training data is less available. Convolutional neural
networks usually require a large amount of training data in order to avoid
overfitting. A common technique is to train the network on a larger data set from
a related domain.
Human interpretable explanations
End-to-end training and prediction are common practice in computer vision.
However, human interpretable explanations are required for critical systems such
as a self-driving car. With recent advances in visual salience, spatial and temporal
attention, the most critical spatial regions/temporal instants could be visualized to
justify the CNN predictions.
40
Figure 4.6.7: Fine Tuning
41
4.7 Related Networks
4.7.1 Deep Belief Networks
42
4.7.2 Deep Q-Networks
A deep Q-network (DQN) is a type of deep learning model that combines a deep
neural network with Q-learning, a form of reinforcement learning. Unlike earlier
reinforcement learning agents, DQNs that utilize CNNs can learn directly from
high- dimensional sensory inputs via reinforcement learning.
43
5. MODULES
MODULE-1
All the libraries and packages required for smooth execution are included and
defined in this module.
MODULE-2
Capturing of images and setting them in frames is done in this module.
MODULE-3
Analysis of the image captured in previous module and detecting the emotion as
happy, sad, angry etc...
MODULE-4
Recommending a song based on the mood detected in previous module.
44
5.1 Module – 1
All the libraries and packages required for smooth execution are included and
defined in this module.
The first module provides the framework and acts as the foundation for the
other three modules.
CV2 Library
1. The first library imported is cv2 library.
2. CV stands for “Computer Vision.”
3. It is an OpenCV tool, where all the image processing takes place.
4. It is used for performing various operations on an image, such as:
a. Capturing b. Deleting c. Importing d. RGB to Greyscale conversion
45
NumPy Library
1. The next library imported is NumPy.
2. It is an open-source Numerical Python library.
3. It is used for dealing with the audio files (songs), that are basically
stored as numbers in the computer system.
4. It contains multi-dimensional array and matrix data structure.
5. It performs various mathematical operations on arrays such as:
statistical, trigonometric and algebraic routines.
DeepFace Model
1. It is a pre-trained model which is used for analyzing the image and
identifying the emotion of the person from the image captured.
2. It is a facial recognition system used by Facebook for tagging images.
3. It is included in our system using the line
“from deepface import DeepFace”
4. It uses the concept of CNN for identifying the key features on detected
face and then deducing various conclusions such as the race, age, skin
color, emotions of the person.
5. It was proposed by researchers at Facebook AI Research (FAIR) at
2014 IEEE Computer Vision and Pattern Recognition Conference
(CVPR).
matplotlib.pyplot
1. It is a collection of functions that make matplotlib work like MATLAB
2. Each pyplot function makes some changes to figure like creating
plotting area, plots a line in plotting area, etc...
3. We have used it for plotting the image inside a specified frame.
46
Face Cascade
1. It is used for detecting the frontal face in a frame.
2. On the image captures using CV2, by applying the “Haar Cascade”
algorithm face of the user is detected.
3. A haar feature traverses the entire image pixel by pixel and detects the
face of the user in the frame.
Pandas
1. It is an open-source Python package.
2. It is used for data analysis and Machine Learning tasks.
3. We are using it for accessing the .csv (Excel) file that contains the list
of songs from which the user will get recommendations.
4. Pandas objects rely heavily on NumPy objects.
5. Essentially, Pandas extends NumPy.
“df=pd.read_csv(“Top 2018.csv”)”
1. It accesses the csv file named “Top 2018”, which contains the list of
top 100 songs released in the year 2018.
2. We can make use of any list of songs by mentioning the file’s name in
place of the above-mentioned file name.
3. The songs in the file are segregated based on different types of moods.
4. So, when the mood of user is identified, a song from the list is
recommended that matches the genre of the user’s mood.
“df.drop(df.columns.difference([‘name’,‘artist’,‘tempo’]),1,inplace=True)”
1. It is used to remove all the irrelevant columns from the file, in order to
reduce unnecessary noise in data.
47
5.2 Module – 2
Capturing of images and setting them in frames is done in this module.
“cam=cv2.VideoCapture(0)”
1. CV2 has already been imported.
2. We are using a variable ‘cam’ to capture an image using the
VideoCapture(0) function.
3. This will return video from the webcam on to the computer.
4. Image captures is stored in ‘cam.’
48
“cv2.namedWindow(“test”)”
1. The window in which the image is captured is named as “test”.
2. img_counter=0 is to count the image(s).
While Loop
1. The while loop is used to resize the window frame and test whether the
frame is captured or not.
2. Ret, frame= cam.read() is for reading the image and storing it in a variable
called frame.
“cv2.inshow(“test”, frame)”
1. It is the function used for displaying an image in window.
“k=cv2.waitKey(1)”
1. It is a keyboard binding function.
2. Its argument is the time in milliseconds.
“img_name= “opencv_frame_{}.png”.format(img_counter)”
1. Format() method formats the specified value and insert them inside the
string’s place holder.
2. “{}” → Place Holder.
“cv2.imwrite()”
1. This function is to save an image.
2. First argument is filename.
3. Second argument is the image we wish to save.
49
cam.release()→ It closes the capturing device.
50
5.3 Module – 3
Analysis of the image captured in previous module and detecting the emotion as
happy, sad, angry etc...
“predictions = DeepFace.analyze(img)”
1. Predictions is a keyword in Python programming language.
2. DeepFace is a pre-trained facial recognition model.
3. “analyze(img)” → Its function is to analyse the image captured and stored
in img.
4. It analyses the image and predicts the age, race, skin colour and emotions
of the user.
5. The output is displayed as “Dominant Emotion:”.
51
5.4 Module – 4
Recommending a song based on the mood detected in previous module.
52
6. PROGRAM & EXECUTION
53
The following is the block of code for Module – 2:
1. cam cv2.VideoCapture(e)
2. cv2.namedicindow("test")
3. img counter = 0
4. while True:
5. ret, frame = cam.read()
6. if not ret:
7. print("failed to grab frame")
8. break
9. cv2.imshow("test", frame)
10.k = cv2.waitkey(1)
11.if k%256 == 27:
#ESC pressed
12.print("Escape hit, closing...")
13.break
14.elif k%256 == 32:
#SPACE pressed
15.img_name= "opencv_frame_{}.png" .format (img _counter)
16.cv2.imvrite(img_name, frame)
17.print("{} written!".format(img_name))
18.img counter == 1
19.cam.release()
20.cv2.destroyAllwindows()
21.img=cv2.imread(“opencv_frame_0.png”)
22.plt.imshow(cv2.cvtcolor(img, cv2.COLOR_BGR2RGB))
OUTPUT
54
The following is the block of code for Module – 3:
1. predictions = DeepFace.analyze(img)
3. Predictions
OUTPUT
55
The following is the block of code for Module – 4:
1. import random
2. if (emotion == ’'happy"):
3. n = random.randint(0,12)
4. print(df.loc[[n]])
5. elif (emotions angry"):
6. n = random.randint(13,23).
7. print(df.loc [[n]])
8. elif (emotion = 'sad"):
9. n random.randint (24,40)
10.print(df.loc [[n]])
11.elif (emotion = neutral"):
12.n random.randint (40,54)
13.print(df.loc [[n]])
14.elif (omotion surprise):
15.n random.randint (55,70)
16.print(df.loc [[n]])
17.elif (emotion = disgust):
18.n = random.randint (71,91)
19.print(df.loc [[n]])
20.else:
21.n = random.randint(92,101)
22.print(df.loc[[n]])
OUTPUT
56
7. FUTURE SCOPE
The music player that we are using it can be used locally and nowadays everything
became portable and efficient to carry but it the emotion of a person can be taken
by different of wearable sensors and easy to use rather than the whole manual
work it would be possible using GSR (galvanic skin response) and PPG
(plethysmography physiological sensors). That would give us enough data to
predict the mood of the customer accurately. This system with enhanced will be
able to benefit and the system with advanced features and needs to be constantly
upgraded. The methodology that enhances the automatic playing of songs is done
by the detection. The facial expressions are detected with the help of programming
interface that is present in the local machine. And the alternative method, that is
based on the additional emotions which are being excluded in our system.
57
8. CONCLUSION
This project, “Music recommendation based on Mood using Facial recognition”
is based on the emotions that are captured in real time images of the user. This
project is designed for the purpose of making better interaction between the
music system and the user. because Music is helpful in changing the mood of the
user and for some people it is a stress reliever. Recent development it shows a
wide prospective in the developing the emotion-based music recommendation
system. Thus, the present system presents Face(expressions) based recognition
system so that it could detect the emotions and music will be recommended
accordingly.
58
9. REFERENCES
59