Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

MUSIC RECOMMENDATION BASED ON MOOD

USING FACIAL RECOGNITION

Project Report Submitted

In Partial Fulfillment of the Requirements

For the Degree Of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING

Submitted by

Aymaan Gillani 1604-17-733-078


Mohammad Abdul Mansoor Khan 1604-17-733-081
Mohammed Hassan 1604-17-733-093

COMPUTER SCIENCE AND ENGINEERING DEPARTMENT


MUFFAKHAM JAH COLLEGE OF ENGINEERING & TECHNOLOGY
(Affiliated to Osmania University)
Mount Pleasant, 8-2-24, Road No. 3, Banjara Hills, Hyderabad-34

2021
Date: 01/05/2021

CERTIFICATE
This is to certify that the project dissertation titled “MUSIC RECOMMENDATION
BASED ON MOOD USING FACIAL RECOGNITION” being submitted by

1. AYMAAN GILLANI (1604-17-733-078)

2. MOHAMMED ABDUL MANSOOR KHAN (1604-17-733-081)

3. MOHAMMED HASSAN (1604-17-733-093)

in Partial Fulfillment of the requirements for the award of the degree of


BACHELOR OF ENGINEERING IN COMPUTER SCIENCE AND
ENGINEERING in MUFFAKHAM JAH COLLEGE OF ENGINEERING AND
TECHNOLOGY, Hyderabad for the academic year 2020-2021 is the bonafide
work carried out by them. The results embodied in the report have not been
submitted to any other University or Institute for the award of any degree or
diploma.

Signatures:

Internal Project Guide Head CSED


Mr. SYED MD AKBAR HASHMI Dr. A.A. MOIZ QYSER
(Assistant Professor)

External Examiner

i
DECLARATION

This is to certify that work reported in the major project entitled “MUSIC
RECOMMENDATION BASED ON MOOD USING FACIAL
RECOGNITION” is a record of the bonafide work done by us in the Department
of Computer Science and Engineering, Muffakham Jah College of Engineering
and Technology, Osmania University. The results embodied in this report are
based on the project work done entirely by us and not copied from any other
source.

1. AYMAAN GILLANI (1604-17-733-078)


2. MOHAMMAD ABDUL MANSOOR KHAN (1604-17-733-081)
3. MOHAMMED HASSAN (1604-17-733-093)

ii
ACKNOWLEDGEMENT

Our hearts are filled with gratitude to the Almighty for empowering us with
courage, wisdom and strength to complete this project successfully. We give him
all the glory, honor and praise.

We thank our Parents for having sacrificed a lot in their lives to impart the
best education to us and make us promising professionals for tomorrow.

We would like to express our sincere gratitude and indebtness to our project
guide Mr. Syed Md Akbar Hashmi, Assistant Professor CSED, for his valuable
suggestions and guidance throughout the course of this project.

We are happy to express our profound sense of gratitude and indebtedness


to Prof. Dr. Ahmed Abdul Moiz Qyser, Head of the Computer Science and
Engineering Department, for his valuable and intellectual suggestions apart from
educate guidance constant encouragement right throughout our work and making
successful.

With a great sense of pleasure and privilege, we extend our gratitude to


Prof. Dr. Umar Farooq, Associate Professor and Program Coordinator (CSED),
project in-charge, offered valuable suggestions was a pre-requisite to carry out
support in every step.

We are pleased to acknowledge our indebtedness to all those who devoted


themselves directly or indirectly to make this project work a total success.

AYMAAN GILLANI
MOHAMMAD ABDUL MANSOOR KHAN
MOHAMMED HASSAN

iii
ABSTRACT

Human emotions play a vital role in recent times. Emotion is based on human
feelings which can be both expressed or not. Emotion expresses the human’s
individual behaviour which can be in different forms. Extraction of the emotion
states humans individual state of behaviour. The objective of this project is to
extract feature from human face and detect emotion and to play music according to
the emotion detected.

The required input is extracted from the human face directly using a camera. One
of the applications of this input can be for extracting the information to deduce the
mood of an individual. This data can then be used to get a list of songs that comply
with the “mood” derived from the input provided earlier. This eliminates the time-
consuming and tedious task of manually segregating or grouping songs into
different lists and helps in generating an appropriate playlist based on an
individual's emotional features.

Thus, our proposed system focusses on detecting human emotions for developing
emotion-based music player. A brief idea about our systems working, playlist
generation and emotion classification is given.

Keywords: - Facial Recognition, Emotions, Music recommendation, Emotional


Features, Emotion Detection.

iv
CONTENTS
TITLE
CERTIFICATE i
DECLARATION ii
ACKNOWLEDGEMENT iii
ABSTRACT iv
LIST OF FIGURES viii
LIST OF TABLES viii

1. INTRODUCTION 1
1.1 Problem Statement 2
1.2 Objectives 2

2. LITERATURE SURVEY 3
2.1 Related Work 3

3. PROPOSED SYSTEM 6
3.1 System Requirements 6
3.2 System Architecture 7
3.3 Methodology 7
3.3.1 Face Capturing 8
3.3.2 Face Detection 8
3.3.3 Emotion Classification 9
3.3.4 Music Recommendation 9
3.4 Calculating Eigen Faces 10
3.5 Flow Chart Overview 12

v
4. ALGORITHMS 13
4.1 Haar Cascade Algorithm 13
4.1.1 Definition 13
4.1.2 Functionality 13
4.2 DeepFace Algorithm 15
4.2.1 Definition 15
4.2.2 Functionality 15
4.3 Convolutional Neural Network (CNN) 17
4.3.1 Definition 17
4.3.2 Architecture 18
4.3.3 Convolutional Layer 19
4.3.4 Fully Connected Layer 21
4.3.5 Weights 21
4.3.6 History 22
4.3.6.1 Neocognitron, Origin of CNN Architecture 23
4.3.7 Time Delay Neural Network (TDNN) 24
4.3.7.1 Max Pooling 24
4.3.8 Shift-invariant Neural Network 25
4.3.9 GPU Implementations 25
4.3.10 Intel Xeon Phi Implementations 26
4.4 CNN Layers arranged in 3-Dimensions 27
4.5 CNN Architecture 28
4.5.1 ReLU Layer 32
4.5.2 Fully Connected Layer 33
4.5.3 Loss Layer 34
4.5.4 Hyper Parameters 35
4.5.5 Number of Filters 35

vi
4.6 Applications of CNN 36
4.6.1 Image Recognition 36
4.6.2 Video Analysis 38
4.6.3 Natural Language Processing 38
4.6.4 Drug Discovery 38
4.6.5 Go 39
4.6.6 Cultural Heritage and 3D-Datasets 40
4.6.7 Fine Tuning 40
4.7 Related Networks 42
4.7.1 Deep Belief Networks 42
4.7.2 Deep Q-Networks 43

5. MODULES 44
5.1 Module – 1 45
5.2 Module – 2 48
5.3 Module – 3 51
5.4 Module – 4 52

6. PROGRAM & EXECUTION 53


7. FUTURE SCOPE 57
8. CONCLUSION 58
9. REFERENCES 59

vii
LIST OF FIGURES
1.2 Different Types of Human Emotions 2
3.2 System Architecture 7
3.5 Flowchart of System Architecture 12
4.1.2 Types of Haar Features 14
4.2.2 Working of DeepFace 16
4.3.2 Architecture of CNN 19
4.5 Layers in CNN 32
4.5.1 ReLU Activation Function 33
4.5.2 Fully Connected Layer 34
4.5.3 Loss Layer in CNN 34
4.5.5 Working of Filter in CNN 36
4.6.7 Fine Tuning 41
4.7.1 Architecture of DBN 42
4.7.2 Architecture of DQN 43
5.1 Module – 1 45
5.2 Module – 2 48
5.3 Module – 3 51
5.4 Module – 4 52
6.1 Output of 1st Module 53
6.2 Output of 2nd Module 54
6.3 Output of 3rd Module 55
6.4 Output of 4th Module 56

LIST OF TABLES
3.1 System Requirements 6

viii
1. INTRODUCTION

In this concept music is recommended to the user by detecting the real time
capturing of user’s emotions. Existing technique were using collaboration
technique which will use previous user data to recommend music and This
technique requires lot of manual work so, we proposed a system to arrange
different music in different categories such as happy, sad or angry etc. Emotion-
Based-music-player It's music player with chrome as front-End which has the
capability to detect emotions I.e., the face of user with the help of machine
learning algorithm using python. Based on the detected user’s mood song list will
be displayed/recommend to the user.

In this application image of a person is captured using a real time machine that has
the access to the local machinery and depending on the captured image it
compares the database data sets that already saved in the local device through
processing it defines the present mood of the user in numerical form based on this
music will be played other than that we have some common features that are
queue playlist so that we can have an individual playlist and the last one is random
it uses python Pandas library so that it can pick a random song without any order
for this we have used libraries like OpenCV, Pandas, Matplotlib, NumPy etc. This
system is mainly proposed because music play a vital role in recent times that is to
reduce stress.

So, in order to detect the emotion, we are using face as a main source of data
because normally face expression defines the Emotion so according to the mood,
we play the music that it can change the user’s mood.

1
1.1 Problem Statement
Music listeners have tough time creating and segregating the play-list manually
when they have hundreds of songs. It is also difficult to keep track of all the songs.
The sequence of songs in a play-list might not be the same every time, and songs
that a user wants to listen frequently might not be given priority or might be left
out from the list. Currently there are no systems that recommends music instantly
without manual work. Facial recognition was never a criterion for suggesting
music to the users.

1.2 Objectives

1. To provide an interface between the music system.


2. To provide a very good entertainment for the users.
3. To implement the ideas of machine learning.
4. To provide a new age platform for music lovers.
5. To bridge gap between growing technologies and music techniques.

Figure 1.2: Different types of human emotions

2
2. LITERATURE SURVEY
The process of multidimensional reduction by taking the primary data that is
lowered to many other classes for sorting out or organizing. Emotion of a user is
extracted by capturing the image of the user through webcam. The captured image
is enhanced by the process of dimensional reduction by tracking the primary data.
These data are converted into binary image format and the face is detected using
Haar cascade method.

The initial or the primary data taken from the human face that is lowered to many
other classes. These classes are sorted and organized using the above methods.
Emotion is detected by extracting the feature from the human face. The main aim
in feature extracting module is to diminish the number of resources required from
the large sets of data. Features in an image consists of 3 parts.

1. Boundaries/edges

2. Corners/projection points

3. Field points

2.1 Related Work


1. Face Detection and Facial Expression Recognition System
Anagha S. Dhavalikar et al proposed Automatic Facial Expression recognition
system. In This system there are three phases:
1. Face detection
2. Feature Extraction and
3. Expression recognition.

3
The First Phase Face Detection are done by RGB Color model, lighting
compensation for getting face and morphological operations for retaining required
face i.e., eyes and mouth of the face. This System is also used AAM i.e., Active
Appearance Model Method for facial feature extraction in this method the point on
the face like eye, eyebrows and mouth are located and it create a data file which
gives information about model points detected and detect the face an expression is
given as input AAM Model changes according to
expression.

2. Emotional Recognition from Facial Expression Analysis using Bezier


Curve Fitting
Yong-Hwan Lee, Woori Han and Youngseop Kim proposed system based on
Bezier curve fitting. This system used two-step for facial expression and emotion
first one is detection and analysis of facial area from input original image and next
phase is verification of facial emotion of characteristics feature in the region of
interest. The first phase for face detection it uses color still image based on skin
color pixel by initialized spatial filtering ,based on result of lighting compassion
then to estimate face position and facial location of eye and mouth it used feature
map After extracting region of interest this system extract points of the feature
map to apply Bezier curve on eye and mouth The for understanding of emotion
this system uses training and measuring the difference of Hausdorff distance with
Bezier curve between entered face image and image from database.

3. Using Animated Mood Pictures in Music Recommendation


Arto Lehtiniemi and Jukka Holm et al proposed system on animated mood picture
in music recommendation. On this system the user interacts with a collection of
images to receive music recommendation with respect to genre of picture. This
music recommendation system is developed by Nokia researched center. This
system uses textual meta tags for describing the genre and audio signal processing.

4
Human-computer interaction using emotion recognition from facial expression. F.
Abdat, C. Maaoui et al and A. Pruski et al. They proposed a system fully
automatic facial expression and recognition system based on three step face
detection, facial characteristics extraction and facial expression classification. This
system proposed anthropometric model to detect the face feature point combined
to shi and Thomasi method. In this method the variation of 21 distances which
describe the facial feature from neutral face and the classification base on SVM
(Support Vector Machine).

4. Emotion-based Music Recommendation by Association Discovery from


Film Music.
Fang-Fei Kuo et al and Suh-Yin Lee et al with the growth of digital music, the
development of music recommendation is helpful for users. The existing
recommendation approaches are based on the users’ preference on music.
However, sometimes, recommending music according to the emotion is needed. In
this paper, we propose a novel model for emotion-based music recommendation,
which is based on the association discovery from film music. We investigated the
music feature extraction and modified the affinity graph for association discovery
between emotions and music features. Experimental result shows that the proposed
approach achieves 85% accuracy in average.

5
3. PROPOSED SYSTEM
Humans have a tendency to show their emotions unknowingly mainly they reflect
the face. The proposed system helps us to provide an interaction between the user
and the music system. This project mainly focuses on the user’s preferred music
that is recommended due to the emotional awareness. In the initial stage of the
proposed system, we have given 3 options and each contains its functionality. To
this, we have given a list of songs and emotions based on spatial recognition. Once
the application starts working it captures images with the webcam or any other
physiological devices. our main aim in this system is making a sophisticated music
player that could make the user mood better and music is one of the best aids to
change the mood.

In this, images captured by the system are compared with the data sets, and mainly
4 emotions are taken because human have many emotions and it is hard to predict
because they differ from person to person and it will be hard to predict so, four
common and easily identifiable moods of the person. And here there is another
alternative method can be used with the main concept i.e., random picking of
songs that could help us to brighten our moods and the other mode is queue mode
with this we can make a playlist on our own and in all the modes we are not using
the previous user data but we using individual user data.

3.1 System Requirements


S.NO HARDWARE SOFTWARE
1. Processor – 2 GHz Anaconda
2. RAM – 2 GB Jupiter Notebook
3. Webcam – 2 Mega Pixel Windows or Mac OS
Table 3.1: System Requirements

6
3.2 System Architecture
Face is captured using the webcam and is processed using the Haar Cascade and
DeepFace Algorithms. Images are stored and are analysed to detect the emotion in
the xml file. Later they are fed to the python libraries and using this music is
recommended

Figure 3.2: System Architecture

3.3 Methodology
Compared to other algorithms used in previous systems, the proposed algorithm is
proficient enough to battle large pose variations. Large pose variations tend to
disrupt the efficiency of pre-existing algorithms. To reduce this Standard image
input format is taken. Few systems detect the faces first and then locate them. On
the other hand, rarely, some other algorithms detect and locate the faces at the
same time. Every face detection algorithm usually has common steps. First, to
achieve a response time, then to perform data dimension. Focusing on data
dimension a few algorithms extract facial measurements and the next react certain

7
relevant facial region. Advantages of the proposed algorithm Using the static
image gives a great advantage on the defect of pose variations. The three most
faced problems are the presence of unidentified elements like glasses or beard,
quality of static images and unidentifiable facial gesture.

3.3.1 Face Capturing


The main objective of this session is to capture images so here we are using the
common device i.e., webcam or can use any other physiological devices. For that
purpose, we are using the computer vision(cv2) library. This makes it easier to
integrate it with other libraries which can also use NumPy and it is mainly used as
a real time computer vision. In the initial process when execution starts it starts to
access the camera stream and captures about 10 images for further process and
emotion detection. So, in the initial phase of this project in order to capture the
images and face detection. We use an algorithm that could take the authentic
images so classify the images and we are need of lot of positive images that they
actually contain images with faces only on the other hand, negative images that
contain the images without faces. In order to train the classifier. The classified
images are taken as a part of the model.

3.3.2 Face Detection


The face recognition is considered as one of the best ways to determine a person’s
mood. This image processing system is used for reducing the face space
dimensions using the Haar Cascade and DeepFace method to obtain the features of
the image characteristics, we especially use this because it maximizes the training
process classification in between classes. This algorithm helps to process for
image recognition is done in Haar Cascade while, matching faces algorithm we
use DeepFace, it helps us to classify the expression that implies the emotion of the
user. Haar Cascade with open CV mainly emphasizes on the class specific
transformation matrix so, they don’t take illustrative images as the subject. And

8
emotion is mainly concluded by the model that the value evaluated from the
process can help us to deduce the mood of the user.by comparing the data sets that
each emotion is compared with tens of stored images and scale gives the exact
emotion so that it can play the music based on the recommendation made by the
system by using the following steps and methods. And it does not depend on the
other personal details like the other existing software's. linear classification step in
the face detection process. It helps to simplify the linear classification rather than
SVM. It is for decreasing the computational time so, that the classification process
will take and makes a better detection.

3.3.3 Emotion Classification

When the face is detected successfully, a box will appear as and it overlay the
image to extract the face and for the further analysis. In the next step the images
that are extracted previously will processed using the function. The code will
extract the facial spatial positions from the face image and it is based on the
pixel’s intensity values that are indexed at each point and it uses boosting
algorithm. It performs the comparison between the input data and with stored one
so it can predict the class that contain the emotion. If it contains one of the four
emotions anger, sad, neutral or happy. and detection of the emotion as seems to be
decreasing speed command and it will be executed so that it can reduce the speed
of the wheelchair so, that we could prevent the user from endangerment.

3.3.4 Music Recommendation

The input images that are acquired is from the web camera and is used to capture
real-time images. And here we are four main emotions because it is very hard to
define all the emotions and by using limited options it can help the compilation
time and the outcome is more sophisticated. it compares the values that are present
as a threshold in the code. The values will be transferred to perform the web

9
service. The song’s will be played from the detected emotion. The emotions are
assigned to every song. When the emotion is transferred the respective song and
the emotions are numbered are arranged and assigned to every song. however, we
can use many kinds of models to recommend because of their accuracy. and we
are using Haar Cascade and DeepFace Algorithms, so it gives the accuracy better
than other algorithms. And for the sound mechanism we are using the win sound
and with frequency and duration it the commonly used python library for basic
sound playing for the mechanism obtained are being compared the values that are
present as a threshold. And there are other alternative options other than the
emotion-based system that queue mode and the random mode. In queue mode, we
can make a playlist as the other usual music software’s and the last one is the
random mode it is for random picking of songs rather than the order and other and
it is also one of the therapies that can brighten up our moods. When the song is
played on the basis of users' emotions at the same time it also represents emotions
in the form of emoticons in four different emotions. Each emotion is assigned with
a number and that implies to music and also emoticons that are detected
respectively.

3.4 Calculating Eigen Faces


Eigen Faces: Not all the parts of the face are important for emotion
recognition. This key fact is considered to be important and useful. Face
recognition techniques focus on recognizing eyes, nose, cheek and forehead
and how the change with respect to each other. Overall, the areas with
maximum changes, mathematically, areas with high variations are targeted.
When multiple faces are considered, they are compared by detecting these
parts of the faces because these parts are the most useful and important parts of
a face. They tend to catch the maximum change among faces, specifically, the
change that helps to differentiate one face from the other. This is how Eigen

10
Faces face recognizer works.

Face Feature Extraction Pictures are spoken to as weighted eigen vectors that are
consolidated and known as “Eigenfaces”. One of the focal points taken by Eigen
faces is the comparability between the pixels among pictures by methods for their
covariance network.

Following are the means required to perceive the outward appearances


utilizing this Eigenfaces approach:

Let X= {x, x, ..., x} x ∊Rd

Here X be a random vector with observations.


1. Calculate the mean µ:

µ= n

2. Calculate the covariance matrix S:

nn
n
∑x
I=1

S= 1

∑ (x -µ) (x -µ)T
I =1

3. Compute the eigenvectors vi and eigenvalues λi of S:

Svi= λivi, i=1, 2,..., n

4. The eigenvectors are arranged by their egeinvalue in descending order:

y= W T(x-µ)

11
3.5 Flow Diagram Overview

Figure 3.5: Flowchart of System Architecture

12
4. ALGORITHMS

Two different algorithms have been used for developing the system, that serve
different purposes according to the system’s needs.

The algorithms used are:

1. Haar Cascade (for Face Detection)

2. DeepFace (for Emotion/Mood Detection)

4.1 Haar Cascade Algorithm


4.1.1 Definition
Image Detection using Haar feature-based cascade classifiers is an effective
method proposed by Paul Viola and Michael Jones in the 2001 paper, "Rapid
Object Detection using a Boosted Cascade of Simple Features". It is a machine
learning based approach in which a cascade function is trained from a lot of
positive and negative images. It is then used to detect objects in other images.

4.1.2 Functionality
Initially, the algorithm needs a lot of positive images (images of faces) and
negative images (images without faces) to train the classifier. Then we need to
extract features from it. For this, Haar features shown in below image are used.
They are just like our convolutional kernel. Each feature is a single value obtained
by subtracting the sum of pixels under the white rectangle from the sum of pixels
under the black rectangle.

Now all possible sizes and locations of each kernel are used to calculate plenty of
features. For each feature calculation, we need to find the sum of the pixels under
the white and black rectangles. To solve this, they introduced the integral images.

13
It simplifies calculation of the sum of the pixels, how large may be the number of
pixels, to an operation involving just four pixels.

Figure 4.1.2: Types of Haar Features

But among all these features we calculated, most of them are irrelevant. For
example, consider the image below. Top row shows two good features. The first
feature selected seems to focus on the property that the region of the eyes is often
darker than the region of the nose and cheeks. The second feature selected relies
on the property that the eyes are darker than the bridge of the nose.

We apply each and every feature on all the training images. For each feature, it
finds the best threshold which will classify the faces to positive and negative. But
obviously, there will be errors or misclassifications. We select the features with
minimum error rate, which means they are the features that best classifies the face
and non-face images. (The process is not as simple as this. Each image is given an
equal weight in the beginning. After each classification, weights of misclassified
images are increased. Then again same process is done. New error rates are
calculated. Also new weights. The process is continued until required accuracy or

14
error rate is achieved or required number of features are found).

Final classifier is a weighted sum of these weak classifiers. It is called weak


because it alone can't classify the image, but together with others forms a strong
classifier. The paper says even 200 features provide detection with 95% accuracy.

OpenCV already contains many pre-trained classifiers for face, eyes, smile etc.
Those XML files are stored in opencv/data/haarcascades/ folder.

4.2 DeepFace Algorithm

4.2.1 Definition

DeepFace is a deep learning facial recognition system created by a research group


at Facebook. It identifies human faces in digital images. The program employs a
nine-layer neural network with over 120 million connection weights and
was trained on four million images uploaded by Facebook users. The Facebook
Research team has stated that the DeepFace method reaches an accuracy of
97.35% ± 0.25% on Labelled Faces in the Wild (LFW) data set where human
beings have 97.53%. This means that DeepFace is sometimes more successful than
the human beings.

4.2.2 Functionality

DeepFace begins by using aligned versions of several existing databases to


improve the algorithms and produce a normalized output. However, these models
are insufficient to produce effective facial recognition in all instances. DeepFace
uses fiducial point detectors based on existing databases to direct the alignment of
faces. The facial alignment begins with a 2D alignment, and then continues with
3D alignment and frontalization. That is, DeepFace’s process is two steps. First, it

15
corrects the angles of an image so that the face in the photo is looking forward. To
accomplish this, it uses a 3-D model of a face. Then the deep learning produces a
numerical description of the face. If DeepFace comes up with a similar enough
description for two images, it assumes that these two images share a face.

2D Alignment

The DeepFace process begins by detecting 6 fiducial points on the detected face
— the center of the eyes, tip of the nose and mouth location. These points are
translated onto a warped image to help detect the face. However, 2D
transformation fails to compensate for rotations that are out of place.

Figure 4.2.2: Working of DeepFace

3D Alignment

In order to align faces, DeepFace uses a generic 3D model wherein 2D images are
cropped as 3D versions. The 3D image has 67 fiducial points. After the image has
been warped, there are 67 anchor points manually placed on the image to match

16
the 67 fiducial points. A 3D-to-2D camera is then fitted that minimizes losses.
Because 3D detected points on the contour of the face can be inaccurate, this step
is important.

Frontalization

Because full perspective projections are not modeled, the fitted camera is only an
approximation of the individual’s actual face. To reduce errors, DeepFace aims to
warp the 2D images with smaller distortions. Also, the camera P is capable of
replacing parts of the image and blending them with their symmetrical
counterparts.

DeepFace is a lightweight face recognition and facial attribute analysis (age,


gender, emotion and race) framework for python.

4.3 Convolutional Neural Network (CNN)

4.3.1 Definition

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class


of deep neural network, most commonly applied to analyze visual imagery.[1]
They are also known as shift invariant or space invariant artificial neural
networks (SIANN), based on the shared-weight architecture of the convolution
kernels or filters that slide along input features and provide translation
equivariant responses known as feature maps.
Counter-intuitively, most convolutional neural networks are only equivariant, as
opposed to invariant, to translation. They have applications in image and video
recognition, recommender systems, image classification, image segmentation,
medical image analysis, natural language processing, brain-computer interfaces,
and financial time series.

17
CNNs are regularized versions of multilayer perceptron’s. Multilayer
perceptron’s usually mean fully connected networks, that is, each neuron in one
layer is connected to all neurons in the next layer. The "full connectivity" of these
networks makes them prone to overfitting data. Typical ways of regularization,
or preventing overfitting, include: penalizing parameters during training (such as
weight decay) or trimming connectivity (skipped connections, dropout, etc.)
CNNs take a different approach towards regularization: they take advantage of
the hierarchical pattern in data and assemble patterns of increasing complexity
using smaller and simpler patterns embossed in their filters. Therefore, on a scale
of connectivity and complexity, CNNs are on the lower extreme.
Convolutional networks were inspired by biological processes in that the
connectivity pattern between neurons resembles the organization of the animal
visual cortex. Individual cortical neurons respond to stimuli only in a restricted
region of the visual field known as the receptive field. The receptive fields of
different neurons partially overlap such that they cover the entire visual field.

CNNs use relatively little pre-processing compared to other image classification


algorithms. This means that the network learns to optimize the filters (or kernels)
through automated learning, whereas in traditional algorithms these filters are
hand- engineered. This independence from prior knowledge and human
intervention in feature extraction is a major advantage.

4.3.2 Architecture
The name “convolutional neural network” indicates that the network employs a
mathematical operation called convolution. Convolutional networks are a
specialized type of neural networks that use convolution in place of general
matrix multiplication in at least one of their layers.
A convolutional neural network consists of an input layer, hidden layers and an
output layer. In any feed-forward neural network, any middle layers are called

18
hidden because their inputs and outputs are masked by the activation function and
final convolution. In a convolutional neural network, the hidden layers include
layers that perform convolutions. Typically, this includes a layer that performs a
dot product of the convolution kernel with the layer's input matrix. This product
is usually the Frobenius inner product, and its activation function is commonly
ReLU. As the convolution kernel slides along the input matrix for the layer, the
convolution operation generates a feature map, which in turn contributes to the
input of the next layer. This is followed by other layers such as pooling layers,
fully connected layers, and normalization layers.

Figure 4.3.2: Architecture of CNN

4.3.3 Convolutional Layer


In a CNN, the input is a tensor with a shape: (number of inputs) x (input height) x
(input width) x (input channels). After passing through a convolutional layer, the
image becomes abstracted to a feature map, also called an activation map, with
shape: (number of inputs) x (feature map height) x (feature map width) x (feature
map channels). A convolutional layer within a CNN generally has the following
attributes:

19
• Convolutional filters/kernels defined by a width and height (hyper-
parameters).

• The number of input channels and output channels (hyper-parameters).


One layer's input channel must equal the number of output channels
(also called depth) of its input.
• Additional hyperparameters of the convolution operation, such as:
padding, stride, and dilation.
Convolutional layers convolve the input and pass its result to the next layer. This
is similar to the response of a neuron in the visual cortex to a specific stimulus.
Each convolutional neuron processes data only for its receptive field. Although
fully connected feedforward neural networks can be used to learn features and
classify data, this architecture is generally impractical for larger inputs such as
high-resolution images. It would require a very high number of neurons, even in a
shallow architecture, due to the large input size of images, where each pixel is a
relevant input feature. For instance, a fully connected layer for a (small) image of
size 100 x 100 has 10,000 weights for each neuron in the second layer. Instead,
convolution reduces the number of free parameters, allowing the network to be
deeper. For example, regardless of image size, using a 5 x 5 tiling region, each
with the same shared weights, requires only 25 learnable parameters. Using
regularized weights over fewer parameters avoids the vanishing gradients and
exploding gradients problems seen during backpropagation in traditional neural
networks. Furthermore, convolutional neural networks are ideal for data with a
grid-like topology (such as images) as spatial relations between separate features
are taken into account during convolution and/or pooling.
Pooling layers
Convolutional networks may include local and/or global pooling layers along with
traditional convolutional layers. Pooling layers reduce the dimensions of data by
combining the outputs of neuron clusters at one layer into a single neuron in the
next layer. Local pooling combines small clusters, tiling sizes such as 2 x 2 are

20
commonly used. Global pooling acts on all the neurons of the feature map. There
are two common types of pooling in popular use: max and average. Max pooling
uses the maximum value of each local cluster of neurons in the feature map, while
average pooling takes the average value.

4.3.4 Fully Connected Layer


Fully connected layers connect every neuron in one layer to every neuron in
another layer. It is the same as a traditional multi-layer perceptron neural
network (MLP). The flattened matrix goes through a fully connected layer to
classify the images.

Receptive field
In neural networks, each neuron receives input from some number of locations in
the previous layer. In a convolutional layer, each neuron receives input from only
a restricted area of the previous layer called the neuron's receptive field.
Typically, the area is a square (e.g., 5 by 5 neurons). Whereas, in a fully
connected layer, the receptive field is the entire previous layer. Thus, in each
convolutional layer, each neuron takes input from a larger area in the input than
previous layers. This is due to applying the convolution over and over, which
takes into account the value of a pixel, as well as its surrounding pixels. When
using dilated layers, the number of pixels in the receptive field remains constant,
but the field is more sparsely populated as its dimensions grow when combining
the effect of several layers.

4.3.5 Weights

Each neuron in a neural network computes an output value by applying a specific


function to the input values received from the receptive field in the previous
layer. The function that is applied to the input values is determined by a vector of

21
weights and a bias (typically real numbers). Learning consists of iteratively
adjusting these biases and weights.

The vector of weights and the bias are called filters and represent particular
features of the input (e.g., a particular shape). A distinguishing feature of CNNs
is that many neurons can share the same filter. This reduces the memory
footprint because a single bias and a single vector of weights are used across all
receptive fields that share that filter, as opposed to each receptive field having
its own bias and vector weighting.

4.3.6 History
CNN design follows vision processing in living organisms.
Receptive fields in the visual cortex

Work by Hubel and Wiesel in the 1950s and 1960s showed that cat and monkey
visual cortexes contain neurons that individually respond to small regions of the
visual field. Provided the eyes are not moving, the region of visual space within
which visual stimuli affect the firing of a single neuron is known as its receptive
field. Neighboring cells have similar and overlapping receptive fields.
Receptive field size and location varies systematically across the cortex to form
a complete map of visual space. The cortex in each hemisphere represents the
contralateral visual field.
Their 1968 paper identified two basic visual cell types in the brain:

• Simple cells, whose output is maximized by straight edges


having particular orientations within their receptive field
• Complex cells, which have larger receptive fields, whose output is
insensitive to the exact position of the edges in the field.
Hubel and Wiesel also proposed a cascading model of these two types of cells
for use in pattern recognition tasks.

22
4.3.6.1 Neocognitron, Origin of CNN Architecture

The "neocognitron" was introduced by Kunihiko Fukushima in 1980. It was


inspired by the above-mentioned work of Hubel and Wiesel. The neocognitron
introduced the two basic types of layers in CNNs: convolutional layers, and down
sampling layers. A convolutional layer contains units whose receptive fields
cover a patch of the previous layer. The weight vector (the set of adaptive
parameters) of such a unit is often called a filter. Units can share filters. Down
sampling layers contain units whose receptive fields cover patches of previous
convolutional layers. Such a unit typically computes the average of the
activations of the units in its patch. This downsampling helps to correctly classify
objects in visual scenes even when the objects are shifted.

In a variant of the neocognitron called the cresceptron, instead of using


Fukushima's spatial averaging, J. Weng et al. introduced a method called max-
pooling where a downsampling unit computes the maximum of the activations
of the units in its patch. Max-pooling is often used in modern CNNs.
Several supervised and unsupervised learning algorithms have been proposed over
the decades to train the weights of a neocognitron. Today, however, the CNN
architecture is usually trained through backpropagation.
The neocognitron is the first CNN which requires units located at multiple
network positions to have shared weights.

Convolutional neural networks were presented at the Neural Information


Processing Workshop in 1987, automatically analyzing time-varying signals by
replacing learned multiplication with convolution in time, and demonstrated for
speech recognition.

23
4.3.7 Time Delay Neural Network (TDNN)
The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel
et al. and was the first convolutional network, as it achieved shift invariance.

It did so by utilizing weight sharing in combination with Backpropagation


training. Thus, while also using a pyramidal structure as in the neocognitron, it
performed a global optimization of the weights instead of a local one.

TDNNs are convolutional networks that share weights along the temporal
dimension. They allow speech signals to be processed time-invariantly. In 1990
Hampshire and Waibel introduced a variant which performs a two-dimensional
convolution. Since these TDNNs operated on spectrograms, the resulting
phoneme recognition system was invariant to both shifts in time and in
frequency. This inspired translation invariance in image processing with CNNs.
The tiling of neuron outputs can cover timed stages
TDNNs now achieve the best performance in far distance speech recognition.

4.3.7.1 Max Pooling


In 1990 Yamaguchi et al. introduced the concept of max pooling, which is a
fixed filtering operation that calculates and propagates the maximum value of a
given region. They did so by combining TDNNs with max pooling in order to
realize a speaker independent isolated word recognition system. In their system
they used several TDNNs per word, one for each syllable. The results of each
TDNN over the input signal were combined using max pooling and the outputs
of the pooling layers were then passed on to networks performing the actual
word classification.
A system to recognize hand-written ZIP Code numbers involved convolutions
in which the kernel coefficients had been laboriously hand designed.

24
Yann LeCun et al. (1989) used back-propagation to learn the convolution kernel
coefficients directly from images of hand-written numbers. Learning was thus
fully automatic, performed better than manual coefficient design, and was suited
to a broader range of image recognition problems and image types.

4.3.8 Shift-invariant Neural Network


Similarly, a shift invariant neural network was proposed by W. Zhang et al. for
image character recognition in 1988. The architecture and training algorithm
were modified in 1991 and applied for medical image processing and automatic
detection of breast cancer in mammograms.
A different convolution-based design was proposed in 1988 for application to
decomposition of one-dimensional electromyography convolved signals via de-
convolution. This design was modified in 1989 to other de-convolution-based
designs.

Neural abstraction pyramid

The feed-forward architecture of convolutional neural networks was extended


in the neural abstraction pyramid by lateral and feedback connections. The
resulting recurrent convolutional network allows for the flexible incorporation
of contextual information to iteratively resolve local ambiguities. In contrast
to previous models, image-like outputs at the highest resolution were
generated, e.g., for semantic segmentation, image reconstruction, and object
localization tasks.

4.3.9 GPU Implementations

Although CNNs were invented in the 1980s, their breakthrough in the 2000s
required fast implementations on graphics processing units (GPUs).

25
In 2004, it was shown by K. S. Oh and K. Jung that standard neural networks
can be greatly accelerated on GPUs. Their implementation was 20 times faster
than an equivalent implementation on CPU. In 2005, another paper also
emphasized the value of GPGPU for machine learning.
The first GPU-implementation of a CNN was described in 2006 by K.
Chellapilla et al. Their implementation was 4 times faster than an equivalent
implementation on CPU. Subsequent work also used GPUs, initially for other
types of neural networks (different from CNNs), especially unsupervised
neural networks.
In 2010, Dan Ciresan et al. at IDSIA showed that even deep standard neural
networks with many layers can be quickly trained on GPU by supervised learning
through the old method known as backpropagation. Their network outperformed
previous machine learning methods on the MNIST handwritten digits benchmark.
In 2011, they extended this GPU approach to CNNs, achieving an acceleration
factor of 60, with impressive results. In 2011, they used such CNNs on GPU to
win an image recognition contest where they achieved superhuman performance
for the first time. Between May 15, 2011 and September 30, 2012, their CNNs
won no less than four image competitions. In 2012, they also significantly
improved on the best performance in the literature for multiple image databases,
including the MNIST database, the NORB database, the HWDB1.0 dataset
(Chinese characters) and the CIFAR10 dataset (dataset of 60000 32x32 labeled
RGB images).

4.3.10 Intel Xeon Phi implementations

Compared to the training of CNNs using GPUs, not much attention was given to
the Intel Xeon Phi coprocessor. A notable development is a parallelization
method for training convolutional neural networks on the Intel Xeon Phi, named
Controlled Hog wild with Arbitrary Order of Synchronization (CHAOS).
CHAOS exploits both the thread- and SIMD-level parallelism that is available

26
on the Intel Xeon Phi.
Distinguishing features
In the past, traditional multilayer perceptron (MLP) models were used for image
recognition. However, the full connectivity between nodes caused the curse of
dimensionality, and was computationally intractable with higher resolution
images. A 1000×1000-pixel image with RGB color channels has 3 million
weights, which is too high to feasibly process efficiently at scale with full
connectivity.

4.4 CNN Layers arranged in 3-Dimensions


Convolutional neural networks are variants of multilayer perceptrons, designed to
emulate the behavior of a visual cortex. These models mitigate the challenges
posed by the MLP architecture by exploiting the strong spatially local correlation
present in natural images. As opposed to MLPs, CNNs have the following
distinguishing features:

• 3D volumes of neurons: The layers of a CNN have neurons arranged in 3


dimensions: width, height and depth. Where each neuron inside a
convolutional layer is connected to only a small region of the layer before
it, called a receptive field. Distinct types of layers, both locally and
completely connected, are stacked to form a CNN architecture.
• Local Connectivity: following the concept of receptive fields, CNNs
exploit spatial locality by enforcing a local connectivity pattern between
neurons of adjacent layers. The architecture thus ensures that the learned
"filters" produce the strongest response to a spatially local input pattern.
Stacking many such layers leads to non-linear filters that become
increasingly global (i.e., responsive to a larger region of pixel space) so
that the network first creates representations of small parts of the input,
then from them assembles representations of larger areas.

27
• Shared weights: In CNNs, each filter is replicated across the entire visual
field. These replicated units share the same parameterization (weight
vector and bias) and form a feature map. This means that all the neurons in
a given convolutional layer respond to the same feature within their
specific response field. Replicating units in this way allows for the
resulting activation map to be equivariant under shifts of the locations of
input features in the visual field, i.e., they grant translational equivariance
- given that the layer has a stride of one.
• Pooling: In a CNN's pooling layers, feature maps are divided into
rectangular sub- regions, and the features in each rectangle are
independently down-sampled to a single value, commonly by taking their
average or maximum value. In addition to reducing the sizes of feature
maps, the pooling operation grants a degree of local translational
invariance to the features contained therein, allowing the CNN to be more
robust to variations in their positions.
Together, these properties allow CNNs to achieve better generalization on
vision problems. Weight sharing dramatically reduces the number of free
parameters learned, thus lowering the memory requirements for running the
network and allowing the training of larger, more powerful networks.

4.5 CNN Architecture


When dealing with high-dimensional inputs such as images, it is impractical to
connect neurons to all neurons in the previous volume because such a network
architecture does not take the spatial structure of the data into account.
Convolutional networks exploit spatially local correlation by enforcing a sparse
local connectivity pattern between neurons of adjacent layers: each neuron is
connected to only a small region of the input volume.

28
The extent of this connectivity is a hyperparameter called the receptive field of
the neuron. The connections are local in space (along width and height), but
always extend along the entire depth of the input volume. Such an architecture
ensures that the learnt filters produce the strongest response to a spatially local
input pattern.

Spatial arrangement
Three hyper parameters control the size of the output volume of the convolutional
layer: the depth, stride and padding size.

• The depth of the output volume controls the number of neurons in a layer
that connect to the same region of the input volume. These neurons learn to
activate for different features in the input. For example, if the first
convolutional layer takes the raw image as input, then different neurons along
the depth dimension may activate in the presence of various oriented edges,
or blobs of color.
• Stride controls how depth columns around the width and height are
allocated. If the stride is 1, then we move the filters one pixel at a time. This
leads to heavily overlapping receptive fields between the columns, and to
large output volumes. For any integer a stride S means that the filter is
translated S units at a time per output. In practice, is rare. A greater stride
means smaller overlap of receptive fields and smaller spatial dimensions of
the output volume.
• Sometimes, it is convenient to pad the input with zeros (or other values,
such as the average of the region) on the border of the input volume. The
size of this padding is a third hyperparameter. Padding provides control of
the output volume's spatial size. In particular, sometimes it is desirable to
exactly preserve the spatial size of the input volume, this is commonly
referred to as "same" padding.

29
The spatial size of the output volume is a function of the input volume size, the
kernel field size of the convolutional layer neurons, the stride, and the amount of
zero padding on the border. The number of neurons that "fit" in a given volume
is then:
If this number is not an integer, then the strides are incorrect and the neurons
cannot be tiled to fit across the input volume in a symmetric way. In general,
setting zero padding to be when the stride is ensures that the input volume and
output volume will have the same size spatially. However, it is not always
completely necessary to use all of the neurons of the previous layer. For example,
a neural network designer may decide to use just a portion of padding.
Parameter sharing
A parameter sharing scheme is used in convolutional layers to control the number
of free parameters. It relies on the assumption that if a patch feature is useful to
compute at some spatial position, then it should also be useful to compute at other
positions.
Denoting a single 2-dimensional slice of depth as a depth slice, the neurons
in each depth slice are constrained to use the same weights and bias.
Since all neurons in a single depth slice share the same parameters, the forward
pass in each depth slice of the convolutional layer can be computed as a
convolution of the neuron's weights with the input volume. Therefore, it is
common to refer to the sets of weights as a filter (or a kernel), which is
convolved with the input. The result of this convolution is an activation map, and
the set of activation maps for each different filter are stacked together along the
depth dimension to produce the output volume.
Parameter sharing contributes to the translation invariance of the CNN
architecture. Sometimes, the parameter sharing assumption may not make sense.
This is especially the case when the input images to a CNN have some specific
centered structure; for which we expect completely different features to be
learned on different spatial locations. One practical example is when the inputs

30
are faces that have been centered in the image: we might expect different eye-
specific or hair-specific features to be learned in different parts of the image. In
that case it is common to relax the parameter sharing scheme, and instead simply
call the layer a "locally connected layer".
Pooling layer
Max pooling with a 2x2 filter and stride = 2
Another important concept of CNNs is pooling, which is a form of non-linear
down- sampling. There are several non-linear functions to implement pooling,
where max pooling is the most common. It partitions the input image into a set of
rectangles and, for each such sub-region, outputs the maximum.
Intuitively, the exact location of a feature is less important than its rough location
relative to other features. This is the idea behind the use of pooling in
convolutional neural networks. The pooling layer serves to progressively reduce
the spatial size of the representation, to reduce the number of parameters,
memory footprint and amount of computation in the network, and hence to also
control overfitting. This is known as down-sampling It is common to periodically
insert a pooling layer between successive convolutional layers (each one
typically followed by an activation function, such as a ReLU layer) in a CNN
architecture. While pooling layers contribute to local translation invariance, they
do not provide global translation invariance in a CNN, unless a form of global
pooling is used. The pooling layer commonly operates independently on every
depth, or slice, of the input and resizes it spatially. A very common form of max
pooling is a layer with filters of size 2×2, applied with a stride of 2, which
subsamples every depth slice in the input by 2 along both width and height,
discarding 75% of the activations:

In this case, every max operation is over 4 numbers. The depth dimension
remains unchanged (this is true for other forms of pooling as well).
In addition to max pooling, pooling units can use other functions, such as
average pooling or ℓ2-norm pooling. Average pooling was often used

31
historically but has recently fallen out of favor compared to max pooling, which
generally performs better in practice.

Due to the effects of fast spatial reduction of the size of the representation,
there is a recent trend towards using smaller filters or discarding pooling
layers altogether.

RoI pooling to size 2x2. In this example region proposal (an input parameter) has
size 7x5. "Region of Interest" pooling (also known as RoI pooling) is a variant of
max pooling, in which output size is fixed and input rectangle is a parameter.[68]
Pooling is an important component of convolutional neural networks for object
detection based on the Fast R-CNN architecture.

Figure 4.5: Layers in CNN

4.5.1 ReLU Layer


ReLU is the abbreviation of rectified linear unit, which applies the non-saturating
activation function. It effectively removes negative values from an activation
map by setting them to zero. It introduces nonlinearities to the decision function
and in the overall network without affecting the receptive fields of the
convolution layers.

32
Other functions can also be used to increase nonlinearity, for example the
saturating hyperbolic tangent, and the sigmoid function. ReLU is often preferred
to other functions because it trains the neural network several times faster without
a significant penalty to generalization accuracy.

Figure 4.5.1: ReLU Activation Function

4.5.2 Fully Connected Layer


After several convolutional and max pooling layers, the final classification is
done via fully connected layers. Neurons in a fully connected layer have
connections to all activations in the previous layer, as seen in regular (non-
convolutional) artificial neural networks. Their activations can thus be
computed as an affine transformation, with matrix multiplication followed by
a bias offset (vector addition of a learned or fixed bias term).

33
Figure 4.5.2: Fully Connected Layers

4.5.3 Loss Layer


The "loss layer", or "loss function", specifies how training penalizes the
deviation between the predicted output of the network, and the true data labels
(during supervised learning). Various loss functions can be used, depending on
the specific task. The SoftMax loss function is used for predicting a single class
of K mutually exclusive classes. Sigmoid cross-entropy loss is used for predicting
K independent probability values in. Euclidean loss is used for regressing to real-
valued labels.

Figure 4.5.3: Loss Layer in CNN

34
4.5.4 Hyper Parameters
CNNs use more hyper parameters than a standard multilayer perceptron (MLP).
While the usual rules for learning rates and regularization constants still apply,
the following should be kept in mind when optimizing.

4.5.5 Number of Filters


Since feature map size decreases with depth, layers near the input layer tend to
have fewer filters while higher layers can have more. To equalize computation at
each layer, the product of feature values va with pixel position is kept roughly
constant across layers. Preserving more information about the input would
require keeping the total number of activations (number of feature maps times
number of pixel positions) non- decreasing from one layer to the next.
The number of feature maps directly controls the capacity and depends on the
number of available examples and task complexity.
Filter size
Common filter sizes found in the literature vary greatly, and are usually chosen
based on the data set.
In modern CNNs, max pooling is typically used, and often of size 2×2, with a
stride of 2. This implies that the input is drastically down sampled, further
improving the computational efficiency.
Very large input volumes may warrant 4×4 pooling in the lower layers. However,
choosing larger shapes will dramatically reduce the dimension of the signal, and
may result in excess information loss. Often, non-overlapping pooling windows
perform best.
Translation Equivariance
It is commonly assumed that CNNs are invariant to shifts of the input. However,
convolution or pooling layers within a CNN that do not have a stride greater than
one are equivariant, as opposed to invariant, to translations of the input. Layers
with a stride greater than one ignores the Nyquist-Shannon sampling theorem, and

35
leads to aliasing of the input signal, which breaks the equivariance (also referred to
as covariance) property. Furthermore, if a CNN makes use of fully connected
layers, translation equivariance does not imply translation invariance, as the fully
connected layers are not invariant to shifts of the input. One solution for complete
translation invariance is avoiding any down-sampling throughout the network and
applying global average pooling at the last layer. Additionally, several other partial
solutions have been proposed, such as anti-aliasing, spatial transformer networks,
data augmentation, subsampling combined with pooling, and capsule neural
networks.

Figure 4.5.5: Working of Filter in CNN

4.6 Applications of CNN


4.6.1 Image Recognition

CNNs are often used in image recognition systems. In 2012 an error rate of
0.23% on the MNIST database was reported. Another paper on using CNN for
image classification reported that the learning process was "surprisingly fast"; in
the same paper, the best published results as of 2011 were achieved in the MNIST

36
database and the NORB database. Subsequently, a similar CNN called Alex Net
won the ImageNet Large Scale Visual Recognition Challenge 2012.
When applied to facial recognition, CNNs achieved a large decrease in error
rate.[88] Another paper reported a 97.6% recognition rate on "5,600 still images
of more than 10 subjects". CNNs were used to assess video quality in an objective
way after manual training; the resulting system had a very low root mean square
error. The ImageNet Large Scale Visual Recognition Challenge is a benchmark
in object classification and detection, with millions of images and hundreds of
object classes. In the ILSVRC 2014,[89] a large-scale visual recognition
challenge, almost every highly ranked team used CNN as their basic framework.
The winner Google Net (the foundation of Deep Dream) increased the mean
average precision of object detection to 0.439329, and reduced classification
error to 0.06656, the best result to date. Its network applied more than 30 layers.
That performance of convolutional neural networks on the ImageNet tests was
close to that of humans. The best algorithms still struggle with objects that are
small or thin, such as a small ant on a stem of a flower or a person holding a quill
in their hand. They also have trouble with images that have been distorted with
filters, an increasingly common phenomenon with modern digital cameras. By
contrast, those kinds of images rarely trouble humans. Humans, however, tend to
have trouble with other issues. For example, they are not good at classifying
objects into fine-grained categories such as the particular breed of dog or species
of bird, whereas convolutional neural networks handle this. In 2015 a many-
layered CNN demonstrated the ability to spot faces from a wide range of angles,
including upside down, even when partially occluded, with competitive
performance. The network was trained on a database of 200,000 images that
included faces at various angles and orientations and a further 20 million images
without faces. They used batches of 128 images over 50,000 iterations.

37
4.6.2 Video Analysis
Compared to image data domains, there is relatively little work on applying
CNNs to video classification. Video is more complex than images since it has
another (temporal) dimension. However, some extensions of CNNs into the video
domain have been explored. One approach is to treat space and time as equivalent
dimensions of the input and perform convolutions in both time and space. Another
way is to fuse the features of two convolutional neural networks, one for the
spatial and one for the temporal stream. Long short-term memory (LSTM)
recurrent units are typically incorporated after the CNN to account for inter-frame
or inter-clip dependencies. Unsupervised learning schemes for training spatio-
temporal features have been introduced, based on Convolutional Gated Restricted
Boltzmann Machines and Independent Subspace Analysis.
4.6.3 Natural Language Processing
CNNs have also been explored for natural language processing. CNN models are
effective for various NLP problems and achieved excellent results in semantic
parsing, search query retrieval, sentence modeling, classification, prediction and
other traditional NLP tasks.
Anomaly Detection

A CNN with 1-D convolutions was used on time series in the frequency domain
(spectral residual) by an unsupervised model to detect anomalies in the time
domain.

4.6.4 Drug Discovery


CNNs have been used in drug discovery. Predicting the interaction between
molecules and biological proteins can identify potential treatments. In 2015,
Atom wise introduced Atom Net, the first deep learning neural network for
structure-based rational drug design. The system trains directly on 3-dimensional
representations of chemical interactions. Similar to how

38
image recognition networks learn to compose smaller, spatially proximate
features into larger, complex structures, Atom Net discovers chemical features,
such as aromaticity, sp3 carbons and hydrogen bonding.
Subsequently, Atom Net was used to predict novel candidate biomolecules for
multiple disease targets, most notably treatments for the Ebola virus and multiple
sclerosis.

4.6.5 Go
CNNs have been used in computer Go. In December 2014, Clark and Storkey
published a paper showing that a CNN trained by supervised learning from a
database of human professional games could outperform GNU Go and win some
games against Monte Carlo tree search Fuego 1.1 in a fraction of the time it took
Fuego to play. Later it was announced that a large 12-layer convolutional neural
network had correctly predicted the professional move in 55% of positions,
equaling the accuracy of a 6 dan human player. When the trained convolutional
network was used directly to play games of Go, without any search, it beat the
traditional search program GNU Go in 97% of games, and matched the
performance of the Monte Carlo tree search program Fuego simulating ten
thousand playouts (about a million positions) per move.
A couple of CNNs for choosing moves to try ("policy network") and evaluating
positions ("value network") driving MCTS were used by AlphaGo, the first to
beat the best human player at the time.
Time series forecasting
Recurrent neural networks are generally considered the best neural network
architectures for time series forecasting (and sequence modeling in general), but
recent studies show that convolutional networks can perform comparably or even
better. Dilated convolutions might enable one-dimensional convolutional neural
networks to effectively learn time series dependences. Convolutions can be
implemented more efficiently than RNN-based solutions, and they do not suffer

39
from vanishing (or exploding) gradients. Convolutional networks can provide an
improved forecasting performance when there are multiple similar time series to
learn from. CNNs can also be applied to further tasks in time series analysis (e.g.,
time series classification or quantile forecasting).

4.6.6 Cultural Heritage and 3D-Datasets

As archaeological findings like clay tablets with cuneiform writing are


increasingly acquired using 3D scanners first benchmark datasets are becoming
available like HeiCuBeDa providing almost 2.000 normalized 2D- and 3D-
datasets prepared with the GigaMesh Software Framework. So, curvature-based
measures are used in conjunction with Geometric Neural Networks (GNNs) e.g.,
for period classification of those clay tablets being among the oldest documents
of human history.

4.6.7 Fine Tuning

For many applications, the training data is less available. Convolutional neural
networks usually require a large amount of training data in order to avoid
overfitting. A common technique is to train the network on a larger data set from
a related domain.
Human interpretable explanations
End-to-end training and prediction are common practice in computer vision.
However, human interpretable explanations are required for critical systems such
as a self-driving car. With recent advances in visual salience, spatial and temporal
attention, the most critical spatial regions/temporal instants could be visualized to
justify the CNN predictions.

40
Figure 4.6.7: Fine Tuning

41
4.7 Related Networks
4.7.1 Deep Belief Networks

Convolutional deep belief networks (CDBN) have structure very similar to


convolutional neural networks and are trained similarly to deep belief networks.
Therefore, they exploit the 2D structure of images, like CNNs do, and make use
of pre- training like deep belief networks. They provide a generic structure that
can be used in many image and signal processing tasks. Benchmark results on
standard image datasets like CIFAR have been obtained using CDBNs.

Figure 4.7.1: Architecture of DBN

42
4.7.2 Deep Q-Networks

A deep Q-network (DQN) is a type of deep learning model that combines a deep
neural network with Q-learning, a form of reinforcement learning. Unlike earlier
reinforcement learning agents, DQNs that utilize CNNs can learn directly from
high- dimensional sensory inputs via reinforcement learning.

Preliminary results were presented in 2014, with an accompanying paper in


February 2015. The research described an application to Atari 2600 gaming.
Other deep reinforcement learning models preceded it.

Figure 4.7.2: Architecture of DQN

43
5. MODULES

The system has been developed in four different Modules.


Each module has its own unique functionality.

MODULE-1
All the libraries and packages required for smooth execution are included and
defined in this module.

MODULE-2
Capturing of images and setting them in frames is done in this module.

MODULE-3
Analysis of the image captured in previous module and detecting the emotion as
happy, sad, angry etc...

MODULE-4
Recommending a song based on the mood detected in previous module.

44
5.1 Module – 1
All the libraries and packages required for smooth execution are included and
defined in this module.

Figure 5.1: Module - 1

The first module provides the framework and acts as the foundation for the
other three modules.

CV2 Library
1. The first library imported is cv2 library.
2. CV stands for “Computer Vision.”
3. It is an OpenCV tool, where all the image processing takes place.
4. It is used for performing various operations on an image, such as:
a. Capturing b. Deleting c. Importing d. RGB to Greyscale conversion

45
NumPy Library
1. The next library imported is NumPy.
2. It is an open-source Numerical Python library.
3. It is used for dealing with the audio files (songs), that are basically
stored as numbers in the computer system.
4. It contains multi-dimensional array and matrix data structure.
5. It performs various mathematical operations on arrays such as:
statistical, trigonometric and algebraic routines.

DeepFace Model
1. It is a pre-trained model which is used for analyzing the image and
identifying the emotion of the person from the image captured.
2. It is a facial recognition system used by Facebook for tagging images.
3. It is included in our system using the line
“from deepface import DeepFace”
4. It uses the concept of CNN for identifying the key features on detected
face and then deducing various conclusions such as the race, age, skin
color, emotions of the person.
5. It was proposed by researchers at Facebook AI Research (FAIR) at
2014 IEEE Computer Vision and Pattern Recognition Conference
(CVPR).

matplotlib.pyplot
1. It is a collection of functions that make matplotlib work like MATLAB
2. Each pyplot function makes some changes to figure like creating
plotting area, plots a line in plotting area, etc...
3. We have used it for plotting the image inside a specified frame.

46
Face Cascade
1. It is used for detecting the frontal face in a frame.
2. On the image captures using CV2, by applying the “Haar Cascade”
algorithm face of the user is detected.
3. A haar feature traverses the entire image pixel by pixel and detects the
face of the user in the frame.

Pandas
1. It is an open-source Python package.
2. It is used for data analysis and Machine Learning tasks.
3. We are using it for accessing the .csv (Excel) file that contains the list
of songs from which the user will get recommendations.
4. Pandas objects rely heavily on NumPy objects.
5. Essentially, Pandas extends NumPy.

“df=pd.read_csv(“Top 2018.csv”)”
1. It accesses the csv file named “Top 2018”, which contains the list of
top 100 songs released in the year 2018.
2. We can make use of any list of songs by mentioning the file’s name in
place of the above-mentioned file name.
3. The songs in the file are segregated based on different types of moods.
4. So, when the mood of user is identified, a song from the list is
recommended that matches the genre of the user’s mood.

“df.drop(df.columns.difference([‘name’,‘artist’,‘tempo’]),1,inplace=True)”
1. It is used to remove all the irrelevant columns from the file, in order to
reduce unnecessary noise in data.

47
5.2 Module – 2
Capturing of images and setting them in frames is done in this module.

Figure 5.2: Module – 2

“cam=cv2.VideoCapture(0)”
1. CV2 has already been imported.
2. We are using a variable ‘cam’ to capture an image using the
VideoCapture(0) function.
3. This will return video from the webcam on to the computer.
4. Image captures is stored in ‘cam.’

48
“cv2.namedWindow(“test”)”
1. The window in which the image is captured is named as “test”.
2. img_counter=0 is to count the image(s).

While Loop
1. The while loop is used to resize the window frame and test whether the
frame is captured or not.
2. Ret, frame= cam.read() is for reading the image and storing it in a variable
called frame.

“cv2.inshow(“test”, frame)”
1. It is the function used for displaying an image in window.

“k=cv2.waitKey(1)”
1. It is a keyboard binding function.
2. Its argument is the time in milliseconds.

“img_name= “opencv_frame_{}.png”.format(img_counter)”
1. Format() method formats the specified value and insert them inside the
string’s place holder.
2. “{}” → Place Holder.

“cv2.imwrite()”
1. This function is to save an image.
2. First argument is filename.
3. Second argument is the image we wish to save.

49
cam.release()→ It closes the capturing device.

cv2.destroyAllWindows()→ Simply destroy and deletes all the windows created.

img=cv2.imread(‘opencv_frame_0.png’)→ To read the image.

plt.imshow(…)→ It displays image in 2D parameters.

cv2.cvtColor(…)→ It converts colour from BGR to RGB.

50
5.3 Module – 3
Analysis of the image captured in previous module and detecting the emotion as
happy, sad, angry etc...

Figure 5.3: Module – 3

“predictions = DeepFace.analyze(img)”
1. Predictions is a keyword in Python programming language.
2. DeepFace is a pre-trained facial recognition model.
3. “analyze(img)” → Its function is to analyse the image captured and stored
in img.
4. It analyses the image and predicts the age, race, skin colour and emotions
of the user.
5. The output is displayed as “Dominant Emotion:”.

51
5.4 Module – 4
Recommending a song based on the mood detected in previous module.

Figure 5.4: Module – 4

1. Songs corresponding to various moods like: happy, angry, sad, neutral,


surprise and disgust have been segregated and numbered.
2. When any one of the above six moods is identified, a song is recommended
randomly from the list of numbers that particular mood has been assigned.
3. For example, if the emotion is detected as “surprise” then a random integer
from 55 to 70 is picked by the random.randint function. And the song
corresponding to that number picked randomly is suggested to the user.
4. Similarly, if other moods are detected different songs are recommended
related to the emotions of the user.

52
6. PROGRAM & EXECUTION

The following is the block of code for Module – 1:


Packages needed
1. import cv2
2. import numpy as np
3. from deepface import DeepFace
4. import matplotlib.pyplot as plt
5. faceCascade = cv2.CascadeClassifier(cv2.data.haarcascades +
‘haaarcascade_frontalface_default.xml’)
6. import pandas as pd
7. df = pd.read_csv('top2018.csv')
8. df.drop(df.columns.difference(['id', 'name', artist', 'tempo']), 1, inplace=True)
OUTPUT

Figure 6.1: Output of 1st Module

53
The following is the block of code for Module – 2:
1. cam cv2.VideoCapture(e)
2. cv2.namedicindow("test")
3. img counter = 0
4. while True:
5. ret, frame = cam.read()
6. if not ret:
7. print("failed to grab frame")
8. break
9. cv2.imshow("test", frame)
10.k = cv2.waitkey(1)
11.if k%256 == 27:
#ESC pressed
12.print("Escape hit, closing...")
13.break
14.elif k%256 == 32:
#SPACE pressed
15.img_name= "opencv_frame_{}.png" .format (img _counter)
16.cv2.imvrite(img_name, frame)
17.print("{} written!".format(img_name))
18.img counter == 1
19.cam.release()
20.cv2.destroyAllwindows()
21.img=cv2.imread(“opencv_frame_0.png”)
22.plt.imshow(cv2.cvtcolor(img, cv2.COLOR_BGR2RGB))

OUTPUT

Figure 6.2: Output of 2nd Module

54
The following is the block of code for Module – 3:

1. predictions = DeepFace.analyze(img)

2. Action: race: 100%| | 4/4 [02:01c00:00, 30.365/it]

3. Predictions

OUTPUT

Figure 6.3: Output of 3rd Module

55
The following is the block of code for Module – 4:

1. import random
2. if (emotion == ’'happy"):
3. n = random.randint(0,12)
4. print(df.loc[[n]])
5. elif (emotions angry"):
6. n = random.randint(13,23).
7. print(df.loc [[n]])
8. elif (emotion = 'sad"):
9. n random.randint (24,40)
10.print(df.loc [[n]])
11.elif (emotion = neutral"):
12.n random.randint (40,54)
13.print(df.loc [[n]])
14.elif (omotion surprise):
15.n random.randint (55,70)
16.print(df.loc [[n]])
17.elif (emotion = disgust):
18.n = random.randint (71,91)
19.print(df.loc [[n]])
20.else:
21.n = random.randint(92,101)
22.print(df.loc[[n]])

OUTPUT

Figure 6.4: Output of 4th Module

56
7. FUTURE SCOPE
The music player that we are using it can be used locally and nowadays everything
became portable and efficient to carry but it the emotion of a person can be taken
by different of wearable sensors and easy to use rather than the whole manual
work it would be possible using GSR (galvanic skin response) and PPG
(plethysmography physiological sensors). That would give us enough data to
predict the mood of the customer accurately. This system with enhanced will be
able to benefit and the system with advanced features and needs to be constantly
upgraded. The methodology that enhances the automatic playing of songs is done
by the detection. The facial expressions are detected with the help of programming
interface that is present in the local machine. And the alternative method, that is
based on the additional emotions which are being excluded in our system.

57
8. CONCLUSION
This project, “Music recommendation based on Mood using Facial recognition”
is based on the emotions that are captured in real time images of the user. This
project is designed for the purpose of making better interaction between the
music system and the user. because Music is helpful in changing the mood of the
user and for some people it is a stress reliever. Recent development it shows a
wide prospective in the developing the emotion-based music recommendation
system. Thus, the present system presents Face(expressions) based recognition
system so that it could detect the emotions and music will be recommended
accordingly.

58
9. REFERENCES

1. Science about facial expression, 2011. [Online; accessed 11-July-2017].

2. Mvc - architecture, 2017. [Online; accessed 11-July-2017].

3. Nodejs, 2017. [Online; accessed 11-July-2017].

4. NoSQL - non-SQL, 2017. [Online; accessed 11-July-2017].

5. Saavn, 2017. [Online; accessed 11-July-2017].

6. Everyone listens to music, but how we listen is changing. [online] Available


at: http://www.nielsen.com/us/en/insights/news/2018/everyone-listens-to-
music-but-howwe-listen-is-changing.html [Accessed 10 Oct. 2019].

7. Labrosa.ee.columbia.edu. Million Song Dataset | scaling MIR research.


[online] Available at: https://labrosa.ee.columbia.edu/millionsong/
[Accessed 10 Oct. 2019].

8. en.wikipedia.org. Netflix Prize. [online] Available at:


https://en.wikipedia.org/wiki/Netflix_Prize [Accessed 27 Mar. 2020].

59

You might also like