Professional Documents
Culture Documents
Personality Based Music Recommendation System
Personality Based Music Recommendation System
net/publication/321528718
CITATIONS READS
0 11,343
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Abhishek Paudel on 05 December 2017.
By:
Abhishek Paudel (070/BCT/502)
Brihat Ratna Bajracharya (070/BCT/513)
Miran Ghimire (070/BCT/521)
Nabin Bhattarai (070/BCT/522)
AUGUST 1, 2017
ii
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING
The undersigned certify that they have read, and recommended to the Institute of Engineer-
ing for acceptance, a project report entitled ”Personality Based Music Recommendation
System” submitted by Abhishek Paudel, Brihat Ratna Bajracharya, Miran Ghimire and
Nabin Bhattarai in partial fulfillment of the requirements for the Bachelor’s Degree in Com-
puter Engineering.
DATE OF APPROVAL:
iii
COPYRIGHT
The authors have agreed that the Library, Department of Electronics and Computer Engi-
neering, Institute of Engineering, Pulchowk Campus may make this report freely available
for inspection. Moreover, the authors have agreed that permission for extensive copying of
this project report for scholarly purpose may be granted by the supervisors who supervised
the project work recorded herein or in their absence, by the Head of the Department wherein
the project report was done. It is understood that the recognition will be given to the authors
of this project and to the Department of Electronics and Computer Engineering, Pulchowk
Campus, Institute of Engineering in any use of the material of this report. Copying or pub-
lication or the other use of this report for financial gain without approval of the Department
of Electronics and Computer Engineering, Institute of Engineering, Pulchowk Campus and
authors’ written permission is strictly prohibited.
Request for permission to copy or to make any other use of the material in this report in
whole or in part should be addressed to:
Head
Department of Electronic and Computer Engineering,
Institute of Engineering, Pulchowk Campus,
Lalitpur, Nepal
iv
ACKNOWLEDGMENT
We would like to express our sincere gratitude to the Department of Electronics and Com-
puter Engineering at Institute of Engineering, Pulchowk Campus for providing us opportu-
nity to implement the knowledge gained over these years as major project for fourth year.
We would also like to express our deepest sense of gratitude and thanks to our supervisor
Mr. Daya Sagar Baral for providing invaluable insight and guidelines for this project.
We would also like to thank all of our friends who have directly and indirectly helped us in
doing this project. Last but not the least, we place a deep sense of appreciation to our family
members who have been constant source of inspiration for us.
Authors:
Abhishek Paudel
Brihat Ratna Bajracharya
Miran Ghimire
Nabin Bhattarai
v
ABSTRACT
Music is an integral part of our life. We listen to music everyday as per our taste and mood.
With the advancement and increase in volume of digital content, the choice for people to
listen to diverse type of music has also increased significantly. Thus, the necessity of de-
livering the most suited music to the listeners has been an interesting field of research in
computer science. One of the important measures to deliver the best music to listeners could
be his/her personality trait. In this project, we aim to discover the impact of personality traits
on the collaborative filtering (user to user) which is one of the most popular recommendation
engines used today.
In order to determine the personality of a person, social media like Facebook can be a use-
ful platform where people express their views on different matters, share their opinions and
thoughts. Such expressions of thoughts and opinions can be leveraged to study the person-
ality traits of the person and hence use this information to try to enhance existing user to
user collaborative filtering techniques for music recommendation. Personality traits of the
users can be studied in terms of standard Big Five Personality Traits defined as Openness
to experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism [2].With
this project, we were able to determine that personality of the user can be one of the crucial
factor, in the recommendation of the music.
TABLE OF CONTENTS
TITLE PAGE i
LETTER OF APPROVAL ii
COPYRIGHT iii
ACKNOWLEDGMENT iv
ABSTRACT v
LIST OF FIGURES x
LIST OF TABLES xi
1 INTRODUCTION 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Scope of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Understanding Of Requirement . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Organization of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 LITERATURE REVIEW 6
3 THEORETICAL BACKGROUND 7
3.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Obtaining a dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.4 Model Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.5 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.6 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.7 Multinomial Naive Bayes . . . . . . . . . . . . . . . . . . . . . . 15
vii
4 METHODOLOGY 32
4.1 Requirement Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Functional Requirement . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Non-functional Requirement . . . . . . . . . . . . . . . . . . . . . 32
4.2 Feasibility Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Operational Feasibility . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.2 Technical Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.3 Economic Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.4 Legal Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.5 Scheduling Feasibility . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Software Development Approach . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 SYSTEM DESIGN 37
5.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 ER Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 Context Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.7 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.8 Front End of the System(User Interface) . . . . . . . . . . . . . . . . . . . 47
5.9 Back End of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
viii
7 RESULT 52
7.1 Big Five Personality Frequency Distribution . . . . . . . . . . . . . . . . . 52
7.2 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.3 Naive Bayes Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.4 Evaluation of Recommendation System . . . . . . . . . . . . . . . . . . . 55
7.4.1 Latent Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9 CONCLUSION 63
REFERENCES 64
APPENDIX A 67
APPENDIX B 69
ix
List of Figures
7.7 RMSE of Collaborative Filtering combined with Global Baseline with User
Rating Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
x
7.8 RMSE of Collaborative Filtering combined with Global Baseline with User
Personality Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.9 RMSE of Collaborative Filtering with User Rating and Personality Matrix . 58
7.10 RMSE of Collaborative Filtering with User Rating and Personality Matrix
combined with Global Baseline . . . . . . . . . . . . . . . . . . . . . . . . 59
List of Tables
LIST OF ABBREVIATIONS
1. INTRODUCTION
On the Internet, where the number of choices is overwhelming, there is need to filter, priori-
tize and efficiently deliver relevant information in order to alleviate the problem of informa-
tion overload, which has created a potential problem to many Internet users. Recommender
systems solve this problem by searching through large volume of dynamically generated
information to provide users with personalized content and services.
Besides, these days social networks have become widely used and popular mediums for
information dissemination as well as the facilitators of social interactions. User contribution
and activities provide a valuable insight into individual behavior, experiences, opinions and
interests.Considering that personality, which uniquely identifies each one of us, affects a lot
of aspects of human behavior, mental process and affective reactions, there is an enormous
opportunities, for adding new personality based qualities in order to enhance the current
collaborative filtering recommendation engine.
Previous work has shown that the information in user’s social media account is reflective of
their actual personalities, not an idealized version of themselves which makes a broad user
base social networking site Facebook, an ideal platform in order to study the personality
traits of an user. Several well studied personality models have been proposed, among which
the ”Big Five Model” as known as ”Five Factor Model”(FFM) is the most popular one [2].
1.1. Background
The Big Five Model of personality dimensions has emerged as one of the most well-researched
and well-regarded measures of personality structure in recent yeas [2]. The model five do-
mains of personality: Openness, Conscientiousness, Extroversion, Agreeableness and Neu-
roticism, were conceived by Tupes and Christal [3] as the fundamental traits that emerged
from analyses of previous personality tests. McCrae, Costa and John [4] continued five fac-
tor model research and consistently found generality across age, gender and cultural lines.
The Big Five Model traits are characterized by the following:
• I follow a schedule.
• I am exact in my work.
• I am always prepared.
1.2. Motivation
The growth in the amount of digital information and the number of users in the Internet
have created a potential challenge of information overload which hinders timely access to
items of interest on the Internet. Thus there has been increased in the demand for the best
recommender system more than ever before. And music is essential to many of our lives. We
listen to it when waking up, while in transit at work, and with our friends. For many, music
is like a constant companion. It can bring us joy and motivate us, accompany us through
difficult times and alleviate our worries. Hence music is much more than mere entertainment,
but as stated earlier, growth in the amount of digital information have created a potential
challenge of information overload where a recommendation engine plays a very crucial role
in filtering the vital fragment out of large amount of dynamically generated information
according to user’s preferences, interest or observed behavior about item. Hence,with this
project, we attempt to devise a method to improve the collaborative filtering engine via the
use of personality in order to compute the similar user’s for the recommendation of music as
it is believed, person with similar personality has similar taste in music.
1.3. Objectives
The objectives of the project can be summarized with the points below:
1. To find out if the personality of the user can be a crucial factor in the music recom-
mendation system.
2. To find out if collaborative recommendation engine can be enhanced via the use of
personality for similar user computation.
Music is an essential part of human life. Music is the pleasant sound that leads us to ex-
perience harmony and higher happiness. With the advancement in technology, music has
significantly progressed and increased in terms of quality and volume. The type of music
people create and listen differs according to place and culture. The taste of music even dif-
fers from person to person and even in moods of same person. So, it is very useful if we
could determine some method to find what kind of music a person might be interested in lis-
tening and use this finding to recommend music to him/her. Collaborative filtering is one the
most popular filtering techniques today and with this project we aim to enhance it.For this,
4
we have assumed that the personality of the user might be one of the key factor in his/her
music listening habit. Hence, via this project, we are to see if the personality of the user
might have any impact on the collaborative filtering enhancement assuming the correlation
between the personality and music listening habit exists.
The most important scope of the project will be to discover if the personality traits of an
individual can be used for the enhancement of the recommendation engine in order to provide
the more personalized content to the user as a recommendation.
Nowadays, digital data on the Internet has been massive than ever, which have created a
potential challenge of information overload, hindering timely access of items of interest on
the Internet. So, there is a requirement for the better recommendation system than ever.
Thus with this project, we will try to find out if the recommendation engine can perform
better if personality of the individual is used as one of the metrics for a recommendation.
Nowadays, social networking sites have become vastly popular among several people of
different religion, caste, ethnic groups and different location of the world, which shows how
culturally diverse the people around the world is. And in this diversity, we can also see
the varieties in the personalities of people living in different parts of the world. The main
purpose of the social networking sites are to connect different peoples from different parts of
the world which makes it the most suitable platform in order to study the personality traits
of the user. Thus studied personality traits might be used in recommendation engine in order
to improve it’s efficiency.
1. Chapter 1: It includes the introduction about the problem and the method we are trying
to employ to solve.
2. Chapter 2: It includes Literature Review which includes the works related to the
project and the notable works prevailing prior to this project development with their
results.
5
3. Chapter 3: It includes the theoretical background for the development of the project.
5. Chapter 5: It includes system design techniques along with the use case, activity dia-
gram used for the development of the system.
6. Chapter 6: It includes tools and technologies used for the development of the system.
7. Chapter 7: It includes the analysis and the result of the experiment we tried in the
project.
2. LITERATURE REVIEW
Recommender System is a rich problem research area. It has abundant practical applications
also defined as systems which promote recommendation of people(normally seen as service
provider) as well as promote recommendation of products/services.In computers, Recom-
mender Systems begin to appear in 90’s, they are applications that provide personalized
advice for users about products or services they might be interested in [9].
In 2005, Gonzalez [8] proposed a first model based on psychological aspects, he uses Emo-
tional Intelligence to improve on-line course recommendations.
In 2008, Recommender System based on personality traits [10] was published, experiment-
ing on recommender system with the personality. The basically tired to recommend a person,
in a voting scenario. Here recommendation was based on those psychological aspect of can-
didates and an imaginary person who they dreamed as ideal candidate. System used 30 facets
of big 5 personality traits and only big 5 personality traits as the psychological measures of
the users.
In 2014, Improving Music Recommender System. What can we learn from research on
music tastes? [5] was published which discuss about the music tastes from psychological
point of view and uses psychology of music to identify the correlates of music tastes and to
understand how music tastes are formed and evolve through time. It reveals the importance
of social influences on music tastes and provides a basic suggestion for the design of music
recommender system.
Also in 2014, Enhancing Music Recommender System with Personality Information andE-
motional States [6] was published, that researches to improve the music recommendation
by including personality and emotional states. The proposal offers a great insight on how a
recommendation engine can be improved with the personality via the series of steps.
3. THEORETICAL BACKGROUND
3.1. General
The project tries to study the impact of the personality on the collaborative recommendation
engine. Thus, initially personality of the user has to be predicted which can be done via the
user’s study on social media i.e Facebook where a status update might be one of the good
metric to predict the personality using ”document classification” technique and the traits of
the person are used as the similarity metric for similar user computation on ”collaborative
filtering”.
Supervised Document Classification comprises of series of steps which are briefly described
below:
The quality of the tagged dataset is by far the most important component of a statistical NLP
classifier. The dataset needs to be large enough to have an adequate number of document in
each class. The datasets also needs to be of a high enough quality in terms of how distinct
8
the documents in the different categories are from each other to allow a clear delineation
between categories [11].
3.2.2. Preprocessing
• Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data or resolving the inconsistencies in the data.
• Data Integration: Data with different representation are put together and conflicts
within the data are resolved.
• Data Reduction: This step aims to present a reduced representation of the data in a
data warehouse.
1. Removal of StopWords: In computing, stop words are words which are filtered out
prior to, or after processing of natural language data(text). There is not one definite
list of stop words which all tools use and such a filter is not always used. Any group
of words cab be chosen as the stop words for a given purpose. By removing the
stop words during data preprocessing we reduce the computational complexity of the
program and hence the project can run in an effective way [12].
2. Convert every characters to lowercase: This step is carried out in order to remove the
distinction between the same words written in upper and lower case, so that model
doesn’t treat them differently.
5. PoS tagging: In corpus linguistics, a part of speech tagging, also called grammatical
tagging or word-category disambiguation, is the process of marking up a word in a
text(corpus) as corresponding to a particular part of speech, based on it’s definition as
well as context i.e relationship with adjacent and related words in a phrase, sentence,
or paragraph [14]. This aids in removal of unwanted part of speech in the sentences
and helps to build the better model. The parts of speech used in our model are verb,
adverb and adjective.
1. Bag of Words: The bag of words model is a simplifying representation used in natural
language processing and information retrieval(IR). In this model, a text (such as sen-
tence or a document) is represented as the bag disregarding grammar and even word
order but keeping multiplicity. The bag of words model is commonly used in methods
of document classification where the frequency or occurrence of each word is used as a
feature for training a classifier. It is comparable to the skip gram model of the unigram
in the language model [15]
2. Feature Vector Creation: Feature Vector Creation is the process of conversion of bag
of words model of into the vector form where by each words are represented with their
frequencies. For feature vector creation initially a vocabulary is built using all of the
corpus available in dataset which helps to create a vector space model for words and
feature vector are derived from the each corpus accordingly [15].
It is a classification algorithm used for the classification of personality in the project. Here
we have implemented Naive Bayes, Logistic Regression as the classification algorithm.
10
In general, when we make a machine learning based software, we are basically trying to
come up with a function to predict the output for future inputs based on the experience it has
gained through the past inputs and their outputs. The past data is referred to as the training
set.
Logistic regression[31] (also known as logit regression or logit model) is one of the most
popular machine learning algorithms used for classification problem. Given a training set
having one or more independent (input) variables where each input set belongs to one of
predefined classes (categories), what logistic regression model tries to do is come up with a
probability function that gives the probability for the given input set to belong to one of those
classes. The basic logistic regression model is a binary classifier (having only 2 classes), i.e.,
it gives the probability of an input set to belong to one class instead of the other. If the
probability is less than 0.5, we can predict the inputs set to belong to the latter class. But
logistic regression can be hacked to work for multi-class classification problem as well by
using concepts like “one vs. rest”. What we basically do is create a classifier for each class
that predicts the probability of an input set to belong to that particular class instead of all
other classes. It is popular because it is a relatively simple algorithm that performs very well
on wide range of problem domains.
Actually, logistic regression is one of the techniques borrowed by machine learning from the
field of statistics. Logistic regression was developed by statistician David Cox in 1958. The
binary logistic model is used to estimate the probability of a binary response based on one
or more predictor (or independent) variables (called features).
The name “logistic” comes from the probability function used by this algorithm. The logistic
function (also known as sigmoid function) is defined as:
ex 1
logistic(x) = sigmoid(x) = = (3.1)
1+e x 1 + e−x
The logistic regression classifier uses the logistic function of the weighed (as well as biased)
sum of the input variables to predict the probability of the input set belonging to a class
(or category). The probability function is already fixed. The only thing that we can change
while learning from different training set is the set of weight parameters (θ ) assigned to each
feature.
11
Let,
The training set for this machine learning algorithm is the set of m training samples (exam-
ples).
n o
training set = (x(1) , y(1) ), (x(2) , y(2) ), · · · , (x(m) , y(m) ) (3.2)
Then the hypothesis function used to predict the output y for input feature set x with param-
eter θ is given by:
!
n
1
= sigmoid θ T x =
hθ (x) = sigmoid ∑ θixi T (3.3)
i=0 1 + e−θ x
Now the aim of this machine learning algorithms is to adjust the parameters θ to fit the
hypothesis hθ (x) with the real output y of training set with minimum cost (error). For that,
we need to define the cost function, preferably, a convex cost function. There are different
types of cost functions. Linear regression, for instance, uses the sum of the squares of the
errors as the cost function. But in logistic regression, since the output is not linear (even
though the input is), this cost function turns out to be non-convex and there are not efficient
algorithms that can minimize a non-convex function. Therefore, we define a logarithmic cost
function J(θ ) for logistic regression as follows:
1 m
(i) (i)
J(θ ) = ∑ Cost h θ x , y (3.4)
m i=1
where,
− log(h) for y = 1
Cost(h, y) =
− log(1 − h) for y = 0
= −y log(h) − (1 − y) log(1 − h) (3.5)
After we define an appropriate convex cost function J(θ ), the machine learning algorithm
basically boils down to finding parameter θ that minimizes J(θ ).
This can be achieved using various optimization algorithms. Some notable ones are Gradient
descent, BFGS, Quasi-Newton, L-BFGS, etc. The gradient descent is the simplest one. It is
a hill-climbing optimization algorithm that tries finds the local optima from the start point.
But since our cost function J(θ ) is convex, there is only one minima and that is the global
minima.
To find parameter θ that minimizes the cost function J(θ ), we initial θ with small random
13
∂ J(θ )
θj = θj −α ; j ∈ 0, 1, 2, . . . , n; α = learning rate
∂θj
where,
m
∂ J(θ )
(i)
= ∑ hθ x(i) − y(i) x j (3.7)
∂θj i=1
Note that all θ j ’s must be updated simultaneously. This concept is called batch learning,
contrary to online learning where the parameter is updated separately for every training
example.
The resulting parameter θ that minimizes the cost function J(θ ) is the parameter of the
learning model. Then we can use the hypothesis hθ (x) to predict the output y for any input
feature set x. The output y will be a value in the range (0, 1). The output can be interpreted
as the probability of the given input set belonging to the class 1 (primary class).
Other advanced optimization algorithms such as BFGS, L-BFGS, Quasi-Newton, etc. are
more efficient that the basic gradient descent and also has the advantage that we don’t have
to manually select the learning rate (α). These advanced optimization algorithms will auto-
matically select the appropriate value of α to maximize efficiency.
Feature scaling is the process of scaling (or normalizing) all the features to a limit of [−1, 1]
or [0, 1]. This is required because unscaled features causes some features to get higher pri-
ority implicitly and it reduces the accuracy of the learning algorithm. Feature scaling can be
done in following ways:
(i)
(i) x j − min(x j )
xj = ∈ [0, 1]
max(x j ) − min(x j )
or
(i)
(i) xj −xj
xj = ∈ [−1, 1]
max(x j ) − min(x j )
14
or
(i)
(i) xj −xj
xj = (for normal distribution)
σx j
3.2.5.2.2 Regularization
Regularization [33] is the process of scaling down the values of parameter θ to reduce the
problem of over-fitting. Over-fitting is the condition when the learning algorithm satisfies
the training set very precisely but doesn’t satisfy test data (not included in training set).
Regularization is done by introducing a regularization parameter λ term in the overall cost
function J(θ ).
1 m
(i) (i)
λ n
J(θ ) = ∑ Cost hθ x
m i=1
,y +
2m ∑ θ j2 (3.8)
j=0
Regularization helps to solve over fitting problem in machine learning. Simple model will
be a very poor generalization of data. At the same time, complex model may not perform
well in test data due to over fitting. It is necessary to choose the right model in between
simple and complex model. Regularization helps to choose preferred model complexity, so
that model is better at predicting. Regularization is nothing but adding a penalty term to the
objective function and control the model complexity using that penalty term.
Naive Bayes is one of the model used for the classification under the“Bayesian Classifier”.
In machine learning, Naive Bayes classifier are the family of simple probabilistic classifiers
based on Bayes theory with the assumption of ”independence” between the features. If the
dependence between the features exists Bayesian Network will be used for the classification.
The major advantage of the Naive Bayes is it’s simplicity and highly scalability and ability to
work on huge dataset too. Despite the oversimplified assumption, Naive Bayes have worked
quite well in many complex real world situation. They are probabilistic, which means that
15
they calculate the probability of each category for a given sample, and output the category
with the highest one. It is comparable to the unigram language model created on each set
of classes. It is widely used for text classification which used in various fields like email
sorting, language detection etc.
There are various variation of Naive Bayes which are:
1. Multi-variate Bernoulli Naive Bayes: It is used whenever the feature vectors are binary
i.e occurrence of the feature is important rather than it’s count.
2. Multinomial Naive Bayes: It is typically used for discrete counts i.e whenever the
frequency of occurrence of the feature vector is important.
3. Gaussian Naive Bayes: In this model, it is assumed that the feature follows a normal
distribution. Instead of discrete counts, there are continuous features.
In the project, multinomial Naive Bayes is used as the classifier as for personality prediction
the frequency of occurrence of each feature is the feature vector is important and distribution
of the feature is in discrete form.
In order to understand how Naive Bayes classifiers [16] work, briefly understanding the
concept of Bayes’ rule is important. The probability model was formulated by Thomas
Bayes.
Given the set of features (x1 , x2 , x3 , · · · , xn ),
Mathematically Bayes theorem can be written as:
(P(Ck ) ∗ P(x|Ck )
P(Ck |x) = (3.9)
P(x)
where,
P(Ck |x) is the posterior probability of class ’c’ given the attributes x
P(Ck ) is the prior probability of class
P(x|Ck is the likelihood which is the conditional probability of attributes being in the given
class Ck .
P(x) is called evidence
k is used to denote the class label
16
Naive Bayes makes the independence assumption, so that 3.9 can be written as:
which is the required equation of Naive Bayes used for the classification of document.
In statistics, additive smoothing [17], also called Laplace smoothing is a technique used
to smooth categorical data. Give an observation x = (x1 , x2 , · · · , xd ) from a multinomial
distribution with N trials and paramater vector θ = (θ1 , θ2 , · · · , θd ), a smoothed version of
data given the estimator:
xi + α
θi = (3.11)
N + αd
When α = 1 in 3.11, it’s called add one Laplace smoothing which has been used as the
smoothing technique in the project in order to cancel out the effect of zero term by assigning
them a small probability.
3.2.7.3 Underfitting
Underfitting [16] in the Naive Bayes Classifier, can occur if the probabilities result from
conditional and prior are very small, in this case in order to prevent the program from un-
derfitting, resulting from the multiplication of the very small terms, log can be used in 3.10,
after which final equation becomes:
k
P(Ck |x) = log p(Ck ) + ∑ log(x|Ck ) (3.12)
i=1
which is the final equation used in the project for the classification of user’s status on Face-
book into the personality.
3.2.7.4 Overfitting
In order to reduce the overfitting and finding the best model for the classifier, kth -fold cross
validation, technique has been used. The major advantage of this method is that all observa-
tions are used for both training and testing and each observation is used for testing exactly
17
once [18].
In the project 5th -fold cross validation technique has been applied in which the data set is
divided into the 5 test cases and train cases and classifier is trained on each of those cases.
3.2.7.5 Optimization
Naive Bayes classifier, as seen in 3.10, classifies features set into a class via the multiplica-
tion of the prior and conditional probability which requires their computation each time the
classifier tries to classify the feature into class.
In order to solve the above problem, conditional and prior probability is precomputed and
stored in “HashTable” [16], where the conditional probability of each feature set is stored,
which can be easily be retrieved and used for the classification. Here hash table has been
implemented as dictionary object in python as dictionary in low level are stored as hash
value pair in memory.
After the detection of the personality, this information is used as the metric for computation
of similar user, in the recommendation engine(collaborative filtering) in order to observe it’s
effect on the recommendation.
Data Analysis [19] is a primary component of data mining and business intelligence and is
key to gaining the insight that derives business decisions. Data analysis is a proven way
for organizations and enterprises to gain the information they need to make better decisions,
server their customers and increase productivity and revenue. Besides, with the growth of
internet, there is so much of digital data and information available and data analysis has
become more necessary than ever. Some of the data analysis techniques are:
• Descriptive: It is analysis techniques that uses aggregation and data mining to provide
insight into past and answer ”What has happened?”. It involves the calculation of
simple measures of composition and distribution of variables. They are often used to
describe relationship in data. Such as: total stock in inventory, average money spent
per customer etc.
• Predictive: It is the process of extracting information from existing data sets in or-
der to determine patterns and predict future outcomes and trends. It encompasses a
variety of statistical techniques from predictive modeling, machine learning and data-
18
mining.Such as: predicting what items customers will purchase together, how sales
might close at the end of the year etc.
Recommender systems are information filtering systems that deal with the problem of infor-
mation overload by filtering vital information fragment out of large amount of dynamically
generated information according to user’s preferences, interest, or observed behavior about
item. Recommender system has the ability to predict whether a particular user would prefer
an item or not based on the user’s profile [23].
Recommender system typically produce a list of recommendation in one of two ways- through
collaborative and content filtering. Collaborative filtering approaches build a model from a
user’s past behavior(items previously purchased or selected and numerical rating given to
those items) as well as similar decisions made by other users. This model is then used to
predict items(or ratings for items) that the user may have an interest in. Content-based fil-
19
tering approaches utilize a series of discrete characteristics of an item i.e item-profile based
on the purchase history of the user i.e with the help of user-profile in order to recommend
items. The Hybrid recommender system is the one in which the one or more recommender
system is combined for the recommendation. Besides there are several categorization of
recommendation system which is enlisted below:
Knowledge based recommendation system [22] is based on the explicit knowledge about
item classification, user interest and recommendation standard(which item should be recom-
mend in which feature). It is an alternative approach to the collaborative filtering.
20
3.4.1.2 Pros
3.4.1.3 Cons
Utility based recommender system make recommendation based on the calculation of the
utility of each item for the user. Utility based recommender techniques uses multi-attribute
utility function based on item rates that user offer to describe user preferences and apply the
utility function to calculate item utility for each user.
• Recommendation: Compute the utility of each object for the user and recommend
accordingly.
3.4.2.2 Pros
• No ramp-up required.
21
3.4.2.3 Cons
Demographic recommendation technique [21] uses information about the user only. The de-
mographic types of users include gender, age, knowledge of language, disabilities, ethnicity,
mobility, employment status, home ownership and even location.The system recommends
items according to demographic similarities of the users.
1. User profile creation: User profile is created based on their demographic information.
2. User-item matrix construction: The user-item rating matrix is constructed based on the
rating of items by the user.
3. Recommendation : In order to recommend the item to user, the similar users are com-
puted with the help of cosine similarity, then the rating for that item by that user is
computed with the help of rating of neighborhood of similar user(average or weighted
average).
3.4.3.2 Pros
3.4.3.3 Cons
1. Item Profile Creation: Here initially, item profile is created in order with the help of
it’s feature. In case of movie, music meta data available can be used for item profile
creation.
2. User Profile Creation: User profile is created, based on their interaction with the item
i.e with the help of the their rating on the items. Hence user profile is created with the
help of the item profile either by taking the average of item-profile or weighted average
of item-profile.
3.4.4.2 Pros
3.4.4.3 Cons
Content based filtering can outperform the collaborative, whenever the the ratio of item to
user is very high.
• Memory Based: This approach uses user rating data to compute the similarity between
users or items. This is used for making recommendations. This was an early approach
used in many commercial systems. It’s effective and easy to implement. Typical exam-
ples of this approaches are neighborhood-based CF and item-based/user-based top-N
recommendations. The user based top-N recommendation algorithm uses a similarity
based vector model to identify the k most similar users to an active user. After the
k most similar users are found their corresponding user-item matrices are aggregated
to identify the set of items to be recommended. The advantages with this approach
include: the explainability of the results, which is an important aspect of recommenda-
tion system, easy creation and use, easy facilitation of new data, content-independence
of the items being recommended, good scaling with co-rated items.There are several
disadvantages with this approach. It’s performance decreases when data gets sparse,
which occurs frequently with web-related items. This might hinder the scalability of
this approach and creates problems with large datasets.
• Model Based: In this approach, models are developed using different data mining,
machine learning algorithms to predict user’s rating of unrated items. There are many
model-based CF algorithms. Bayesian networks, clustering models, latent semantic
models such as singular value decomposition, probabilistic latent semantic analysis,
multiple multiplicative factor etc.
24
In this model, methods like singular value decomposition, principle component analy-
sis, known as latent factor models, compress user-item matrix into a low-dimensional
representation in terms of latent factors. One advantage of using this approach is that
instead of having a high dimensional matrix containing abundant number of missing
values,will be dealing with a much smaller matrix in lower-dimensional space. A re-
duced presentation could be utilized for either user-based or iitem-base neighborhood
algorithms. It handles the sparsity of the original matrix better than memory based
ones.
3.4.5.1 Pros
3.4.5.2 Cons
All of the known recommendation techniques have strengths and weakness and many re-
searchers choose to combine the techniques in different ways. The different approaches used
for the modeling of hybrid recommendation system are:
• Feature Combination: Feature from different recommendation data sources are thrown
together into a single recommendation algorithm.
• Feature augmentation: Output from one technique is used as an input feature to an-
other.
The issues [22] that can result in the recommendation system can be described as follows:
1. Data Collection: The data used by recommendation engines can be categorized into
explicit and implicit data. Explicit is all data the user themselves feed into the system.
The collection of explicit data must not be intrusive or time consuming. Implicit data
source in e-commerce is the transaction data. Implicit data needs to be analyzed before
it can be used to describe user features or user-item ratings.
2. Cold Start/Ramp-Up: The cold start problem occurs when too little/no rating data is
available in the initial state. The recommendation system then lack data to produce
appropriate recommendations. They mostly occur in the learning models. Two cold
start problems are new user problem and new item problem.
3. Stability Vs Plasticity: The converse of the cold start problem is the stability vs plas-
ticity. When consumers have rated so many items their preferences in the established
user profiles are difficult to change.
4. Sparsity: In most use cases for recommendation systems, due to the catalog sizes of e-
business vendors, the count of rating already obtained is very small related to the count
of ratings that need to be predicted. But collaborative filtering techniques focuses on
an overlap in ratings and have difficulties when the space of rating is sparse(few user
have rated the similar items). Sparsity in the user-item rating matrix degrades the
quality of the recommendations.
5. Performance and Scalability: Performance and scalability are important issues for rec-
ommendation systems as e-commerce websites must be able to determine recommen-
dations in real-time and often deal with huge data sets of millions of users and items.
26
The big growth rates of e-business are making the sets even larger in the user dimen-
sion.
6. User Input Consistency: Recommendation techniques that work with user-to-user cor-
relations like collaborative filtering or demographic, depends on more correlation co-
efficients between the users in a data set. Users can be categorized into three classes
based on their correlation coefficients with other users. The majority of users fall into
the class of “white sheep”, where there is a high rating correlation with other users.
Resulted engines can easily find recommendations for them. The opposite type is the
“black sheep” where there are only few or no correlating users. This makes it quite
difficult to find recommendations for them. The bigger problem is the “gray sheep”
problem where users have different opinions or an unusual taste that results in low
correlation coefficients with many users. They fall on a border opinions or an unusual
taste that result in low correlation coefficients with many users. They fall on a border
between user tastes. Recommendations for them are very difficult to find and they also
cause different recommendations for their correlated users.
The purpose of this study is to understand how personality impacts on the collaborative fil-
tering model and compare it with other some popular models (global baseline, latent factor).
Hence the recommendation model used in the project are:
• Matrix Factorization
27
Here all together, 8 different recommendation models are created among which 4 are created
by the combination of global baseline algorithm, user to user collaborative filtering(with and
without personality).
Global Baseline algorithm provides a mechanism to compute the unknown rating with base-
line (i.e “global effects”) estimates of corresponding users and items. Mathematically, Sup-
pose µ be the system wide average rating, bx be the overall user rating deviation from system
average and bi be the deviation in rating for an item i then global base line algorithm rates
an item i for an user x as:
GlobalBaselineEstimate[Rx,i ] = µ + bx + bi (3.13)
• User to Rating matrix computation: User-rating matrix is computed with rating data
of different users available from database or dataset.
• Normalization of the rating: It is done in order to make the avg rating of the system
zeros so that the unknown values can be padded with zeros. Mathematically, Suppose
µx be the average rating of the user x and Rx,i represents a rating of user x on item i
then normalized rating for an user x on item i can be computed as:
• Computing similar user: In order two compute the similar user,two metrics has been
used in the project i.e similarilty based on the rating matrix of the user and similarity
based on the personality. In both of the cases the similar user is computed with the
help of cosine similarity after the normalization of the rating. Mathematically, Suppose
ra = [ra 1, ra 2, · · · , ra n] be the user rating matrix of the user a and rb = [rb 1, rb 2, · · · , rb n]
be the user rating matrix of user b, then cosine similarity between user a and b can be
obtained as:
ra 1 ∗ rb 1 + ra 2 ∗ rb 2 + · · · + ra n ∗ rb n
similaritya,b = p p (3.15)
ra 12 + ra 22 + · · · + ra n2 ∗ rb 12 + rb 22 + · · · + rb n2
Similarly, person with similar personality is computed with the help of personality
vector.
28
• Rating prediction: A rating for user x on item i with the help of N neighbor is computed
by taking the weighted average rating of the neighbors.
∑N
y=1 sx,y ∗ ry,i
rx,i = (3.16)
∑N
y=1 sx,y
∑N
y=1 sx,y ∗ (ry,i − baseliney,i )
rx,i = baselinex,i + (3.17)
∑N
y=1 sx,y
where,
rx,i is the rating on item i by user x
baselinex,i is the baseline estimate on item i by user x
baseliney,i is the baseline estimate on item i by user y
sx,y is the similarity between user x and y
N is the total neighbors used for the recommendation
Matrix factorization [24] involves in a factorization of a matrix to find out tow or more
matrices such that when fators are multiplied together, original matrix in obtained. In rec-
ommender system, the matrix factorization is employed to predict the missing ratings such
that the values would be consistent with the existing rating in the matrix. The intuition be-
hind using matrix factorization, is that it is assumed there should be some latent features that
determine how a user rates an item. For example two users would give high rating to a cer-
tain music if they both like the singer of the music or if the music is of same genre. Hence, if
these latent features can be discovered, we should be able to predict a rating with respect to
a certain user and a certain item, because the features associated with the user should match
with the features associated with the item.
In trying to discover the different features, we also make the assumptions that the number of
features would be smaller than the number of users and the number of items. Suppose we
29
have a set U of users and set D of items. Let R of size |U| ∗ |D| be the matrix that contains
all the ratings that the users have assigned to the items. We also assume that we would like
to discover K latent features. So our task here is to find out two matrices P of size |U| ∗ K
and Q of size |D| ∗ K such that their products approximates R.
R ≈ P ∗ QT = Rb (3.18)
In this ways each row of P would represent the strength of the associations between a user
and the features. Similarly, each row of Q would represent the strength of the associates
between an item and the features. To get the prediction of a rating of an item d j by ui we can
calculate the dot product of the two vectors corresponding to the ui and d j :
rbi j = pi T q j
K (3.19)
= ∑ pik q jk
k=1
Now, in order to find a way to obtain P and Q. One way to approach this problem is first
initialize the two matrices with some values, calculate how ’different’ their product is to M,
and then try to minimize this difference iteratively. Such a method is called gradient descent,
aiming at finding a local minimum of the difference. The difference here, usually called the
error between the estimated rating and the real rating, can be calculated by the following
equation of each user-item pair:
ei j 2 = (ri j − rbi j )2
K (3.20)
= (ri j ) − ∑ pik q jk )2
k=1
Here we consider the squared error because the estimated rating can be either higher or
lower than the real rating. To minimize the error, it is necessary to know in which direction
we have to modify the values of pik and qk j . In order words, we need to know the gradient
at the current values and hence differentiate the above equation with respect to those two
variables separately:
∂ ei j 2
= −2(ri j − rbi j )(qk j )
∂ pik (3.21)
= −2ei j qk j
30
And,
∂ ei j 2
= −2(ri j − rbi j )(pk j )
∂ qik (3.22)
= −2ei j pk j
Now, the update rules can be formulated for both pik and qk j as:
∂ ei j 2
pik = pik + α
∂ pik (3.23)
= pik + 2αei j qk j
And,
∂ ei j 2
qik = qik + α
∂ qik (3.24)
= qik + 2αei j pik
Here, α is a constant called learning rate whose value determines the rate of approaching the
minimum. Usually, α is chosen to be between 0.001 to 0.1, this is because if we make too
large step towards the minimum, we may run into the risk of missing the minimum and end
up oscillating around minimum. Using the above rule, it is possible to iteratively perform the
operation until the error converges to it’s minimum or run the process for the finite number
of iteration.
3.6.4.1 Regularization
K
β 2
ei j 2 = (ri j − ∑ pik qk j)2 + ∑ (|P|2 + |Q|2 ) (3.25)
k=1 2 k=1
Here, the new parameter β is used to control the magnitudes of the user=feature and item-
feature vector such that P and Q would give a good approximation of R without having to
contain large numbers. In practice, β is set in range of 0.02. Now the new update rules for
this squared error can be obtained similarly as above and the new update rules becomes:
31
∂ ei j 2
pik = pik + α
∂ pik (3.26)
= pik + α(2ei j qk j − β pki )
And,
∂ ei j 2
qik = qik + α
∂ qik (3.27)
= qik + α(2ei j pik − β qk j )
Thus, in this way matrix factorization can be implemented as the recommender system.
Evaluation measures for recommender systems are separated into three categories [28]:
• Predictive Accuracy Measures: These measures evaluate how close the recommender
system came t predicting actual rating/utility values.
• Classification Accuracy Measures: These measures evaluate the frequency with which
a recommender system makes correct/incorrect decisions regarding items.
• Rank Accuracy Measures: These measures evaluate the correctness of the system or-
dering of items performed by the recommendation system.
Since the project is about the impact of personality on user to user based collaborative
filtering we are concerned with only predictive accuracy measures. There are many
variant of predictive accuracy measures such as: Mean Absolute Error(MAE), Mean
Squared Error(MSE), Root Mean Squared Error(RMSE),Normalized Mean Absolute
Error(NMAE).
Among them, root mean squared error is the most popular one and has been used in
the project as:
Let uxi be the actual rating of the user x on item i and uc
xi be the predicted rating of the
user x on item i, then, the root squared error can be computed as:
s
(∑N d2
n=1 (uxi − uxi )
RMSE = (3.28)
N
32
4. METHODOLOGY
The functional requirement specification of the project are mainly categorized as user re-
quirements, security requirements, and device requirement each of which are explained in
detail below:
• User Requirement: User should have account on Facebook and user must have at least
one post needed to analyze the personality for the music recommendation.
• Security Requirement: The user can’t have access to the Facebook API. User must
provide their own login credentials.
• Performance: The system shall have a quick, accurate and reliable results.
• Capacity and Scalability: The system shall be able to store personality computed by
the system into the database.
• Availability: The system shall be available to user anytime whenever there is an Inter-
net connection.
• Flexibility and Portability: System shall be accessible anytime from any locations.
Feasibility Assessment is done to analyze the viability of an idea. In case of software devel-
opment, it performs the practicality of the project or system. The result from feasibility as-
sessment determines whether the project should go ahead, be redesigned, or dropped. There
are five areas of feasibility - Technical, Economic, Legal, Operational and Scheduling.
The operational feasibility analysis gives the description about how the system operates and
what resources do the system requires for performing its designated task. Being closely re-
lated to data analysis and integration, the system requires to be easily operable for different
uses and operations like new data entry, concurrent use of system should be fluent consum-
ing minimum cost and resources. System to designed is operationally feasible as the system
can be operated with the resource as of personal computer i.e browser. The project is devel-
oped as the website allows the easy access to the multiple users. Beisdes the analysis task
performed by the subsystem is also operationally feasible.
Technical Feasibility Assessment examines whether the proposed system can actually be de-
signed to solve the desired problems and requirements using available technologies in the
given problem domain. The system is said to be feasible technically if it can be deploy-
able, operable and manageable under current technological context of our country Since the
system aims to enhance the existing collaborative engines, via the use of personality and as
the data set for the personality is avialable and the building a classifier is also feasible, the
project can be considered technically feasible.
Economic Feasibility checks whether the cost required for complete system development is
feasible using at he available resources in hand. It should be noted that the cost of resources
and overall cost of deployment of system should be kept minimum while operational and
34
maintenance cost for the system should be within the capacity of organization. Since the
system is hosted on Heroku cloud hosting service which is free of cost for limited use, the
system can be considered economically feasible for the developement.
Legal Feasibility assessment checks the system for any conflicts with legal requirement,
regulations that are to be followed as per the standard maintained by the governing body. As
such, the system that is being developed must comply with all the legal boundaries such as
copyright violation, authorize use of licenses and other. This prevent any future conflicts for
the system and also provide legal basis for the system in future if any other tries to use part
of or full system without necessary permission and documents. The data obtained from the
social media of the user is being consented by the user and doesn’t does violate any other
obligation of law and privacy, project can be considered legally feasible.
Any project is considered fail if it is not completed on time. so, Scheduling Feasibility
estimates the time require fro the system to fully develop and whether that time feasible or
not according to current trend in market. If the project takes longer time to complete, it may
be outdated or some other may launch the similar system before our system is complete. So,
it is required to fix the deadline for any project and the system should be out and operative
before specified deadline. As the scheduling of the project is in consistent with the available
time of the project, project can be considered scheduling feasibile.
The developed system, being huge and dynamic nature, may not be developed efficiently and
timely procured with the traditional development approaches like waterfall. Thus to meet the
requirements of the system and ensuring the timely delivery and adaptability for changing
requirement, Scrum methodology under the Agile Development method is chosen for the
development of system.
Scrum is an agile way to manage project. Agile software development with Scrum is of-
ten perceived as a methodology, but rather than viewing Scrum as methodology, it is often
thought as a framework for managing a process. In the agile Scrum world, instead of provid-
35
This model suggests that projects progress via a series of sprints. In keeping with an agile
methodology, sprints are time boxed to no more than a month long, most commonly two
weeks. It advocates for planning meeting at the start of the sprint, where team members
figure out how many items they can commit to and then create a sprint backlog, a list of task
to perform during the sprint. During an agile Scrum sprint, the Scrum team takes a small set
of features from idea to coded and tested functionality. At the end, these features are coded,
tested and integrated into the evolving product or system.
On each day of the sprint, all team members attend a Scrum meeting. During that time, team
members share what they worked on the prior day, will work on that day and identify any
impediments to progress. The model sees routine scrums as a way to synchronize the work of
team members as they discuss the work of the sprint. At the end of a sprint the sprint review is
conducted, during which the team demonstrates the new functionality to any stakeholder who
wishes to provide feedback that could influence the next sprint. The feedback loop within
Scrum software development may result in changes to the freshly delivered functionality, but
it may just as likely result in revising or adding items to the product backlog.
The primary artifact in Scrum development is, the product itself. The Scrum model expects
36
the team to bring the product or system to a potentially shippable state at the end of each
Scrum sprint. The product backlog is another artifact of Scrum. This is the complete list of
the functionality that remains to be added to the product. The most popular and successful
way to create a product backlog using Scrum methodology is to populated it with the user
stories, which are short descriptions of functionality described from the perspective of a
user or customer. In Scrum project management, on the first day of a sprint and during
the planning meeting, team members create the sprint backlog. The sprint backlog can be
thought of as the team’s to do list for the sprint whereas the product backlog is a list of
features to be built. The sprint backlog is the list of tasks the team needs to perform in order
to deliver the functionality it committed to deliver during the sprint. Additional artifacts
in the Scrum methodology is the sprint burn down chart which shows the amount of work
remaining either in a sprint to determine whether a sprint is on schedule to have all planned
work finished by the desired date.
Hence scrum is adaptive, has small repeating cycles and there is short-term planning with
constant feedback, inspection and adaption and is therefore chosen as the software develop-
ment methodology. Here, project team members can be thought of a software development
team and project supervisor as scrum master. The scrum meeting was conducted regularly
within the interval of 3 weeks, keeping scrum backlog as well as product backlog and hence
the progress in the project was made in the form of sprints whereby the product backlog
helped to identify and prioritizes the features to implement in each sprint and the burn down
chart helped to keep the project timely on schedule. Whenever the bug was found relating to
the feature, it was dealt immediately before marking the feature complete i.2 1-2 sprints ware
focused only on Defect backlogs. Each scrum meeting lasted about 15 minutes in which ev-
ery team member answered three question: What have I done since last meeting?, What will
I do until next meeting? and What problems do I have? and hence in this way a cycle is
continued till product is completely developed.
In order to predict the personality, dataset for training the classification model was obtained
from myPersonality website[1]. It consisted of collections of status updates of Facebook
users along with their personality classification scores in terms of big five personality traits.
For the recommendation system, survey was conducted among the colleagues who gave
ratings to predefined set of music in the database. Their personality traits were determined
with our personality classification model before they rated the music.
37
5. SYSTEM DESIGN
The system uses initially predicts the personality of the user with the help of their social
media account(facebook). Thus initially a classifier is to be trained to classify the personality
of the user on the basis of their status update. Afterwards, the predicted personality of the
user is to be used as one the metrics for the similar user computation in collaborative filtering
and the effect of the personality on the collaborative filtering engine was observed.
The give figure below is the architectural diagram of the project showing the process that are
involved during the development of the project.
5. Stemming
6. Conversion of textual data to vector representation(numerical form)
The following figure depicts tasks performed within preprocessor unit within the sys-
tem:
• Classifier: After the vector representation of the status update,this subsystem is re-
sponsible for personality prediction. Classifier are trained by the admin in the system
using the dataset [1] in order to predict a personality. In the project there are two clas-
sifier model used for the personality classification.
They are:
The following figure depicts tasks performed by classifier unit within the system:
39
• Music Recommender System: The system comprises of the eight models for the rec-
ommendation of the music to the user.
They are:
The following figure summarizes tasks performed by recommender unit within the
system:
40
The following figure depicts various recommendation models used within the system:
• Storage Unit/Database: It is responsible for storing of user data, music data, user-
music-rating data and user-music-recommendation data made by the recommender
system as well as providing recommendation based on user-feedbback. SQLite database
is used as the storage unit for the project.
41
A use case diagram represents user’s interaction with the system that shows the relationship
between the user and the different use cases in which the user is involved. A use case diagram
can identify different types of users of a system, different use cases as well. A use-case
diagram provides higher-level view of the system. Use case diagrams are the blueprints
for the system. The use cases are shown as ovals, actors as stick people (even if they are
machines), with lines (known as associations) connecting use cases to the actors who are
involved with them. A box around the use cases emphasizes the boundary between the
system (defined by the use cases) and the actors who are outside of the system. In our case,
actor ’user’ can log in to the system, allow system to access his profile information and view
output or result from the system.
Use case diagram of the system depicting the actors and their interaction to the system is
given in the figure below:
From the above diagram, it is clear that the system consists of two actors.They are:
42
• User: They are the ones who will be using the system directly. The users will be able
to do the actions like login, viewing recommendation and listening to a music.
• Admin: Admin is directly responsible for training a classifier subsystem and recom-
mender subsystem, creation of model for the storage engine and verification of all of
these subsystem.
System is composed of ui, classifier, music recommender and storage unit.Classifer
within the system is responsible for the classification of the personality of the user and
update of the database. Recommender is responsible for the recommendation of the
music to the user and also update of the database.Storage unit is responsible for the
creation of the data base model and storage of system data and ui is responsible for
providing user with login access, profile access and displaying user a personality and
recommendation of music.
43
5.4. ER Diagram
1. Session: It consists of attributes: session id and user id and has one to one relationship
with user.
2. Music: It consists of attributes: artist and song title and also has many to many rela-
tionship with the user-music and recommendation.
3. User: It consists of attributes user id, name and personality traits attributes and has
many to many relationship with user-music and one to one with the session with ses-
sion and recommendation.
4. Recommendation: It consists of attributes user and music and has one to one relation-
ship with user while many to many relationship with music.
44
5. User-Music: It consists of attributes user, music and rating and many to many relation-
ship with user and music.
System is implemented within the Django framework that provides a abstraction to the rela-
tionship within the database hence we can directly implement those relationship within the
database such as one-to-one, one-to-many and many-to-many.
Activity Diagram describes the dynamic aspects of the system. It diagram shows user ori-
ented view of system operation. We have made activity diagram using swim-lanes. A
swim lane is a visual element that distinguishes job sharing and responsibilities for sub-
processes. In our system’s activity diagram, we have three swim-lanes and we have sepa-
rated job/responsibilities accordingly. Each step is continuation of previous step. Decision
is taken wherever necessary and fork and join is used to divide or attach work flow. The
objective of making activity diagram is similar to objectives of other UML Diagrams. Only
difference is that it is used to show message flow between activities.
45
The diagram above shows the activity digram of the system. It depicts how the user, admin
and system interacts with each other. Initially user login into the system providing the basic
user profile information. Afterwards, the status of the new/old user is used to predict the
personality with classifier. Then music is recommended to user. Besides, the user can also
view his personality.
46
A system context diagram (SCD) in engineering is a diagram that defines the boundary be-
tween the system, or part of a system, and its environment, showing the entities that interact
with it. This diagram is a high level view of a system. It is similar to a block diagram. In
our system context diagram, there are two entities namely, user and sysadmin and a process
(which is the system we developed as our project). This diagram shows the input and output
for each of the entity as well as the process.
A data flow diagram (DFD) is a graphical representation of the ”flow” of data through an
information system, modelling its process aspects. A DFD shows what kind of information
will be input to and output from the system, how the data will advance through the system,
and where the data will be stored. However, it does not show information about process
timing or whether processes will operate in sequence or in parallel Being a UML diagram,
DFD presents both control and data flows as a unified model. Given diagram is the level-
0 DFD that shows internal distinct process of our system. There are four processes and a
datastore which stores all data, intermediate outcomes and results. Two entities, user and
sysadmin, take part in flow of data to/from these process. Each arrow head in the data
flow diagram shows the direction of the data/information flow and label provide type of
data/information that flows through. The figure given below is the data flow diagram of the
project showing the flow of data within the system.
It is the level-0 dfd, with the two entities User and Admin. User is responsible for log in,
47
view personality and view recommended music, all of which takes data from the user and
recommendation store, to provide a data to the user. Besides admin is responsible for creating
and altering models(classifier,recommender,database) all of which are reflected within the
user and recommendation store.
User Interface is one of the major part of the system. It is where a user will login through
their Facebook id in order to experience the personalized based music listening. User is
able to able to view his/her recommended music as well as personality via the website and
also view the detailed description about the personality traits. Personality classification and
music recommendation are all performed in the backend of the system.
After the user login through Facebook, the user post are extracted through Graph API. The
data obtained goes through the preprocessor where it’s performs various NLP techniques
48
such as tokenization, POS tagging at the end of which feature vector is given as output
by this subsystem. Thus created vector is passed through the classifier, that classifies the
personality of the user, which is then stored in the database and is also fed into user-to-user
collaborative filtering engine to determine the similar user and recommend the music to the
user.Besides there are also other recommendation model, one with the least RMSE value is
used for the recommendation of the music. Thus obtained result is sent back to the front
end and is displayed to the user. Then the user can view recommended music and his/her
personality traits too.
49
6.1. Python
6.2. Django
Django is a free and open-source web framework, written in Python, which follows the
model-view-template (MVT) architectural pattern. Django’s primary goal is to ease the cre-
ation of complex, database-driven websites. Django emphasizes reusability and ”pluggabil-
ity” of components, rapid development, and the principle of don’t repeat yourself. Python is
used throughout, even for settings files and data models. Django also provides an optional
administrative create, read, update and delete interface that is generated dynamically through
introspection and configured via admin models.
6.3. NumPy
NumPy is a library for the Python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high-level mathematical
functions to operate on these arrays.
6.4. Pandas
Pandas is a software library written for the Python programming language for data manip-
ulation and analysis. In particular, it offers data structures and operations for manipulating
numerical tables and time series. It offers wide range of features including DataFrame object
for data manipulation with integrated indexing, tools for reading and writing data between
in-memory data structures and different file formats, data alignment and integrated handling
of missing data and more.
50
6.5. NLTK
NLTK is a leading platform for building Python programs to work with human language data.
It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet,
along with a suite of text processing libraries for classification, tokenization, stemming, tag-
ging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and
an active discussion forum. Natural Language Processing with Python provides a practical
introduction to programming for language processing.
Facebook Platform is an umbrella term used to describe the set of services, tools, and prod-
ucts provided by the social networking service Facebook for third-party developers to create
their own applications and services that access data in Facebook. The Graph API is the core
of Facebook Platform, enabling developers to read from and write data into Facebook.
6.7. HTML/CSS
Hypertext Markup Language (HTML) is the standard markup language for creating web
pages and web applications. Web browsers receive HTML documents from a web server
or from local storage and render them into multimedia web pages. HTML describes the
structure of a web page semantically and originally included cues for the appearance of the
document. Cascading Style Sheets (CSS) is a style sheet language used for describing the
presentation of a document written in a markup language. It is most often used to set the
visual style of web pages and user interfaces written in HTML.
6.8. JavaScript
6.9. PostgreSQL
6.10. Git
Git is a version control system (VCS) for tracking changes in computer files and coordinating
work on those files among multiple people. It is primarily used for source code management
in software development, but it can be used to keep track of changes in any set of files.
As a distributed revision control system it is aimed at speed, data integrity, and support for
distributed, non-linear workflows.
52
7. RESULT
myPersonality dataset contains status updates of 223 users. These users come under Big Five
Personality traits. We analyzed this data to see how users are distributed under different per-
sonality traits. The frequency distribution of each class of personality traits is given below:
We analyzed the effect of number of iterations on the f-measure of the logistic regression
model which is given below:
The followings tables show the confusion matrix of Naive Bayes for Big Five Personality
classes:
53
The following figure shows f-measure of the naive bayes model for Big Five Personality
classes:
The followings tables show the confusion matrix of Naive Bayes for Big Five Personality
classes:
The following figures show effects of change in number of nearest neighborhood in the dif-
ferent collaborative filtering models.
Figure 7.6: RMSE of Collaborative Filtering with similarity interm of Personality Matrix
Figure 7.7: RMSE of Collaborative Filtering combined with Global Baseline with User Rat-
ing Matrix
58
Figure 7.8: RMSE of Collaborative Filtering combined with Global Baseline with User Per-
sonality Matrix
Figure 7.9: RMSE of Collaborative Filtering with User Rating and Personality Matrix
59
Figure 7.10: RMSE of Collaborative Filtering with User Rating and Personality Matrix com-
bined with Global Baseline
60
The following figure shows the RMSE of matrix factorization when number of iterations is
varied:
The following figure shows the RMSE of matrix factorization when k is varied with number
of iterations is fixed at 1000:
Comparison of the above models, we can conclude that the result of user to user collabora-
tive filtering with the personality has slightly better result than the user to user collaborative
filtering with the user Rating matrix but the matrix factorization outperforms them all. Be-
sides, the result of weighted average of user similarity matrix with rating and personality also
performs better than only a rating matrix but has a comparable result with the user to user
collaborative filtering with personality to compute the similarity. Currently the system uses
switching hybrid methodology in order to recommend the user among the different models
used in the system i.e one with the least RMSE.
62
In the context of building classification model, the preprocessing of the status updates was
a huge challenge. The status updates of users can contain various emojis which could be
significant but ignored.
The current recommendation system as a whole suffers from cold start in case of item i.e.
item ramp up problem. In case of user to user collaborative filtering with rating matrix, it
suffers from stability vs plasticity issue. Besides, in case of collaborative engine, gray sheep
problem still prevails with the use of the personality too because of sparsity in the user rating
matrix.
The current system can be enhanced with the consideration of emojis and some demographic
information about the users in case of personality classification. The ramp-up problem in
case of recommendation engine can be solved with the content filtering method in order to
create a profile of an item. Besides stability vs plasticity issue can be solved by giving low
weights to the old rating of the users in case of user to user collaborative filtering with rating
matrix.
63
9. CONCLUSION
In this project, we have developed classification models that take Facebook user’s status as
input and classifies their personality based on big five personality traits. This information is
used by user to user collaborative filtering to find out similar users and recommend music
to them. This recommendation model performs better than the user to user collaborative
collaborative filtering with rating matrix but not as good as the matrix factorization. Besides
the recommendation model developed with personality has comparable result to weighted
average of similarity using rating matrix and personality matrix. Hence, with reference to
current scenario of our project, we can conclude that personality,i.e big five traits of the
user can be used to enhance existing user to user collaborative filtering that computes the
similarity with the user rating matrix.
64
References
[1] D. Stillwell and M. Kosinski. (2017, January). myPersonality DataSet. Retrieved from
http://mypersonality.org/wiki/doku.php
[2] Goldberg LR, et al. (2006). Five Factor Model of Personality: The international per-
sonality item pool and the future of pulic-domain personality measures J Res Pers,
40(1):8486.
[3] E. Tupes and R. Christal. (1992). Recurrent personality factors based on trait ratings.
Journal of Personality, 60(2): 225251.
[4] R. McCrae and O. John. (1992). An introduction to the five-factor model and its appli-
cations. Journal of Personality, 60(2): 175215.
[5] L. Audery. (2014). Improving music recommender systems: What can we learn from
research on music tastes? ISMIR.
[6] B. Ferwerda and S. Markus. (2014). Enhancing Music Recommender Systems with Per-
sonality Information and Emotional States: A Propoasal. UMAP Workshops.
[13] M. F. Porter. (2001). Stemming Algorithm Snowball: A Language for Stemming Algo-
rithms
65
[15] Textual data and vector space model. (2017, July). Retrieved from http://www.
calpoly.edu/~dsun09/lessons/textprocessing
[21] L. Safoury and A. Salah. (2013). Exploiting user demographiattributes for solving cold-
start problem in recommender system.Lecture Notes on Software Engineering, 1(3), 303
[25] C. Manning. (2017, June). Stanford NLP - Standford NLP Group. Retrieved from
https://nlp.standford.edu/manning
[27] Studying the big five personality traits-UK Essays. (2017. Jan-
uary). Retrieved from https://ukessays.com/essays/psychology/
studying-the-big-five-personality-traits.php
[31] Andrew Ng. (2017, March). Machine Learning [Video lectures]. Retrieved from
https://www.coursera.org/learn/machine-learning.
[33] Jason Brownlee. (2017, August). Logistic Regression for Machine Learn-
ing [Online tutorial]. Retrieved from http://machinelearningmastery.com/
logistic-regression-for-machine-learning.
67
APPENDIX A
Output Screenshots
APPENDIX B
Stopwords used
’i’, ’me’, ’my’, ’myself’, ’we’, ’our’, ’ours’, ’ourselves’, ’you’, ’your’, ’yours’, ’yourself’,
’yourselves’, ’he’, ’him’, ’his’, ’himself’, ’she’, ’her’, ’hers’, ’herself’, ’it’, ’its’, ’itself’,
’they’, ’them’, ’their’, ’theirs’, ’themselves’, ’what’, ’which’, ’who’, ’whom’, ’this’, ’that’,
’these’, ’those’, ’am’, ’is’, ’are’, ’was’, ’were’, ’be’, ’been’, ’being’, ’have’, ’has’, ’had’,
’having’, ’do’, ’does’, ’did’, ’doing’, ’a’, ’an’, ’the’, ’and’, ’but’, ’if’, ’or’, ’because’, ’as’,
’until’, ’while’, ’of’, ’at’, ’by’, ’for’, ’with’, ’about’, ’against’, ’between’, ’into’, ’through’,
’during’, ’before’, ’after’, ’above’, ’below’, ’to’, ’from’, ’up’, ’down’, ’in’, ’out’, ’on’, ’off’,
’over’, ’under’, ’again’, ’further’, ’then’, ’once’, ’here’, ’there’, ’when’, ’where’, ’why’,
’how’, ’all’, ’any’, ’both’, ’each’, ’few’, ’more’, ’most’, ’other’, ’some’, ’such’, ’no’, ’nor’,
’not’, ’only’, ’own’, ’same’, ’so’, ’than’, ’too’, ’very’, ’s’, ’t’, ’can’, ’will’, ’just’, ’don’,
’should’, ’now’, ’d’, ’ll’, ’m’, ’o’, ’re’, ’ve’, ’y’, ’ain’, ’aren’, ’couldn’, ’didn’, ’doesn’,
’hadn’, ’hasn’, ’haven’, ’isn’, ’ma’, ’mightn’, ’mustn’, ’needn’, ’shan’, ’shouldn’, ’wasn’,
’weren’, ’won’, ’wouldn’
RB: adverb
example: occasionally unabatingly maddeningly adventurously professedly stirringly promi-
70