Professional Documents
Culture Documents
Multilevel Predictive Model For Detecting Depression in Social Media Users
Multilevel Predictive Model For Detecting Depression in Social Media Users
BACHELOR OF ENGINEERING
in
Mr. G. Vivek
Assistant Professor
May, 2019
A Project Report
BACHELOR OF ENGINEERING
in
by
May, 2019
CERTIFICATE
This is to certify that the project work entitled “Multilevel Predictive Model for
Detecting Depression in social media users” is the bonafide work carried out by T.
Sai Sreeshma(160115733140) and M. Sai Tejaswi(160115733142), students of
B.E.(CSE) of Chaitanya Bharathi Institute of Technology, Hyderabad, affiliated to
Osmania University, Hyderabad, Telangana(India) during the academic year 2018-19,
submitted in partial fulfilment of the requirements for the award of degree in
Bachelor of Engineering in (Computer Science and Engineering) and that the
project has not formed the basis for the award previously of any other degree,
diploma, fellowship or any other similar title.
Place: Hyderabad
Date:
External Examiner
DECLARATION
We hereby declare that the project entitled “Multilevel Predictive Model for
Detecting Depression in social media users” submitted for the B.E.(CSE) degree is
our original work and the project has not formed the basis for the award of any other
degree, diploma, fellowship or any other similar titles.
T. Sai Sreeshma
(160115733140)
M. Sai Tejaswi
(160115733142)
Place: Hyderabad
Date:
iii
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of any task would be
incomplete without introducing the people who made it possible and whose constant
guidance and encouragement crowns all efforts with success. They have been a
guiding light and source of inspiration towards the completion of the project.
We would like to express our sincere gratitude and indebtness to our project guide,
Mr. G. Vivek who has supported us throughout our project with patience and
knowledge.
We are also thankful to Head of the Department Dr. M. Swami Das for providing
excellent infrastructure and a conducive atmosphere for completing this project
successfully.
We are also extremely thankful to our Project Coordinators Dr. K. Sagar, Professor,
Dept. of CSE, CBIT and Dr. M. Venu Gopalachari, Associate Professor, Dept. of
CSE, CBIT for their valuable suggestions and interest throughout the course of this
project.
We convey our heartfelt thanks to the lab staff for allowing us to use the required
equipment whenever needed.
Finally, we would like to take this opportunity to thank our families or their support
through the work. We sincerely acknowledge and thank all those who gave directly or
indirectly their support in completion of this work.
T. Sai Sreeshma
(160115733140)
M. Sai Tejaswi
(160115733142)
iv
ABSTRACT
Depression is a widespread disease nowadays, with about 300 million people
suffering from it globally. It is more than just a negative mood that requires treatment.
People who lead a life with high satisfaction tend to suffer fewer mental health issues.
The large volume of data generated by the users onto the social network
platforms would help detect hidden patterns in data and obtain new insights. This
project aims to develop a multilevel predictive model to detect users with depression,
by predicting life satisfaction in users. This model would learn about the relationship
between life satisfaction and depression in users on social network, twitter for
example. This model would help in detecting depression, which would allow the user
to get help, and health recommendations. It would also allow doctors and
psychologists to get frequent information from the patient.
v
LIST OF FIGURES
Title Page i
Certificate ii
Declaration iii
Acknowledgement iv
Abstract v
List of Figures vi
1. INTRODUCTION 1
1.2 Methodologies 1
2. LITERATURE SURVEY 4
2.2 Background 5
4. IMPLEMENTATION OF THE 22
PROPOSED SYSTEM
4.3 Design 24
6.1 Conclusion 42
6.1.1 Limitations 42
REFERENCES 43
APPENDIX 1: CODE 44
APPENDIX 2: DICTIONARY 59
CHAPTER-1
INTRODUCTION
Problem Definition:
The large volume of data generated on social network platforms help detect
patterns in data and obtain new insights. Using these patterns, this project will detect
life satisfaction levels among different social network users.
1.2 Methodology
This section describes the datasets, the methods used to extract information
from the datasets, the measures of life satisfaction and depressive statuses, and the
construction of the predictive model.
1
A. Dataset:
Our dataset is a synthesised dataset which is generated using IBM Watson API
by giving a folder which contains text files of tweets from each individual user.
1. Satisfaction with Life Scale: SWLS is a valid and reliable questionnaire to measure
life satisfaction [3]. This questionnaire contains 5 questions, e.g., in most ways my
life is close to my ideal, the conditions of my life are excellent, I am satisfied with my
life, so far I have gotten the important things I want in life, and If I could live my life
over, I would change almost nothing, requesting respondents to indicate their feelings
on a scale from 1 (strongly disagree) to 7 (strongly agree). The total score ranges from
5 (extremely dissatisfied) to 35 (extremely satisfied). The self-reported score can be
also used to reflect emotional well-being of respondents.
2. Centre for Epidemiological Study Depression Scale: CES-D is one of the most
common and popular questionnaires to measure depressive tendencies [20]. It has
been used to screen respondents with depression both in the general population and in
primary care settings [21]. This questionnaire consists of 20 items. Each of them asks
how often the respondent has felt in different states or moods over the last week. The
answers are rarely or none of the time (less than 1 day), some or a little of the time (1-
2 days), occasionally or a moderate amount of the time (3-4 days), and most or all of
the time (5-7 days). Its total score ranges from 0 to 60 and can used to classify a
respondent as depressed or non-depressed.
C. Predictive models
Two predictive models are developed and evaluated :one for classifying users
according to life satisfaction and another one for detecting users with depression. All
models relied on supervised learning techniques, which need inputs and outputs for
training. Supervised learning techniques learn patterns from a set of given inputs and
then try to classify or categorise the inputs into classes following a given set of known
outputs. After the training state, a well-trained model can provide or classify an
unknown output from a new set of inputs into a suitable class.
2
1.3 Outline of the results
The output of project is the level of depression in CES-D scale. The CES-D
scale denotes the level of depression in the user by analysing the user’s twitter profile.
3
CHAPTER-2
LITERATURE SURVEY
2.1. Introduction
In the last decade, people have started to openly express their activities,
feelings, and thoughts on social network platforms. Thanks to these sites, people can
also establish new virtual relationships, communicate information, and interact with
other users. These networks attract large numbers of users, with Facebook alone
having over 2.2 billion monthly active users in March 2018. Twitter reported 336
million monthly active accounts in March 2018. The users of these networks are
generating a large volume of data, which is attracting researchers and companies in
hope of understanding hidden patterns in data and obtaining new insights.
The growing amount of data and the advances in high performance computing,
led to popularisation of data mining and machine learning techniques, as well as data
science, which enables researchers to interpret and visualise information from
complex datasets. Using these methods, predictive models have been developed to
detect, classify, and categorise data in a wide variety of research domains, including
health, energy, marketing, and genetics.
Given the relationship between mental health problems and life satisfaction, it
is possible to detect people with mental health problems on the basis of their
expression of any message or emotion providing information about their life
satisfaction. Previous studies have linked well-being and mental illness in online
users.
4
So far, a small, but notable, number of studies have investigated mental health
problems using social network data. Some studies have developed predictive models
to classify users according to their levels of life satisfaction and depression. To the
best of our knowledge, no study has developed a multilevel predictive model to
classify users with depression based on satisfaction with life.
2.2. Background
5
are based on surveys, relying on retrospective self-reports about mood and
observations about health: a method that limits temporal granularity. That is, such
assessments are designed to collect high-level summaries about experiences over long
periods of time. Collecting finer-grained longitudinal data has been difficult, given the
resources and invasiveness required to observe individuals’ behaviour over months
and years. The continuing streams of evidence from social media on posting activity
often reflects people’s psyches and social milieus. This data about people’s social and
psychological behaviour can be used to predict their vulnerabilities to depression in an
unobtrusive and fine-grained manner. Moving to research on social media, over the
last few years, there has been growing interest in using social media as a tool for
public health, ranging from identifying the spread of flu symptoms (Sadilek et al.,
2012), to building insights about diseases based on postings on Twitter (Paul &
Dredze, 2011). However, research on harnessing social media for understanding
behavioural health disorders is still in its infancy. Kotikalapudi et al., (2012) analysed
patterns of web activity of college students that could signal depression. Similarly,
Moreno et al., (2011) demonstrated that status updates on Facebook could reveal
symptoms of major depressive episodes.
6
7
2.4 Tools used
Python:
Python is a widely used general-purpose, high level programming language. It
was initially designed by Guido van Rossum in 1991 and developed by Python
Software Foundation. It was mainly developed for emphasis on code
readability, and its syntax allows programmers to express concepts in fewer
lines of code. Python is a programming language that lets you work quickly
and integrate systems more efficiently.
JSON:
JSON or JavaScript Object Notation is a format for structuring data. Like
XML, it is one of the way of formatting the data. Such format of data is used
by web applications to communicate with each other. It is Human-readable
and writable .It is light weight text based data interchange format which
means, it is simpler to read and write when compared to XML. Though it is
derived from a subset of JavaScript, yet it is Language independent. Thus, the
code for generating and parsing JSON data can be written in any other
programming language.
8
CHAPTER-3
This chapter describes the block diagram, description of different modules and
various algorithms used in the project.
3.2.1 get_tweets.py:
This piece of code takes username as the input and gives out the latest 100
tweets posted by that individual user. There are 4 keys provided to us through the
twitter developer account. These keys are used to authenticate the developer details
using tweepy. The libraries used in this code are:
9
Sys:
This module has System-specific parameters and functions. It provides access
to some variables used or maintained by the interpreter and to functions that
interact strongly with the interpreter. It is always available. The sys module
provides functions and variables used to manipulate different parts of the
Python runtime environment. sys.argv returns a list of command line
arguments passed to a Python script. The item at index 0 in this list is always
the name of the script. The rest of the arguments are stored at the subsequent
indices. sys.exit causes the script to exit back to either the Python console or
the command prompt. This is generally used to safely exit from the program in
case of generation of an exception. sys.maxsize returns the largest integer a
variable can take. sys.path is an environment variable that is a search path for
all Python modules. sys.version attribute displays a string containing the
version number of the current Python interpreter.
Csv:
The so-called CSV (Comma Separated Values) format is the most common
import and export format for spreadsheets and databases. CSV format was
used for many years prior to attempts to describe the format in a standardized
way. The lack of a well-defined standard means that subtle differences often
exist in the data produced and consumed by different applications. These
differences can make it annoying to process CSV files from multiple sources.
Still, while the delimiters and quoting characters vary, the overall format is
similar enough that it is possible to write a single module which can efficiently
manipulate such data, hiding the details of reading and writing the data from
the programmer. The csv module implements classes to read and write tabular
data in CSV format. It allows programmers to say, “write this data in the
format preferred by Excel,” or “read data from this file which was generated
by Excel,” without knowing the precise details of the CSV format used by
Excel. Programmers can also describe the CSV formats understood by other
applications or define their own special-purpose CSV formats. The csv
module’s reader and writer objects read and write sequences. Programmers
10
can also read and write data in dictionary form using the DictReader and
DictWriter classes.
Tweepy:
Tweepy is open-sourced, hosted on GitHub and enables Python to
communicate with Twitter platform and use its API. Tweepy provides access
to the well documented Twitter API. With tweepy, it's possible to get any
object and use any method that the official Twitter API offers. Tweepy
supports accessing Twitter via Basic Authentication and the newer
method, OAuth. Twitter has stopped accepting Basic Authentication
so OAuth is now the only way to use the Twitter API. The main
difference between Basic and OAuth authentication are the consumer
and access keys. With Basic Authentication, it was possible to
provide a username and password and access the API, but since 2010
when the Twitter started requiring OAuth, the process is a bit more
complicated. One of the main usage cases of tweepy is monitoring for
tweets and doing actions when some event happens. Key component
of that is the StreamListener object, which monitors tweets in real
time and catches them. To sum up, tweepy is a great open -source
library which provides access to the Twitter API for Python.
Although the documentation for tweepy is a bit scarce and doesn't
have many examples, the fact that it heavily relies on the Twitter API,
which has excellent documentation, makes it probably the best
Twitter library for Python, especially when considering
the Streaming API support, which is where tweepy excels. Other
libraries like python-twitter provide many functions too, but the
tweepy has most active community and most commits to the code
3.2.2 preprocessing.py:
After a text is obtained, the first step is text normalization. Text normalization
includes:
11
removing white spaces
removing stop words, sparse terms, and particular words
IO:
The io module provides Python’s main facilities for dealing with various types
of I/O. There are three main types of I/O: text I/O, binary I/O and raw I/O.
These are generic categories, and various backing stores can be used for each
of them. A concrete object belonging to any of these categories is called a
file_object. Other common terms are stream and file-like object. Independent
of its category, each concrete stream object will also have various capabilities:
it can be read-only, write-only, or read-write. It can also allow arbitrary
random access (seeking forwards or backwards to any location), or only
sequential access (for example in the case of a socket or pipe). Text I/O
expects and produce str objects. This means that whenever the backing store is
natively made of bytes (such as in the case of a file), encoding and decoding of
data is made transparently as well as optional translation of platform-specific
newline characters.
RE:
12
\\, and each backslash must be expressed as \\ inside a regular Python string
literal. The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in a string
literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n',
while "\n" is a one-character string containing a newline. Usually patterns will
be expressed in Python code using this raw string notation.
REGEX:
Regular expression patterns are compiled into a series of bytecodes which are
then executed by a matching engine written in C. For advanced use, it may be
necessary to pay careful attention to how the engine will execute a given RE,
and write the RE in a certain way in order to produce bytecode that runs faster.
Optimization isn’t covered in this document, because it requires that you have
a good understanding of the matching engine’s internals.
The regular expression language is relatively small and restricted, so not all
possible string processing tasks can be done using regular expressions. There
are also tasks that can be done with regular expressions, but the expressions
turn out to be very complicated. In these cases, you may be better off writing
Python code to do the processing; while Python code will be slower than an
elaborate regular expression, it will also probably be more understandable.
NLTK.CORPUS:
13
The modules in this package provide functions that can be used to read corpus
files in a variety of formats. These functions can be used to read both the
corpus files that are distributed in the NLTK corpus package, and corpus files
that are part of external corpora.
Each corpus module defines one or more "corpus reader functions", which can
be used to read documents from that corpus. These functions take an
argument, ``item``, which is used to indicate which document should be read
from the corpus:
-If “item” is one of the unique identifiers listed in the corpus module's “items”
variable, then the corresponding document will be loaded from the NLTK
corpus package.
3.2.3 analyse_text.py:
IBM Watson Personality Insights is used to analyse texts and generate the big 5
characteristics i.e.,
openness
conscientiousness
extraversion
agreeableness
neuroticism
PersonalityInsightsV3:
14
Values, from digital communications such as email, text messages, tweets, and
forum posts. The service can automatically infer, from potentially noisy social
media, portraits of individuals that reflect their personality characteristics. The
service can report consumption preferences based on the results of its analysis,
and for JSON content that is timestamped, it can report temporal behaviour. The
service offers a single /v3/profile method that accepts up to 20 MB of input data
and produces results in JSON or CSV format. The service accepts input in Arabic,
English, Japanese, Korean, or Spanish and can produce output in a variety of
languages. You authenticate to the service by using your service credentials. You
can use your credentials to authenticate via a proxy server that resides in Bluemix,
or you can use your credentials to obtain a token and contact the service directly.
By default, all Watson services log requests and their results. Data is collected
only to improve the Watson services. If you do not want to share your data, set the
header parameter X-Watson-Learning-Opt-Out to true for each request. Data is
collected for any request that omits this header.
Json:
The json module provides an API similar to convert in-memory Python objects to
a serialized representation known as JavaScript Object Notation (JSON) and vice-
a-versa. JSON (JavaScript Object Notation) is a popular data format used for
representing structured data. It's common to transmit and receive data between a
server and web application in JSON format. The json module makes it easy to
parse JSON strings and files containing JSON object. You can parse a JSON
string using json.loads() method. The method returns a dictionary. You can use
json.load() method to read a file containing JSON object. You can convert a
dictionary to JSON string using json.dumps() method. To write JSON to a file in
Python, json.dump() method can be used. To analyse and debug JSON data, it
may need to printed in a more readable format. This can be done by passing
additional parameters indent and sort_keys to json.dumps() and json.dump()
method.
Numpy:
15
NumPy is a module for Python. The name is an acronym for "Numeric Python". It
is an extension module for Python, mostly written in C. This makes sure that the
precompiled mathematical and numerical functions and functionalities of Numpy
guarantee great execution speed.
Csv:
The so-called CSV (Comma Separated Values) format is the most common import
and export format for spreadsheets and databases. CSV format was used for many
years prior to attempts to describe the format in a standardized way. The lack of a
well-defined standard means that subtle differences often exist in the data
produced and consumed by different applications. These differences can make it
annoying to process CSV files from multiple sources. Still, while the delimiters
and quoting characters vary, the overall format is similar enough that it is possible
to write a single module which can efficiently manipulate such data, hiding the
details of reading and writing the data from the programmer. The csv module
implements classes to read and write tabular data in CSV format. It allows
programmers to say, “write this data in the format preferred by Excel,” or “read
data from this file which was generated by Excel,” without knowing the precise
details of the CSV format used by Excel. Programmers can also describe the CSV
formats understood by other applications or define their own special-purpose CSV
formats. The csv module’s reader and writer objects read and write sequences.
Programmers can also read and write data in dictionary form using the DictReader
and DictWriter classes.
Pandas:
16
Pandas is a python package providing fast, flexible, and expressive data structures
designed to make working with “relational” or “labelled” data both easy and
intuitive. It aims to be the fundamental high-level building block for doing
practical, real world data analysis in Python. Additionally, it has the broader goal
of becoming the most powerful and flexible open source data analysis /
manipulation tool available in any language. It is already well on its way toward
this goal.
Sys:
17
script to exit back to either the Python console or the command prompt. This is
generally used to safely exit from the program in case of generation of an
exception. sys.maxsize returns the largest integer a variable can take. sys.path is
an environment variable that is a search path for all Python modules. sys.version
attribute displays a string containing the version number of the current Python
interpreter.
18
3.2.4 model_predict.py:
This code models and predicts the behaviour of a user. The dataset is given as
the input. The columns of the dataset which must be used for training and target
setting are also specified. A data frame was created for a given dataset and set the
target column as CES-D scale output. Then the model was trained using Support
Vector Machine using Radial Basis Function Kernel. The following are the libraries
used:
Sklearn:
It is a free software machine learning library for the Python programming
language. It is designed to interoperate with the Python numerical and
scientific libraries NumPy and SciPy. Scikit-learn provides a range of
supervised and unsupervised learning algorithms via a consistent interface in
Python. It is licensed under a permissive simplified BSD license and is
distributed under many Linux distributions, encouraging academic and
commercial use. Extensions or modules for SciPy care conventionally
named SciKits. As such, the module provides learning algorithms and is
named scikit-learn. The vision for the library is a level of robustness and
support required for use in production systems. This means a deep focus on
concerns such as easy of use, code quality, collaboration, documentation and
performance.
Urllib:
urllib is a package that collects several modules for working with URLs:
o urllib.request for opening and reading URLs
o urllib.error containing the exceptions raised by urllib.request
o urllib.parse for parsing URLs
o urllib.robotparser for parsing robots.txt files.
urllib.parse module helps to define functions to manipulate URLs and their
components parts, to build or break them. It usually focuses on splitting a
URL into small components; or joining different URL components into URL
string.
19
urllib.error defines the classes for exception raised by urllib.request. Whenever
there is an error in fetching a URL, this module helps in raising exceptions.
The following are the exceptions raised :
MatPlotLib:
Matplotlib is a Python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats and interactive environments across
platforms. Matplotlib can be used in Python scripts, the Python
and IPython shells, the Jupyter notebook, web application servers, and four
graphical user interface toolkits. Matplotlib tries to make easy things easy and
hard things possible. You can generate plots, histograms, power spectra, bar
charts, errorcharts, scatterplots, etc., with just a few lines of code. For
examples, see the sample plots and thumbnail gallery.
20
3.3 Theoretical Algorithms
Support Vector Machines are based on the concept of decision planes that
define decision boundaries. A decision plane is one that separates between a set of
objects having different class memberships. Support Vector Machine (SVM) is
primarily a classier method that performs classification tasks by constructing
hyperplanes in a multidimensional space that separates cases of different class labels.
SVM supports both regression and classification tasks and can handle multiple
continuous and categorical variables. For categorical variables a dummy variable is
created with case values as either 0 or 1. Thus, a categorical dependent variable
consisting of three levels, say (A, B, C), is represented by a set of three dummy
variables:
A: {1 0 0}, B: {0 1 0}, C: {0 0 1}
21
3.3.2 Radial Basis Function Kernel
The Radial basis function kernel, also called the RBF kernel, or Gaussian
kernel, is a kernel that is in the form of a radial basis function (more specifically, a
Gaussian function). The RBF kernel is defined as:
Intuitively, the gamma parameter defines how far the influence of a single training
example reaches, with low values meaning ‘far’ and high values meaning ‘close’.
The gamma parameters can be seen as the inverse of the radius of influence of
samples selected by the model as support vectors.
22
3.4 Advantages of the proposed model
23
CHAPTER-4
This chapter includes the flow of the activity of the project, design
steps from the beginning till the end of the project, code snippets and description of
the project, description of datasets which is used and testing process.
24
Fig. 4.1 Activity Diagram
big5_openness:
Appreciation for art, emotion, adventure, unusual ideas, curiosity, and variety
of experience. Openness reflects the degree of intellectual curiosity, creativity
and a preference for novelty and variety a person has. It is also described as
the extent to which a person is imaginative or independent and depicts a
personal preference for a variety of activities over a strict routine.
big5_conscientiousness:
Tendency to be organized and dependable, show self-discipline, act dutifully,
aim for achievement, and prefer planned rather than spontaneous behaviour.
big5_extraversion:
Energetic, surgency, assertiveness, sociability and the tendency to
seek stimulation in the company of others, and talkativeness. High
extraversion is often perceived as attention-seeking and domineering
big5_agreeableness:
Tendency to be compassionate and cooperative rather than suspicious and
antagonistic towards others. It is also a measure of one's trusting and helpful
nature, and whether a person is generally well-tempered or not.
big5_neuroticism:
Tendency to be prone to psychological stress. The tendency to experience
unpleasant emotions easily, such as anger, anxiety, depression,
and vulnerability. Neuroticism also refers to the degree of emotional stability
and impulse control and is sometimes referred to by its low pole, "emotional
stability".
SWLS:
The total score ranges from 5 (extremely dissatisfied) to 35 (extremely
satisfied). The self-reported score can be also used to reflect emotional well-
being of respondents.
25
CES-D:
Its total score ranges from 0 to 60 and can used to classify a respondent as
depressed or non-depressed.
4.3 Design
The datasets was cleaned, removing unsuitable data. The dataset was cleaned
for numbers, punctuations, spaces, new line and stop words. The dataset was then
removed of users with fewer than 50 English posts. This ensured that there were
enough changing patterns to classify users. Finally, outputs were labelled from SWLS
and CES-D scores. Users with any unanswered question were removed from the
dataset. Users with SWLS scores over 25 were identified as life satisfied and users
26
with scores below 25 were assigned as life dis-satisfied . A cut-off score of 20 was
used to label users according to their CES-D score
27
The following is a brief description of the three kinds of personality insights that are
provided by this service:
Personality characteristics: The service can build a portrait of an
individual’s personality characteristics and how they engage with the
world across five primary
dimensions: Openness, Conscientiousness, Extroversion, Agreeableness,
and Neuroticism (also known as Emotional Range).
Needs: The service can infer certain aspects of a product that will resonate
with an individual across
twelveneeds: Excitement, Harmony, Curiosity, Ideal, Closeness, Self
expression, Liberty, Love, Practicality, Stability, Challenge, and Structure.
Values: The service can identify values that describe motivating factors
which influence a person’s decision-making across five dimensions: Self-
transcendence / Helping others, Conservation / Tradition, Hedonism / Taking
pleasure in life, Self-enhancement / Achieving success, and Open to change /
Excitement.
28
the group. It is active anytime people feel that it's "one for all, and all for one."
4) Authority/subversion: This foundation was shaped by the long primate history of
hierarchical social interactions. It underlies virtues of leadership and followership,
including deference to legitimate authority and respect for traditions.
5) Sanctity/degradation: This foundation was shaped by the psychology of disgust
and contamination. It underlies religious notions of striving to live in an elevated, less
carnal, more noble way. It underlies the widespread idea that the body is a temple
which can be desecrated by immoral activities and contaminants (an idea not unique
to religious traditions)
This dictionary was used as basis for text analysis of the tweets in Watson
IBM plugin. This is a benchmark dictionary which will categorise every type of word
as authoritative, fair etc, the percentage of every category is given as the output into
the dataset.
That these underlying factors can be found is consistent with the lexical hypothesis:
personality characteristics that are most important in people's lives will eventually
29
become a part of their language and, secondly, that more important personality
characteristics are more likely to be encoded into language as a single word. The five
factors are:
30
Agreeableness (friendly/compassionate vs. challenging/detached). Tendency to
be compassionate and cooperative rather than suspicious and antagonistic towards
others. It is also a measure of one's trusting and helpful nature, and whether a
person is generally well-tempered or not. High agreeableness is often seen as
naive or submissive. Low agreeableness personalities are often competitive or
challenging people, which can be seen as argumentative or untrustworthy.
Neuroticism (sensitive/nervous vs. secure/confident). Tendency to be prone to
psychological stress. The tendency to experience unpleasant emotions easily, such
as anger, anxiety, depression, and vulnerability. Neuroticism also refers to the
degree of emotional stability and impulse control and is sometimes referred to by
its low pole, "emotional stability". High stability manifests itself as a stable and
calm personality, but can be seen as uninspiring and unconcerned. Low stability
manifests as the reactive and excitable personality often found in dynamic
individuals, but can be perceived as unstable or insecure. Also, individuals with
higher levels of neuroticism tend to have worse psychological well-being.
After processing these features, two predictive models were developed and
evaluated: one for classifying users according to life satisfaction and another one for
detecting users with depression.
The life satisfaction predictive model (LF model) is a classifier for labelling users
with life satisfaction and dissatisfaction. A set of predictive models on the above
described features as inputs and outputs was developed. Several techniques will be
used including support vector machine (SVM) with a radial basis function (RBF)
kernel, logistic regression, decision tree and naive Bayes. The best model will be used
to classify users taking CES-D into life satisfaction.
The main aim of this study is to develop a depression predictive model. The
depression classifier used the same features as the LF model. However, this model
31
relies on an additional input from the best LF model. Outputs to train the depression
model were calculated from CES-D scores of each user. The depression prediction
model was trained with supervised learning approaches similar to above.
32
4.4 System Requirements
4.4.1 Software Requirements:
The below mentioned are the software tools that are required to execute the different
classifiers used for evaluating the project.
Python:
Python is the scripting language which is used to implement the algorithms. The latest
version of Python 3 was used. Python scripts can be edited using notepad or any
coding IDE.
The most widely used operating system which consists of graphical user interface
with lots of applications available.
Computers with more CPU cores can outperform those with a lower core count, but
results will vary with the application. CPU quadcore processor enables the faster
execution of the algorithms compared the older versions.
Hard Disk:
The hard disk speed is a significant factor in program execution time. Once program
is running, disk speed is only a factor. For disk- intensive applications or to improve
the execution time of program you can take advantage of technologies such as solid-
state drives.
33
CHAPTER-5
The purpose of this study is to develop a multilevel predictive model to detect users
with depression. This project tests the correlation between SWLS and CES-D scores.
Get_tweets module takes the twitter username as the input. Using twitter API and
tweepy, the tweets are collected in the text format.
34
5.2 Pre-Processing
Pre- Processing module takes the input as the name of the file in which tweets are
stored. It will remove numbers, white spaces, new lines, punctuations and stop words.
The output is another text file named “filtered.txt” where all the pre-processed tweets
are stored.
Fig 5.3 Screenshot of the output of pre-processing module for username “Adele”
35
5.3 Text Analysis
Text Analysis is done using IBM Watson API. Using the IBM credentials, the text is
analysed and the percentages of the Big Personality Five outputs are given out as the
output in the form of a JSON object. To create the data set, this object is manipulated
and the values are stored in a .csv file. To predict, the json object is converted to a list
object and that is saved for the next step.
36
5.4 Comparing Algorithms for modelling
Logistic Regression:
If you have more than two classes then Linear Discriminant Analysis is the
preferred linear classification technique. LDA is a simple model in both
preparation and application. There is some interesting statistics behind how
the model is setup and how the prediction equation is derived. The
representation of LDA is straight forward. It consists of statistical properties of
your data, calculated for each class. For a single input variable (x) this is the
mean and the variance of the variable for each class. For multiple variables,
this is the same properties calculated over the multivariate Gaussian, namely
37
the means and the covariance matrix. These statistical properties are estimated
from your data and plug into the LDA equation to make predictions. These are
the model values that you would save to file for your model. LDA makes
some simplifying assumptions about your data. That your data is Gaussian,
that each variable is is shaped like a bell curve when plotted.That each
attribute has the same variance, that values of each variable vary around the
mean by the same amount on average.
K- Nearest Neighbour:
KNN can be used for both classification and regression predictive problems.
However, it is more widely used in classification problems in the industry. To
evaluate any technique 3 important aspects are considered: Ease to interpret
output, Calculation time, Predictive Power. KNN algorithm fairs across all
parameters of considerations. It is commonly used for its easy of interpretation
and low calculation time. First let us try to understand what exactly does K
influence in the algorithm. In the last example, given that all the 6 training
observation remain constant, with a given K value boundaries of each class
can be made. These boundaries will segregate RC from GS. The same way,
let’s try to see the effect of value “K” on the class boundaries. As you can see,
the error rate at K=1 is always zero for the training sample. This is because the
closest point to any training data point is itself.Hence the prediction is always
accurate with K=1. If validation error curve would have been similar, the
choice of K would have been 1. At K=1, it was overfitting the boundaries.
Hence, error rate initially decreases and reaches a minima. After the minima
point, it then increase with increasing K. To get the optimal value of K, you
can segregate the training and validation from the initial dataset. Now plot the
validation error curve to get the optimal value of K. This value of K should be
used for all predictions.
38
dividing a plane in two parts where in each class lay in either side. The
objective of the support vector machine algorithm is to find a hyperplane in an
N-dimensional space(N — the number of features) that distinctly classifies the
data points. To separate the two classes of data points, there are many possible
hyperplanes that could be chosen. The objective is to find a plane that has the
maximum margin, i.e the maximum distance between data points of both
classes. Maximizing the margin distance provides some reinforcement so that
future data points can be classified with more confidence. Hyperplanes are
decision boundaries that help classify the data points. Data points falling on
either side of the hyperplane can be attributed to different classes. Also, the
dimension of the hyperplane depends upon the number of features. If the
number of input features is 2, then the hyperplane is just a line. If the number
of input features is 3, then the hyperplane becomes a two-dimensional plane. It
becomes difficult to imagine when the number of features exceeds 3. Support
vectors are data points that are closer to the hyperplane and influence the
position and orientation of the hyperplane. Using these support vectors, the
margin of the classifier is maximized. Deleting the support vectors will change
the position of the hyperplane. These are the points that help build the SVM. In
logistic regression, the linear function is taken and squash the value within the
range of [0,1] using the sigmoid function. If the squashed value is greater than
a threshold value(0.5) a label 1 is assigned to it, else a label 0 is assigned. In
SVM, the output of the linear function is taken and if that output is greater than
1, it is identified with one class and if the output is -1, it is identified with
another class. Since the threshold values are changed to 1 and -1 in SVM,
reinforcement range of values([-1,1]) are obtained which acts as margin. In the
SVM algorithm, the margin between the data points and the hyperplane is
maximized. The loss function that helps maximize the margin is hinge loss. A
regularization parameter is added to the cost function. The objective of the
regularization parameter is to balance the margin maximization and loss.
39
Radial Basis Function Kernel:
The main advantage of the kernel methods is the possibility of using linear
models in a nonlinear subspace by an implicit transformation of patterns to a
high-dimensional feature space without computing their images directly. An
appropriately constructed kernel results in a model that fits well to the
structure underlying data and doesn’t over-fit to the sample. Recent state-of-
the-art kernel evaluation measures are examined and their application in
kernel optimization is verified. Optimization leveraging these measures
results in parameters corresponding to the classifiers that achieve minimal
error rate for RBF kernel. The Radial basis function kernel, also called the
RBF kernel, or Gaussian kernel, is a kernel that is in the form of a radial basis
function (more specifically, a Gaussian function). The RBF kernel is defined
as:
The RBF kernel is formed by taking an infinite sum over polynomial kernels.
The RBF kernel represents this similarity as a decaying function of the
distance between the vectors (i.e. the squared-norm of their distance). That is,
if the two vectors are close together then, difference will be small. Thus,
closer vectors have a larger RBF kernel value than farther vectors. This
function is of the form of a bell-shaped curve. The γ parameter sets the width
of the bell-shaped curve. The larger the value of γ the narrower will be the
bell. Small values of γ yield wide bells.
40
Fig 4.6 Algorithms Comparison
41
5.5 Training the models
The dataset is given as the input to train the two models using Support vector
machines algorithm using radial basis function kernel. The satisfaction model’s
training attributes will the Big Five Personality attributes and the target attribute is the
satisfaction column. The prediction model’s training attributes will be the Big Five
Personality attributes and satisfaction attribute. The target attribute is the depression
column.
42
Fig. 5.8 Output of Training Models
43
Fig 5.10 Screenshot of the predicted output of “urstrulymahesh”
44
CHAPTER-6
CONCLUSIONS/ RECOMMENDATIONS
6.1 Conclusions
The purpose of this study was to build a multilevel predictive model to detect
users with depression. The model predicts life satisfaction in users and then uses this
prediction to refine the detection of users with depression. The results show that the
multilevel model can outperform a depression predictive model alone. The model can
also predict a negative correlation between depression and life satisfaction.
6.1.1 Limitations
Usually, the online behaviour cannot depict the actual mental state of a person.
This project limits itself on the premise that every person will post all their
feelings on social media platforms without any bias.
Every person must truthfully answer the questionnaire.
As the next step, interaction features shall be added, e.g., comments, likes, and replies
to the model to improve the accuracy of detecting depression. An important impact of
this research is to enable online personalised interventions, providing suitable help,
health recommendations, or health information links to individual users, in a similar
way that social networks currently advertise products to users. Another application is
for decision support system data capture to help doctors and psychologists get better
and more frequent information from a patient. In future work, more observation
factors shall be extracted to improve the performance of the model and understand
other behaviours which can distinguish users with and without depression, e.g., topic
analysis, graph analysis, and image analysis.
45
REFERENCES
46
APPENDIX 1: CODE
get_tweets.py:
import sys
import csv
import tweepy
consumer_key = "IgM5R8xVMGNJBZuqQ42RbEHI5"
consumer_secret =
"Myu6ernLGk2ODUrsBNWr6deZu01MzlnVMTkvz93JARFJVHuoqU"
access_key = "339802190-clJpgJofm8n6tXUPVwrmWDILUZFdEGvecKqynjbY"
access_secret = "qqOr9Fx5Sh3uIxS0nf8Y0ic1rnoK2HJhQg0uAIpEGL5fh"
def get_tweets(username):
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
number_of_tweets = 100
tweets_for_csv = []
tweets_for_csv.append([tweet.text.encode("utf-8")])
#print(tweets_for_csv)
47
print("writing to " + outfile)
writer.writerows(tweets_for_csv)
if __name__ == '__main__':
if len(sys.argv) == 2:
get_tweets(sys.argv[1])
else:
Pre-Processing:
import io
import re
import string
import regex
import sys
if __name__ == '__main__':
48
if len(sys.argv) == 2:
filename = sys.argv[1]
else:
file=""+filename+".txt"
file1 = open(file)
words = line.split('\n')
mysen=""
remnum = mysen.join(words)
resultnum=re.sub(r'\d+','',remnum)
#print(resultpunc)
stop_words = set(stopwords.words('english'))
word_tokens=word_tokenize(resultpunc)
appendFile = open('filteredtext.txt','w')
for r in word_tokens:
if not r in stop_words:
appendFile.write(" "+r)
appendFile.close()
49
Text Analysis:
import json
import numpy
import csv
import pandas
import sys
url='https://gateway-lon.watsonplatform.net/personality-insights/api'
apikey='X7GVHZpsWB3FBCM7bwyboAMFwMHCK-RBFQSAyJ37f7Q-'
file1=open("filteredtext.txt")
line=file1.read()
text=""
profile=service.profile(line,content_type='text/plain').get_result()
#print(profile)
trait_id=[]
percentile=[]
for b in profile["personality"]:
50
#print(b)
trait_id.append(b["trait_id"])
percentile.append(b["percentile"])
#print(b["trait_id"])
#print(percentile)
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(percentile)
Training of Models:
import urllib
import numpy as np
import urllib
import pandas
import warnings
import pickle
51
filename1='foo.csv'
names1 =
['Openness','conscientiousness','extraversion','agreeableness','neuroticism','satisfact
ion','depression']
data_set1=pandas.read_csv(filename1,names=names1)
X1=pandas.DataFrame(data_set1)
d=pandas.Series(data_set1['depression'])
y1=pandas.DataFrame(d)
del X1[data_set1.columns[6]]
X1.columns=[data_set1.columns[0],data_set1.columns[1],data_set1.columns[2],data
_set1.columns[3],data_set1.columns[4],data_set1.columns[5]]
#print("Printing X1:")
#print(X1)
C=1.0
y1_pred = depression.predict(X1_test)
warnings.filterwarnings('ignore')
print("Depression Model:\n")
print(confusion_matrix(y1_test, y1_pred))
print(classification_report(y1_test,y1_pred))
52
filename='foo.csv'
names =
['Openness','conscientiousness','extraversion','agreeableness','neuroticism','satisfact
ion','depression']
data_set=pandas.read_csv(filename,names=names)
X=pandas.DataFrame(data_set)
s=pandas.Series(data_set['satisfaction'])
y=pandas.DataFrame(s)
del X[data_set.columns[6]]
del X[data_set.columns[5]]
#print("Printing X:")
#print(X)
X.columns=[data_set.columns[0],data_set.columns[1],data_set.columns[2],data_set.
columns[3],data_set.columns[4]]
C=1.0
y_pred = satisfaction.predict(X_test)
print("Satisfaction Model:\n"
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test,y_pred))
53
filename = 'swls_model.sav'
pickle.dump(satisfaction,open(filename,'wb'))
filename1 = 'cesd_model.sav'
pickle.dump(depression,open(filename1,'wb'))
import os
import sys
if __name__ == '__main__':
if len(sys.argv) == 2:
username = sys.argv[1]
else:
os.system(st1)
54
os.system(st2)
os.system("python text_analysis.py")
os.system("python svm.py")
import urllib
import numpy as np
import pandas
import warnings
import pickle
filename='foo.csv'
55
names =
['Openness','conscientiousness','extraversion','agreeableness','neuroticism','satisfaction'
,'depression']
dataframe=pandas.read_csv(filename,names=names)
X=pandas.DataFrame(dataframe)
d=pandas.Series(dataframe['depression'])
Y=pandas.DataFrame(d)
warnings.filterwarnings('ignore')
del X[dataframe.columns[6]]
X.columns=[dataframe.columns[0],dataframe.columns[1],dataframe.columns[2],dataf
rame.columns[3],dataframe.columns[4],dataframe.columns[5]]
seed = 7
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
56
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', SVC()))
results = []
names = []
scoring = 'accuracy'
results.append(cv_results)
names.append(name)
print(msg)
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
import pickle
57
from watson_developer_cloud import PersonalityInsightsV3
import json
import numpy
import csv
import pandas
import sys
import os
if __name__ == '__main__':
if len(sys.argv) == 2:
username = sys.argv[1]
else:
os.system(st1)
os.system(st2)
58
print("Preprocessing done!")
#os.system("python text_analysis_main.py")
url='https://gateway-lon.watsonplatform.net/personality-insights/api'
apikey='X7GVHZpsWB3FBCM7bwyboAMFwMHCK-RBFQSAyJ37f7Q-'
file1=open("filteredtext.txt")
line=file1.read()
text=""
profile=service.profile(line,content_type='text/plain').get_result()
#print(profile)
trait_id=[]
percentile=[]
for b in profile["personality"]:
#print(b)
59
trait_id.append(b["trait_id"])
percentile.append(b["percentile"])
#print(b["trait_id"])
new_percentile = numpy.reshape(percentile,(-1,5))
print(new_percentile)
#os.system("python svm.py")
filename = 'swls_model.sav'
result = swls_model.predict(new_percentile)
print(result)
percentile.append(result[0])
#print(percentile)
new_percentile = numpy.reshape(percentile,(-1,6))
60
filename1 = 'cesd_model.sav'
result = cesd_model.predict(new_percentile)
print("Depression: ")
print(result)
61
APPENDIX 2: MORAL FOUNDATIONS DICTIONARY
%
01 HarmVirtue
02 HarmVice
03 FairnessVirtue
04 FairnessVice
05 IngroupVirtue
06 IngroupVice
07 AuthorityVirtue
08 AuthorityVice
09 PurityVirtue
10 PurityVice
11 MoralityGeneral
%
safe* 01
peace* 01
compassion* 01
empath* 01
sympath* 01
care 01
caring 01
protect* 01
shield 01
shelter 01
amity 01
secur* 01
benefit* 01
defen* 01
guard* 01
preserve 01 07 09
harm* 02
suffer* 02
war 02
wars 02
warl* 02
warring 02
fight* 02
violen* 02
hurt* 02
kill 02
kills 02
killer* 02
killed 02
killing 02
endanger* 02
cruel* 02
brutal* 02
abuse* 02
damag* 02
ruin* 02 10
ravage 02
detriment* 02
crush* 02
attack* 02
annihilate* 02
destroy 02
stomp 02
62
abandon* 02 06
spurn 02
impair 02
exploit 02 10
exploits 02 10
exploited 02 10
exploiting 02 10
wound* 02
fair 03
fairly 03
fairness 03
fair-* 03
fairmind* 03
fairplay 03
equal* 03
justice 03
justness 03
justifi* 03
reciproc* 03
impartial* 03
egalitar* 03
rights 03
equity 03
evenness 03
equivalent 03
unbias* 03
tolerant 03
equable 03
balance* 03
homologous 03
unprejudice* 03
reasonable 03
constant 03
honest* 03 11
unfair* 04
unequal* 04
bias* 04
unjust* 04
injust* 04
bigot* 04
discriminat* 04
disproportion* 04
inequitable 04
prejud* 04
dishonest 04
unscrupulous 04
dissociate 04
preference 04
favoritism 04
segregat* 04 05
exclusion 04
exclud* 04
together 05
nation* 05
homeland* 05
family 05
families 05
familial 05
63
group 05
loyal* 05 07
patriot* 05
communal 05
commune* 05
communit* 05
communis* 05
comrad* 05
cadre 05
collectiv* 05
joint 05
unison 05
unite* 05
fellow* 05
guild 05
solidarity 05
devot* 05
member 05
cliqu* 05
cohort 05
ally 05
insider 05
foreign* 06
enem* 06
betray* 06 08
treason* 06 08
traitor* 06 08
treacher* 06 08
disloyal* 06 08
individual* 06
apostasy 06 08 10
apostate 06 08 10
deserted 06 08
deserter* 06 08
deserting 06 08
deceiv* 06
jilt* 06
imposter 06
miscreant 06
spy 06
sequester 06
renegade 06
terroris* 06
immigra* 06
obey* 07
obedien* 07
duty 07
law 07
lawful* 07 11
legal* 07 11
duti* 07
honor* 07
respect 07
respectful* 07
respected 07
respects 07
order* 07
father* 07
64
mother 07
motherl* 07
mothering 07
mothers 07
tradition* 07
hierarch* 07
authorit* 07
permit 07
permission 07
status* 07
rank* 07
leader* 07
class 07
bourgeoisie 07
caste* 07
position 07
complian* 07
command 07
supremacy 07
control 07
submi* 07
allegian* 07
serve 07
abide 07
defere* 07
defer 07
revere* 07
venerat* 07
comply 07
defian* 08
rebel* 08
dissent* 08
subver* 08
disrespect* 08
disobe* 08
sediti* 08
agitat* 08
insubordinat* 08
illegal* 08
lawless* 08
insurgent 08
mutinous 08
defy* 08
dissident 08
unfaithful 08
alienate 08
defector 08
heretic* 08 10
nonconformist 08
oppose 08
protest 08
refuse 08
denounce 08
remonstrate 08
riot* 08
obstruct 08
piety 09 11
pious 09 11
purity 09
65
pure* 09
clean* 09
steril* 09
sacred* 09
chast* 09
holy 09
holiness 09
saint* 09
wholesome* 09 11
celiba* 09
abstention 09
virgin 09
virgins 09
virginity 09
virginal 09
austerity 09
integrity 09 11
modesty 09
abstinen* 09
abstemiousness 09
upright 09 11
limpid 09
unadulterated 09
maiden 09
virtuous 09
refined 09
decen* 09 11
immaculate 09
innocent 09
pristine 09
church* 09
disgust* 10
deprav* 10
disease* 10
unclean* 10
contagio* 10
indecen* 10 11
sin 10
sinful* 10
sinner* 10
sins 10
sinned 10
sinning 10
slut* 10
whore 10
dirt* 10
impiety 10
impious 10
profan* 10
gross 10
repuls* 10
sick* 10
promiscu* 10
lewd* 10
adulter* 10
debauche* 10
defile* 10
tramp 10
prostitut* 10
unchaste 10
66
intemperate 10
wanton 10
profligate 10
filth* 10
trashy 10
obscen* 10
lax 10
taint* 10
stain* 10
tarnish* 10
debase* 10
desecrat* 10
wicked* 10 11
blemish 10
exploitat* 10
pervert 10
wretched* 10 11
righteous* 11
moral* 11
ethic* 11
value* 11
upstanding 11
good 11
goodness 11
principle* 11
blameless 11
exemplary 11
lesson 11
canon 11
doctrine 11
noble 11
worth* 11
ideal* 11
praiseworthy 11
commendable 11
character 11
proper 11
laudable 11
correct 11
wrong* 11
evil 11
immoral* 11
bad 11
offend* 11
offensive* 11
transgress* 11
67