Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 79

A Project Report

MULTILEVEL PREDICTIVE MODEL FOR


DETECTING DEPRESSION IN SOCIAL
MEDIA USERS
Submitted in partial fulfilment of requirements for the award of degree

BACHELOR OF ENGINEERING
in

COMPUTER SCIENCE AND ENGINEERING


by

SAI SREESHMA THUPAKULA (160115733140)

SAI TEJASWI MUTTAVARAPU(160115733142)

Under the guidance of

Mr. G. Vivek

Assistant Professor

Department of Computer Science and Engineering

Chaitanya Bharathi Institute of Technology (Autonomous),


(Affiliated to Osmania University, Hyderabad)

Hyderabad, TELANGANA(INDIA) – 500 075

May, 2019
A Project Report

MULTILEVEL PREDICTIVE MODEL FOR


DETECTING DEPRESSION IN SOCIAL
MEDIA USERS
Submitted in partial fulfilment of requirements for the award of degree

BACHELOR OF ENGINEERING

in

COMPUTER SCIENCE AND ENGINEERING

by

SAI SREESHMA THUPAKULA (160115733140)


SAI TEJASWI MUTTAVARAPU(160115733142)

Department of Computer Science and Engineering

Chaitanya Bharathi Institute of Technology (Autonomous),

(Affiliated to Osmania University, Hyderabad)

Hyderabad, TELANGANA(INDIA) – 500 075

May, 2019
CERTIFICATE

This is to certify that the project work entitled “Multilevel Predictive Model for
Detecting Depression in social media users” is the bonafide work carried out by T.
Sai Sreeshma(160115733140) and M. Sai Tejaswi(160115733142), students of
B.E.(CSE) of Chaitanya Bharathi Institute of Technology, Hyderabad, affiliated to
Osmania University, Hyderabad, Telangana(India) during the academic year 2018-19,
submitted in partial fulfilment of the requirements for the award of degree in
Bachelor of Engineering in (Computer Science and Engineering) and that the
project has not formed the basis for the award previously of any other degree,
diploma, fellowship or any other similar title.

Supervisor Head of the Dept


Mr. G. Vivek, Dr. M. Swamy Das
Assistant Professor Professor and Head
Department of CSE Department of CSE
CBIT, Hyderabad CBIT, Hyderabad

Place: Hyderabad

Date:

External Examiner
DECLARATION
We hereby declare that the project entitled “Multilevel Predictive Model for
Detecting Depression in social media users” submitted for the B.E.(CSE) degree is
our original work and the project has not formed the basis for the award of any other
degree, diploma, fellowship or any other similar titles.

T. Sai Sreeshma
(160115733140)

M. Sai Tejaswi
(160115733142)

Place: Hyderabad

Date:

iii
ACKNOWLEDGEMENT

The satisfaction that accompanies the successful completion of any task would be
incomplete without introducing the people who made it possible and whose constant
guidance and encouragement crowns all efforts with success. They have been a
guiding light and source of inspiration towards the completion of the project.

We would like to express our sincere gratitude and indebtness to our project guide,
Mr. G. Vivek who has supported us throughout our project with patience and
knowledge.

We are also thankful to Head of the Department Dr. M. Swami Das for providing
excellent infrastructure and a conducive atmosphere for completing this project
successfully.

We are also extremely thankful to our Project Coordinators Dr. K. Sagar, Professor,
Dept. of CSE, CBIT and Dr. M. Venu Gopalachari, Associate Professor, Dept. of
CSE, CBIT for their valuable suggestions and interest throughout the course of this
project.

We convey our heartfelt thanks to the lab staff for allowing us to use the required
equipment whenever needed.

Finally, we would like to take this opportunity to thank our families or their support
through the work. We sincerely acknowledge and thank all those who gave directly or
indirectly their support in completion of this work.

T. Sai Sreeshma
(160115733140)

M. Sai Tejaswi
(160115733142)

iv
ABSTRACT
Depression is a widespread disease nowadays, with about 300 million people
suffering from it globally. It is more than just a negative mood that requires treatment.
People who lead a life with high satisfaction tend to suffer fewer mental health issues.

The large volume of data generated by the users onto the social network
platforms would help detect hidden patterns in data and obtain new insights. This
project aims to develop a multilevel predictive model to detect users with depression,
by predicting life satisfaction in users. This model would learn about the relationship
between life satisfaction and depression in users on social network, twitter for
example. This model would help in detecting depression, which would allow the user
to get help, and health recommendations. It would also allow doctors and
psychologists to get frequent information from the patient.

Keywords: Predictive model, Machine learning, Mental health, Depression,


Life satisfaction, Social network

v
LIST OF FIGURES

S.No Figure Name Page No.

3.1 Block Diagram of project 8

4.1 Activity Diagram 22

4.2 Screenshot of Dataset 24

4.3 Screenshot of list of twitter files 24

3.4 Screenshot of Moral Foundations Dictionary 31

5.1 Screenshot of the output of get_tweets with 31


username “Adele”

5.2 Screenshot of the output of get_tweets with 32


username “ArianaGrande”

5.3 Screenshot of the output of pre-processing module 32


for username “Adele”

5.4 Screenshot of the output of pre-processing module 33


for username “ArianaGrande”

5.5 Screenshot of output of Text Analysis for a number 38


of usernames

5.6 Algorithm Comparison 38

5.7 ROC Curve Comparison 40

5.8 Output of Training Models 40

5.9 Screenshot of the predicted output for 40


“Theellenshow”

5.10 Screenshot of the predicted output for 41


“arianagrande”
TABLE OF CONTENTS

Title Page i

Certificate ii

Declaration iii

Acknowledgement iv

Abstract v

List of Figures vi

1. INTRODUCTION 1

1.1 Problem Definition including significance 1


and objective

1.2 Methodologies 1

1.3 Outline of Results 2

1.4 Scope of Project 3

1.5 Organisation of the Report 3

2. LITERATURE SURVEY 4

2.1 Introduction to the problem domain 4


terminology

2.2 Background 5

2.3 Existing Systems and Major problems 6

2.4 Tools/ Technologies used 7


3. DESIGN OF PROPOSED SYSTEM 8

3.1 Block Diagram 8

3.2 Module description 8

3.3 Theoretical Foundations/ Algorithms 19

3.4 Advantages of Proposed System 21

4. IMPLEMENTATION OF THE 22
PROPOSED SYSTEM

4.1 Activity Diagram 22

4.2 Data Set Description 23

4.3 Design 24

4.4 System Requirements 30

5. RESULTS AND DISCUSSIONS 31

6. CONCLUSION AND FUTURE WORK 42

6.1 Conclusion 42

6.1.1 Limitations 42

6.2 Future Work 42

REFERENCES 43

APPENDIX 1: CODE 44

APPENDIX 2: DICTIONARY 59
CHAPTER-1
INTRODUCTION

The Society is currently witnessing an unprecedented growth in the incidence


of mental disorders, with an estimated 300 million people suffering from depression
globally. People with high life satisfaction tend to suffer fewer mental health issues.

1.1. Problem Definition including significance and objective

Depression is considered one of the main reasons for suicidal tendencies.


Social media is allowing people to post their feelings indirectly creating lots of
information to process. This model would use social media data, particularly twitter
data to find different patterns of depression among different profiles.

Problem Definition:

The large volume of data generated on social network platforms help detect
patterns in data and obtain new insights. Using these patterns, this project will detect
life satisfaction levels among different social network users.

Objective of the Project:

This project aims to

 Explore the relationship between life satisfaction and depression in


social network users.
 Develop a multilevel predictive model to detect users with depression.

1.2 Methodology

This section describes the datasets, the methods used to extract information
from the datasets, the measures of life satisfaction and depressive statuses, and the
construction of the predictive model.

1
A. Dataset:

Our dataset is a synthesised dataset which is generated using IBM Watson API
by giving a folder which contains text files of tweets from each individual user.

B. Ground Truth Measures:

1. Satisfaction with Life Scale: SWLS is a valid and reliable questionnaire to measure
life satisfaction [3]. This questionnaire contains 5 questions, e.g., in most ways my
life is close to my ideal, the conditions of my life are excellent, I am satisfied with my
life, so far I have gotten the important things I want in life, and If I could live my life
over, I would change almost nothing, requesting respondents to indicate their feelings
on a scale from 1 (strongly disagree) to 7 (strongly agree). The total score ranges from
5 (extremely dissatisfied) to 35 (extremely satisfied). The self-reported score can be
also used to reflect emotional well-being of respondents.

2. Centre for Epidemiological Study Depression Scale: CES-D is one of the most
common and popular questionnaires to measure depressive tendencies [20]. It has
been used to screen respondents with depression both in the general population and in
primary care settings [21]. This questionnaire consists of 20 items. Each of them asks
how often the respondent has felt in different states or moods over the last week. The
answers are rarely or none of the time (less than 1 day), some or a little of the time (1-
2 days), occasionally or a moderate amount of the time (3-4 days), and most or all of
the time (5-7 days). Its total score ranges from 0 to 60 and can used to classify a
respondent as depressed or non-depressed.

C. Predictive models

Two predictive models are developed and evaluated :one for classifying users
according to life satisfaction and another one for detecting users with depression. All
models relied on supervised learning techniques, which need inputs and outputs for
training. Supervised learning techniques learn patterns from a set of given inputs and
then try to classify or categorise the inputs into classes following a given set of known
outputs. After the training state, a well-trained model can provide or classify an
unknown output from a new set of inputs into a suitable class.

2
1.3 Outline of the results

The output of project is the level of depression in CES-D scale. The CES-D
scale denotes the level of depression in the user by analysing the user’s twitter profile.

1.4 Scope of the project


This project aims to explore correlations between life satisfaction and depression in
social network users, especially Twitter. Then, a multilevel predictive model to label
users as life satisfaction and then used that label as an additional input to detect users
with depression is developed.

1.5 Organisation of the Report

This report is organized as follows:

 Chapter 1 introduces the project, its objective, problem definition, existing


system, proposed system.
 Chapter 2 gives a detailed literature survey of the existing work related to this
project.
 Chapter 3 discusses the system design.
 Chapter 4 discusses the implementation of the proposed system and system
requirements.
 Chapter 5 analyses the results.
 Chapter 6 concludes the project and address the future scope.
 References are attached, followed by an appendix consisting of code
segments.

3
CHAPTER-2

LITERATURE SURVEY

Mental illness is becoming a serious global health problem worldwide, with a


growing number of patients suffering from depression, anxiety and other disorders.
New solutions are needed to tackle this issue.

2.1. Introduction

The main goal of this research project is to develop prediction models to


classify users with poor mental health from social network data and then implement
an intervention model to help these users.

In the last decade, people have started to openly express their activities,
feelings, and thoughts on social network platforms. Thanks to these sites, people can
also establish new virtual relationships, communicate information, and interact with
other users. These networks attract large numbers of users, with Facebook alone
having over 2.2 billion monthly active users in March 2018. Twitter reported 336
million monthly active accounts in March 2018. The users of these networks are
generating a large volume of data, which is attracting researchers and companies in
hope of understanding hidden patterns in data and obtaining new insights.

The growing amount of data and the advances in high performance computing,
led to popularisation of data mining and machine learning techniques, as well as data
science, which enables researchers to interpret and visualise information from
complex datasets. Using these methods, predictive models have been developed to
detect, classify, and categorise data in a wide variety of research domains, including
health, energy, marketing, and genetics.

Given the relationship between mental health problems and life satisfaction, it
is possible to detect people with mental health problems on the basis of their
expression of any message or emotion providing information about their life
satisfaction. Previous studies have linked well-being and mental illness in online
users.

4
So far, a small, but notable, number of studies have investigated mental health
problems using social network data. Some studies have developed predictive models
to classify users according to their levels of life satisfaction and depression. To the
best of our knowledge, no study has developed a multilevel predictive model to
classify users with depression based on satisfaction with life.

This project aims to explore correlations between life satisfaction and


depression in social network users, especially Twitter. Then, a multilevel predictive
model to label users as life satisfaction was develped and then used that label as an
additional input to detect users with depression.

2.2. Background

Rich bodies of work on depression in psychiatry, psychology, medicine, and


sociolinguistics describe efforts to identify and understand correlates of MDD in
individuals. Cloninger et al., (2006) examined the role of personality traits in the
vulnerability of individuals to a future episode of depression, through a longitudinal
study. On the other hand, Rude et al., (2003) found support for the claim that negative
processing biases, particularly (cognitive) biases in resolving ambiguous verbal
information can predict subsequent depression. Robinson and Alloy, (2003) similarly
observed that negative cognitive styles and stress reactive rumination were predictive
of the onset, number and duration of depressive episodes. Finally, Brown et al.,
(1986) found that lack of social support and lowered self-esteem are important factors
linked to higher incidences of depression. Among a variety of somatic factors,
reduced energy, disturbed sleep, eating disorders, and stress and tension have also
been found to be correlates of depressive disorders (Abdel-Khalek, 2004). In the field
of sociolinguistics, Oxman et al., (1982) showed that linguistic analysis of speech
could classify patients into groups suffering from depression and paranoia.
Computerized analysis of written text through the LIWC program has also been found
to reveal predictive cues about neurotic tendencies and psychiatric disorders (Rude,
Gortner & Pennebaker, 2004).
Although studies to date have improved our understanding of factors that are
linked to mental disorders, a notable limitation of prior research is that it relies
heavily on small, often homogeneous samples of individuals, who may not
necessarily be representative of the larger population. Further, these studies typically

5
are based on surveys, relying on retrospective self-reports about mood and
observations about health: a method that limits temporal granularity. That is, such
assessments are designed to collect high-level summaries about experiences over long
periods of time. Collecting finer-grained longitudinal data has been difficult, given the
resources and invasiveness required to observe individuals’ behaviour over months
and years. The continuing streams of evidence from social media on posting activity
often reflects people’s psyches and social milieus. This data about people’s social and
psychological behaviour can be used to predict their vulnerabilities to depression in an
unobtrusive and fine-grained manner. Moving to research on social media, over the
last few years, there has been growing interest in using social media as a tool for
public health, ranging from identifying the spread of flu symptoms (Sadilek et al.,
2012), to building insights about diseases based on postings on Twitter (Paul &
Dredze, 2011). However, research on harnessing social media for understanding
behavioural health disorders is still in its infancy. Kotikalapudi et al., (2012) analysed
patterns of web activity of college students that could signal depression. Similarly,
Moreno et al., (2011) demonstrated that status updates on Facebook could reveal
symptoms of major depressive episodes.

2.3 Existing Systems and Major Problems in Existing Systems


The existing systems consider only one kind of scale to measure the
depression. There is very limited data about a person through one’s Twitter profile.
So, one scale to measure depression is not sufficient. A combined scale of measure
must be used to achieve more accuracy in our results.
Existing works demonstrated that leverage social media for healthcare, and in
particular stress detection, is feasible. There are some limitations existing in Twitter
content-based stress detection. Users do not always express their stressful states
directly in a Twitter post. Although no stress is revealed from the post itself, from the
follow-up interactive comments made by the user and their friends, the conclusion
that the user is actually stressed from work can be made. Thus, simply relying on a
user’s Twitter post content for stress detection is insufficient. Users with high
psychological stress may exhibit low activeness on social networks. Stress detection
performance is low.

6
7
2.4 Tools used
 Python:
Python is a widely used general-purpose, high level programming language. It
was initially designed by Guido van Rossum in 1991 and developed by Python
Software Foundation. It was mainly developed for emphasis on code
readability, and its syntax allows programmers to express concepts in fewer
lines of code. Python is a programming language that lets you work quickly
and integrate systems more efficiently.
 JSON:
JSON or JavaScript Object Notation is a format for structuring data. Like
XML, it is one of the way of formatting the data. Such format of data is used
by web applications to communicate with each other. It is Human-readable
and writable .It is light weight text based data interchange format which
means, it is simpler to read and write when compared to XML. Though it is
derived from a subset of JavaScript, yet it is Language independent. Thus, the
code for generating and parsing JSON data can be written in any other
programming language.

8
CHAPTER-3

DESIGN OF THE PROPOSED SYSTEM

This chapter describes the block diagram, description of different modules and
various algorithms used in the project.

3.1 Block Diagram


The block diagram describes different parts of the project. The twitter
API extracts the tweets using Tweepy. Once the posts are available, a classifier is
used to classify the tweets based on the questionnaire of CES-D and SWLS. Then,
report is generated after classifying and predicted the user’s tweets.

3.1 Block Diagram of the project

3.2 Module Description

3.2.1 get_tweets.py:
This piece of code takes username as the input and gives out the latest 100
tweets posted by that individual user. There are 4 keys provided to us through the
twitter developer account. These keys are used to authenticate the developer details
using tweepy. The libraries used in this code are:

9
 Sys:
This module has System-specific parameters and functions. It provides access
to some variables used or maintained by the interpreter and to functions that
interact strongly with the interpreter. It is always available. The sys module
provides functions and variables used to manipulate different parts of the
Python runtime environment. sys.argv returns a list of command line
arguments passed to a Python script. The item at index 0 in this list is always
the name of the script. The rest of the arguments are stored at the subsequent
indices. sys.exit causes the script to exit back to either the Python console or
the command prompt. This is generally used to safely exit from the program in
case of generation of an exception. sys.maxsize returns the largest integer a
variable can take. sys.path is an environment variable that is a search path for
all Python modules. sys.version attribute displays a string containing the
version number of the current Python interpreter.

 Csv:
The so-called CSV (Comma Separated Values) format is the most common
import and export format for spreadsheets and databases. CSV format was
used for many years prior to attempts to describe the format in a standardized
way. The lack of a well-defined standard means that subtle differences often
exist in the data produced and consumed by different applications. These
differences can make it annoying to process CSV files from multiple sources.
Still, while the delimiters and quoting characters vary, the overall format is
similar enough that it is possible to write a single module which can efficiently
manipulate such data, hiding the details of reading and writing the data from
the programmer. The csv module implements classes to read and write tabular
data in CSV format. It allows programmers to say, “write this data in the
format preferred by Excel,” or “read data from this file which was generated
by Excel,” without knowing the precise details of the CSV format used by
Excel. Programmers can also describe the CSV formats understood by other
applications or define their own special-purpose CSV formats. The csv
module’s reader and writer objects read and write sequences. Programmers

10
can also read and write data in dictionary form using the DictReader and
DictWriter classes.
 Tweepy:
Tweepy is open-sourced, hosted on GitHub and enables Python to
communicate with Twitter platform and use its API. Tweepy provides access
to the well documented Twitter API. With tweepy, it's possible to get any
object and use any method that the official Twitter API offers. Tweepy
supports accessing Twitter via Basic Authentication and the newer
method, OAuth. Twitter has stopped accepting Basic Authentication
so OAuth is now the only way to use the Twitter API. The main
difference between Basic and OAuth authentication are the consumer
and access keys. With Basic Authentication, it was possible to
provide a username and password and access the API, but since 2010
when the Twitter started requiring OAuth, the process is a bit more
complicated. One of the main usage cases of tweepy is monitoring for
tweets and doing actions when some event happens. Key component
of that is the StreamListener object, which monitors tweets in real
time and catches them. To sum up, tweepy is a great open -source
library which provides access to the Twitter API for Python.
Although the documentation for tweepy is a bit scarce and doesn't
have many examples, the fact that it heavily relies on the Twitter API,
which has excellent documentation, makes it probably the best
Twitter library for Python, especially when considering
the Streaming API support, which is where tweepy excels. Other
libraries like python-twitter provide many functions too, but the
tweepy has most active community and most commits to the code

3.2.2 preprocessing.py:

After a text is obtained, the first step is text normalization. Text normalization
includes:

 converting all letters to lower or upper case


 removing numbers
 removing punctuations, accent marks and other diacritics

11
 removing white spaces
 removing stop words, sparse terms, and particular words

The libraries used are:

 IO:

The io module provides Python’s main facilities for dealing with various types
of I/O. There are three main types of I/O: text I/O, binary I/O and raw I/O.
These are generic categories, and various backing stores can be used for each
of them. A concrete object belonging to any of these categories is called a
file_object. Other common terms are stream and file-like object. Independent
of its category, each concrete stream object will also have various capabilities:
it can be read-only, write-only, or read-write. It can also allow arbitrary
random access (seeking forwards or backwards to any location), or only
sequential access (for example in the case of a socket or pipe). Text I/O
expects and produce str objects. This means that whenever the backing store is
natively made of bytes (such as in the case of a file), encoding and decoding of
data is made transparently as well as optional translation of platform-specific
newline characters.

 RE:

This module provides regular expression matching operations similar to those


found in Perl. Both patterns and strings to be searched can be Unicode strings
(str) as well as 8-bit strings (bytes). However, Unicode strings and 8-bit strings
cannot be mixed: that is, you cannot match a Unicode string with a byte
pattern or vice-versa; similarly, when asking for a substitution, the
replacement string must be of the same type as both the pattern and the search
string. Regular expressions use the backslash character ('\') to indicate special
forms or to allow special characters to be used without invoking their special
meaning. This collides with Python’s usage of the same character for the same
purpose in string literals; for example, to match a literal backslash, one might
have to write '\\\\' as the pattern string, because the regular expression must be

12
\\, and each backslash must be expressed as \\ inside a regular Python string
literal. The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in a string
literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n',
while "\n" is a one-character string containing a newline. Usually patterns will
be expressed in Python code using this raw string notation.

 REGEX:

Regular expressions (called REs, or regexes, or regex patterns) are essentially


a tiny, highly specialized programming language embedded inside Python and
made available through the re module. Using this little language, you specify
the rules for the set of possible strings that you want to match; this set might
contain English sentences, or e-mail addresses, or TeX commands, or anything
you like. You can then ask questions such as “Does this string match the
pattern?”, or “Is there a match for the pattern anywhere in this string?”. You
can also use REs to modify a string or to split it apart in various ways.

Regular expression patterns are compiled into a series of bytecodes which are
then executed by a matching engine written in C. For advanced use, it may be
necessary to pay careful attention to how the engine will execute a given RE,
and write the RE in a certain way in order to produce bytecode that runs faster.
Optimization isn’t covered in this document, because it requires that you have
a good understanding of the matching engine’s internals.

The regular expression language is relatively small and restricted, so not all
possible string processing tasks can be done using regular expressions. There
are also tasks that can be done with regular expressions, but the expressions
turn out to be very complicated. In these cases, you may be better off writing
Python code to do the processing; while Python code will be slower than an
elaborate regular expression, it will also probably be more understandable.

 NLTK.CORPUS:

13
The modules in this package provide functions that can be used to read corpus
files in a variety of formats. These functions can be used to read both the
corpus files that are distributed in the NLTK corpus package, and corpus files
that are part of external corpora.

Each corpus module defines one or more "corpus reader functions", which can
be used to read documents from that corpus. These functions take an
argument, ``item``, which is used to indicate which document should be read
from the corpus:

-If “item” is one of the unique identifiers listed in the corpus module's “items”
variable, then the corresponding document will be loaded from the NLTK
corpus package.

-If “item” is a filename, then that file will be read.

3.2.3 analyse_text.py:
IBM Watson Personality Insights is used to analyse texts and generate the big 5
characteristics i.e.,

 openness
 conscientiousness
 extraversion
 agreeableness
 neuroticism

The following modules are used:

 PersonalityInsightsV3:

Service Overview The IBM Watson Personality Insights service provides a


Representational State Transfer (REST) Application Programming Interface (API)
that enables applications to derive insights from social media, enterprise data, or
other digital communications. The service uses linguistic analytics to infer
individuals’ intrinsic personality characteristics, including Big Five, Needs, and

14
Values, from digital communications such as email, text messages, tweets, and
forum posts. The service can automatically infer, from potentially noisy social
media, portraits of individuals that reflect their personality characteristics. The
service can report consumption preferences based on the results of its analysis,
and for JSON content that is timestamped, it can report temporal behaviour. The
service offers a single /v3/profile method that accepts up to 20 MB of input data
and produces results in JSON or CSV format. The service accepts input in Arabic,
English, Japanese, Korean, or Spanish and can produce output in a variety of
languages. You authenticate to the service by using your service credentials. You
can use your credentials to authenticate via a proxy server that resides in Bluemix,
or you can use your credentials to obtain a token and contact the service directly.
By default, all Watson services log requests and their results. Data is collected
only to improve the Watson services. If you do not want to share your data, set the
header parameter X-Watson-Learning-Opt-Out to true for each request. Data is
collected for any request that omits this header.

 Json:

The json module provides an API similar to convert in-memory Python objects to
a serialized representation known as JavaScript Object Notation (JSON) and vice-
a-versa. JSON (JavaScript Object Notation) is a popular data format used for
representing structured data. It's common to transmit and receive data between a
server and web application in JSON format. The json module makes it easy to
parse JSON strings and files containing JSON object. You can parse a JSON
string using json.loads() method. The method returns a dictionary. You can use
json.load() method to read a file containing JSON object. You can convert a
dictionary to JSON string using json.dumps() method. To write JSON to a file in
Python, json.dump() method can be used. To analyse and debug JSON data, it
may need to printed in a more readable format. This can be done by passing
additional parameters indent and sort_keys to json.dumps() and json.dump()
method.

 Numpy:

15
NumPy is a module for Python. The name is an acronym for "Numeric Python". It
is an extension module for Python, mostly written in C. This makes sure that the
precompiled mathematical and numerical functions and functionalities of Numpy
guarantee great execution speed.

Furthermore, NumPy enriches the programming language Python with powerful


data structures, implementing multi-dimensional arrays and matrices. These data
structures guarantee efficient calculations with matrices and arrays. The
implementation is even aiming at huge matrices and arrays, better known under
the heading of "big data". Besides that, the module supplies a large library of
high-level mathematical functions to operate on these matrices and arrays.

 Csv:

The so-called CSV (Comma Separated Values) format is the most common import
and export format for spreadsheets and databases. CSV format was used for many
years prior to attempts to describe the format in a standardized way. The lack of a
well-defined standard means that subtle differences often exist in the data
produced and consumed by different applications. These differences can make it
annoying to process CSV files from multiple sources. Still, while the delimiters
and quoting characters vary, the overall format is similar enough that it is possible
to write a single module which can efficiently manipulate such data, hiding the
details of reading and writing the data from the programmer. The csv module
implements classes to read and write tabular data in CSV format. It allows
programmers to say, “write this data in the format preferred by Excel,” or “read
data from this file which was generated by Excel,” without knowing the precise
details of the CSV format used by Excel. Programmers can also describe the CSV
formats understood by other applications or define their own special-purpose CSV
formats. The csv module’s reader and writer objects read and write sequences.
Programmers can also read and write data in dictionary form using the DictReader
and DictWriter classes.

 Pandas:

16
Pandas is a python package providing fast, flexible, and expressive data structures
designed to make working with “relational” or “labelled” data both easy and
intuitive. It aims to be the fundamental high-level building block for doing
practical, real world data analysis in Python. Additionally, it has the broader goal
of becoming the most powerful and flexible open source data analysis /
manipulation tool available in any language. It is already well on its way toward
this goal.

Pandas is well suited for many different kinds of data:

 Tabular data with heterogeneously-typed columns, as in an SQL table


or Excel spreadsheet
 Ordered and unordered (not necessarily fixed-frequency) time series
data.
 Arbitrary matrix data (homogeneously typed or heterogeneous) with
row and column labels
 Any other form of observational / statistical data sets. The data actually
need not be labelled at all to be placed into a pandas data structure

The two primary data structures of pandas, Series(1-dimensional) and DataFrame


(dimensional), handle the vast majority of typical use cases in finance, statistics,
social science, and many areas of engineering. For R users, DataFrame provides
everything that R’s data.frame provides and much more. pandas are built on top of
NumPy and is intended to integrate well within a scientific computing environment
with many other 3rd party libraries.

 Sys:

This module has System-specific parameters and functions. It provides access to


some variables used or maintained by the interpreter and to functions that interact
strongly with the interpreter. It is always available. The sys module provides
functions and variables used to manipulate different parts of the Python runtime
environment. sys.argv returns a list of command line arguments passed to a
Python script. The item at index 0 in this list is always the name of the script. The
rest of the arguments are stored at the subsequent indices. sys.exit causes the

17
script to exit back to either the Python console or the command prompt. This is
generally used to safely exit from the program in case of generation of an
exception. sys.maxsize returns the largest integer a variable can take. sys.path is
an environment variable that is a search path for all Python modules. sys.version
attribute displays a string containing the version number of the current Python
interpreter.

18
3.2.4 model_predict.py:
This code models and predicts the behaviour of a user. The dataset is given as
the input. The columns of the dataset which must be used for training and target
setting are also specified. A data frame was created for a given dataset and set the
target column as CES-D scale output. Then the model was trained using Support
Vector Machine using Radial Basis Function Kernel. The following are the libraries
used:

 Sklearn:
It is a free software machine learning library for the Python programming
language. It is designed to interoperate with the Python numerical and
scientific libraries NumPy and SciPy. Scikit-learn provides a range of
supervised and unsupervised learning algorithms via a consistent interface in
Python. It is licensed under a permissive simplified BSD license and is
distributed under many Linux distributions, encouraging academic and
commercial use. Extensions or modules for SciPy care conventionally
named SciKits. As such, the module provides learning algorithms and is
named scikit-learn. The vision for the library is a level of robustness and
support required for use in production systems. This means a deep focus on
concerns such as easy of use, code quality, collaboration, documentation and
performance.

 Urllib:
urllib is a package that collects several modules for working with URLs:
o urllib.request for opening and reading URLs
o urllib.error containing the exceptions raised by urllib.request
o urllib.parse for parsing URLs
o urllib.robotparser for parsing robots.txt files.
urllib.parse module helps to define functions to manipulate URLs and their
components parts, to build or break them. It usually focuses on splitting a
URL into small components; or joining different URL components into URL
string.

19
urllib.error defines the classes for exception raised by urllib.request. Whenever
there is an error in fetching a URL, this module helps in raising exceptions.
The following are the exceptions raised :

o URLError – It is raised for the errors in URLs, or errors while fetching


the URL due to connectivity, and has a ‘reason’ property that tells a
user the reason of error.
o HTTPError – It is raised for the exotic HTTP errors, such as the
authentication request errors. It is a subclass or URLError. Typical
errors include ‘404’ (page not found), ‘403’ (request forbidden),and
‘401’ (authentication required).

urllib.robotparser contains a single class, RobotFileParser. This class answers


question about whether or not a particular user can fetch a URL that published
robot.txt files. Robots.txt is a text file webmasters create to instruct web robots
how to crawl pages on their website. The robot.txt file tells the web scraper
about what parts of the server should not be accessed.

 MatPlotLib:
Matplotlib is a Python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats and interactive environments across
platforms. Matplotlib can be used in Python scripts, the Python
and IPython shells, the Jupyter notebook, web application servers, and four
graphical user interface toolkits. Matplotlib tries to make easy things easy and
hard things possible. You can generate plots, histograms, power spectra, bar
charts, errorcharts, scatterplots, etc., with just a few lines of code. For
examples, see the sample plots and thumbnail gallery.

20
3.3 Theoretical Algorithms

3.3.1 Support Vector Machines

Support Vector Machines are based on the concept of decision planes that
define decision boundaries. A decision plane is one that separates between a set of
objects having different class memberships. Support Vector Machine (SVM) is
primarily a classier method that performs classification tasks by constructing
hyperplanes in a multidimensional space that separates cases of different class labels.
SVM supports both regression and classification tasks and can handle multiple
continuous and categorical variables. For categorical variables a dummy variable is
created with case values as either 0 or 1. Thus, a categorical dependent variable
consisting of three levels, say (A, B, C), is represented by a set of three dummy
variables:

A: {1 0 0}, B: {0 1 0}, C: {0 0 1}

To construct an optimal hyperplane, SVM employs an iterative training algorithm,


which is used to minimize an error function. According to the form of the error
function, SVM models can be classified into four distinct groups:

 Classification SVM Type 1 (also known as C-SVM classification)


 Classification SVM Type 2 (also known as nu-SVM classification)
 Regression SVM Type 1 (also known as epsilon-SVM regression)
 Regression SVM Type 2 (also known as nu-SVM regression)

21
3.3.2 Radial Basis Function Kernel
The Radial basis function kernel, also called the RBF kernel, or Gaussian
kernel, is a kernel that is in the form of a radial basis function (more specifically, a
Gaussian function). The RBF kernel is defined as:

where γ is a parameter that sets the “spread” of the kernel.

Intuitively, the gamma parameter defines how far the influence of a single training
example reaches, with low values meaning ‘far’ and high values meaning ‘close’.
The gamma parameters can be seen as the inverse of the radius of influence of
samples selected by the model as support vectors.

The C parameter trades off correct classification of training examples against


maximization of the decision function’s margin. For larger values of C, a smaller
margin will be accepted if the decision function is better at classifying all training
points correctly. A lower C will encourage a larger margin, therefore a simpler
decision function, at the cost of training accuracy. In other words “C” behaves as a
regularization parameter in the SVM.

22
3.4 Advantages of the proposed model

 The purpose of this project is to build a multilevel predictive model to detect


users with depression. The model predicts life satisfaction in users and then
uses this prediction to refine the detection of users with depression.
 The results show that the multilevel model can outperform a depression
predictive model alone.
 Our model can also predict a negative correlation between depression and life
satisfaction. As the next step, interaction features, e.g., comments, likes, and
replies will be added to the model to improve the accuracy of detecting
depression.
 An important impact of this project is to enable online personalised
interventions, providing suitable help, health recommendations, or health
information links to individual users, in a similar way that social networks
currently advertise products to users.
 Another application is for decision support system data capture to help doctors
and psychologists get better and more frequent information from a patient.

23
CHAPTER-4

IMPLEMENTATION OF THE PROPOSED SYSTEM

This chapter includes the flow of the activity of the project, design
steps from the beginning till the end of the project, code snippets and description of
the project, description of datasets which is used and testing process.

4.1 Activity Diagram


The project starts with conducting satisfaction with life scale and ces-d
test, calculate score for both the tests. Then, pre-processing is done. Then, the tweets
are extracted and classified using two different scales. The tweets are then given as
input and the classifiers are used to predict the depression scale.

24
Fig. 4.1 Activity Diagram

4.2 Data Set Description


Our dataset is a synthesised dataset which is generated using IBM Watson API
by giving a folder which contains text files of tweets from each individual user. The
data set has the following fields:

 big5_openness:
Appreciation for art, emotion, adventure, unusual ideas, curiosity, and variety
of experience. Openness reflects the degree of intellectual curiosity, creativity
and a preference for novelty and variety a person has. It is also described as
the extent to which a person is imaginative or independent and depicts a
personal preference for a variety of activities over a strict routine.
 big5_conscientiousness:
Tendency to be organized and dependable, show self-discipline, act dutifully,
aim for achievement, and prefer planned rather than spontaneous behaviour.
 big5_extraversion:
Energetic, surgency, assertiveness, sociability and the tendency to
seek stimulation in the company of others, and talkativeness. High
extraversion is often perceived as attention-seeking and domineering
 big5_agreeableness:
Tendency to be compassionate and cooperative rather than suspicious and
antagonistic towards others. It is also a measure of one's trusting and helpful
nature, and whether a person is generally well-tempered or not.
 big5_neuroticism:
Tendency to be prone to psychological stress. The tendency to experience
unpleasant emotions easily, such as anger, anxiety, depression,
and vulnerability. Neuroticism also refers to the degree of emotional stability
and impulse control and is sometimes referred to by its low pole, "emotional
stability".
 SWLS:
The total score ranges from 5 (extremely dissatisfied) to 35 (extremely
satisfied). The self-reported score can be also used to reflect emotional well-
being of respondents.

25
 CES-D:
Its total score ranges from 0 to 60 and can used to classify a respondent as
depressed or non-depressed.

Fig 4.2 Screenshot of Dataset

Fig 4.3 Screenshot of list of twitter files

4.3 Design

4.3.1 Data Pre-Processing

The datasets was cleaned, removing unsuitable data. The dataset was cleaned
for numbers, punctuations, spaces, new line and stop words. The dataset was then
removed of users with fewer than 50 English posts. This ensured that there were
enough changing patterns to classify users. Finally, outputs were labelled from SWLS
and CES-D scores. Users with any unanswered question were removed from the
dataset. Users with SWLS scores over 25 were identified as life satisfied and users

26
with scores below 25 were assigned as life dis-satisfied . A cut-off score of 20 was
used to label users according to their CES-D score

4.3.2 IBM Watson Personality Insights


IBM Watson Personality Insights, a Watson Developer Cloud service that lets
you understand different individual’s personality characteristics, solely based on text
analysis. This new demo adds features to maximize understanding of personality
characteristics and to improve the overall user experience, focusing on the relation
between personality traits and behaviours that a person is more or less likely to
express. In addition to the options of writing your own text to be analysed or copying
from a different source, the text samples available have been extended. Since Twitter
is one of the most popular media sources that users have been using to analyse and
draw conclusions from audiences, three celebrities who are frequent Twitter users
were added as samples, on which you can perform the personality analysis. Just select
a sample and hit Analyse! Together with the personality summary, which provides an
agile way to understand the outcome of the analysis, you will also visualize feedback
about how strong the results are, a list of the behaviours that the personality is likely
or unlikely to manifest, and the lists of traits and facets of the analysed personality
separated by category. The analysis strength is based on the number of words needed
to get statistically significant results. While invoking a service is allowed with as little
as 100 words, more words are needed to get reliable results. There are four levels of
strength: Weak (100~1500 words), Decent (1500~3500 words), Strong (3500~6000
words) and Very Strong (6000+ words). The behaviours shown are based on studies
by researchers which reveal different correlations between personality traits and
behaviours in certain industries or domains. Their findings include facts such as:
Spending habits are related to the emotional range trait, risk profiles to extraversion or
introversion, and healthy decisions are related to conscientiousness, to name a few.
The full personality scoreboard features all the trait values produced by the service
analysis, allowing a granular view of the personality that will let you explore each
individual aspect of the personality, with the possibility of a more complex analysis.
These traits are grouped into the three kind of personality insights provided by the
service: Personality, Consumer Needs and Values. Each trait value is a percentile
score which compares the individual to a broader population.

27
The following is a brief description of the three kinds of personality insights that are
provided by this service:
 Personality characteristics: The service can build a portrait of an
individual’s personality characteristics and how they engage with the
world across five primary
dimensions: Openness, Conscientiousness, Extroversion, Agreeableness,
and Neuroticism (also known as Emotional Range).
 Needs: The service can infer certain aspects of a product that will resonate
with an individual across
twelveneeds: Excitement, Harmony, Curiosity, Ideal, Closeness, Self
expression, Liberty, Love, Practicality, Stability, Challenge, and Structure.
 Values: The service can identify values that describe motivating factors
which influence a person’s decision-making across five dimensions: Self-
transcendence / Helping others, Conservation / Tradition, Hedonism / Taking
pleasure in life, Self-enhancement / Achieving success, and Open to change /
Excitement.

4.3.3 The Moral Foundations Dictionary:


Moral Foundations Theory was created by a group of social and cultural
psychologists to understand why morality varies so much across cultures yet still
shows so many similarities and recurrent themes. The five foundations for which the
evidence is best are:

1) Care/harm: This foundation is related to the long evolution as mammals with


attachment systems and an ability to feel (and dislike) the pain of others. It underlies
virtues of kindness, gentleness, and nurturance.
2) Fairness/cheating: This foundation is related to the evolutionary process of
reciprocal altruism. It generates ideas of justice, rights, and autonomy. [Note: In the
original conception, Fairness included concerns about equality, which are more
strongly endorsed by political liberals. However, as the theory was reformulated in
2011 based on new data, proportionality is emphasized, which is endorsed by
everyone, but is more strongly endorsed by conservatives]
3) Loyalty/betrayal: This foundation is related to the long history as tribal creatures
able to form shifting coalitions. It underlies virtues of patriotism and self-sacrifice for

28
the group. It is active anytime people feel that it's "one for all, and all for one."
4) Authority/subversion: This foundation was shaped by the long primate history of
hierarchical social interactions. It underlies virtues of leadership and followership,
including deference to legitimate authority and respect for traditions.
5) Sanctity/degradation: This foundation was shaped by the psychology of disgust
and contamination. It underlies religious notions of striving to live in an elevated, less
carnal, more noble way. It underlies the widespread idea that the body is a temple
which can be desecrated by immoral activities and contaminants (an idea not unique
to religious traditions)

This dictionary was used as basis for text analysis of the tweets in Watson
IBM plugin. This is a benchmark dictionary which will categorise every type of word
as authoritative, fair etc, the percentage of every category is given as the output into
the dataset.

4.3.4 Big Five Personality Traits:


The Big Five personality traits, also known as the five-factor model (FFM) and
the OCEAN model, is a taxonomy for personality traits. It is based on common
language descriptors. When factor analysis (a statistical technique) is applied
to personality survey data, some words used to describe aspects of personality are
often applied to the same person. For example, someone described as conscientious is
more likely to be described as "always prepared" rather than "messy". This theory is
based therefore on the association between words but not
on neuropsychological experiments. This theory uses descriptors of common language
and therefore suggests five broad dimensions commonly used to describe the
human personality and psyche.

The five factors have been defined as openness to experience, conscientiousness,


extraversion, agreeableness, and neuroticism, represented by the acronym OCEAN.
Beneath each proposed global factor, there are a number of correlated and more
specific primary factors. For example, extraversion is said to include such related
qualities as gregariousness, assertiveness, excitement seeking, warmth, activity, and
positive emotions.

That these underlying factors can be found is consistent with the lexical hypothesis:
personality characteristics that are most important in people's lives will eventually

29
become a part of their language and, secondly, that more important personality
characteristics are more likely to be encoded into language as a single word. The five
factors are:

 Openness to experience (inventive/curious vs. consistent/cautious). Appreciation


for art, emotion, adventure, unusual ideas, curiosity, and variety of experience.
Openness reflects the degree of intellectual curiosity, creativity and a preference
for novelty and variety a person has. It is also described as the extent to which a
person is imaginative or independent and depicts a personal preference for a
variety of activities over a strict routine. High openness can be perceived as
unpredictability or lack of focus, and more likely to engage in risky behaviour or
drug taking. Also, individuals that have high openness tend to lean, in occupation
and hobby, towards the arts, being, typically, creative and appreciative of the
significance of intellectual and artistic pursuits. Moreover, individuals with high
openness are said to pursue self-actualization specifically by seeking out intense,
euphoric experiences. Conversely, those with low openness seek to gain
fulfilment through perseverance and are characterized as pragmatic and data-
driven—sometimes even perceived to be dogmatic and closed-minded. Some
disagreement remains about how to interpret and contextualize the openness
factor.
 Conscientiousness (efficient/organized vs. easy-going/careless). Tendency to be
organized and dependable, show self-discipline, act dutifully, aim for
achievement, and prefer planned rather than spontaneous behaviour. High
conscientiousness is often perceived as being stubborn and focused. Low
conscientiousness is associated with flexibility and spontaneity, but can also
appear as sloppiness and lack of reliability.
 Extraversion (outgoing/energetic vs. solitary/reserved). Energetic, surgency,
assertiveness, sociability and the tendency to seek stimulation in the company of
others, and talkativeness. High extraversion is often perceived as attention-
seeking and domineering. Low extraversion causes a reserved, reflective
personality, which can be perceived as aloof or self-absorbed. Extroverted people
may appear more dominant in social settings, as opposed to introverted people in
this setting.

30
 Agreeableness (friendly/compassionate vs. challenging/detached). Tendency to
be compassionate and cooperative rather than suspicious and antagonistic towards
others. It is also a measure of one's trusting and helpful nature, and whether a
person is generally well-tempered or not. High agreeableness is often seen as
naive or submissive. Low agreeableness personalities are often competitive or
challenging people, which can be seen as argumentative or untrustworthy.
 Neuroticism (sensitive/nervous vs. secure/confident). Tendency to be prone to
psychological stress. The tendency to experience unpleasant emotions easily, such
as anger, anxiety, depression, and vulnerability. Neuroticism also refers to the
degree of emotional stability and impulse control and is sometimes referred to by
its low pole, "emotional stability". High stability manifests itself as a stable and
calm personality, but can be seen as uninspiring and unconcerned. Low stability
manifests as the reactive and excitable personality often found in dynamic
individuals, but can be perceived as unstable or insecure. Also, individuals with
higher levels of neuroticism tend to have worse psychological well-being.

4.3.5 Predictive Models

After processing these features, two predictive models were developed and
evaluated: one for classifying users according to life satisfaction and another one for
detecting users with depression.

 Life satisfaction predictive model

The life satisfaction predictive model (LF model) is a classifier for labelling users
with life satisfaction and dissatisfaction. A set of predictive models on the above
described features as inputs and outputs was developed. Several techniques will be
used including support vector machine (SVM) with a radial basis function (RBF)
kernel, logistic regression, decision tree and naive Bayes. The best model will be used
to classify users taking CES-D into life satisfaction.

 Depression predictive model

The main aim of this study is to develop a depression predictive model. The
depression classifier used the same features as the LF model. However, this model

31
relies on an additional input from the best LF model. Outputs to train the depression
model were calculated from CES-D scores of each user. The depression prediction
model was trained with supervised learning approaches similar to above.

32
4.4 System Requirements
4.4.1 Software Requirements:
The below mentioned are the software tools that are required to execute the different
classifiers used for evaluating the project.

 Python:

Python is the scripting language which is used to implement the algorithms. The latest
version of Python 3 was used. Python scripts can be edited using notepad or any
coding IDE.

 Windows 7 and Higher:

The most widely used operating system which consists of graphical user interface
with lots of applications available.

4.4.2 Hardware Requirements:


The hardware requirements are the most important essentials for successful execution
of the project.

 Central Processing Unit:

Computers with more CPU cores can outperform those with a lower core count, but
results will vary with the application. CPU quadcore processor enables the faster
execution of the algorithms compared the older versions.

 Hard Disk:
The hard disk speed is a significant factor in program execution time. Once program
is running, disk speed is only a factor. For disk- intensive applications or to improve
the execution time of program you can take advantage of technologies such as solid-
state drives.

33
CHAPTER-5

RESULTS AND DISCUSSIONS

The purpose of this study is to develop a multilevel predictive model to detect users
with depression. This project tests the correlation between SWLS and CES-D scores.

5.1 Getting Tweets

Get_tweets module takes the twitter username as the input. Using twitter API and
tweepy, the tweets are collected in the text format.

Fig 5.1 Screenshot of the output of get_tweets with username “Adele”

Fig 5.2 Screenshot of the output of get_tweets with username “ArianaGrande”

34
5.2 Pre-Processing

Pre- Processing module takes the input as the name of the file in which tweets are
stored. It will remove numbers, white spaces, new lines, punctuations and stop words.
The output is another text file named “filtered.txt” where all the pre-processed tweets
are stored.

Fig 5.3 Screenshot of the output of pre-processing module for username “Adele”

Fig 5.4 Screenshot of the output of pre-processing module for username


“ArianaGrande”

35
5.3 Text Analysis

Text Analysis is done using IBM Watson API. Using the IBM credentials, the text is
analysed and the percentages of the Big Personality Five outputs are given out as the
output in the form of a JSON object. To create the data set, this object is manipulated
and the values are stored in a .csv file. To predict, the json object is converted to a list
object and that is saved for the next step.

Fig 5.5 Screenshot of output of Text Analysis for a number of usernames

36
5.4 Comparing Algorithms for modelling

 Logistic Regression:

The logistics regression is common and is a useful regression method for


solving the binary classification problem. Logistic Regression can be used for
various classification problems such as spam detection. Diabetes prediction, if
a given customer will purchase a particular product or will they churn another
competitor, whether the user will click on a given advertisement link or not,
and many more examples are in the bucket. Logistic Regression is one of the
most simple and commonly used Machine Learning algorithms for two-class
classification. It is easy to implement and can be used as the baseline for any
binary classification problem. Its basic fundamental concepts are also
constructive in deep learning. Logistic regression describes and estimates the
relationship between one dependent binary variable and independent variables.
Logistic regression is a statistical method for predicting binary classes. The
outcome or target variable is dichotomous in nature. Dichotomous means there
are only two possible classes. For example, it can be used for cancer detection
problems. It computes the probability of an event occurrence. It is a special
case of linear regression where the target variable is categorical in nature. It
uses a log of odds as the dependent variable. Logistic Regression predicts the
probability of occurrence of a binary event utilizing a logit function.

 Linear Discriminate Analysis:

If you have more than two classes then Linear Discriminant Analysis is the
preferred linear classification technique. LDA is a simple model in both
preparation and application. There is some interesting statistics behind how
the model is setup and how the prediction equation is derived. The
representation of LDA is straight forward. It consists of statistical properties of
your data, calculated for each class. For a single input variable (x) this is the
mean and the variance of the variable for each class. For multiple variables,
this is the same properties calculated over the multivariate Gaussian, namely

37
the means and the covariance matrix. These statistical properties are estimated
from your data and plug into the LDA equation to make predictions. These are
the model values that you would save to file for your model. LDA makes
some simplifying assumptions about your data. That your data is Gaussian,
that each variable is is shaped like a bell curve when plotted.That each
attribute has the same variance, that values of each variable vary around the
mean by the same amount on average.

K- Nearest Neighbour:
KNN can be used for both classification and regression predictive problems.
However, it is more widely used in classification problems in the industry. To
evaluate any technique 3 important aspects are considered: Ease to interpret
output, Calculation time, Predictive Power. KNN algorithm fairs across all
parameters of considerations. It is commonly used for its easy of interpretation
and low calculation time. First let us try to understand what exactly does K
influence in the algorithm. In the last example, given that all the 6 training
observation remain constant, with a given K value boundaries of each class
can be made. These boundaries will segregate RC from GS. The same way,
let’s try to see the effect of value “K” on the class boundaries. As you can see,
the error rate at K=1 is always zero for the training sample. This is because the
closest point to any training data point is itself.Hence the prediction is always
accurate with K=1. If validation error curve would have been similar, the
choice of K would have been 1. At K=1, it was overfitting the boundaries.
Hence, error rate initially decreases and reaches a minima. After the minima
point, it then increase with increasing K. To get the optimal value of K, you
can segregate the training and validation from the initial dataset. Now plot the
validation error curve to get the optimal value of K. This value of K should be
used for all predictions.

 Support Vector Machine:


A Support Vector Machine (SVM) is a discriminative classifier formally
defined by a separating hyperplane. In other words, given labeled training data
(supervised learning), the algorithm outputs an optimal hyperplane which
categorizes new examples. In two dimentional space this hyperplane is a line

38
dividing a plane in two parts where in each class lay in either side. The
objective of the support vector machine algorithm is to find a hyperplane in an
N-dimensional space(N — the number of features) that distinctly classifies the
data points. To separate the two classes of data points, there are many possible
hyperplanes that could be chosen. The objective is to find a plane that has the
maximum margin, i.e the maximum distance between data points of both
classes. Maximizing the margin distance provides some reinforcement so that
future data points can be classified with more confidence. Hyperplanes are
decision boundaries that help classify the data points. Data points falling on
either side of the hyperplane can be attributed to different classes. Also, the
dimension of the hyperplane depends upon the number of features. If the
number of input features is 2, then the hyperplane is just a line. If the number
of input features is 3, then the hyperplane becomes a two-dimensional plane. It
becomes difficult to imagine when the number of features exceeds 3. Support
vectors are data points that are closer to the hyperplane and influence the
position and orientation of the hyperplane. Using these support vectors, the
margin of the classifier is maximized. Deleting the support vectors will change
the position of the hyperplane. These are the points that help build the SVM. In
logistic regression, the linear function is taken and squash the value within the
range of [0,1] using the sigmoid function. If the squashed value is greater than
a threshold value(0.5) a label 1 is assigned to it, else a label 0 is assigned. In
SVM, the output of the linear function is taken and if that output is greater than
1, it is identified with one class and if the output is -1, it is identified with
another class. Since the threshold values are changed to 1 and -1 in SVM,
reinforcement range of values([-1,1]) are obtained which acts as margin. In the
SVM algorithm, the margin between the data points and the hyperplane is
maximized. The loss function that helps maximize the margin is hinge loss. A
regularization parameter is added to the cost function. The objective of the
regularization parameter is to balance the margin maximization and loss.

39
Radial Basis Function Kernel:

The main advantage of the kernel methods is the possibility of using linear
models in a nonlinear subspace by an implicit transformation of patterns to a
high-dimensional feature space without computing their images directly. An
appropriately constructed kernel results in a model that fits well to the
structure underlying data and doesn’t over-fit to the sample. Recent state-of-
the-art kernel evaluation measures are examined and their application in
kernel optimization is verified. Optimization leveraging these measures
results in parameters corresponding to the classifiers that achieve minimal
error rate for RBF kernel. The Radial basis function kernel, also called the
RBF kernel, or Gaussian kernel, is a kernel that is in the form of a radial basis
function (more specifically, a Gaussian function). The RBF kernel is defined
as:

The RBF kernel is formed by taking an infinite sum over polynomial kernels.
The RBF kernel represents this similarity as a decaying function of the
distance between the vectors (i.e. the squared-norm of their distance). That is,
if the two vectors are close together then, difference will be small. Thus,
closer vectors have a larger RBF kernel value than farther vectors. This
function is of the form of a bell-shaped curve. The γ parameter sets the width
of the bell-shaped curve. The larger the value of γ the narrower will be the
bell. Small values of γ yield wide bells.

40
Fig 4.6 Algorithms Comparison

Fig 4.7 ROC Curve Comparison

41
5.5 Training the models

The dataset is given as the input to train the two models using Support vector
machines algorithm using radial basis function kernel. The satisfaction model’s
training attributes will the Big Five Personality attributes and the target attribute is the
satisfaction column. The prediction model’s training attributes will be the Big Five
Personality attributes and satisfaction attribute. The target attribute is the depression
column.

 Life satisfaction predictive model:


The life satisfaction predictive model (LF model) is a classifier for labelling
users with life satisfaction and dissatisfaction. A set of predictive models were
trained on the above described features as inputs and outputs. Several
techniques were used including support vector machine (SVM) with a radial
basis function (RBF) kernel, logistic regression, decision tree and naive Bayes.
The best model was used to classify users taking CES-D into life satisfaction.
The LF models are trained and the best model was trained on SVM with RBF
kernel. The model yields a maximum accuracy of 69%.

 Depression predictive model:


The main aim of this study is to develop a depression predictive model. The
depression classifier used the same features as the LF model. However, this
model relies on an additional input from the best LF model. Outputs to train
the depression model were calculated from CES-D scores of each user. The
depression prediction model was trained with supervised learning approaches
similar to above. Classifiers are built for predicting users with depression.
Two models were trained, one without predicted life satisfaction labels called
“depression model” and another one with additional inputs from the best LF
model called “multilevel model”. Both models were constructed with SVM
with RBF kernel. Accuracy of the model to detect depression was 77%. After
adding life-satisfaction labels as inputs, the multilevel model yielded an
accuracy of 78%. A comparison of both classifiers shows that the multilevel
model with predicted life satisfaction inputs outperforms the model without
information.

42
Fig. 5.8 Output of Training Models

5.6 Predicting Depression


All the previous modules are imported here and are executed as console commands.
Firstly, tweets are obtained using get_tweets module. Then, the pre processing of
those tweets is done. Then, the text analysis is done and the big Five Personality
Traits are printed on the screen. Then, the satisfaction factor and depression factors
are given as the output.

Fig 5.9 Screenshot of the predicted output of “theellenshow”

43
Fig 5.10 Screenshot of the predicted output of “urstrulymahesh”

Fig 5.11 Screenshot of the predicted output of “kanyewest”

44
CHAPTER-6

CONCLUSIONS/ RECOMMENDATIONS

6.1 Conclusions

The purpose of this study was to build a multilevel predictive model to detect
users with depression. The model predicts life satisfaction in users and then uses this
prediction to refine the detection of users with depression. The results show that the
multilevel model can outperform a depression predictive model alone. The model can
also predict a negative correlation between depression and life satisfaction.

6.1.1 Limitations
 Usually, the online behaviour cannot depict the actual mental state of a person.
This project limits itself on the premise that every person will post all their
feelings on social media platforms without any bias.
 Every person must truthfully answer the questionnaire.

6.2 Future Scope

As the next step, interaction features shall be added, e.g., comments, likes, and replies
to the model to improve the accuracy of detecting depression. An important impact of
this research is to enable online personalised interventions, providing suitable help,
health recommendations, or health information links to individual users, in a similar
way that social networks currently advertise products to users. Another application is
for decision support system data capture to help doctors and psychologists get better
and more frequent information from a patient. In future work, more observation
factors shall be extracted to improve the performance of the model and understand
other behaviours which can distinguish users with and without depression, e.g., topic
analysis, graph analysis, and image analysis.

45
REFERENCES

[1] Akkapon Wongkoblap, Miguel A. Vadillo, Vasa Curcin, “A Multilevel


Predictive Model for Detecting Social Network Users with Depression”, 2018
IEEE International Conference on Healthcare Informatics,2018.

[2] World Health Organization, “Depression and Other Common Mental


Disorders: Global Health Estimates,” 2017. [Online].
[3] Namrata Sonawane, Mayuri Padmane, Vishwja Suralkar, Snehal Wable,
Prakash Date, “Predicting Depression Level Using Social Media
Posts”,IJRISET Journal, Vol 7, Issue 5, May 2018.

[4] A. Wongkoblap, M. A. Vadillo, and V. Curcin, “Researching Mental Health


Disorders in the Era of Social Media: Systematic Review.,” J.Med. Internet
Res., vol. 19, no. 6, p.e228, Jun. 2017.

[5] Akkapon Wongkoblap, Miguel A. Vadillo, Vasa Curcin,” Detecting and


Treating Mental Illness on Social Networks”, 2018 IEEE International
Conference on Healthcare Informatics,2018.

46
APPENDIX 1: CODE

get_tweets.py:

import sys

import csv

import tweepy

consumer_key = "IgM5R8xVMGNJBZuqQ42RbEHI5"

consumer_secret =
"Myu6ernLGk2ODUrsBNWr6deZu01MzlnVMTkvz93JARFJVHuoqU"

access_key = "339802190-clJpgJofm8n6tXUPVwrmWDILUZFdEGvecKqynjbY"

access_secret = "qqOr9Fx5Sh3uIxS0nf8Y0ic1rnoK2HJhQg0uAIpEGL5fh"

def get_tweets(username):

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

auth.set_access_token(access_key, access_secret)

api = tweepy.API(auth)

number_of_tweets = 100

tweets_for_csv = []

for tweet in tweepy.Cursor(api.user_timeline, screen_name


=username).items(number_of_tweets):

tweets_for_csv.append([tweet.text.encode("utf-8")])

#print(tweets_for_csv)

outfile = username + ".txt"

47
print("writing to " + outfile)

with open(outfile, 'w+') as file:

writer = csv.writer(file, delimiter=',')

writer.writerows(tweets_for_csv)

if __name__ == '__main__':

if len(sys.argv) == 2:

get_tweets(sys.argv[1])

else:

print("Error: enter one username")

Pre-Processing:

import io

import re

import string

import regex

import sys

#from string import maketrans

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

if __name__ == '__main__':

48
if len(sys.argv) == 2:

filename = sys.argv[1]

else:

print("Error: enter one filename")

file=""+filename+".txt"

file1 = open(file)

line = file1.read()# Use this to read file content as a stream:

words = line.split('\n')

mysen=""

remnum = mysen.join(words)

resultnum=re.sub(r'\d+','',remnum)

#removing numbers,new lines is done

resultpunc = resultnum.translate(str.maketrans("","", string.punctuation))

#print(resultpunc)

#removing punctuation marks is done

stop_words = set(stopwords.words('english'))

word_tokens=word_tokenize(resultpunc)

appendFile = open('filteredtext.txt','w')

for r in word_tokens:

if not r in stop_words:

appendFile.write(" "+r)

appendFile.close()

49
Text Analysis:

from watson_developer_cloud import PersonalityInsightsV3

import json

import numpy

import csv

import pandas

import sys

url='https://gateway-lon.watsonplatform.net/personality-insights/api'

apikey='X7GVHZpsWB3FBCM7bwyboAMFwMHCK-RBFQSAyJ37f7Q-'

file1=open("filteredtext.txt")

line=file1.read()

text=""

service = PersonalityInsightsV3(url=url , iam_apikey=apikey, version='2017-10-13')

profile=service.profile(line,content_type='text/plain').get_result()

#print(profile)

trait_id=[]

percentile=[]

for b in profile["personality"]:

50
#print(b)

trait_id.append(b["trait_id"])

percentile.append(b["percentile"])

#print(b["trait_id"])

print("Text Analysis Complete")

#print(percentile)

with open("foo.csv","a", newline='') as myfile:

wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)

wr.writerow(percentile)

Training of Models:

from sklearn import svm

import urllib

import numpy as np

import matplotlib.pyplot as plt

import urllib

import pandas

import warnings

from sklearn import model_selection

import pickle

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report,confusion_matrix

51
filename1='foo.csv'

names1 =
['Openness','conscientiousness','extraversion','agreeableness','neuroticism','satisfact
ion','depression']

data_set1=pandas.read_csv(filename1,names=names1)

X1=pandas.DataFrame(data_set1)

d=pandas.Series(data_set1['depression'])

y1=pandas.DataFrame(d)

del X1[data_set1.columns[6]]

X1.columns=[data_set1.columns[0],data_set1.columns[1],data_set1.columns[2],data
_set1.columns[3],data_set1.columns[4],data_set1.columns[5]]

#print("Printing X1:")

#print(X1)

C=1.0

X1_train, X1_test, y1_train, y1_test = train_test_split(X1,y1,test_size=0.2)

depression = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X1_train,


y1_train.values.ravel())

y1_pred = depression.predict(X1_test)

warnings.filterwarnings('ignore')

print("Depression Model:\n")

print(confusion_matrix(y1_test, y1_pred))

print(classification_report(y1_test,y1_pred))

52
filename='foo.csv'

names =
['Openness','conscientiousness','extraversion','agreeableness','neuroticism','satisfact
ion','depression']

data_set=pandas.read_csv(filename,names=names)

X=pandas.DataFrame(data_set)

s=pandas.Series(data_set['satisfaction'])

y=pandas.DataFrame(s)

del X[data_set.columns[6]]

del X[data_set.columns[5]]

#print("Printing X:")

#print(X)

X.columns=[data_set.columns[0],data_set.columns[1],data_set.columns[2],data_set.
columns[3],data_set.columns[4]]

C=1.0

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

satisfaction = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X_train,


y_train.values.ravel())

y_pred = satisfaction.predict(X_test)

print("Satisfaction Model:\n"

print(confusion_matrix(y_test, y_pred))

print(classification_report(y_test,y_pred))

53
filename = 'swls_model.sav'

pickle.dump(satisfaction,open(filename,'wb'))

filename1 = 'cesd_model.sav'

pickle.dump(depression,open(filename1,'wb'))

Creating the data-set:

import os

import sys

if __name__ == '__main__':

if len(sys.argv) == 2:

username = sys.argv[1]

else:

print("Error: enter one username")

st1="python get_tweets.py "+username

os.system(st1)

st2="python preprocessing.py "+username

54
os.system(st2)

os.system("python text_analysis.py")

os.system("python svm.py")

Comparing Algorithms for Modeling:

from sklearn import svm

import urllib

import numpy as np

import matplotlib.pyplot as plt

import pandas

import warnings

import pickle

import matplotlib.pyplot as plt

from sklearn import model_selection

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report,confusion_matrix

filename='foo.csv'

55
names =
['Openness','conscientiousness','extraversion','agreeableness','neuroticism','satisfaction'
,'depression']

dataframe=pandas.read_csv(filename,names=names)

X=pandas.DataFrame(dataframe)

d=pandas.Series(dataframe['depression'])

Y=pandas.DataFrame(d)

warnings.filterwarnings('ignore')

del X[dataframe.columns[6]]

X.columns=[dataframe.columns[0],dataframe.columns[1],dataframe.columns[2],dataf
rame.columns[3],dataframe.columns[4],dataframe.columns[5]]

seed = 7

models = []

models.append(('LR', LogisticRegression()))

models.append(('LDA', LinearDiscriminantAnalysis()))

56
models.append(('KNN', KNeighborsClassifier()))

models.append(('SVM', SVC()))

results = []

names = []

scoring = 'accuracy'

for name, model in models:

kfold = model_selection.KFold(n_splits=10, random_state=seed)

cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold,


scoring=scoring)

results.append(cv_results)

names.append(name)

msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())

print(msg)

# boxplot algorithm comparison

fig = plt.figure()

fig.suptitle('Algorithm Comparison')

ax = fig.add_subplot(111)

plt.boxplot(results)

ax.set_xticklabels(names)

plt.show()

Predicting Depression and Satisfaction:

import pickle

57
from watson_developer_cloud import PersonalityInsightsV3

import json

import numpy

import csv

import pandas

import sys

import os

if __name__ == '__main__':

if len(sys.argv) == 2:

username = sys.argv[1]

else:

print("Error: enter one username")

st1="python get_tweets.py "+username

os.system(st1)

print("Tweets obtained from "+username)

st2="python preprocessing.py "+username

os.system(st2)

58
print("Preprocessing done!")

#os.system("python text_analysis_main.py")

url='https://gateway-lon.watsonplatform.net/personality-insights/api'

apikey='X7GVHZpsWB3FBCM7bwyboAMFwMHCK-RBFQSAyJ37f7Q-'

file1=open("filteredtext.txt")

line=file1.read()

text=""

service = PersonalityInsightsV3(url=url , iam_apikey=apikey, version='2017-10-13')

profile=service.profile(line,content_type='text/plain').get_result()

#print(profile)

trait_id=[]

percentile=[]

for b in profile["personality"]:

#print(b)

59
trait_id.append(b["trait_id"])

percentile.append(b["percentile"])

#print(b["trait_id"])

print("Text Analysis Complete")

new_percentile = numpy.reshape(percentile,(-1,5))

print(new_percentile)

#os.system("python svm.py")

filename = 'swls_model.sav'

swls_model = pickle.load(open(filename, 'rb'))

result = swls_model.predict(new_percentile)

print("Satisfaction Factor: ")

print(result)

percentile.append(result[0])

#print(percentile)

new_percentile = numpy.reshape(percentile,(-1,6))

60
filename1 = 'cesd_model.sav'

cesd_model = pickle.load(open(filename1, 'rb'))

result = cesd_model.predict(new_percentile)

print("Depression: ")

print(result)

61
APPENDIX 2: MORAL FOUNDATIONS DICTIONARY
%
01 HarmVirtue
02 HarmVice
03 FairnessVirtue
04 FairnessVice
05 IngroupVirtue
06 IngroupVice
07 AuthorityVirtue
08 AuthorityVice
09 PurityVirtue
10 PurityVice
11 MoralityGeneral
%

safe* 01
peace* 01
compassion* 01
empath* 01
sympath* 01
care 01
caring 01
protect* 01
shield 01
shelter 01
amity 01
secur* 01
benefit* 01
defen* 01
guard* 01
preserve 01 07 09

harm* 02
suffer* 02
war 02
wars 02
warl* 02
warring 02
fight* 02
violen* 02
hurt* 02
kill 02
kills 02
killer* 02
killed 02
killing 02
endanger* 02
cruel* 02
brutal* 02
abuse* 02
damag* 02
ruin* 02 10
ravage 02
detriment* 02
crush* 02
attack* 02
annihilate* 02
destroy 02
stomp 02

62
abandon* 02 06
spurn 02
impair 02
exploit 02 10
exploits 02 10
exploited 02 10
exploiting 02 10
wound* 02

fair 03
fairly 03
fairness 03
fair-* 03
fairmind* 03
fairplay 03
equal* 03
justice 03
justness 03
justifi* 03
reciproc* 03
impartial* 03
egalitar* 03
rights 03
equity 03
evenness 03
equivalent 03
unbias* 03
tolerant 03
equable 03
balance* 03
homologous 03
unprejudice* 03
reasonable 03
constant 03
honest* 03 11

unfair* 04
unequal* 04
bias* 04
unjust* 04
injust* 04
bigot* 04
discriminat* 04
disproportion* 04
inequitable 04
prejud* 04
dishonest 04
unscrupulous 04
dissociate 04
preference 04
favoritism 04
segregat* 04 05
exclusion 04
exclud* 04

together 05
nation* 05
homeland* 05
family 05
families 05
familial 05

63
group 05
loyal* 05 07
patriot* 05
communal 05
commune* 05
communit* 05
communis* 05
comrad* 05
cadre 05
collectiv* 05
joint 05
unison 05
unite* 05
fellow* 05
guild 05
solidarity 05
devot* 05
member 05
cliqu* 05
cohort 05
ally 05
insider 05

foreign* 06
enem* 06
betray* 06 08
treason* 06 08
traitor* 06 08
treacher* 06 08
disloyal* 06 08
individual* 06
apostasy 06 08 10
apostate 06 08 10
deserted 06 08
deserter* 06 08
deserting 06 08
deceiv* 06
jilt* 06
imposter 06
miscreant 06
spy 06
sequester 06
renegade 06
terroris* 06
immigra* 06

obey* 07
obedien* 07
duty 07
law 07
lawful* 07 11
legal* 07 11
duti* 07
honor* 07
respect 07
respectful* 07
respected 07
respects 07
order* 07
father* 07

64
mother 07
motherl* 07
mothering 07
mothers 07
tradition* 07
hierarch* 07
authorit* 07
permit 07
permission 07
status* 07
rank* 07
leader* 07
class 07
bourgeoisie 07
caste* 07
position 07
complian* 07
command 07
supremacy 07
control 07
submi* 07
allegian* 07
serve 07
abide 07
defere* 07
defer 07
revere* 07
venerat* 07
comply 07

defian* 08
rebel* 08
dissent* 08
subver* 08
disrespect* 08
disobe* 08
sediti* 08
agitat* 08
insubordinat* 08
illegal* 08
lawless* 08
insurgent 08
mutinous 08
defy* 08
dissident 08
unfaithful 08
alienate 08
defector 08
heretic* 08 10
nonconformist 08
oppose 08
protest 08
refuse 08
denounce 08
remonstrate 08
riot* 08
obstruct 08

piety 09 11
pious 09 11
purity 09

65
pure* 09
clean* 09
steril* 09
sacred* 09
chast* 09
holy 09
holiness 09
saint* 09
wholesome* 09 11
celiba* 09
abstention 09
virgin 09
virgins 09
virginity 09
virginal 09
austerity 09
integrity 09 11
modesty 09
abstinen* 09
abstemiousness 09
upright 09 11
limpid 09
unadulterated 09
maiden 09
virtuous 09
refined 09
decen* 09 11
immaculate 09
innocent 09
pristine 09
church* 09

disgust* 10
deprav* 10
disease* 10
unclean* 10
contagio* 10
indecen* 10 11
sin 10
sinful* 10
sinner* 10
sins 10
sinned 10
sinning 10
slut* 10
whore 10
dirt* 10
impiety 10
impious 10
profan* 10
gross 10
repuls* 10
sick* 10
promiscu* 10
lewd* 10
adulter* 10
debauche* 10
defile* 10
tramp 10
prostitut* 10
unchaste 10

66
intemperate 10
wanton 10
profligate 10
filth* 10
trashy 10
obscen* 10
lax 10
taint* 10
stain* 10
tarnish* 10
debase* 10
desecrat* 10
wicked* 10 11
blemish 10
exploitat* 10
pervert 10
wretched* 10 11

righteous* 11
moral* 11
ethic* 11
value* 11
upstanding 11
good 11
goodness 11
principle* 11
blameless 11
exemplary 11
lesson 11
canon 11
doctrine 11
noble 11
worth* 11
ideal* 11
praiseworthy 11
commendable 11
character 11
proper 11
laudable 11
correct 11
wrong* 11
evil 11
immoral* 11
bad 11
offend* 11
offensive* 11
transgress* 11

67

You might also like