Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Suicidal Ideation Detection using ColBERT

PROJECT REPORT

Subitsha T L K (2018115114)
Prathiksha P (2018115074)
Kaushic Aravind B (2018115046)

I. Introduction

Suicide is often regarded as a threatening and demeaning issue in today’s


modern society as the figures seem to startle everyone. “More than 700 000
people die due to suicide every year.”, states the World Health
Organisation. Suicide is a tragedy that adversely affects families, societies,
communities and countries, and has a persisting harsh effect on the people
who belong to the close circles of the victims. Suicide is the fourth leading
cause of death among teenagers and young people especially 15-29 year-
olds. Suicide is the most desperate attempt to get rid of the unbearable
suffering. People are blinded by feelings of self hatred, isolation, fear about
the future and mental depression. A suicidal person can’t see any way of
finding relief from their problems and they prefer to choose death as their only
end. Their violent and impulsive behaviour make them believe that suicide is
the only ray of light and are unable to find any other alternative solution for
their never ending depressive problems.

The below figure is a visualisation of the suicidal rates per 100,000 population
all over the world. It is significant that India is one of the nations that holds a
high suicidal rate and it is thought-provoking to find approaches that mitigate
this adverse situation.
The major factors leading a person’s mind to commit suicide includes serious
financial issues,unmanageable social pressure, depression in relationships,
inability to do attain their long term endeavours and dreams, chronic health
issues and deprivation of self esteem and self confidence drastically.
The below bar graph represents the number of people who commit suicide in
the high-income and low-income countries with respect to the age in years.

As per the reports from WHO nearly 40% of the countries have more than 15
suicide deaths per 100,000 men whearas only 1.5% of the countries show a
rate that shows higher rate for women to attempt suicide. The reason is
attributed to the lack of communication in the longer run.

Among the people who commit suicide, 25-35% of them tend to leave suicidal
notes before the attempt.These notes play a pivotal role in analysing the
behaviour of an individual who committed suicide. It also helps in
understanding the main reason behind such thoughts.The notes generally are
the means of conveying their long held resentment and send-off messages to
their parents, friends and close kins.

Since social media emerges as one of the popular means in which people
express their sense of resentment. We aim to incorporate deep learning
techniques to identify the risk of suicide as efforts to detect early signs of
suicide thoughts. Social network platforms especially Reddit and Twitter,
recently demonstrated a gradual but significant peak in forum posts related to
suicidality, mental depression, and other problems. Twitter is specifically a
communicating platform that allows the users to share and broadcast their
personal details, activities, and opinions through posts, which are short text -
based alerts or updates. As the rate of people dying from suicide is ever
increasing, the speed of information spread through Twitter or Reddit will
serve as an advantage in order to detect the early signs of suicide in a quick
and efficient manner. The need to understand the intensity of such information
from posts is of paramount importance in the study for suicidal ideation
detection.

Our main objective is to conceptualize and develop a suicidal ideation


detection model using ColBERT, Contexualised late interaction over BERT, to
identify suicidal risk factors and in turn to compare it with the existing models
to ensure its outperformance.

The traditional methods for detection of suicide involved collecting data by


health professionals and observing the intended person’s psycological, health
and linguistic features. But nowadays, the main source of data is the suicidal
notes and last thoughts that are penned by the people who attempt suicie,
most commonly in social media platforms like Twitter and Reddit.

Several academic institutions have made attempts to predict the state of mind
of people and the genuineness of the notes. Genuineness of the notes is used
to identify if the texts are genuine enough to be classified as a suicidal note
which could further help in analysis and apply efficient machine learning
algorithms. The reasons for suicide are quite difficult to understand
and are attributed to a complex interaction of many unique factors.On this
note, many researchers strive to perform a wide eclectic range of
psychological, health related content, and responses of questionnaires and
surveys. Depending on the social media texts that people post, artificial
intelligence and machine learning techniques can predict people’s chances of
suicide and in turn understand people’s intentions on the same. The new
advancing automation techniques facilitate better prediction of suicidal
thoughts than the cumbersome traditional methods.

Existing Machine learning techniques :


Several machine learning algorithms have used mental health records and
outperformed the previously used in-person health professional
identifications. This scope and need for increased prediction rate gave rise to
analysis of sentiment and helped in applying it to several new machine
learning models. The popular machine learning models that have been trained
and tested for suicidal ideation detection are Logistic regression, Naive Bayes,
Decision Tree, Random Forest, Support Vector Machine (SVM), Recurrent
Neural Networks (RNN), Convolutional Neural Networks (CNN), Long Short
Term Memory (LSTM). Recent studies have concluded that BERT,
Bidirectional Encoder Representations from Transformers which is a
transformer based model is capable of bidirectionality. The masked language
model performs next sentence prediction and has been used for suicide risk
detection.

In an attempt to outperform all of the existing traditional as well as


transformer-based models, we have come up with the idea of using a novel
model known as the ColBERT for detecting suicidal ideas from tweets.

ColBERT introduces a late interaction architecture that encodes the query and
the document simultaneously yet independently using BERT and then
employs a diminutive interaction step that captures their ethereal similarity. By
delaying and yet retaining this fine granular interaction. Late interaction is an
archetype for estimating relevance between a query q and a document d.
Under late interaction, query and document are separately encoded into two
sets of contextual embeddings, and relevance is evaluated using effective
computations between both sets—that is, fast computations that enable
ranking without evaluating every possible combination available in the input.

BERT is one of the best examples of All-All interaction based paradigms


where the modelling of interactions between words within as well as across
query and document at the same time. Research shows that this modelling
which isolates the computations among the query and document makes it
possible
to pre-compute document representations offline, drastically reducing the
computational load per query. In this proposed method , every query is
embedded separately but simultaneously along with the document embedding
and the interactions between Q and D happens later by passing via a MaxSim
operator which computes maximum similarity and the scalar outputs are
documented.This criterion allows ColBERT to exploit deep LM-based
representations while shifting the cost of encoding documents and
decreasing the cost of encoding the query once across all ranked documents.

Results show that ColBERT’s effectiveness is competitive with existing BERT-


based models (and outperforms every nonBERT baseline), while executing
two orders-of-magnitude faster and requiring four orders-of-magnitude fewer
FLOPs per query, although it is explicit that ColBERT has not been used for
this particular problem statement( Suicidal Ideation Detection).

We summarize our contribution as follows :

● We used the dataset from GitHub that has labeled tweets of several
users.
● The dataset is 3541 Kb with 9120 rows and 2 columns namely ‘tweet’
and ‘intention’.
● The ‘tweet’ column has texts that were posted by the Twitter users
which also contains suicidal notes and the ‘intention’ column indicates
if the users possess suicidal thoughts or not.
● We performed tokenization on the string of texts to convert it into
tokens.
● We also performed stratified splitting of data into train,test and valid.
The ratio of train,test and valid sets is 80:10:10 respectively.
● The processed input was fed into the embedding layer and further into
the hidden layers to learn the context of tweets and identify the
features.
● Further, we trained, tested and evaluated the model for our dataset
● After evaluating our model, we generated a classification report based
on our results.
● We finally compared and analysed the classification results of the
ColBERT model with the previous models such as SVM, GloVe and
BERT. We also observed the accuracy rates of each model in
accordance with ColBERT.

The structure of the report is as follows :

Section 1 includes a brief introduction about the suicidal rates, thoughts,


factors and detection methods.
Section 2 reviews past works based on suicide ideation detection which
includes a detailed analysis of various research papers penned by notable
researchers.
Section 3 visualises the detailed architecture diagram of the ColBERT model
with an intricate depiction of the modules that it consists of.
Section 4 lists the modules that are involved as a part of training and testing
the model.
Section 5 indicates the progress of work as individual contributors to the
project and further work that will be carried out.
Section 6 tabulates the results obtained from the various models that were
observed and evaluated.
Section 7 elaborates on the analysis in the trend of results that were obtained
for several models.
Section 8 concludes the report of the entire intended project.

II. Literature Survey

S.No Title Author, journal / Problem Methodolo Pros and Cons and future work
conference statement gy/ advantage
name and year Algorithm
used

1. Hierarchical Annika M To present a Dilated It The number of dilated


Multiscale Schoene, learning LSTM introduced layers increases with
recurrent Alexander P model for (Long the Dilated increase in sequence
neural Turner, Geeth modelling Short Term LSTM and length of the
networks De Mel and Nina long Memory) its document.
for Dethlefs sequences in overperfor A more efficient model
detecting Reddit Blogs ming that can yield better
suicide for Suicide results performance has to be
notes. Detection. than the focussed.
(2021) other
models.

2. Learning Ning Wang, Fan Propose a C-Attention C-Att model More work is needed to
Models For Luo, Yuvraj deep learning model, performs better decipher why
Suicide Shivtare1 architecture KNN and better on certain features and
Prediction and test three SVM longer models best predict
From Social other machine models range suicidality in large
Media Posts. learning suicide samples.
(2021) models to predictions.
detect KNN and
individuals that SVM work
will attempt better on
suicide within shorter
30 days and six range
months suicide
predictions

3. Suicidal Shaoxiong Ji, A survey to Attention Understand RNNs window size is its
Ideation Shirui Pan, comprehensivel and RNNs ing the own disadvantage as
Detection: A Xue Li y introduce the context of the increased text size
Review of methods from the text with and smaller window size
Machine machine a specified restricts the model to
Learning learning window in remember more than the
Methods and techniques with an accurate window size.
Applications. feature manner.
(2021) engineering

4. Sentiment Sérgio Barreto, An BERT - The main More computational


Analysis in Ricardo Moura, assessment of Bidirectiona focus of this resources needed to
tweets- an Jonnathan existing l Encoder study was train/fine-tune and make
assessment Carvalho, Aline language Representa on inferences.
study from Paes, Alexandre models in tions from identifying
classical to Plastino distinguishing Transforme the most
modern text the sentiment rs,RoBERT appropriate
representatio expressed in a, and word
n models. tweets and five BERTweet representati
(2021) classification ons for the
algorithms sentiment
analysis of
English
tweets.

5. Early Risk For early risk It uses a Transforme It failed at grasping the
Detection of Ana-Maria detection of BERT r Encoders context provided in the
Pathological Bucur1 , Adrian problem encoder are stacked sentence previous to a
Cosma2 and gambling, data that and the high degree. (common
Gambling,
Liviu P. Dinu3,4 sets from encodes input text is sense based
Self-Harm popular both the encoded to contextualization).
and subreddits that user text later match
Depression address and all the it with the
Using BERT. gambling answers to document.
(2021) addiction were the The major
opted for questionnai advantage
crawling and re. Then is the
constructing fine-tune bidirectional
and training a the BERT traversing
BERT classifier encoder of the
on user posts. through words
contrastive during the
learning encoding
process.

III. Architecture Diagram

IV. List of Modules

MODULE 1 : Data Collection and Preprocessing :

Data Collection : ColBERT works with tab-separated (TSV) files.


The data set that is used for the analysis and for the entire project is from the
github resource :
https://github.com/laxmimerit/twitter-suicidal-intention-
dataset/blob/master/twitter-suicidal_data.csv

The data set has a collection of tweets and the intention of suicide (indicated
with 1’s and 0’s).
The collected data was imbalanced and stratified sampling was implemented
to balance it.

df = pd.read_csv('/content/twitter-suicidal_data.csv')

Data Cleaning : A get_clean function is implemented to clean the data in


various ways such as removing emails, URLs, html tags, accented characters,
and special characters.

Splitting data to train, test and valid set : We have made use of stratified
sampling to split the dataset into train, test and valid. The ratio in which this
was made is 80:10:10 , train:test:val. The splitting was performed by using the
train_test_split() twice.

df_train, df_val, df_test = \


split_stratified_into_train_val_test(df,stratify_colname=
'intention', frac_train=0.80, frac_val=0.10,
frac_test=0.10)

MODULE 2 : Function Preprocessing :

BERT Embedding Layer : The input text is fed into the BERT embedding
layer. This layer helps the model learn the context of the entire text and
classify the sentences by understanding the true meaning and context behind
them rather than just considering them as vectors.

3 types of embeddings happen in this layer namely :

1. Token Embedding : Token embedding is basically assigning vocabulary


IDs to the tokens.
2. Sentence Embedding: Embedding sentences to differentiate between
the sentences
3. Positional Encoding: Encoding the position of each word in a sentence
to understand the context behind the usage of that particular word.
The BERT used special tokens for its training to under the context of the to be
embedded sentences which are namely :

1. [CLS] : BERT’s Pretraining involves the step of masked language


model , where random words are masked and later predicted by the
model. The CLS token depicts the start of every unique
sentence.Token:101

2. [SEP] : A SEP special token is used in the pretraining task which is


known as next sentence prediction. Therefore for the model to identify
the next sentence, the SEP token is attached at the end of every
sentence. When a single sequence is used it is just appended at the
end. Token:102

3. [MASK] : Token used for masked words. Only used for pre-training. To
evaluate the training better, BERT masks specific words in a sentence
prediction and accurately identifies/ predicts the masked words.

The BERT pretrained model that is being used here is Bert-Base-uncased.

Details of BERT-Base-uncased :

● Number of transformer blocks (L): 12.


● Hidden Layer size: 768
● Attention Heads: 12

Tokenization : The split data is being tokenized using the BERT tokenizer,
making it a suitable input for the BERT layers. Each sentence is tokenized in
this module.

tokenizer = BertTokenizer.from_pretrained(MODEL_TYPE)

MODULE 3 : Hidden Layer Module Description :

The sentence embeddings acquired using the BERT model is used for
classification by feeding the embeddings into a neural network that classifies
the text into Suicidal or not.The proposed model uses 8 layers with 110 million
parameters.

MODEL SUMMARY:
Layer Output Shape Parameter Connected to
(type)/number
Input / 18 (None,20) 0 -

Bert - multiple 109482240 Input[18]


(custom>TFBertMai
nLayer)

global (None,768) 0 Bert-[0][0],[1][0],[2]


_average_pooling1d [0],[3][0],[4][0],[5][0]
/6

Dense_layer / 5 (None, 32) 24608 global_average_pool


ing1d

Dense_layer / 6 (None,256) 19864 global_average_pool


ing1d

Dropout / 6 (None,32) 0 Dense_layers

Dense / 6 (None,8) 264 Dropout

Concatenate_1 (None, 104) 16448 Dense_layers

Dense_27 (None, 512) 53760 concatenate_1[0][0]

Dropout_50 (None,512) 0 Dense_27

Dense_28_29/2 (None,1) 257 Dropout_50

Summary inference about the model:

Total params: 110,005,257


Trainable params: 110,005,257
Non-trainable params: 0

The stacked transformer encoders in the above model encode the query and
document and the similarities are computed using a MaxSim. The document
which has the maximum similarity for a given query will be classified for it. In
our case the document is the tweet and the query is the label.

The maxSim score goes under softmax to find the result with highest
probability and that is our resulting 'label'.

MODULE 4 : Training, Testing, Validation

The data is trained using the pretrained model and tested for prediction score.
print_evaluation_metrics(np.array(testing_outputs),
np.array(preds),'')

The confusion matrix is obtained by using :

sklearn.metrics.confusion_matrix(testing_outputs_array,te
sting_preds.round())

V. A. Work Completed

● Kaushic Aravind B

- ColBERT - Preprocessing, training, testing and evaluation


- BERT - Preprocessing, training, testing and evaluation

● Prathiksha P

- ColBERT - Preprocessing, training, testing and evaluation


- SVM - Preprocessing, training, testing and evaluation

● Subitsha T L K

- ColBERT - Preprocessing, training, testing and evaluation


- GloVe - Preprocessing, training and testing and evaluation

B. Work to be done

● ColBERT - To increase accuracy of the existing model

VI. Results

(i) Performance of the models:


Models Accuracy Accuracy Accuracy Precision Recall
% for train % for test % for
valid

ColBert 47.7 48.8 47.3 47.6 97.25

BERT 86.48 87.89 - 88.27 87.39

SVM 100 92 92 92 92

GloVe 92 92 94 92 92

(ii) Training Sequential models:

Model Optimizer Loss Activation


Function function

(Hidden-Output)

ColBert Adam Binary Cross Relu-Softmax


Entropy

Bert Adam Binary Cross Relu-Softmax


Entropy

GloVe Adam Binary Cross Relu-Sigmoid


Entropy

VII. Analysis

From the implementations that have been done using models such as BERT,
SVM and GloVe, we've analysed the accuracy score of each model for the
suicidal data. SVM and GloVe show the highest percent of accuracy for the
suicidal data as of now. ColBert has shown good results. The way in which a
suicidal sentence is classified has to be analysed for a better result. The hyper
parameters of the ColBert are to be changed and the results will be noted.

VIII. Conclusion
In this document , we proposed a Suicide detection model using Colbert a
transformer model which uses late interaction technique between the query
and document over Bert Bidirectional Encoder Representations from
Transformers
The major task was to understand the psychology behind how a text is classified into
suicidal or not from a human perspective. The data set has 2 main classes namely
“tweet” which is the text posted by a user and “intention” is the label which classifies
a text into suicidal or not suicidal. The dataset was cleaned and preprocessed by
truncating, removing special symbols, emails, special characters etc. The data was
balanced using stratified sampling and was ready for the next step. The next step
was to tokenize the input text. The Input text is sent into the bert model for
embedding and contextualized word vectors were obtained. Finally the proposed 8
layers 110 M parameter transformer model was employed on these word vectors to
map the query(label) to the input document. The model achieves an accuracy of
47.7% for training data , 48.8% for testing data and 47.3% for validation data. The
proposed model has not been pre trained for this problem statement else where and
therefore the model was pre-trained and fine-tuned by the team members. The future
work involves studying more models and increasing the accuracy of the proposed
ColBert model for Suicidal Ideation Detection.

You might also like