Professional Documents
Culture Documents
Suicidal Ideation Detection Using Colbert Project Report
Suicidal Ideation Detection Using Colbert Project Report
PROJECT REPORT
Subitsha T L K (2018115114)
Prathiksha P (2018115074)
Kaushic Aravind B (2018115046)
I. Introduction
The below figure is a visualisation of the suicidal rates per 100,000 population
all over the world. It is significant that India is one of the nations that holds a
high suicidal rate and it is thought-provoking to find approaches that mitigate
this adverse situation.
The major factors leading a person’s mind to commit suicide includes serious
financial issues,unmanageable social pressure, depression in relationships,
inability to do attain their long term endeavours and dreams, chronic health
issues and deprivation of self esteem and self confidence drastically.
The below bar graph represents the number of people who commit suicide in
the high-income and low-income countries with respect to the age in years.
As per the reports from WHO nearly 40% of the countries have more than 15
suicide deaths per 100,000 men whearas only 1.5% of the countries show a
rate that shows higher rate for women to attempt suicide. The reason is
attributed to the lack of communication in the longer run.
Among the people who commit suicide, 25-35% of them tend to leave suicidal
notes before the attempt.These notes play a pivotal role in analysing the
behaviour of an individual who committed suicide. It also helps in
understanding the main reason behind such thoughts.The notes generally are
the means of conveying their long held resentment and send-off messages to
their parents, friends and close kins.
Since social media emerges as one of the popular means in which people
express their sense of resentment. We aim to incorporate deep learning
techniques to identify the risk of suicide as efforts to detect early signs of
suicide thoughts. Social network platforms especially Reddit and Twitter,
recently demonstrated a gradual but significant peak in forum posts related to
suicidality, mental depression, and other problems. Twitter is specifically a
communicating platform that allows the users to share and broadcast their
personal details, activities, and opinions through posts, which are short text -
based alerts or updates. As the rate of people dying from suicide is ever
increasing, the speed of information spread through Twitter or Reddit will
serve as an advantage in order to detect the early signs of suicide in a quick
and efficient manner. The need to understand the intensity of such information
from posts is of paramount importance in the study for suicidal ideation
detection.
Several academic institutions have made attempts to predict the state of mind
of people and the genuineness of the notes. Genuineness of the notes is used
to identify if the texts are genuine enough to be classified as a suicidal note
which could further help in analysis and apply efficient machine learning
algorithms. The reasons for suicide are quite difficult to understand
and are attributed to a complex interaction of many unique factors.On this
note, many researchers strive to perform a wide eclectic range of
psychological, health related content, and responses of questionnaires and
surveys. Depending on the social media texts that people post, artificial
intelligence and machine learning techniques can predict people’s chances of
suicide and in turn understand people’s intentions on the same. The new
advancing automation techniques facilitate better prediction of suicidal
thoughts than the cumbersome traditional methods.
ColBERT introduces a late interaction architecture that encodes the query and
the document simultaneously yet independently using BERT and then
employs a diminutive interaction step that captures their ethereal similarity. By
delaying and yet retaining this fine granular interaction. Late interaction is an
archetype for estimating relevance between a query q and a document d.
Under late interaction, query and document are separately encoded into two
sets of contextual embeddings, and relevance is evaluated using effective
computations between both sets—that is, fast computations that enable
ranking without evaluating every possible combination available in the input.
● We used the dataset from GitHub that has labeled tweets of several
users.
● The dataset is 3541 Kb with 9120 rows and 2 columns namely ‘tweet’
and ‘intention’.
● The ‘tweet’ column has texts that were posted by the Twitter users
which also contains suicidal notes and the ‘intention’ column indicates
if the users possess suicidal thoughts or not.
● We performed tokenization on the string of texts to convert it into
tokens.
● We also performed stratified splitting of data into train,test and valid.
The ratio of train,test and valid sets is 80:10:10 respectively.
● The processed input was fed into the embedding layer and further into
the hidden layers to learn the context of tweets and identify the
features.
● Further, we trained, tested and evaluated the model for our dataset
● After evaluating our model, we generated a classification report based
on our results.
● We finally compared and analysed the classification results of the
ColBERT model with the previous models such as SVM, GloVe and
BERT. We also observed the accuracy rates of each model in
accordance with ColBERT.
S.No Title Author, journal / Problem Methodolo Pros and Cons and future work
conference statement gy/ advantage
name and year Algorithm
used
2. Learning Ning Wang, Fan Propose a C-Attention C-Att model More work is needed to
Models For Luo, Yuvraj deep learning model, performs better decipher why
Suicide Shivtare1 architecture KNN and better on certain features and
Prediction and test three SVM longer models best predict
From Social other machine models range suicidality in large
Media Posts. learning suicide samples.
(2021) models to predictions.
detect KNN and
individuals that SVM work
will attempt better on
suicide within shorter
30 days and six range
months suicide
predictions
3. Suicidal Shaoxiong Ji, A survey to Attention Understand RNNs window size is its
Ideation Shirui Pan, comprehensivel and RNNs ing the own disadvantage as
Detection: A Xue Li y introduce the context of the increased text size
Review of methods from the text with and smaller window size
Machine machine a specified restricts the model to
Learning learning window in remember more than the
Methods and techniques with an accurate window size.
Applications. feature manner.
(2021) engineering
5. Early Risk For early risk It uses a Transforme It failed at grasping the
Detection of Ana-Maria detection of BERT r Encoders context provided in the
Pathological Bucur1 , Adrian problem encoder are stacked sentence previous to a
Cosma2 and gambling, data that and the high degree. (common
Gambling,
Liviu P. Dinu3,4 sets from encodes input text is sense based
Self-Harm popular both the encoded to contextualization).
and subreddits that user text later match
Depression address and all the it with the
Using BERT. gambling answers to document.
(2021) addiction were the The major
opted for questionnai advantage
crawling and re. Then is the
constructing fine-tune bidirectional
and training a the BERT traversing
BERT classifier encoder of the
on user posts. through words
contrastive during the
learning encoding
process.
The data set has a collection of tweets and the intention of suicide (indicated
with 1’s and 0’s).
The collected data was imbalanced and stratified sampling was implemented
to balance it.
df = pd.read_csv('/content/twitter-suicidal_data.csv')
Splitting data to train, test and valid set : We have made use of stratified
sampling to split the dataset into train, test and valid. The ratio in which this
was made is 80:10:10 , train:test:val. The splitting was performed by using the
train_test_split() twice.
BERT Embedding Layer : The input text is fed into the BERT embedding
layer. This layer helps the model learn the context of the entire text and
classify the sentences by understanding the true meaning and context behind
them rather than just considering them as vectors.
3. [MASK] : Token used for masked words. Only used for pre-training. To
evaluate the training better, BERT masks specific words in a sentence
prediction and accurately identifies/ predicts the masked words.
Details of BERT-Base-uncased :
Tokenization : The split data is being tokenized using the BERT tokenizer,
making it a suitable input for the BERT layers. Each sentence is tokenized in
this module.
tokenizer = BertTokenizer.from_pretrained(MODEL_TYPE)
The sentence embeddings acquired using the BERT model is used for
classification by feeding the embeddings into a neural network that classifies
the text into Suicidal or not.The proposed model uses 8 layers with 110 million
parameters.
MODEL SUMMARY:
Layer Output Shape Parameter Connected to
(type)/number
Input / 18 (None,20) 0 -
The stacked transformer encoders in the above model encode the query and
document and the similarities are computed using a MaxSim. The document
which has the maximum similarity for a given query will be classified for it. In
our case the document is the tweet and the query is the label.
The maxSim score goes under softmax to find the result with highest
probability and that is our resulting 'label'.
The data is trained using the pretrained model and tested for prediction score.
print_evaluation_metrics(np.array(testing_outputs),
np.array(preds),'')
sklearn.metrics.confusion_matrix(testing_outputs_array,te
sting_preds.round())
V. A. Work Completed
● Kaushic Aravind B
● Prathiksha P
● Subitsha T L K
B. Work to be done
VI. Results
SVM 100 92 92 92 92
GloVe 92 92 94 92 92
(Hidden-Output)
VII. Analysis
From the implementations that have been done using models such as BERT,
SVM and GloVe, we've analysed the accuracy score of each model for the
suicidal data. SVM and GloVe show the highest percent of accuracy for the
suicidal data as of now. ColBert has shown good results. The way in which a
suicidal sentence is classified has to be analysed for a better result. The hyper
parameters of the ColBert are to be changed and the results will be noted.
VIII. Conclusion
In this document , we proposed a Suicide detection model using Colbert a
transformer model which uses late interaction technique between the query
and document over Bert Bidirectional Encoder Representations from
Transformers
The major task was to understand the psychology behind how a text is classified into
suicidal or not from a human perspective. The data set has 2 main classes namely
“tweet” which is the text posted by a user and “intention” is the label which classifies
a text into suicidal or not suicidal. The dataset was cleaned and preprocessed by
truncating, removing special symbols, emails, special characters etc. The data was
balanced using stratified sampling and was ready for the next step. The next step
was to tokenize the input text. The Input text is sent into the bert model for
embedding and contextualized word vectors were obtained. Finally the proposed 8
layers 110 M parameter transformer model was employed on these word vectors to
map the query(label) to the input document. The model achieves an accuracy of
47.7% for training data , 48.8% for testing data and 47.3% for validation data. The
proposed model has not been pre trained for this problem statement else where and
therefore the model was pre-trained and fine-tuned by the team members. The future
work involves studying more models and increasing the accuracy of the proposed
ColBert model for Suicidal Ideation Detection.