Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

STOCHASTIC GRADIENT BOOSTING MODEL BASED CONTENT ANALYSIS IN ONLINE SOCIAL MEDIA.

DONE BY : UNDER THE GUIDANCE OF :


Batch No : 15 Mr . Naveen Kumar
B . Navya(21RA5A0509)
G . Bhavani(21RA5A0502)
A . Nagaraj(21RA5A0508)

COMPUTER SCIENCE AND ENGINEERING


CONTENTS
 Abstract
 Introduction
 Problem Statement
 Literature Survey
 Research gaps
 Existing System
 Drawbacks of existing system
 Proposed system
 Advantages
 Applications
 Software and packages requirement
 Dataset description
 Conclusion
 Future scope
 References
ABSTRACT

 According to an estimate of world health organization, each year


approximately 700,000 people die by suicide.

 People who are depressed or suicidal are increasingly using social media to
express themselves.

 A well-labeled dataset of suicide thoughts was created on reddit and twitter


and six feature groups were identified that included not just clinical suicidal
symptoms but also online behaviors on social media.

 The main aim of this research is to provide early detection of suicide ideation
by evaluating online social media.
INTRODUCTION

 Suicide is one of the most serious social health issues confronting modern
society. Suicidal activities can be influenced by a range of personal and social
factors, such as negative experiences, physical or mental illness, hopelessness,
anxiety, and so on.

 Since the introduction of social media, people have been increasingly using
online forums, tweets, and blogs to communicate their suicidal tendencies.

 Some people, particularly teenagers, use social media to express suicidal


intentions, seek advice on how to commit suicide in online forums, and even
participate in suicide pacts.

 This is accomplished by detecting suicidal ideas in user postings using machine


learning techniques and natural language processing (NLP) methodology.
PROBLEM STATEMENT
 Online social media platforms generate vast amounts of diverse and dynamic
content, presenting a challenge in effectively analyzing and categorizing this
content for various purposes such as sentiment analysis, topic modeling, and
user behavior prediction.

 Traditional machine learning approaches often struggle to handle the


inherent complexities, noise, and non-linearity of social media data
LITERATURE SURVEY
 “Automatic detection of suicidal ideation in social media posts" by karami et
al. (2021) - this study proposed a deep learning model to detect suicidal
ideation in social media posts. They used a dataset of 5,000 tweets labeled as
suicidal or non-suicidal and achieved an f1-score of 0.78

 “Detecting suicidal ideation on social media using neural language models"


by Zhang et al. (2020) - the authors used a neural language model to detect
suicidal ideation on twitter. They used a dataset of 7,000 tweets labeled as
suicidal or non-suicidal and achieved an f1-score of 0.82.
 "Using machine learning to detect suicidal ideation in online user content" by o'dea et al.
(2020) - this study used a machine learning model to detect suicidal ideation in online user
content. They used a dataset of 9,000 reddit posts labeled as suicidal or non-suicidal and
achieved an f1-score of 0.88.
 “Detecting suicidal ideation in online user content using NLP" by cheng et al. (2019) - the
authors used an NLP-based approach to detect suicidal ideation in online user content. They
used a dataset of 10,000 reddit posts labeled as suicidal or non-suicidal and achieved an f1-
score of 0.87.
 “Suicide risk assessment in online forums using text classification and user engagement" by
Nguyen et al. (2019) - this study proposed a text classification model to assess suicide risk in
online forums. They used a dataset of 2,500 forum posts labeled as high or low suicide risk
and achieved an f1-score of 0.81.
RESEARCH GAPS

 There are several research gaps in suicidal content detection:


LIMITED DATA AVAILABILITY :
• Limited availability of data for Training and testing the suicidal content
detection system. More and more efforts are needed to collect and accurate
data from different sources.
Model Used :
• Stochastic gradient boosting model is used as it gives more accurate detection
than the methods used in other models.
EXISTING SYSTEM

Manual content gathering


Rule based on keywords or tags
DRAWBACKS OF EXISTING SYSTEM

Time consuming
Low accuracy
High error
Low performance in prediction
PROPOSED SYSTEM
Dataset

Preprocessing

Vectorization

Data splitting

Existing Model
Model Trains
Proposed Model

Ypred
Model Testing
Accuracy
DATA PREPROCESSING

Before building a stochastic gradient boosting model for content analysis


in online social media, we need to preprocess the data to ensure that it is
clean and usable. This involves several steps:

1. Data cleaning: remove irrelevant or duplicate data, correct typos and


spelling errors, and remove special characters.
2 .Tokenization: split the text into individual words or tokens.
3. Stopword removal: remove common words that do not add meaning to
the text, such as 'the', 'and', and 'of’.
4. Stemming: reduce words to their root form, such as 'running' to 'run’.
5. Vectorization: convert the text into numerical vectors that can be used
as input for the model.
MODEL TRAINING

The stochastic gradient boosting model was trained using a dataset of online social
media posts. The training process involved several steps:

1. data cleaning and preprocessing, including removing stop words, stemming, and
tokenizing.
2. Feature extraction, including the use of bag-of-words and TF-IDF.
3. Hyperparameter tuning using cross-validation.
4. Training the model using the xgboost library.
ADVANTAGES

High accuracy
Handles complex data
High performance in prediction
Low error
Low time consuming
Efficiency
APPLICATIONS

 Sentiment Analysis
 Topic Modeling
 Content Recommendation Systems
 Fake News Detection
 User Profiling and Personalization
SOFTWARE AND PACKAGES
REQUIREMENTS

 Operating system: windows 11


 Coding language: python
 Tool: anaconda, Jupyter notebook
 Packages :
 pickle
 TQDM
 Regular Expression(re)
 pandas
 Natural Language Toolkit(nltk)
 numpy
DATASET DESCRIPTION

 The dataset comprises 9000+ instances of data collected.


 Instances that are collected are from different Tweets.
 These data is used for the detection of the suicidal content.
CONCLUSION

 As part of this work, a method for detecting dangerous web pages containing
suicidal content, based on machine learning algorithms, was presented.
Based on the experimental results obtained, we can conclude that the method
described in this article is able to detect dangerous social media content.
FUTURE SCOPE

In the future, we schedule to improve the system for checking web pages for suicidal
content, namely:
 Add images verification as these images may be suicidal or contain symbols of death
groups. The last one may be an indication that the website containing it belongs to this
group.
 Add links verification on the page as they may be related to relevant websites.
 Add checking for suicidal instructions. We also going to improve this method in the
future by analyzing more machine learning algorithms and text processing libraries.
Research in this area is worth continuing to improve the accuracy of the detection of
dangerous web pages.
REFERENCES
 “Automatic detection of suicidal ideation in social media posts" by karami et
al. (2021).
 “Detecting suicidal ideation on social media using neural language models"
by Zhang et al. (2020).
 "Using machine learning to detect suicidal ideation in online user content" by
o'dea et al. (2020) .
 “Detecting suicidal ideation in online user content using NLP" by cheng et al.
(2019) .
 “Suicide risk assessment in online forums using text classification and user
engagement" by Nguyen et al. (2019).
* THANK YOU *

You might also like