CBDPPT

|| JAI SRI GURUDEV ||
SRI ADICHUNCHANAGIRI SHIKSHANA TRUST ®

SJB INSTITUTE OF TECHNOLOGY
BGS HEALTH & EDUCATION CITY, KENGERI, BENGALURU-560060
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Project Work Phase – I

Synopsis Presentation On
“Cyber Bullying Detection using Machine

Learning”
Presented By:
Under the Guidance of Neha Nagendrayya [1JB20CS069]
Nithin G [1JB20CS074]
Dr. Roopa MJ Riya Nagori [1JB20CS096]
Associate Professor S. Tejas [1JB20CS099]
Dept. of CSE
AGENDA CONTENTS
1. Abstract
2. Introduction
3. Objective
4. Literature Survey
5. Existing System
6. Proposed System
7. Methodology
8. Requirement Specification
9. Conclusion
10.Reference
ABSTRACT
Social media is a platform where many young people are getting bullied. As social
networking sites are increasing, cyber bullying is increasing day by day. To identify word
similarities in the tweets made by bullies and make use of machine learning and can
develop an ML model automatically detect social media bullying actions. However, many
social media bullying detection techniques have been implemented, but many of them
were textual based. The objective of our project work is to show the implementation of
NLP and CNN which detects bullied tweets, posts, etc. A machine learning model is
proposed to detect and prevent bullying on Twitter. Two classifiers i.e. NLP(Natural
Language Processing) are used for identifying the complete sentence in the comments
and CNN(Convolution Neural Networks) for image identification. Both NLP and CNN
were able to detect the true positives with more accuracy. Also, Twitter API is used to
fetch tweets and tweets are passed to the model to detect whether the tweets are bullying
or not.
INTRODUCTION
With the advancement in technology, the internet has been a safe and secure sphere of communication, though
the arena of social media has been prone to cybercrimes. It is characterized as the utilization of online
communication to bully an individual, regularly by sending messages of an intimidating or threatening nature.
Around 87 percent of the today’s youth have witnessed some form of cyber bullying. Cyber bullying
can take different structures like Sexual Harassment, Hostile Environment, Revenge, and Retaliation.Since the
offender is hidden to the victim, the problem statement gets complex. With the multiplication of online life
and internet access, the act of cyber bullying too has increased , and it’s difficult to detect .Thus, it is necessary
to detect cyber bullying in order to protect adolescents. The focus is on identifying textual cyber bullying.
Automatic surveillance of cyber bullying has gained considerable interest in the field of computer science.
In this research, this vital data is utilized and information in the form of texts to improve the existing cyber
bullying detection performance.
A Convolution Neural Network (CNN) popularly known as ConvNet is a specific type of artificial
neural network that use perceptrons, a machine learning algorithm to analyze data. CNN's apply to
image processing, natural language processing and other intellectual assignments.
OBJECTIVE
• In the proposed system the tweets are classified into a threat tweet or not a threat tweet. This is
done by utilizing a set of keywords belonging to different categories. For each of category the
probability is computed, after the probability is computed contingency is computed and after that
sorting is performed in order to classify the tweet as belonging different cyber threat category.
• Framework aids the E-crime department to identify suspicious words from cyber messages and
trace the suspected culprits. Currently existing Instant Messengers and Social Networking Sites
lack these features of capturing significant suspicious patterns of threat activity from dynamic
messages and find relationships among people, places and things during online chat, as criminals
have adapted to it.
• We will apply different algorithms for identifying and preventing the text or post in comments.
Later our model will help people from the attacks of social media bullies.
PROBLEM STATEMENT
Cyberbullying has become a pervasive and harmful issue in today's digital age, causing emotional
distress and psychological harm to individuals, especially among adolescents and young adults. To
address this problem, there is a critical need for an automated system that can effectively detect and
mitigate instances of cyberbullying on social media platforms, chat applications, and other online
communication channels.
The goal of this project is to develop a robust and accurate Cyber Bullying Detection system using deep
learning techniques, specifically Long Short-Term Memory (LSTM) and Convolutional Neural Networks
(CNN). This system should be capable of analyzing text and multimedia content (such as images and
videos) to identify and classify instances of cyberbullying, hate speech, or offensive content in real-
time.
KEY FEATURES
1.Data Collection: Gather a diverse and comprehensive dataset containing text, images, and videos from various online
sources, including social media platforms, forums, and messaging apps. This dataset should include labeled examples of
cyberbullying and non-cyberbullying content.
2.Data Preprocessing: Clean, preprocess, and annotate the dataset to ensure consistency and prepare it for training the
deep learning models. This includes text tokenization, image resizing, and video frame extraction.
3.Model Architecture: Design and implement a hybrid deep learning model that combines LSTM and CNN layers. The
LSTM layers will process textual data, while the CNN layers will handle image and video content. Fine-tune the
architecture to optimize performance.
4.Feature Extraction: Extract relevant features from text, images, and videos, which capture the linguistic, visual, and
contextual cues associated with cyberbullying.
5.Training and Validation: Train the model using the preprocessed dataset and implement cross-validation techniques to
ensure robustness and minimize overfitting.
6.Real-time Detection: Develop an interface or API that allows users to input text, images, or video content for real-time
cyberbullying detection. The system should provide immediate feedback on the likelihood of cyberbullying and the
severity of the content.
7.Model Evaluation: Evaluate the performance of the LSTM-CNN model using appropriate metrics such as precision,
recall, F1-score, and accuracy. Compare the results with existing cyberbullying detection methods.
8.User Alerts and Reporting: Implement a mechanism to notify users or moderators when cyberbullying content is
detected. Provide reporting and logging capabilities for further analysis and action.
CHALLENGES
.
LITERATURE SURVEY
 Cyber bullying Identification Using Participant-Vocabulary Consistency
 Authors - Elaheh Raisi, Bert Huang
 Published in - 26 jan 2016
 This paper proposed the participant-vocabulary consistency method to simultaneously discover

victims, instigators, and vocabulary of words indicates bullying. Starting with seed dictionary of high-
precision bullying indicators, they optimize an objective function that seeks consistency between the
scores of the participants in each interaction and the scores of the language use. For evaluation, they
perform their experiments on data from Twitter and Ask.fm, services known to contain high
frequencies of bullying. their experiments indicate that their method can successfully detect new
bullying vocabulary. They are currently working on creating a more formal probabilistic model for
bullying to robustly incorporate noise and uncertainty.
.
LITERATURE SURVEY
 Evaluation of Machine Learning Algorithms for Anomaly Detection
 Authors - Nebrase Elmrabit, Feixiang Zhou, Fengyin Li, Huiyu Zhou
 Published in - 2018
 In this paper, the authors have comprehensively evaluated the performance of the twelve ML
algorithms for the detection of anomalous behaviours that may be indicative of cyber attacks. In order
to recommend the best-fit algorithms, three datasets (i.e. UNSW-NB15, CICIDS-2017, and ICS
cyberattack) were applied to the selected methods, but deep learning classification requires a very large
amount of data to train the models and this is not available in the current studies, and Naive Bayes
classification has the lowest performance in terms of accuracy, precision, recall and AUC.
.
LITERATURE SURVEY
 The Role of Artificial Intelligence and Cyber Security for Social Media
 Authors - Bhavani Thuraisingham
 This paper has discussed the benefits of social media and the application of machine learning
techniques for social media. For example, machine learning techniques are being used to detect the
sentiment of the users and to provide information on the spread of deadly diseases as well as prevent
child trafficking. It also discussed the use of machine learning for detecting fake news and malicious
software. Next, the paper discussed security and privacy issues for social media systems including
access control models and privacy aware social media systems. Finally, the paper discussed the
integration of AI ad cyber security for social media systems such as adversarial machine learning and
the inference and privacy problems.
.
LITERATURE SURVEY
 A Framework to Predict Social Crime through Twitter Tweets By Using Machine Learning
 Authors - Zaheer Abbass, Zain Ali, Mubashir Ali, Bilal Akbar, Ahsan Saleem
 The aim of this research study to predict social media crimes by using twitter data. They use
three ML classifier with bag of word model. The study proves better result with existing state of art.
The proposed model is currently offline in future work it can be extended for real-time Twitter data
streaming to predict further crimes. More crime classes can be added to make the system efficient and
robust but not for image detection.
.
LITERATURE SURVEY
 Detecting A Twitter Cyber bullying Using Machine Learning
 Authors - Rahul Ramesh Dalvi, Sudhanshu Baliram Chavan, Aparna Halbe
 An approach is proposed for detecting and preventing Twitter cyber bullying using
Supervised Binary classification Machine Learning algorithms. This model is evaluated on both
Support Vector Machine(SVM) and Naive Bayes, also for feature extraction, used the TFIDF
vectorizer. As the results shows that the accuracy for detecting cyber bullying content has also been
great for Support Vector Machine which is better than Naive Bayes. But this technique doesn’t
identify bullying text more accurately.
EXISTING SYSTEM
Different ways to track the cyber crimes. But most of the papers work of 2
category classification i.e either the action is a crime or not a crime and does
not work on type of crime. There are work which also classifies the mails,
social media data as spam or not a spam. However the classification under
spam is not available. Hence an approach is needed which can classify the
data into various categories. After classification is performed the spam or
any data more accuracy is obtained and necessary actions can be taken on
each category users.
- Existing System used SVM and Naive Bayes but SVM algorithm is not
suitable for large data sets. SVM does not perform very well when the data
set has more noise i.e. target classes are overlapping. In cases where the
number of features for each data point exceeds the number of training data
samples, the SVM will underperform.
PROPOSED SYSTEM
 In this project, a solution is proposed to detect twitter cyberbullying. The main difference
with previous research is that we not only developed a machine learning model to detect
cyberbullying content but also implemented it on particular locations real-time tweets
using Twitter API.
 In Data Pre-processing, It is important to ensure that our dataset is good enough for
analysis. This is where data cleaning becomes extremely vital. Data cleaning extensively
deals with the process of detecting and correcting of data records, ensuring that data is
complete and accurate and the components of data that are irrelevant are deleted or
modified as per the needs.
 In feature extraction step has got more to do with the feature that we are selecting from
the set of possible features that the dataset could have. We had to make an intelligent
decision regarding the type of feature that we want to select to go ahead with our
machine learning model.
 In test train split we are splitting the dataset for training and testing for crating model and
prediction. Then apply the algorithm for creating model for the sentiment classification.
PROPOSED SYSTEM
 Proposed System uses NLP Technique and CNN algorithm , Where CNN Little
dependence on pre processing, decreasing the needs of human effort developing
its functionalities. It is easy to understand and fast to implement. It has the
highest accuracy among all algorithms that predicts images.training CNN algo
Data processing and data NLP technique

labeling techniques Noun, pronoun ,
Frequency Extraction
Dataset Data Preprocessing Feature Extraction
Input(text/image)
Train Test split
Classification Classification
result model
Checks accuracy CNN algo applied
 DATA COLLECTION : Data collection is the process of gathering and
measuring information on targeted variables in an established system.
 PRE-PROCESSING : To do preliminary processing of data. data preprocessing
include cleaning, instance selection, normalization, one hot encoding,
transformation, feature extraction and selection, etc. The product of data
preprocessing is the final training set.
 FEATURE EXTRACTION : Feature Extraction aims to reduce the number of
features in a dataset by creating new features from the existing ones (and then
discarding the original features). These new reduced set of features should then be
able to summarize most of the information contained in the original set of features.
 MODEL BUILDING : A machine learning model is built by learning and
generalizing from training data, then applying that acquired knowledge to new
data it has never seen before to make predictions and fulfill its purpose.
 CLASSIFICATION OF BULLYING/NON-BULLYING
 COMPUTE ACCURACY PRECITION : metric that quantifies the number of
correct positive predictions made. Precision, therefore, calculates the accuracy for
the minority class.
ALGORITHM
CNN MODEL LAYERS

 The crux of the entire process depends upon the CNN layers used for processing. The main
layers in the model include Sequential Layer.
 The initial building block of keras is a model and the simplest model is called sequential model
which consists of a stack of neural network layers.
 The network is dense which means every node from each layer is connected with nodes from
other layers.
 The perceptron is a single algorithm which takes the input vector x of m values as input and
outputs either 1(yes) or 0(no)
 There are many types of Activation functions like sigmoid, ReLu etc.
 The sigmoid function is defined as 1/ (1+e x) and can be used to produce continues values.
 Activation function ReLu known as a Rectified linear unit is also one such activation function
which gives smooth values with nonlinear functions.
 A ReLu is simply defined as f(x)=max (0, x). The function is 0 for negative values and grows
for positive values. In this network, the input text is converted to a sequence of word indices.
NATURAL LANGUAGE PROCESSING
 natural language processing is considered a subset of machine learning while NLP

and ML both fall under the larger category of artificial intelligence.
 Natural Language Processing combines Artificial Intelligence (AI) and
computational linguistics so that computers and humans can talk seamlessly.
 NLP endeavours to bridge the divide between machines and people by enabling a
computer to analyse what a user said (input speech recognition) and process what
the user meant. This task has proven quite complex.
 NLP is a tool for computers to analyse, comprehend, and derive meaning from
natural language in an intelligent and useful way.
 NLP can be divided into two basic components.
Natural Language Understanding, Natural Language Generation
METHODOLOGY
 Data Collection using Tweet
The sentiment/tweets s are collected from a set of 20 accounts. The data
retrieval is done by using twitter API using OAuthapi used to authenticate the open source
framework with the twitter application.
 Sentiment Storage based on Tweets
The sentiment storage based on Tweets is a process of storing the data about the
tweets into the relational storage in terms of (TwitterId, TwitterDesc, UserId). Twitter Id is
unique Id associated with the tweet, TwitterDesc is the actual tweet and UserId is the Id
associated with the user.
 Stopwords
These are the set of words which do not have any specific meaning. The data
mining forum has defined set of keywords. Stop words are words which are filtered out before
or after processing of natural language data (text). There is not one definite list of stop words
which all tools use and such a filter is not always used.
 Data Cleaning
Data Cleaning is used for removing the stop words from each of the tweets and
clean them. After the data cleaning process is completed the clean data can be represented as a
set (CleanId ,CleanData ,UserId). CleanId is the unique Id associated with the Tweet,
CleanData is the clean data after removal of clean data and UserId is the unique Id associated
with the user.
REQUIREMENT SPECIFICATION
Software Requirements:
• Python, PyCharm, Anaconda.
• Windows
Hardware Requirements:
• Processor – Intel core i7
• Memory – 2GB RAM
• 0.5TB Hard Disk Drive
• Mouse, Keyboard, Display device
Programming:
• Project will be in Python Programming
CONCLUSION
Internet crimes have become very dangerous because victims are continuously Being
hunted, and there is little possibility of escape. Cyber bullying is one of the most critical
internet crimes, and research has demonstrated its critical impact on the victims.
The system uses a accurate method of CNN implementation using keras and helps in
achieving precise results. This can help the users by preventing them for
becoming
victims to this harsh consequence of cyber bullying.
Hence , compare to the existing model our technique is going to identify more
accurate
result of cyber bullying, where this new technique.
REFERENCE
• Elaheh Raisi, Bert Huang., “Cyber bullying Identification Using Participant-Vocabulary
Consistency” Virginia Tech, Blacksburg, VA , 2016
• Nebrase Elmrabit, Feixiang Zhou, Fengyin Li, Huiyu Zhou., “Evaluation of Machine Learning
Algorithms for Anomaly Detection” 2018
• Bhavani Thuraisingham ., “The Role of Artificial Intelligence and Cyber Security for Social
Media” Computer Science Dept. The University of Texas at Dallas Richardson, USA
bxt043000@utdallas.edu 2020
• Zaheer Abbass, Zain Ali, Mubashir Ali, Bilal Akbar, Ahsan Saleem ., “A Framework to Predict
Social Crime through Twitter Tweets By Using Machine Learning” Department of Computer Science
University of Lahore, Gujrat Campus, Pakistan 2020
• Rahul Ramesh Dalvi, Sudhanshu Baliram Chavan, Aparna Halbe., ” Detecting A Twitter Cyber
bullying Using Machine Learning” Department of Information Technology Sardar Patel Institute of
Technology Mumbai, India 2020
THANK YOU

CBDPPT

Uploaded by

Copyright:

Available Formats

You might also like

CBDPPT

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CBDPPT

Uploaded by

Copyright:

Available Formats

|| JAI SRI GURUDEV ||

SRI ADICHUNCHANAGIRI SHIKSHANA TRUST ®

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Project Work Phase – I

“Cyber Bullying Detection using Machine

 Cyber bullying Identification Using Participant-Vocabulary Consistency

 Authors - Elaheh Raisi, Bert Huang

 Published in - 26 jan 2016

 This paper proposed the participant-vocabulary consistency method to simultaneously discover

 Evaluation of Machine Learning Algorithms for Anomaly Detection

 Authors - Nebrase Elmrabit, Feixiang Zhou, Fengyin Li, Huiyu Zhou

 Authors - Bhavani Thuraisingham

 Detecting A Twitter Cyber bullying Using Machine Learning

 Authors - Rahul Ramesh Dalvi, Sudhanshu Baliram Chavan, Aparna Halbe

Data processing and data NLP technique

Dataset Data Preprocessing Feature Extraction

Train Test split

CNN MODEL LAYERS

 natural language processing is considered a subset of machine learning while NLP

You might also like