Professional Documents
Culture Documents
Cyberbullying Detection and Classification Using Information Retrieval Algorithm
Cyberbullying Detection and Classification Using Information Retrieval Algorithm
net/publication/300714599
CITATIONS READS
34 1,293
2 authors, including:
Sheeba Immanuvelrajkumar
Pondicherry Engineering College
38 PUBLICATIONS 209 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Sheeba Immanuvelrajkumar on 13 June 2018.
ABSTRACT
There is an urgent need to study cyberbullying in terms of its
Social networking site is being rapidly increased in recent years,
detection, prevention and mitigation.
which provides platform to connect people all over the world and
share their interests. However, Social Networking Sites is Bullying as a form of social turmoil has occurred in various
providing opportunities for cyberbullying activities. Cyberbullying forms over the years with the WWW and communication
is harassing or insulting a person by sending messages of hurting technologies being used to support deliberate, repeated and
or threatening nature using electronic communication. hostile behavior by an individual or group, in order to harm
Cyberbullying poses significant threat to physical and mental others[1]. Cyberbullying is defined as an intentional
health of the victims. Detection of cyberbullying and the provision aggressive act carried out by a group or individual, using
of subsequent preventive measures are the main courses of action electronic forms of contact, repeatedly and over time, against
to combat cyberbullying. A cyberbully detection system to identify a victim who cannot easily defend him or herself[2].
and classify cyberbullying activities such as Flaming, Harassment,
Racism and Terrorism in Social network is proposed. Cyberbully Recent research has shown that most teenagers experience
detection is done using Levenshtein algorithm and classification of cyberbullying during their online activities including mobile
Cyberbully activity is carried out using Naïve Bayes classifier. phone usage[3], and also while involved in online gaming or
social networking sites. As highlighted by the National
Categories and Subject Descriptors Crime Prevention Council, approximately 50% of the youth
Social networking sites- Security, Information Retrieval algorithm.
in America are victimized by cyberbullying[4]. The
General Terms implications of cyberbullying[5] become serious (suicidal
Bullying, Detection, Classification. attempts) when the victims fail to cope with emotional strain
Keywords from abusive, threatening, humiliating and aggressive
Social Network, Cyberbullying, Lavenshtein algorithm, Naïve messages. The impact of cyberbullying is exasperated by the
Bayes classifier, Network security. fact that children are reluctant to share their predicament
with adults, driven by the fear of losing their mobile phone
1. INTRODUCTION and/or Internet access privileges[6]. The challenges in
With the proliferation of the Internet, security is becoming an fighting cyberbullying include: detecting online bullying
important concern. While Web 2.0 provides easy, interactive, when it occurs; reporting it to law enforcement agencies,
anytime and anywhere access to the online communities, it also Internet service providers and others and identifying
provides platform for cybercrimes like cyberbullying. Life predators and their victims.
annoying cyberbullying experiences among young people have
been reported internationally, thus drawing attention to its negative The Levenshtein distance is a metric for measuring the
impact. In the USA, traces of cyberbullying is highly increasing amount of difference between two sequences (i.e. an edit
and it has officially been identified as a social threat. distance). The Levenshtein distance between two strings is
defined as the minimum number of edits needed to transform
Permission to make digital or hard copies of all or part of this one string into the other, with the allowable operations like
work for personal or classroom use is granted without fee provided adding, removing, or substituting a single character.
that copies are not made or distributed for profit or commercial
Naive Bayes is a simple technique for constructing
advantage and that copies bear this notice and the full citation on
classifiers: models that assign class labels to problem
the first page. Copyrights for components of this work owned by
instances, represented as vectors of feature values, where the
others than ACM must be honored. Abstracting with credit is class labels are drawn from some finite set.
permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. The growth of cyberbullying activities is increasing as
Request permissions from Permissions@acm.org. equally as the growth of social networks. Cyberbullying
activity poses a significant threat to mental and physical
ICARCSET '15, March 06 - 07, 2015, Unnao, India Copyright health of the victims. Study about effects of bullying is
2015 ACM 978-1-4503-3441-9/15/03…$15.00 present but implementation for monitoring social network to
http://dx.doi.org/10.1145/2743065.2743085 detect cyberbullying activities is less.
Hence, the proposed system focuses on detecting the presence of 3. PROPOSED FRAMEWORK
cyberbullying activity in social networks using Levenshtein
algorithm which helps government to take action before many
users becoming a victim of cyberbullying. In the proposed framework shown in Figure 1, the process of
detecting cyberbully activities begins with input dataset from
social network. Input is text conversation collected from
2. RELATED WORK formspring.me. Input is given to data pre-processing which
In a recent study on cyberbullying detection, gender specific is applied to improve the quality of the research data and
features were used and users are categorized into male and subsequent analytical steps, this includes removing stop
female groups. It is limited only to gender feature. words, extra characters and hyperlinks. After performing
pre-processing on the input data, it is given to Feature
In other study[7], NUM and NORM features were devised by Extraction.
assigning a severity level to the bad words list
(nosewaring.com). NUM is a count and NORM is a Feature Extraction is done to obtain features like Noun,
normalization of the bad words respectively. The dataset Adjective and Pronoun from the text and statistics on
consisted of 3,915 posted messages crawled from the Web Site, occurrence of word (frequency) in the text.
Formspring.me. It showed only 58.5% accuracy, which is very The cyberbullying words are given as training dataset. With
less accuracy. the training dataset the preprocessed online social network
conversation is tested for bullying word presence.
In [8] system allowing OSN users to have a direct control on the Lavenshtein distance algorithm detects the cyberbully words
messages posted on their walls. This is done by using flexible present in the conversation and displays it. For cyberbully
rule-based system, this system allows users to customize the Classification, Naïve Bayes classifier is used. A Bayes
filtering criteria to be applied to their walls, and a Machine classifier is a simple probabilistic classifier based on
Learning based classifier will automatically label messages applying Bayes' theorem (from Bayesian statistics) with
using content-based filtering. This approach is incapable of strong (naive) independence assumptions.
capturing more complex relationships at a deeper semantic level.
In[9] a research work by Massachusetts Institute of Technology
a system to detect cyberbullying through textual context in
YouTube video comments is developed. The system classifies
the comment in a range of sensitive topics such as sexuality,
culture, intelligence, and physical attributes and determining
what topic it is. The system shows less precise classification
outcome and increased false positives.
In[10] using a bag-of-words approach examined a baseline text
mining system and improved by including sentiment and
contextual features. Even with those models, vector machine
learners produce a recall level of 61.9%.
In [11] bullying traces are identified using a variety of natural
language processing techniques. Online and offline instances of Figure 1. Proposed Framework of Cyberbully Detection
bullying are traced. To identify the bullying they use sentiment and Classification system.
analysis system and Latent Dirichlet Analysis to identify topics.
In this method, the instances of bullying are not accurately
detected. In the proposed framework for detecting cyberbully
activities, following steps have been included
Other interesting works[12] in this area performed harassment Input
detection from comments and chat datasets provided by a Data Preprocessing
content analysis workshop (CAW). Various features were
Feature Extraction
generated including: TFIDF as local features; sentiment feature,
which includes second person and all other pronouns like „you‟, Cyberbully detection
„yourself‟, „him‟, „himself‟ and foul words; and contextual Naïve Bayes Classifier
features[13]. Increased false positive is its limitation. There are
also some software products available for fighting against 3.1 Input
cyberbullying like Bsecure, Cyber Patrol, eBlaster, Textual conversation collected from social networks
IamBigBrother, Kidswatch[14][15][16][17][18]. (Formspring.me, Myspace.com) is given as input.
Generally most existing systems are focusing on effects after
cyberbullying incident and there is no system for online 3.2 Data Preprocessing
cyberbullying detection. The proposed system is to detect the The data preprocessing is an important phase in representing
cyberbullying activities and classify them as Flaming, data in feature space to the classifiers. Social network data
Harassment, Racism and Terrorism, which helps to prevent the are noisy, thus preprocessing has been applied to improve
cyberbullying victims from facing effects of cyberbullying and the quality of the research data and subsequent analytical
take necessary actions like blocking, law enforcement or taking steps, and this includes removing stop words.
corresponding legal actions accordingly.
The stop words add little semantic value to a sentence. Stop
words are usually words like “to”, “I”, “has”, “the”, “be”, “or”,
Where,
etc. Stop words bloat memory space and processing time without
providing any extra value.
C (upper case) represents the document class and c (lower
3.3 Feature Extraction case) is one of the possible classes. In this case the possible
classes are Flaming, Insult, Harassment, and Terrorism.
3.3.1 Noun Adjective and pronoun Extraction
Parsing noun, adjective and pronoun involves two steps such as Fi, dots, Fn called features and fi, dots, fn are the values of
part-of-speech tagging and extracting noun, adjective and the corresponding features. Features are the words
pronoun from the tagged output. The part-of-speech tagging occurring in the document and their value is the number of
(POS tagging or POST), also called grammatical tagging, is the occurrences.
process of marking up a word in a text as corresponding to a argmaxc is the argmax function which means something
particular part of speech. like “The class c with the highest value for the following
function.”
The Part-of-speech tagging is carried out using the package P(C = c) is the probability that a given bullying word
provided by Stanford Natural Language Processing. belongs to class c. Let‟s assume this is the same for all
classes and hence ignore it as it is just a constant.
3.3.2 Frequency Extraction P (Fi = fi | C = c) is the probability that the word occurs
The frequency extraction module involves extracting the in a document of class c.
occurrence count of the words parsed in the parser module.
Using the frequency of the word the class probability is 3. EXPERIMENTAL RESULTS
calculated using Naïve Bayes classifier.
4.1 Dataset
3.4 Cyberbully Detection For this work, the datasets described below for the
Cyberbully detection is carried out using the Lavenshtein experiment on cyberbullying detection are considered, which
distance algorithm. are available from the workshop on Content Analysis for the
Lavenshtein distance Algorithm: Web 2.0 [10]. The dataset contains data collected from two
different social networks: Formspring.me and MySpace.
Given two strings a and b on an alphabet Σ, the edit distance d Formspring.me is a discussion-based site, users broadcast
(a, b) is the minimum-weight series of edit operations that their message from which about 500 posts is selected
transforms a into b. Operations involved in edit distance randomly. MySpace is a popular social networking website
algorithm are and 600 posts are randomly selected from it. Datasets were
• Insertion of a single symbol. If a = uv, then inserting the provided in the form of text document for each conversation
symbol x produces uxv. This can also be denoted ε→x, using set between two users.
ε to denote the empty string.
• Deletion of a single symbol changes uxv to uv (x→ε). 4.2 Evaluation Parameters
• Replacing a single symbol x for a symbol y ≠ x changes Precision: The total number of correctly identified true
uxv to uyv (x→y). bullying posts out of retrieved bullying posts.
No. of F-measure
cyberbully
Posts Formspring.me Myspace.com
100 79 77
200 83 84
300 71 74
400 94 92
500 87 89