Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 1

1

Spam Detection for YouTube Comments using ML

Abstract In today’s world, online social media sites like YouTube, Facebook, Twitter are very popular. People turn to social
media for interacting with other people, gaining knowledge, sharing ideas, for entertainment and staying informed about the
events happening in the rest of the world. Among these sites, YouTube has emerged as the most popular website for sharing
and viewing video content. Due to its popularity, it became a platform for spammers to distribute spam through the comments
on YouTube. This has become a concern because spam can lead to phishing attack which the target can be any user that clicks
any malicious link. YouTube themselves have tackled this issue with very limited methods which revolve around blocking
comments that contain links. Such methods have proven to be extremely ineffective as Spammers have found ways to bypass
such heuristics. Automatic comment spam filtering on YouTube is a challenge since the messages are very short with slangs,
symbols and abbreviations. We attempt to detect such comments by applying conventional machine learning algorithm such as
Naïve Bayes. The data is collected from YouTube Spam dataset by using Naïve Bayes.

Index Terms— Classification Model, Machine Learning, Spam Filtering, YouTube Comments

users use this as an opportunity to share promotional


content which is known as Spam.
I. INTRODUCTION1 Spam comments consists of information that is
irrelevant to the context of the uploaded video. Automated

Y OUTUBE is an online video platform owned by bots disguised as a user often contribute to spam. These
Google. It is a video sharing website that was irrelevant or unsolicited messages are aimed to attack
launched in the year 2005. Now it is the 2 nd most visited users by luring them into clicking links to view malicious
website in the world. About 300 hours of video are sites containing malware, phishing and scams. The
uploaded to YouTube every minute. Almost 5 billion spammer`s intention is to spread malware through the
videos are watched on YouTube every single day. 37% of comment section, which will exploit vulnerabilities in the
all mobile internet traffic belongs to YouTube. In 2014, user’s machines. Another intention is seizing money
the user “PewDiePie”, owner of the most subscribed transactions and hijacking credit card and banking
channel on YouTube (nearly 40 million subscribers), information. Spammer also tends to ruin the content of
disabled comments on his videos, claiming most of the web pages.
comments are mainly spam and there is no tool to deal There are many techniques for automatic spam
with them [1]. filtering which have degraded performance when dealing
The users of YouTube are known as channels. with YouTube’s comments. This is due to the fact that
YouTube allows channels to upload, rate, share, add to such messages are usually very short and rife with idioms,
favorites, report, comment on videos, and subscribe to slangs, symbols, emoticons, and abbreviations which make
other users. Among many features of YouTube, one of the even tokenization a challenging task.
features is the commenting system on YouTube videos
which allows users to comment on the videos. The I.1 Types of
[1] “Improving Email Spam Detection Using Content Based Feature
comments represent opinions of the users which maybe Engineering Approach,” 2016.
praising the owner of the video or they maybe expressing [2] Y. Yusof and O. H. Sadoon, “Detecting Video Spammers In Youtube
displeasure towards the video or the video contributor. Social Media,” no. 082, pp. 228–234, 2017.
[3] T. Stone, “Parameterization of Na ¨ ıve Bayes for Spam Filtering,”
Due to this commenting system in YouTube, malicious 2003.
[4] R. Chowdury, N. M. Adnan, G. A. N. Mahmud, and R. M. Rahman,
“A Data Mining Based Spam Detection System for YouTube,” pp.
1
This paragraph of the first footnote will contain the date on which you 373–378, 2013.
submitted your paper for review. It will also contain support information, [5] S. Raschka, “Introduction and Theory,” pp. 1–20, 2014.
including sponsor and financial support acknowledgment. For example, [6] P. Langley, W. Iba, and K. Thompson, “An analysis of Bayesian
“This work was supported in part by the U.S. Department of Commerce classifiers,” Aaai, 1992, vol. 90, pp. 223–228
under Grant BS123456”. [7] A. McCallum and K. Nigam, “A comparison of event models for
The next few paragraphs should contain the authors’ current Naive Bayes text classification,” AAAI-98 Workshop on Learning
affiliations, including current address and e-mail. For example, F. A. for Text Categorization, 1998, pp. 41–48.
Author is with the National Institute of Standards and Technology,
Boulder, CO 80305 USA (e-mail: author@ boulder.nist.gov).
S. B. Author, Jr., was with Rice University, Houston, TX 77005 USA.
He is now with the Department of Physics, Colorado State University, Fort
Collins, CO 80523 USA (e-mail: author@lamar.colostate.edu).
T. C. Author is with the Electrical Engineering Department, University
of Colorado, Boulder, CO 80309 USA, on leave from the National
Research Institute for Metals, Tsukuba, Japan (e-mail: author@nrim.go.jp).

You might also like