Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Mobile SMS Spam Detection using

Machine Learning Techniques


Shah Jainam Rhythm Patel
B.Tech IT B.Tech IT
Indus University Indus University
Ahmedabad, India Ahmedabad, India
jainamshah.21.it@iite.indusuni.ac.in patelrhythm.21.it@iite.indusuni.ac.in

Malav Raval Sumitsinh Rahevar Author: Sejal Thakkar


B.Tech IT B.Tech IT Indus University
Indus University Indus University Ahmedabad, India
Ahmedabad, India Ahmedabad, India sejalthakkar.ce@indusuni.ac.in
malavraval.21.it@iite.indusuni.ac.in sumitrahevar.21.it@iite.indusuni.ac.in

Abstract— Unwanted texts, or spam Introduction.


SMS, can be concerning and occasionally Small Message Once Over (SMS) is the most
harmful to users. There will now be a popular and widely utilized type of messaging.
collection of survey studies on methods In many regions of the world, the term "SMS"
for detecting SMS spam. examine and refers to both user activity and all forms of short
evaluate the methods, strategies, and text messaging. It has evolved into a platform
algorithms they employed, as well as for product announcements, product
their benefits and drawbacks, assessment endorsements, banking updates, agricultural
metrics, dataset discussions, and, data, flight updates, and online specials. SMS is
ultimately, an assessment of the also used in direct marketing, sometimes
research's conclusions. Unfortunately, known as SMS marketing. Users may
none of the current research addresses the occasionally find SMS marketing to be
issues posed by using shortened words problematic. We refer to these kind of
and local contents in SMS spam messages as spam. Spam is defined as one or
detection approaches, despite the fact that more unsolicited messages that people do not
these methods are more demanding than wish to receive, send, or post as part of a
those used in other types of spam better collection of messages, all of which
detection. There is a vast amount of have a significant amount of similarity.
unfinished research in this area, and this
survey can serve as a guide for future
study directions.
Background and related work
The most popular and widely used messaging
Keywords: Mobile SMS spam Detection format is short message service (SMS). Many
places throughout the world use the term "SMS" to
refer to both user activity and all forms of short text
messaging. It has evolved into a platform for
product announcements and endorsements, as well
as for financial and agriculture updates, airline
information, and online discounts. SMS is
also used in SMS marketing, which is a type of comparable to or a model of decisions and
direct marketing. Users occasionally experience their anticipated punishment. To decide if a
difficulties with SMS marketing. Spam SMS is new communication is
the term for these types of messages. Spam is spam or ham, one can utilize a decision tree.
defined as one or more unsolicited messages that [11]
are posted or transmitted by users as part of a
larger collection of messages that are all Logistic Regression
remarkably similar. Logistic regression is a type of regression used
in predictive analysis. The goal of logistic
Naïve Bayes. regression is to explain the data and its
The Bayesian approach is a probabilistic one that relationship between a dependent binary
begins with an initial belief, takes in some facts, and variable (i.e. rank, interval, or percentile) and
then modifies that belief. By using the Bayesian one or several independent variables
technique to analyze the word's frequency in spam
and harm messages, one may determine whether a Random Forest
word is spam or not [12].
An ensemble of decision trees is referred to by
Support vector Machine. (SVM) Support vector the trademark term Random Forest. We have a
machines are capable of supervised learning collection of decision trees in random forests.
through the use of connected algorithms that Each tree casts a vote for the class that best fits
analyze regression analysis and the data utilized the new object based on its properties. Out of all
for classification. The SVM teaching method the trees in the forest, the category with the
builds a model that can assign new examples based highest votes is chosen by the forest.
on spam and rightful group if a set of teaching
examples comprising spam and rightful SMS is
Spam filtering process
known. An SVM model is a representation of the
Spam and harm messages that are physically
example since it maps a point in space such that
private can be input or used as a teaching tool
examples of the divide category are as widely
for spam filtering algorithms. The following
apart as is practical.[9]
steps make up the algorithm.

Preprocessing.
As part of the data preprocessing, stop words
and other unnecessary components are
removed.

Tokenization.
dividing the message into sections based on
words, characters, or tokens. Tokenization can
be achieved by various methods, including
word tokenization, phrase tokenization,
orthogonal sparse bigrams, and word or
character N-grams.

Representation.
Decision Trees.
Transformation to pairwise attribute values
A decision hierarchy is a tool for decision
assistance that counts the likelihood of event
outcomes and uses a hierarchy that is
Validation of the study :
Selection. close to a few relevant studies Our primary
Rather than picking every pair of attribute values, search engine was Google Academic. We
prioritize the attributes that would crash the most have several articles from it together, and
during classification. there are many more conferences and
journals that we have created from side to
Training. side, such as IEEExplore, IJCSI ITJ ACM,
teach the algorithm by way of the chosen quality etc. Google educational tool There are
values. numerous references in the journals and
conference files, just as in every other place
where we have chosen a paper. We also
Testing. looked over the referenced studies and used
instruct the algorithm using the selected quality levels. several of them as models for our work. We
employed the linked articles and cited
Study Selection Procedure : feature of Google Scholar as part of our
search process.
close to a few relevant studies Our primary search
engine was Google Academic. We have several
articles from it together, and there are many more Result Analysis:
conferences and journals that we have created from To enhance impressions in the spam detection
side to side, such as IEEExplore, IJCSI ITJ ACM, field, we first physically searched on top of
etc. Google educational tool There are numerous Google using the term "spam detection." This
references in the journals and conference files, just led us to numerous papers that were related to
as in every other place where we have chosen a SMS spam detection. Afterward, we changed
paper. We also looked over the referenced studies our search to only include mobile SMS spam
and used several of them as models for our work. detection. Thirteen papers published in various
We employed the linked articles and cited feature of conferences and journals pertaining solely to
Google Scholar as part of our search process. the topic of mobile SMS spam detection have
been selected as part of our study selection
process.
Table 1. SMS Spam Detection Dataset
Dataset Description:
description
An initial dataset is required for various
machine learning classification algorithms.
The dataset affects the machine knowledge
algorithms' results. since a dataset is not
necessary for spam detection algorithms to
function. We generated distinct publicly
available datasets that are utilized in various
research projects. table [13] displays the
dataset's link as well as a number of statistics,
including the total number of SMS messages
that contain spam and ham.
Conclusion. [5] J. W. Yoon, H. Kim, and J. H. Huh,
The results of a thorough study of the “Hybrid spam filtering for mobile
literature on SMS spam detection are communication,” computers & security,
presented in this paper. We selected vol. 29, no. 4, pp. 446–459, 2010.
thirteen research papers in this area and
examined the methods they suggested. [6] Q. Xu, E. W. Xiang, Q. Yang, J. Du, and
benefits and drawbacks. the difficulties J. Zhong,“Sms spam detection
they dealt with. We looked into their using noncontent features,”IEEE
assessment practices as well. We Intelligent Systems, vol. 27, no. 6, pp.
presented the information from the 44–51, 2012.
publicly accessible dataset, which was
previously in need of a spam filtering
method. [7] I. Ahmed, D. Guan, and T. C. Chung,
“Sms classification based on naïve
bayes classifier and apriori algorithm
REFERENCES frequent itemset,” International Journal
of machineLearning and computing, vol.
4, no. 2, p. 183, 2014.
[1] K. Yadav, S. K. Saha, P. Kumaraguru,
and R. Kumra,’’ take control of your [8] J. M. G´omez Hidalgo, G. C. Bringas, E.
smses: designing an usable spam sms P. S´anz, and F. C. Garc´ıa, “Content
filtering system,” in 2012 IEEE 13th based sms spam filtering,” in
International Conference on Mobile Data Proceedings of the 2006 ACM
Management. IEEE, 2012, pp. 352–355. symposium on Document engineering.
ACM, 2006, pp. 107–114.
[2] S. J. Warade, P. A. Tijare, and S. N.
Sawalkar, “An approach for sms [9] https://en.wikipedia.org/wiki/Support_
spam detection.” vector machine [last Accessed: 05-11-
2016]
[3] A. Narayan and P. Saxena, “The curse [10] https://en.wikipedia.org/wiki/Decision
of 140 characters: evaluating the tree[Last Accessed:05-11-2016]
efficacy of sms spam detection on
android,” in Proceedings of the Third [11] http://en.wikipedia.org/wiki/K-
ACM workshop on Security and
privacy in smartphones & mobile
devices ACM, 2013, pp. 33–42. [12] http://fastml.com/bayesian-machine-
learning/ [Last Accessed: 05-11-2016

[4] A. S. Onashoga, O. O. Abayomi-Alli,


[13] SMS Spam Collection data set from
A.S. Sodiya, and D. A.Ojo “An adaptive
UCI Machine learning
and collaborative server side sms spam
Repository,”http://archive.ics,uci.edu/ml
filtering scheme using artificial immune
/data set/SMS+Spam+Collection
system,” Information Security
Journal: A Global Perspective, vol. 24,
no. 46, pp. 133–145, 2015.

You might also like