Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25

lOMoARcPSD|25853965

MAJOR PROJECT

ON

SPAM MESSAGE DETECTION

BY

SHAIK SOUKAT ALI


lOMoARcPSD|25853965

CONTENTS

Chapter Name

1.Abstract ………………………………………………….
2.Introduction ……………………………………………
3.Requirements ……………………………………………
4.Flow Chart ……………………………………………
5. Introduction to the machine
Learning algorithms ……………………………………..

K-Nearest Neighbours …………………………


Support Vector Machines (SVM)………………
Random Forest …………………………………………
Naïve Bayes ……………………………………………….

Step 1 . About Bayes Theorem …………


Step 2 understand data ………………………..
Step 3: bag_of_word …………………………….
Step 4: training_and_testing ……………………
Step 5 : Implementing NB ML
alogorithm…… Step 6: Evaluate model
………………………………

Adaboost ……………………………………………………
NLTK ………………………………………………………….

6. Python Code ScreenShot ……………………………


7. Result ScreenShot ………………………………………..
8. Conclusion ……………………………..………………………….
lOMoARcPSD|25853965

ABSTRACT

Short_Message_Service (SMS), which allows users to send and


receive messages, has become a multi-billion dollar industry as
mobile phone usage has soared. The cost of messaging services has
also decreased, which has led to an increase in the amount of spam
that is delivered to mobile devices. Up to 40% of SMS messages in
some regions of Asia were spam in 2012. Due to short message
lengths, lack of reliable databases for SMS spams, informal language,
and brief message characteristics, the current email filtering
algorithms may not perform well in their. In this project, real SMS
spam databases from the ML repository are used. Following feature
extraction and preprocessing, On the databases, numerous machine
learning methods are used. After comparing the results, the best
algorithm for text message spam filtering is then presented. The
results utilising that in this study decreases the total error rate of the
best model in the original research referencing this. The following
algorithms are used in this technique: Spam communications are
categorised in mobile device communication using decision trees, K-
Nearest Neighbour, and logistic regression The SMS spam collecting
set is used to test the approach.
lOMoARcPSD|25853965

INTRODUCTION

Chat technology is simply one aspect of SMS. SMS technology was made
possible by standard, an accepted international standard. Spam is the term for
the abuse of electronic messaging services to send large numbers of
unwanted messages to anybody. Even though email spam is the most well-
known example, identical offences in other media and mediums are
frequently referred to as "spam."
SMS In this sense, spam is frequently unsolicited bulk communications that
contains some commercial interest and is quite similar to email spams.
Phishing URLs and business promotion are spread via SMS spam.
Commercial spammers use malware to transmit SMS since most countries
outlaw the practise. Since it is challenging to pinpoint the origin of spam when
it is sent from a hacked computer, spammers take less of a risk while doing so.
Only letters, numbers, and a few symbols are permitted in SMS messages. A
brief glance at the mails identifies. Almost all spm msg direct users to call a
phone number or go to a website. A simple SQL query on the spam yields
results that reveal this trend. Due of the low cost and large bandwidth of the
SMS network, SMS spam is widely used.
lOMoARcPSD|25853965

Every time a user receives an SMS spam message, their mobile phone notifies
them of the message's arrival. The consumer will be unhappy when they
realise the message is unwanted, and SMS spam uses up some of the storage
space on their mobile device.

There are several notable differences emails and text messages.Contrary, which
may access a range of sizable datasets, actual databases for SMS spams are quite
scarce. The number of criteria that can be utilised to classify text messages is
also considerably less than those of emails due to the shorter duration of text
messages. There is also no header in this case. In addition, text messages use
significantly less professional language than emails do and are chock full of
acronyms. All of these elements could lead to a significant decline in the
effectiveness of the most important Short text message spam filtering algorithms
are utilised.

ML algorithms to the problem of classifying SMS spam, compare


their results to learn more and further research the problem, and
create a programme based on one of these approaches that can
precisely filter SMS spams. A number of machine learning algorithms
are then implemented using the module in Python after performing
data feature extraction and basic analysis in MAT_LAB. Data is first
analysed in MAT_LAB, and then several machine learning techniques
are applied using the learn module in python.
lOMoARcPSD|25853965

In the third installment of a three-part series, we'll examine the spam or ham
classifier from the standpoint of AI ideas, experiment with several classification
algorithms in a based on performance criteria. A web-based Python.

Applications of machine learning in modern internet technology. service


providers have integrated spam detection algorithms that label such
content as "Junk Mail" when it is received.
In this project, the nave_bayes approach is utilised to create a model that,
depending on the training data we provide the model, can classify a
dataset. The words "free," "win," "winner," "cash," "prize," and similar
expressions are frequently used in these letters because they are meant to
catch your attention and in a sense persuade you to open them.
Exclamation marks and writing in all capitals are other characteristics of
spam communications. Since spam texts are often pretty evident to the
receiver, we want to train a model to identify them for us.
Finding spam mails is a binary classification issue since messages can only
be categorised and nothing else. This is a supervised learning problem as
well because we will be giving the model a tagged.
lOMoARcPSD|25853965

REQUIREMENTS

For Hardware

Processor: 1.5 GHz or more 4GB

or more of RAM

HDD: at least

100GB

For Software

Python 3 IDLE or the Anaconda

Jupyter Notebook
lOMoARcPSD|25853965

Flow Chart
lOMoARcPSD|25853965

INTRO TO ALGORITHM

KNN
K-Nearest Neighbor is a straightforward instance-based learning
technique that can be used to solve classification challenges. According to

this method, a test sample's label is predicted using the votes of its knn_
closest neighbour.

overall_err Spm_ cought Blocked_hm


Knn_
3 3.11 82.5 0.35
12 3.25 86.2 0.42
22 2.91 79.4 0.41
52 3.25 78.6 0.34
90 4.12 69.5 0.17

Support Vector Machine


On the dataset, support vector machine is used. The with various kernels are
shown in Table I I for a 10-fold cross validation. The table demonstrates that
the linear kernel outperforms alternative mappings in terms of performance.
The error rate decreases while the degree of the polynomial from two to three,
but it does not decrease as the degree is raised higher. Here, the dataset is
subjected to another kernel called the radial basis function (RBF). The following
equation represents the RBF kernel for the two samples,
lOMoARcPSD|25853965

Kernal Function Overall error Spam cought Blocked ham

linear 1.19 94.1 0.45


lOMoARcPSD|25853965

Degree 2 2.04 86.3 0.23

Degree 3 1.67 90.4 0.47


polynomial
Degree 4 2.01 92.45 0.65
polynomial

Radial basis 23.16 79.6 0.35


function
Sigmoid 22.4 0 0
lOMoARcPSD|25853965

We observe that the text message's character count is a very helpful factor for
categorising spam. When features are ordered based on the mutual
informationcriterion, this feature has the highest mutual with the target labels.
Also, although text messages with lengths below a specific threshold are normally
hams, they could be mistakenly labelled as spams due to the tokens that correlate.
This is shown when looking at the samples that were improperly classified.
The result_show no accuracy advantage over the algorithm, despite the model
being more complex and taking longer to train on data when using SVM with
different kernels.-

5.1 Random Forest


Random-forests is a technique for classification that uses
ensemble ageing. The is a group of assembled from the boot
strap sample of a training set. when a node is divided during the
construction of the decision-tree, the split that is chosen is the
among a random selection of characteristics.A single model's bias
will increase as a result, however averaging can also make up for
the increase in bias by lowering variance. As a result, a better
model is created. The scikit learn python library's random forest
implementation, which averages the probabilistic predictions, is
used in this study. For this method, two numbers of estimators
are simulated. The overall error with 12 estim-ators is 1.91% the
SC is 86.6% and the bh is 0.71% With 90 estim-ators, the overall
error will be 1.41%, the SC will be 92.2%, and the BH will be
0.52% We notice that, when compared to the naïve- bays-
algorithm, performance is unchanged despite the model's
increased complexity.
lOMoARcPSD|25853965

Naïve Bays Algoritm

Step 1 About Bayes Theorem


The bayes Theorem one of the first prob-lastic algorithm created
by Reverend- Bayes (and use, no less, to try to infer the presence
of god), Still works incredibly well in some situations. To
understand this theorem, an example is recommended. Consider
yourself a Secret Service agent tasked with protecting the
democratic presidential candidate as he or she delivers a
campaign speech. Your task is challenging, and you must always
be on guard for threats because it is a public event that is open to
everyone. Consequently, a reasonable place to start is by giving
each person a distinct threat level. Therefore, based on a person's
physical characteristics, such as their age, sex, and other minor
details like whether or not they are carrying a bag or seem tense,
you can determine whether they pose a threat.
If a person checks all the right boxes up until the point where
your level of doubt is crossed, and have them removed from the
area. The works similarly to how we determine the (a person who
poses a threat) based on the probability of numerous.
The indepe-ndence of these features from one another is
something to take into account. For instance, if a child exhibits
signs of anxiety throughout the event, the likelihood that they
pose a threat is lower than, say, if it were a big man. To clarify,
age AND anxiousness are the two characteristics we are taking
into account here. If we examine each of these characteristics
separately, we might be able to create a model that marks
EVERYONE who exhibits anxiety as a possible threat. But given
the likelihood that any children present at the event will be
anxious, it is possible that we will get a lot of false positives.
lOMoARcPSD|25853965

Thus, by taking a person's age into account in addition to the "nervousness"


aspect, we would undoubtedly receive a more accurate conclusion regarding
who poses a threat and who does not.
The "Naive" portion of the theory is where it assumes that each aspect is
independent of the others, which may not always be the case and may therefore
influence the verdict.
In essence, the bayes theorem determines the likelihood that an event will occur
based on the proba-bilistic- distributions of a number of other events, in this
Case, the likelihood that a message would be spm. Later in the mission, we will
go into the bayes Theorem's operations, but first, let's examine the data we will
be using.
lOMoARcPSD|25853965

Step 2 understand data

Step 3:
bag_of_word

We have a substantial set of text data (5580 rows of data). Email and other
messages usually contain a lot of language, yet the majority of machine learning
algorithms require numerical data as input.
lOMoARcPSD|25853965

In this part, we'd like to introduce the notion, which is a term for issues with
processing a single text data set or a collection of text data. BOW's fundamental
concept is to count the instances of each word inside a given body of text. The
order in which the words appear is irrelevant, according to the BOW notion,
which analyses each word separately.
We can turn a group of documents into a matrix using a technique we'll cover
later, where each document represents a row, each word or token represents a
column, and the values in each row and column represent the frequency with
which each word or token appears in that document.

Step 4: training_and_testing

We can return to our dataset and continue our analysis now that we know how to
handle the Bag of Words problem. To later test our model, we would first divide
our dataset into a training and testing set.

After dividing the data, our next goal is to carry out Step 2's procedures:
Convert our data to the desired matrix format and bag of words. As before,
we will use CountVectorizer() to accomplish this. Here, there are two steps
to think about:

We will be using the data from X test, which has been transformed into
a matrix, to make predictions about the "sms message" column. Then,
in a subsequent step.
lOMoARcPSD|25853965

Step 5 : Implementing NB alogorithm


I'll utilise the technique to produce predictions on our dataset for
SMS Spm _Detection.

Particularly, we'll apply the multinomial nv byes implementation.


Using discrete features to categorise data is appropriate for this
particular classifier. Word counts in the form of integers are
accepted as input.

Step 6: Evaluate model


our model is performing in relation to the forecasts we
made on our test set. There are other ways to do this, but
let's first quickly go over them.

Accuracy is used to determine how frequently the classifier


makes accurate predictions. The ratio of accurate forecasts to
all predictions is measured.
The percentage of messages that were mistakenly classified as spam is
revealed by precision. It is the proportion of genuine positives (words
flagged as spam that are actually spam) to all positives, in other words
(words labelled as spam regardless of whether that classification was
accurate).

The proportion of messages that we wrongly classified as spam is shown


by recall (sensitivity). It measures the proportion of true positives—
words that were marked as spam but are in fact spam—to the total
number of words that were marked as spam.

For example, in our situation, if we had 100 text messages and only two of
them were spam and the other 98 were not, this is an example of a
classification problem where the classification distributions are skewed.
lOMoARcPSD|25853965

Adaboost
classifiers one at a time, refining each one to account for examples that were
misidentified by prior classifiers . Even if the classifiers employed are only
moderately superior to random guessing, the final model will be improved. To
ensemble strategy combination others.

Certain weights are added to the training samples at each Ada Boost
iteration. These weights are distributed uniformly prior to the initial
iteration. Following that, the current model increases weights for labels that
were wrongly classified and decreases weights for samples that were
incorrectly categorised. This suggests that the new predictor is concentrating
on the problems with the.

Mdl S(C) (B)H Accuu-


NB 95.35 0.52 97.73
S(VM) 93.36 0.63 89.63
KNN 83.87 0.32 97.56
Random forest 91.62 0.63 97.52

Adaboost with 93.21 0.45 98.56


decision-tree
lOMoARcPSD|25853965

We investigated implementing with the module. In the simulation with 12


estimators, the total error rate is 3.1%, the SC is 86.7%, and the BH is 0.64%.
When the number of estimators is increased to 90, these figures will change
to (3.41, 93.6, and 0.71)%,. Similar to, naïve Bayes algorithm performs
better than Ada boost with decision trees despite being significantly more
sophisticated.

5.2 NLTK

Leading Python development environments for working with human language


data include NLTK. It offers straightforward interfaces to more than 50 corpora
and lexical resources, including WordNet, as well as a collection of text
processing libraries for categorization, tokenization, stemming, tagging, parsing,
and semantic reasoning, wrappers for powerful NLP libraries, and a lively
discussion board.

NLTK is appropriate for linguists, engineers, students, educators, academics,


and industry users equally thanks to a hands-on approach covering
programming foundations along with themes in computational linguistics and
full API documentation. Windows, Mac OS X, and Linux all support NLTK. The
project is community-driven, free, and open source, which is the best part.
lOMoARcPSD|25853965
lOMoARcPSD|25853965

Python code ScreenShot:


lOMoARcPSD|25853965

Result ScreenShot:
lOMoARcPSD|25853965

Conclusion

The outcomes of various classification models run on the SMS Spam dataset are
displayed. Results of the simulation.The best classifiers for SMS spam detection
include SVM with a linear kernel and multinomial naive Bayes with Laplace
smoothing. The SVM-based classifier in the original research that used this
dataset had the highest overall accuracy (92.64%), making it the best one. With
an overall accuracy of 92.60%, enhanced naive Bayes is the next best classifier
in their research. When compared to the outcome of earlier research, our
classifier cuts the overall error in half. The variables that led to this increase in
outcomes include the addition of significant characteristics like the amount of
characters in messages, the addition of specific thresholds for the length, and
analysis of learning curves and misclassified data.
The capability of Naive Bayes to handle an exceptionally high number of
features is one of its key benefits over other classification methods. Since there
are hundreds of distinct words, they are all considered as features in our
situation. Additionally, it functions effectively even when irrelevant
characteristics are present and is mostly unaffected by them. Its relative
simplicity is another key benefit, unless often in situations when the data
distribution is known. Rarely does the data overfit the model.
lOMoARcPSD|25853965

Another key benefit is how quickly the model trains and predicts given the
volume of data it can manage. Overall, Naive Bayes' algorithm really is a
treasure.
lOMoARcPSD|25853965

You might also like