Professional Documents
Culture Documents
Major Project by Ali(Intrainz)
Major Project by Ali(Intrainz)
MAJOR PROJECT
ON
BY
CONTENTS
Chapter Name
1.Abstract ………………………………………………….
2.Introduction ……………………………………………
3.Requirements ……………………………………………
4.Flow Chart ……………………………………………
5. Introduction to the machine
Learning algorithms ……………………………………..
Adaboost ……………………………………………………
NLTK ………………………………………………………….
ABSTRACT
INTRODUCTION
Chat technology is simply one aspect of SMS. SMS technology was made
possible by standard, an accepted international standard. Spam is the term for
the abuse of electronic messaging services to send large numbers of
unwanted messages to anybody. Even though email spam is the most well-
known example, identical offences in other media and mediums are
frequently referred to as "spam."
SMS In this sense, spam is frequently unsolicited bulk communications that
contains some commercial interest and is quite similar to email spams.
Phishing URLs and business promotion are spread via SMS spam.
Commercial spammers use malware to transmit SMS since most countries
outlaw the practise. Since it is challenging to pinpoint the origin of spam when
it is sent from a hacked computer, spammers take less of a risk while doing so.
Only letters, numbers, and a few symbols are permitted in SMS messages. A
brief glance at the mails identifies. Almost all spm msg direct users to call a
phone number or go to a website. A simple SQL query on the spam yields
results that reveal this trend. Due of the low cost and large bandwidth of the
SMS network, SMS spam is widely used.
lOMoARcPSD|25853965
Every time a user receives an SMS spam message, their mobile phone notifies
them of the message's arrival. The consumer will be unhappy when they
realise the message is unwanted, and SMS spam uses up some of the storage
space on their mobile device.
There are several notable differences emails and text messages.Contrary, which
may access a range of sizable datasets, actual databases for SMS spams are quite
scarce. The number of criteria that can be utilised to classify text messages is
also considerably less than those of emails due to the shorter duration of text
messages. There is also no header in this case. In addition, text messages use
significantly less professional language than emails do and are chock full of
acronyms. All of these elements could lead to a significant decline in the
effectiveness of the most important Short text message spam filtering algorithms
are utilised.
In the third installment of a three-part series, we'll examine the spam or ham
classifier from the standpoint of AI ideas, experiment with several classification
algorithms in a based on performance criteria. A web-based Python.
REQUIREMENTS
For Hardware
or more of RAM
HDD: at least
100GB
For Software
Jupyter Notebook
lOMoARcPSD|25853965
Flow Chart
lOMoARcPSD|25853965
INTRO TO ALGORITHM
KNN
K-Nearest Neighbor is a straightforward instance-based learning
technique that can be used to solve classification challenges. According to
this method, a test sample's label is predicted using the votes of its knn_
closest neighbour.
We observe that the text message's character count is a very helpful factor for
categorising spam. When features are ordered based on the mutual
informationcriterion, this feature has the highest mutual with the target labels.
Also, although text messages with lengths below a specific threshold are normally
hams, they could be mistakenly labelled as spams due to the tokens that correlate.
This is shown when looking at the samples that were improperly classified.
The result_show no accuracy advantage over the algorithm, despite the model
being more complex and taking longer to train on data when using SVM with
different kernels.-
Step 3:
bag_of_word
We have a substantial set of text data (5580 rows of data). Email and other
messages usually contain a lot of language, yet the majority of machine learning
algorithms require numerical data as input.
lOMoARcPSD|25853965
In this part, we'd like to introduce the notion, which is a term for issues with
processing a single text data set or a collection of text data. BOW's fundamental
concept is to count the instances of each word inside a given body of text. The
order in which the words appear is irrelevant, according to the BOW notion,
which analyses each word separately.
We can turn a group of documents into a matrix using a technique we'll cover
later, where each document represents a row, each word or token represents a
column, and the values in each row and column represent the frequency with
which each word or token appears in that document.
Step 4: training_and_testing
We can return to our dataset and continue our analysis now that we know how to
handle the Bag of Words problem. To later test our model, we would first divide
our dataset into a training and testing set.
After dividing the data, our next goal is to carry out Step 2's procedures:
Convert our data to the desired matrix format and bag of words. As before,
we will use CountVectorizer() to accomplish this. Here, there are two steps
to think about:
We will be using the data from X test, which has been transformed into
a matrix, to make predictions about the "sms message" column. Then,
in a subsequent step.
lOMoARcPSD|25853965
For example, in our situation, if we had 100 text messages and only two of
them were spam and the other 98 were not, this is an example of a
classification problem where the classification distributions are skewed.
lOMoARcPSD|25853965
Adaboost
classifiers one at a time, refining each one to account for examples that were
misidentified by prior classifiers . Even if the classifiers employed are only
moderately superior to random guessing, the final model will be improved. To
ensemble strategy combination others.
Certain weights are added to the training samples at each Ada Boost
iteration. These weights are distributed uniformly prior to the initial
iteration. Following that, the current model increases weights for labels that
were wrongly classified and decreases weights for samples that were
incorrectly categorised. This suggests that the new predictor is concentrating
on the problems with the.
5.2 NLTK
Result ScreenShot:
lOMoARcPSD|25853965
Conclusion
The outcomes of various classification models run on the SMS Spam dataset are
displayed. Results of the simulation.The best classifiers for SMS spam detection
include SVM with a linear kernel and multinomial naive Bayes with Laplace
smoothing. The SVM-based classifier in the original research that used this
dataset had the highest overall accuracy (92.64%), making it the best one. With
an overall accuracy of 92.60%, enhanced naive Bayes is the next best classifier
in their research. When compared to the outcome of earlier research, our
classifier cuts the overall error in half. The variables that led to this increase in
outcomes include the addition of significant characteristics like the amount of
characters in messages, the addition of specific thresholds for the length, and
analysis of learning curves and misclassified data.
The capability of Naive Bayes to handle an exceptionally high number of
features is one of its key benefits over other classification methods. Since there
are hundreds of distinct words, they are all considered as features in our
situation. Additionally, it functions effectively even when irrelevant
characteristics are present and is mostly unaffected by them. Its relative
simplicity is another key benefit, unless often in situations when the data
distribution is known. Rarely does the data overfit the model.
lOMoARcPSD|25853965
Another key benefit is how quickly the model trains and predicts given the
volume of data it can manage. Overall, Naive Bayes' algorithm really is a
treasure.
lOMoARcPSD|25853965