Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Detecting spam mail with Naive Bayes

Introduction
Nowadays, online communication has become essential in our lives, and a widely used formal
communication medium is email. As it became more and more popular, spam emails were rising
as well. Everyone has encountered the problem with email spam, which is an operation, that is
sending undesirable messages to different email clients. Spam emails can contain potential
damage, like viruses, which is something companies are trying to protect their machines from.

Spam detection can help with this problem. There are many algorithms, which can be used for
spam filtering, but this study presents the Naive Bayes theorem. We use the Bag-of-Words
model, which allows us to extract features from textual data. Thus, we get a numeric
representation for the words. This representation can be used by the Naive Bayes Algorithm for
further analysis. We have a dataset that contains emails, and we know whether it is spam or ham
(not spam). This method can be trained on a per-user basis, which may be the main advantage of
the algorithm. This paper offers a step-by-step explanation about the method, it can help in
understanding the Naïve Bayes spam filtering.

The rest of the paper is organized as follows: Section 2, an explanation, presentation about Bayes
Theorem and Naive Bayes algorithm; Section 3, describes the data and the bag-of-words model;
Section 4, discusses the advantages and disadvantages of the method and Section 5, concludes
this work and presents more ways for spam detecting for future work.

Bayes Theorem and Naïve Bayes


The Naive Bayes algorithm is a simple statistical method based on the Bayesian theorem. This
method can calculate the probability that a particular email is spam, based on the words that
appear in the mail. The algorithm is called naive because, when creating the model, it is assumed
that the email characteristics, the words, are completely independent of each other. This is very
simplified, as in many cases there is a connection between the words next to each other and the
order of the words is also important. The Naive Bayes spam filtering algorithm is a well-
established method, which can be tailored to the individual and has a correspondingly low false-
positive rate.

Preliminaries

As we mentioned in the beginning of the paper, the Naïve Bayes algorithm is a machine learning
algorithm. In machine learning, a frequently used procedure is to break down available data into
training and test datasets. The selected model is taught on the training dataset, then the
performance is checked on the test dataset, sometimes is tested even on the training set. By
breaking the data down into training and test datasets, we try to prevent overfitting and
underfitting, which affect all known statistical models, including the Naive Bayes.

When our model does not fit the data correctly, which means that it does not recognize the
different features and characteristics, then we are talking about the underfitting. In this case, the
model failed to identify the characteristic properties of the training data. As a result, the new,
unseen inputs cannot be identified and classified correctly. There can be several reasons for
underfitting: poor quality or a small amount of learning data or usage of an inappropriate model.
The opposite of underfitting is overfitting: this expression indicates that the model corresponds
too closely or exactly to the train data. A possible source of error is that the model adapts to the
specific characteristics of the training dataset, and therefore it cannot generalize properly in the
future. In this case, the model is lost in the details, often the noise or protruding data is
considered as a feature, hence cannot draw sufficiently significant conclusions from the dataset.
Therefore, the fact that we achieve high performance on training data does not apply that our
algorithm will perform well on test data.

When splitting the input dataset, we have to make sure that the distribution of the training and
test dataset is the same as in the original dataset. We do not want that, for example, the training
dataset contains almost only SPAMs, while SPAM hardly appears in the test dataset. For this
reason, each input is randomly decided to be a test or a train individual.

In the light of the above, in order to check the correctness of our model, we can calculate two
types of errors: training error and test error:
 During the calculation of the training error, the model is taught on the training data, and
then the training data is relabeled with the obtained model and then comparing these
labels with real labels, we calculate how well the algorithm performs. For example, if the
obtained training error is large, we are talking about underfitting.
 During the calculation of the test error, the model is taught on the training data, and the
correctness of the algorithm is checked on the test data. Thus, for example, if the train
error is small, but the test error large, then we can assume that we have overfitting

Naïve Bayes

The basic idea of the Naïve Bayes method is that certain words (e.g. FREE, MILLION, CLICK)
in a given email allow us to conclude that the email is more likely to be spam. For a further
explanation, we introduce the following:
 di is the i. document and yi is the label of this file (SPAM/HAM)
 D = {(di, yi) | i=1,...,ℓ} is the labeled dataset
 wk is the words in the document
We treat each document as a series of independent words (bag of words), each message can be
characterized as a histogram:
di = {(wk, card(wk, di)) | k = 1, . . . , m}
We want to determine the probability that a given mail is HAM or SPAM.
For example, we consider the probability P(SPAM|d). According to the Bayes theorem, this can
be described as:

where:
 P(SPAM) is the probability that an arbitrarily selected mail from the data set is unwanted
mail
 P(d) is the probability that an arbitrary document has the form d
 P(d|SPAM) is a conditional probability that tells the possibility, that a SPAM mail looks
like the d document in question. Document d is defined by its features, for this, we use
the bag of words principle

For a given word wk, denote P(wk | SPAM) the probability that in a SPAM email this word (wk)
is displayed. As we use naive Bayes, we can assume that all words appearing in the document
are completely independent of each other.

This means that we have to take into account the probability that the words of the document
usually appear in SPAM-type mail.

Thus, the probability that the given d document is SPAM:

The content-based spam filtering is a binary classification, in this case, is sufficient to compute
the

ratio. If this ratio is greater than 1, it means that the probability of the document being SPAM is
greater than 50% otherwise it is more likely to be a HAM. We can estimate every argument from
the right side of the formula, hence theoretically we can detect the spam with the Naive Bayes
algorithm.

Improvements
In the second section, we presented the simple Naïve Bayes algorithm, and the results from the
previous section are based on that model. However, Naïve Bayes has some optimization
methods, which can help with better estimation. The additive smoothing is one of these methods,
which essence is to assign higher probabilities to the words not seen in the training set, this way
we can get a better estimation. The additive smoothing needs a hyperparameter, which can be
determined with cross-validation.

The K-fold cross-validation is a resampling procedure used by machine learning algorithms to


estimate the optimal hyperparameters of the model or to evaluate the method when the amount of
available data is limited. The essence of this method is to divide the available dataset into K
disjoint subsets of approximately the same size and then select the subsets in order as the
validation set (test dataset) and the remaining K-1 set will be merged, which will form the
training dataset. We train our model in turn on each of the formed training sets and we evaluate
the performance of the model in turn on the corresponding validation sets. Finally, the results of
the evaluators are summarized.

Cross-validation is also advantageous because it allows the original dataset to be sampled evenly,
consequently, it offers a more robust solution. In the case where cross-validation is used to
determine the hyperparameters of the model, the cross-validated errors must be calculated
differently. In the case of these parameters, then we select the parameter for which the obtained
error is minimal.

In this paper we talked about supervised learning, but we can use just semi-supervised learning
with the Naïve Bayes. The basic idea of the semi-supervised learning is that if we have unlabeled
data, we will also use them, thus improving the predictions. In the case of Naïve Bayes, on of the
applicable method is to teach based on the labeled training data, and then determine classes of
unallocated training data. If the decision is sufficient then we add the document together with the
predicted label to the training dataset. Then we teach the Naïve Bayes classifier again based on
the new expanded set of learning. Thus, the set of labeled learning data will increase and our
predictions will be more accurate.

Discussion
This study presented the Naïve Bayes algorithm, which can be used for detecting unwanted,
harmful emails. This Bayesian filtering not just looks easy, but it is efficient as well. The method
takes the whole email into account, it examines all aspects of the message. Moreover, it is self-
adapting, we can always train our algorithm on new dataset, and thus it can learn the new spam
techniques. This method is also sensitive to user, it can learn the email habits. Furthermore, the
Bayesian method is completely international and multi-lingual, hence it can catch more harmful
emails.
References
Stuart Russel, Peter Norvig, University of California at Berkeley: Artificial Intelligence: A
Modern Approach, 1995

Jeremy J. Eberhardt, University or Minnesota, Morris, Bayesian Spam Detection, 2015

Why bayesian filtering is the most effective anti-spam technology, 2011

Erik G. Learned-Miller, University of Massachusetts, Amherst: Supervised Learning and


Bayesian Classification, 2011

You might also like