Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 18

A Study of Supervised Spam Detection

using Artificial Intelligence

Presented by
Mohit Magare
Class: BE-B-10
PRN No: 71921639H

1
What is Spam?
• Typical legal definition
– Unsolicited commercial email from someone
without a pre-existing business relationship

• Definition mostly used


– Whatever the users think

2
Spam Detection

Ham

Spam

Is this just text categorization?


What are the special challenges?
3
Text classification alone is not enough

• Spammers now often try to obscure text.


• Special features are necessary.
– E.g. subject line vs. body text
– E.g. Mail in the middle of the night is more
likely to be spam than mail in the middle of the
day.

4
Weather Report Guy

• Content in Image

Weather, Sunny, High


82, Low 81, Favorite…

5
Secret Decoder Ring Dude
• Character Encoding
• HTML word breaking
Pharmacy
Prod&#117;c<!LZJ>t<!LG>s

6
Diploma Guy
• Word Obscuring

Dlpmoia Pragorm
Caerte a mroe prosoeprus

7
One Solution to Spam Detection
• Machine Learning
– Learn spam versus good

8
Naïve Bayes
• Want P( spam | words )
• Use Bayes Rule: P(spam | words )  P(words | spam) P(spam)
P( words )

P ( words )  P ( words | spam)  P ( spam)  P ( words | good )  P( good )

• Assume independence: probability of each word


independent of others
P( words | spam)  P( word1 | spam)  P( word 2 | spam)  ... P( wordn | spam)

9
A Bayesian Approach to Filtering Junk E-Mail
1998 - Sahami, Dumais, Heckerman, Horvitz

• One of the first papers on using machine learning to


combat spam
• Used Naïve Bayes
• Feature Space: Words, Phrases, Domain-Specific Features
• Evaluation Data: ~1700 Messages, ~88% Spam, from
volunteer’s private e-mail

10
A Bayesian Approach to Filtering Junk E-Mail
1998 - Sahami, Dumais, Heckerman, Horvitz

• Hand Crafted Features


– 35 Phrases
• ‘Free Money’
• ‘Only $’
• ‘be over 21’
– 20 Domain Specific Features
• Domain type of sender (.edu, .com, etc)
• Sender name resolutions (internal mail)
• Has attachments
• Time received
• Percent of non-alphanumeric characters in subject
• Best collection of heuristics discussed in literature
– Without them: Spam precision 97.1% Spam recall 94.3%
– With them: Spam precision 100% Spam recall 98.3%
11
Algorithms Used in Spam Detection
12
10
8
6
4
2
0

• Naïve Bayes reported to do very well


• More complex algorithms have some gain 12
Which Algorithm is Best?

• Very difficult to tell


– No consistently-used good data set
– No standard evaluation measures

13
O

• Present several evaluation measures for spam detection


• Compare methods in six open-sources spam filters
• Analysis the experiment results

14
Filters
• Some available open-source spam filters
– Spamassassin
– Bogofilter
– CRM-114
– DSPAM
– SpamBayes
– Spamprobe

15
Evaluation Measures (1)
judgement
Ham Spam
Ham a b
Result
Spam c d
a: ham (correctly classified) [true negative]
b: spam misclassification [false negative]
c: ham misclassification [false positive]
d: spam (correctly classified) [true negative]

• Accuracy: (a+d)/(a+b+c+d) • Ham misclassification rate: c/(a+c)


• Spam misclassification rate: b/(b+d)
• Spam recall: d/(b+d)
• Spam precision: d/(d+c) 16
Conclusion

We are able to classify the emails as spam or


non-spam using artificial intelligence with almost
99.9% accuracy and with best performing
algorithms.

17
Thank you!

18

You might also like