Welcome to Scribd!

A Study of Supervised Spam Detection Using Artificial Intelligence

Uploaded by

0% found this document useful (0 votes)

111 views18 pages

This document summarizes a study on using machine learning for supervised spam detection. It discusses different types of spam techniques used by spammers, such as obscuring text, using images or character encodings. It then presents Naive Bayes as a solution, which learns the probabilities of words occurring in spam vs ham (non-spam) emails. The document evaluates different spam detection algorithms and open-source filters on standard evaluation measures like accuracy, recall and precision. It concludes that machine learning can classify emails into spam and ham with over 99.9% accuracy using the best performing algorithms.

Original Description:

Original Title

Spam_Mail_Detection

Copyright

Available Formats

PPT, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PPT, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as ppt, pdf, or txt

0% found this document useful (0 votes)

111 views18 pages

A Study of Supervised Spam Detection Using Artificial Intelligence

Uploaded by

Mohit

Copyright:

Available Formats

Download as PPT, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as ppt, pdf, or txt

Jump to Page

You are on page 1of 18

Search inside document

A Study of Supervised Spam Detection

using Artificial Intelligence

Presented by
Mohit Magare
Class: BE-B-10
PRN No: 71921639H

1
What is Spam?
• Typical legal definition
– Unsolicited commercial email from someone
without a pre-existing business relationship

• Definition mostly used

– Whatever the users think

2
Spam Detection

Ham

Spam

Is this just text categorization?

What are the special challenges?
3
Text classification alone is not enough

• Spammers now often try to obscure text.

• Special features are necessary.
– E.g. subject line vs. body text
– E.g. Mail in the middle of the night is more
likely to be spam than mail in the middle of the
day.

4
Weather Report Guy

• Content in Image

Weather, Sunny, High

82, Low 81, Favorite…

5
Secret Decoder Ring Dude
• Character Encoding
• HTML word breaking
Pharmacy
Produc<!LZJ>t<!LG>s

6
Diploma Guy
• Word Obscuring

Dlpmoia Pragorm
Caerte a mroe prosoeprus

7
One Solution to Spam Detection
• Machine Learning
– Learn spam versus good

8
Naïve Bayes
• Want P( spam | words )
• Use Bayes Rule: P(spam | words )  P(words | spam) P(spam)
P( words )

P ( words )  P ( words | spam)  P ( spam)  P ( words | good )  P( good )

• Assume independence: probability of each word

independent of others
P( words | spam)  P( word1 | spam)  P( word 2 | spam)  ... P( wordn | spam)

9
A Bayesian Approach to Filtering Junk E-Mail
1998 - Sahami, Dumais, Heckerman, Horvitz

• One of the first papers on using machine learning to

combat spam
• Used Naïve Bayes
• Feature Space: Words, Phrases, Domain-Specific Features
• Evaluation Data: ~1700 Messages, ~88% Spam, from
volunteer’s private e-mail

10
A Bayesian Approach to Filtering Junk E-Mail
1998 - Sahami, Dumais, Heckerman, Horvitz

• Hand Crafted Features

– 35 Phrases
• ‘Free Money’
• ‘Only $’
• ‘be over 21’
– 20 Domain Specific Features
• Domain type of sender (.edu, .com, etc)
• Sender name resolutions (internal mail)
• Has attachments
• Time received
• Percent of non-alphanumeric characters in subject
• Best collection of heuristics discussed in literature
– Without them: Spam precision 97.1% Spam recall 94.3%
– With them: Spam precision 100% Spam recall 98.3%
11
Algorithms Used in Spam Detection
12
10
8
6
4
2
0

• Naïve Bayes reported to do very well

• More complex algorithms have some gain 12
Which Algorithm is Best?

• Very difficult to tell

– No consistently-used good data set
– No standard evaluation measures

13
O

• Present several evaluation measures for spam detection

• Compare methods in six open-sources spam filters
• Analysis the experiment results

14
Filters
• Some available open-source spam filters
– Spamassassin
– Bogofilter
– CRM-114
– DSPAM
– SpamBayes
– Spamprobe

15
Evaluation Measures (1)
judgement
Ham Spam
Ham a b
Result
Spam c d
a: ham (correctly classified) [true negative]
b: spam misclassification [false negative]
c: ham misclassification [false positive]
d: spam (correctly classified) [true negative]

• Accuracy: (a+d)/(a+b+c+d) • Ham misclassification rate: c/(a+c)

• Spam misclassification rate: b/(b+d)
• Spam recall: d/(b+d)
• Spam precision: d/(d+c) 16
Conclusion

We are able to classify the emails as spam or

non-spam using artificial intelligence with almost
99.9% accuracy and with best performing
algorithms.

17
Thank you!

Grokking Machine Learning
From Everand
Grokking Machine Learning
Luis Serrano
No ratings yet
Australian Informatics Olympiad Intermediate
Document9 pages
Australian Informatics Olympiad Intermediate
Casper Lu
No ratings yet
Python Interview Questions
From Everand
Python Interview Questions
Equity Press
Rating: 4.5 out of 5 stars
4.5/5 (6)
HTML Forms & Interactive Elements: Or How to Poke a Zombie in the Eye: Undead Institute
From Everand
HTML Forms & Interactive Elements: Or How to Poke a Zombie in the Eye: Undead Institute
John Rhea
No ratings yet
Exploratory Course - Computer Hardware Servicing For G 7and 8 - Answer Key PDF
Document9 pages
Exploratory Course - Computer Hardware Servicing For G 7and 8 - Answer Key PDF
Sirossi Ven
67% (12)
Offpipe A&r 2 Excel Using Matlab-E - Rev1.4-2010
Document9 pages
Offpipe A&r 2 Excel Using Matlab-E - Rev1.4-2010
Peggy Jan Qorban
No ratings yet
A Study of Supervised Spam Detection Applied To Eight Months of Personal E-Mail
Document34 pages
A Study of Supervised Spam Detection Applied To Eight Months of Personal E-Mail
Mohit
No ratings yet
Spam
Document34 pages
Spam
Rajeev Hatwar
No ratings yet
Lec-6 Spam-1
Document16 pages
Lec-6 Spam-1
Adish garg
No ratings yet
A Plan For Spam
Document10 pages
A Plan For Spam
nimi
No ratings yet
Automatic Detection of Spamming and Phishing: Kallol Dey Rahul Mitra Shubham Gautam
Document65 pages
Automatic Detection of Spamming and Phishing: Kallol Dey Rahul Mitra Shubham Gautam
HipMorsq
No ratings yet
The Spamhaus Project - Frequently Asked Questions (FAQ)
Document5 pages
The Spamhaus Project - Frequently Asked Questions (FAQ)
Priscilla Felicia Harmanus
No ratings yet
Chapter-Vi Ai and Mail Server
Document29 pages
Chapter-Vi Ai and Mail Server
shivamsaidane09
No ratings yet
Bayesian Filtering: Beyond Binary Classification
Document8 pages
Bayesian Filtering: Beyond Binary Classification
maizesmagikarp
No ratings yet
Modelling and Analysis On The Propagation Dynamics of Email Malware
Document30 pages
Modelling and Analysis On The Propagation Dynamics of Email Malware
Hem Ramesh
No ratings yet
Red Teaming Language Models With Language Models
Document31 pages
Red Teaming Language Models With Language Models
NeilFaver
No ratings yet
Tut Pres Wijayatunga Spam BCP
Document137 pages
Tut Pres Wijayatunga Spam BCP
Milton-Genesis
No ratings yet
E-Mail Spam Detection Using Machine Learning Naive Bayes Theorem
Document5 pages
E-Mail Spam Detection Using Machine Learning Naive Bayes Theorem
International Journal of Innovative Science and Research Technology
No ratings yet
Randomized Algorithms Randomized Algorithms: - Algebraic Techniques
Document31 pages
Randomized Algorithms Randomized Algorithms: - Algebraic Techniques
TanayJohari
No ratings yet
Yahoo Spam Industry Initiative
Document39 pages
Yahoo Spam Industry Initiative
Surya
No ratings yet
563.10.3 CAPTCHA: Presented By: Sari Louis
Document19 pages
563.10.3 CAPTCHA: Presented By: Sari Louis
yogi_sam
No ratings yet
Batch 17
Document27 pages
Batch 17
iamnijamudeen001
No ratings yet
A Comparison of Feature-Based and Feature-Free Case-Based Reasoning For Spam Filtering
Document59 pages
A Comparison of Feature-Based and Feature-Free Case-Based Reasoning For Spam Filtering
Karthik Keyan
No ratings yet
Gee Network Communication
Document27 pages
Gee Network Communication
Angeline Sison
No ratings yet
Botnet Detection Combining Botnet Detection Combining DNS and Honeypot Data
Document17 pages
Botnet Detection Combining Botnet Detection Combining DNS and Honeypot Data
Charlene Pinheiro
No ratings yet
Tutorial: Checksum and CRC Data Integrity Techniques For Aviation
Document44 pages
Tutorial: Checksum and CRC Data Integrity Techniques For Aviation
xman_1365_x
No ratings yet
Authorship Analysis by
Document26 pages
Authorship Analysis by
kikipol
No ratings yet
563 10 3captcha 110422121116 Phpapp01
Document21 pages
563 10 3captcha 110422121116 Phpapp01
Poojithabhuma
No ratings yet
Email Spam Detection Using Machine Learning
Document2 pages
Email Spam Detection Using Machine Learning
Milton
No ratings yet
0 Introduction
Document7 pages
0 Introduction
Joseph Isek
No ratings yet
Cyber Security
Document29 pages
Cyber Security
satyam.jhawar
No ratings yet
Lecture # 11 PDF
Document7 pages
Lecture # 11 PDF
nahel abdallah
No ratings yet
Lecture1 - Introduction To Machine Learning
Document39 pages
Lecture1 - Introduction To Machine Learning
Packet Mancer
No ratings yet
L11 Unit Testing
Document40 pages
L11 Unit Testing
Raihan Kabir Rifat
No ratings yet
Practice Questions On Huffman Encoding - GeeksforGeeks
Document6 pages
Practice Questions On Huffman Encoding - GeeksforGeeks
Zohaib Hasan Khan
0% (1)
Beringer Its Not Magic 5-10
Document68 pages
Beringer Its Not Magic 5-10
Robert Ryan
No ratings yet
The Realities of Software Testing: (Reading Assignment: Chapter 3, Pp. 37-50)
Document29 pages
The Realities of Software Testing: (Reading Assignment: Chapter 3, Pp. 37-50)
Ramana Rao
No ratings yet
Detection and Forensics On DNS Tunelling
Document44 pages
Detection and Forensics On DNS Tunelling
damtek
No ratings yet
CEH Exam Blueprint v4.0: EC-Council
Document5 pages
CEH Exam Blueprint v4.0: EC-Council
GustavoBusch
No ratings yet
CEH Exam Blueprint v4.0: EC-Council
Document10 pages
CEH Exam Blueprint v4.0: EC-Council
Anderson Silva
No ratings yet
UNIT 9 Lesson D and Progress Check WorkBook
Document10 pages
UNIT 9 Lesson D and Progress Check WorkBook
simon toloza
No ratings yet
495 Lecture 13 Trans Decoder
Document21 pages
495 Lecture 13 Trans Decoder
Mohibur Nabil
No ratings yet
Chapter 1
Document40 pages
Chapter 1
a9ooly
No ratings yet
Astrolabs Email v1.1
Document63 pages
Astrolabs Email v1.1
rasha.alhassan
No ratings yet
Machine Learning Algorithm
Document32 pages
Machine Learning Algorithm
Prashant Sahu
No ratings yet
Web Search Engines
Document30 pages
Web Search Engines
Arpita Gupta
No ratings yet
Introduction
Document49 pages
Introduction
Ebrahim Daneshifar
100% (1)
Web Captcha: Human or Script? An AI Approach To Cryptography
Document31 pages
Web Captcha: Human or Script? An AI Approach To Cryptography
Praneetha Kolluru
No ratings yet
ML Advice Lecture
Document87 pages
ML Advice Lecture
Manal Khalil
No ratings yet
Apache Spam Assassin
Document8 pages
Apache Spam Assassin
Ogundare Segun Gideon
No ratings yet
Module 1
Document34 pages
Module 1
anushaj
No ratings yet
P1 PDF
Document2 pages
P1 PDF
Kalisetty Swetha
No ratings yet
4IT0 01 Que 20180517
Document24 pages
4IT0 01 Que 20180517
Riaz Khan
No ratings yet
Implementation of Naïve Bayesian Spam Filter Algorithm
Document16 pages
Implementation of Naïve Bayesian Spam Filter Algorithm
Alok Nandan Jha
No ratings yet
What Is Machine Learning?
Document71 pages
What Is Machine Learning?
Mia Hasan
No ratings yet
Lecture 21
Document39 pages
Lecture 21
Ali Mola
No ratings yet
Business Intelligence: A Managerial Approach (2 Edition)
Document58 pages
Business Intelligence: A Managerial Approach (2 Edition)
Steve Evander Aryeetey
No ratings yet
Defending Yourself from Hackers and Spammers
From Everand
Defending Yourself from Hackers and Spammers
Damien Fellows
No ratings yet
Clean C++20: Sustainable Software Development Patterns and Best Practices
From Everand
Clean C++20: Sustainable Software Development Patterns and Best Practices
Stephan Roth
No ratings yet
Guide to PC Security
From Everand
Guide to PC Security
Max Editorial
No ratings yet
Pandas Workout: 200 exercises to make you a stronger data analyst
From Everand
Pandas Workout: 200 exercises to make you a stronger data analyst
Reuven Lerner
No ratings yet
A Study of Supervised Spam Detection Applied To Eight Months of Personal E-Mail
Document34 pages
A Study of Supervised Spam Detection Applied To Eight Months of Personal E-Mail
Mohit
No ratings yet
A Real-Time Pothole Detection Approach For Intelli
Document7 pages
A Real-Time Pothole Detection Approach For Intelli
Mohit
No ratings yet
Pothole and Object Detection For An Autonomous Vehicle Using YOLO
Document5 pages
Pothole and Object Detection For An Autonomous Vehicle Using YOLO
Mohit
No ratings yet
Rastogi 2020
Document6 pages
Rastogi 2020
Mohit
No ratings yet
Deep Learning Approach To Detect Potholes in Real-Time Using Smartphone
Document4 pages
Deep Learning Approach To Detect Potholes in Real-Time Using Smartphone
Mohit
No ratings yet
RV-K - IEC 60502-1 XLPE PVC Cable: Application Standards
Document5 pages
RV-K - IEC 60502-1 XLPE PVC Cable: Application Standards
Franklin Ceballos
No ratings yet
Service Manual: DSC-P32
Document2 pages
Service Manual: DSC-P32
tm5u2r
No ratings yet
An Environmentally Friendly and Highly Efficient Combined Heat and Power Plant With A MACH II-SI (KU30GSI) Gas Engine PDF
Document6 pages
An Environmentally Friendly and Highly Efficient Combined Heat and Power Plant With A MACH II-SI (KU30GSI) Gas Engine PDF
Rao Mukheshwar Yadav
No ratings yet
Manual Fallas Terex
Document184 pages
Manual Fallas Terex
Victor Curo Palma
100% (5)
Fintech in Malaysia: A Systematic Literature Review On Published Literature
Document20 pages
Fintech in Malaysia: A Systematic Literature Review On Published Literature
Razman Ruzaimi
No ratings yet
1.2 Compare & Contrast TCP and UDP Protocol
Document5 pages
1.2 Compare & Contrast TCP and UDP Protocol
Amit Singh
No ratings yet
HVDC Superhighways For China
Document7 pages
HVDC Superhighways For China
Jorginho Herrera
No ratings yet
La2 1
Document6 pages
La2 1
Hega Sini
No ratings yet
Zsample Process 00001040
Document2 pages
Zsample Process 00001040
Sara Iuoarhgam
No ratings yet
Cv4 Manual
Document112 pages
Cv4 Manual
jonathan morales labarca
No ratings yet
Contabo Secure VPS Setup-V0.73
Document29 pages
Contabo Secure VPS Setup-V0.73
Basil Ogbunude
No ratings yet
TSLOTS Catalog Website
Document256 pages
TSLOTS Catalog Website
Anirudh Saboo
No ratings yet
AWS Certificate Notes
Document6 pages
AWS Certificate Notes
sawafiri
No ratings yet
Artificial Intelligence
Document12 pages
Artificial Intelligence
John Amazing
No ratings yet
REV REV REV Revision Revision Revision Engineer: Engineer: Engineer: Date Date Date
Document94 pages
REV REV REV Revision Revision Revision Engineer: Engineer: Engineer: Date Date Date
Ashutosh Pendkar
No ratings yet
Ce-Iii Marketing Management Shanti Business School: Guided By: Dr. Sandeep Makwana Sir
Document5 pages
Ce-Iii Marketing Management Shanti Business School: Guided By: Dr. Sandeep Makwana Sir
Jay Sharma
No ratings yet
FSB - L4 - LARAEFSB2020-31 - Moto E6s Moto G8 Power Lite - Device Starting To MTK Test Mode Screen
Document3 pages
FSB - L4 - LARAEFSB2020-31 - Moto E6s Moto G8 Power Lite - Device Starting To MTK Test Mode Screen
Saúl ríos
No ratings yet
Irfan CV PMD
Document3 pages
Irfan CV PMD
Irfan Shakoor
No ratings yet
Drilling Systems Brochure
Document8 pages
Drilling Systems Brochure
Beniamine Sarmiento
No ratings yet
Software Ri 1112
Document48 pages
Software Ri 1112
Javier Bosigas
No ratings yet
EEN 443 Communication Systems II A
Document2 pages
EEN 443 Communication Systems II A
julio
No ratings yet
Free YouTube To MP3 Converter 2021 Crack Serial Keygen
Document3 pages
Free YouTube To MP3 Converter 2021 Crack Serial Keygen
John Ship
No ratings yet
DSE7410-MKII-DSE7420-MKII Manual PDF
Document222 pages
DSE7410-MKII-DSE7420-MKII Manual PDF
Michel García
67% (3)
Intro To Publisher
Document9 pages
Intro To Publisher
Agni Rudra
No ratings yet
Nagcarlan Floor Plan
Document11 pages
Nagcarlan Floor Plan
Engr. Christopher Lennon Dela Cruz
No ratings yet
Sensor Based Automated Irrigation System With Iot A Technical Review
Document4 pages
Sensor Based Automated Irrigation System With Iot A Technical Review
Jhanina Capicio
No ratings yet
Landt Switchgear Yaskawa Variable Speed Ac Drives Price List
Document4 pages
Landt Switchgear Yaskawa Variable Speed Ac Drives Price List
Hitesh Panigrahi
No ratings yet
Wiley - Market Operations in Electric Power Systems - Forecasting, Scheduling, and Risk Management - 978-0-471-44337-7
Document2 pages
Wiley - Market Operations in Electric Power Systems - Forecasting, Scheduling, and Risk Management - 978-0-471-44337-7
Luisa Blanco
No ratings yet