Professional Documents
Culture Documents
Anti-Spam Methods - State-of-the-Art: W. Gansterer, M. Ilger, P. Lechner, R. Neumayer, J. Strauß
Anti-Spam Methods - State-of-the-Art: W. Gansterer, M. Ilger, P. Lechner, R. Neumayer, J. Strauß
March 2005
This report summarizes the results of Phase 1 of the project FA 384018 “Spamabwehr”
of the Institute of Distributed and Multimedia Systems at the University of Vienna,
funded by Mobilkom Austria, UPC Telekabel and Internet Service Providers Austria
(ISPA).
We would like to thank Mobilkom Austria, UPC Telekabel and Internet Service
Providers Austria (ISPA) for their support which made this research project possible.
We also would like to express our gratitude to all those commercial vendors of antispam
tools who provided us with their products for experimental investigations as well as to
the volunteers who provided us private e-mail messages for testing purposes.
Copyright:
© 2005 by University of Vienna. All rights reserved. No part of this publication may be reproduced or distributed in
any form or by any means without the prior permission of the authors. The Institute of Distributed and Multimedia
Systems at the University of Vienna does not guarantee the accuracy, adequacy or completeness of any information
and is not responsible for any errors or omissions or the result obtained from the use of such information.
2
About the Authors
Project “Spamabwehr” was launched in summer 2004 at the Department of Computer
Science (Distributed Systems group) which, due to internal restructuring at the
University of Vienna, became the new Institute of Distributed and Multimedia Systems
at the Faculty of Computer Science.
The team:
Dr. Wilfried Gansterer (project leader),
Michael Ilger, Peter Lechner, Robert Neumayer and Jürgen Strauß.
The institution:
The Faculty of Computer Science (Fakultät für Informatik) is currently lead by Dean
Prof. Dr. Günter Haring. The Institute of Distributed and Multimedia Systems, headed
by Prof. DDr. Gerald Quirchmayr, is one of the institutes within this faculty.
3
Table of Content
4
4.2.1. Our Test Sample........................................................................................................64
4.2.2. SpamAssassin Test Sample........................................................................................65
4.3. EXPERIMENTAL SETUP ........................................................................................................65
4.3.1. Windows Test Process...............................................................................................66
4.3.2. Linux Test Process ....................................................................................................66
5. EXPERIMENTAL RESULTS ..................................................................................................67
5.1. OUR TEST SAMPLE ..............................................................................................................67
5.1.1. Commercial Products ...............................................................................................67
5.1.2. Open Source Tools ....................................................................................................71
5.1.3. Conclusion ................................................................................................................75
5.2. SPAMASSASSIN TEST SAMPLE .............................................................................................76
5.2.1. Commercial Products ...............................................................................................76
5.2.2. Open Source Tools ....................................................................................................81
5.2.3. Conclusion ................................................................................................................85
6. CONCLUSION...........................................................................................................................87
6.1. METHODS ............................................................................................................................87
6.2. EXPERIMENTS......................................................................................................................88
7. LIST OF FIGURES ...................................................................................................................89
8. LIST OF TABLES .....................................................................................................................90
9. INDEX.........................................................................................................................................91
10. BIBLIOGRAPHY ......................................................................................................................93
5
Executive Summary
This report summarizes the findings and results of the first phase of the project “FA
384108 Spam-Abwehr” (“Spam-Defense”) which was launched in July 2004 at the
Department of Computer Science and Business Informatics at the University of Vienna
and is supported by Mobilkom Austria, UPC Telekabel and Internet Service Providers
Austria (ISPA).
(4) Section 4 summarizes the setup we used for experimenting with the products
and tools mentioned above. In particular, it describes the two test sets
containing spam and ham messages (one of them we collected ourselves from
various sources, the other one is publicly available) and the hardware we
used.
(5) Section 5 summarizes our experimental results in detail. Detection rates and
false positive rates are given for each of the products and tools used.
Since the goal of this report was an analysis of existing methodology, and not
a comprehensive and detailed evaluation or comparison of anti-spam
products/tools available, the results of our experimental evaluation must not
be interpreted as a “ranking”. In order to produce a rigorous ranking, we
would have to use a wider variety of test sets, we would have to spend much
6
more effort on tuning the products and tools (which can be an enormously
time consuming task), we would have to monitor their performance over a
longer period of time, and we also would have to take into account other
properties beyond detection rates (such as user-friendliness, administrative
overhead, etc.). Thus, the results quoted should be considered approximations
of the performance achievable with the respective tools.
In many cases, our results are reasonably good indications of the performance
to be expected – experience shows that even with higher efforts for tuning the
detection rates usually cannot be expected to increase a lot.
Table 1 provides a compact overview of the products and tools we experimented with
and of some of the most common anti-spam methods, indicating which product/tool
uses which method.
SurfControl E-Mail
Mail
Mail
Anti-
Anti-
SpamAssassin
mySpamWall
Brightmail
Kaspersky
Bogofilter
Symantec
Symantec
MXtreme
CRM 114
Firewall
Ikarus
Spam
Spam
Filter
Commercial/
C C C C C Service O O O
Open Source
Operating System W L W W Prop Appliance L L L
Prop Methods * * * *
Whitelist * * * * * * * *
Blacklist * * * * * * * *
SPF *
Challenge/Response
Token based
Challenge/Response
*
Greylist
DCC
Fingerprint Prop DCC Prop Pyzor
Razor
Bayes * * * * *
Neural Networks * * *
7
URL Whitelist
URL Blacklist * *
Static techniques
* * * * * * *
(Keywords, ..)
Digital Signature
Hashcash * *
SVM
Table 1: Products/tools considered, methods used by these, further remarks see page 2
8
1. Introduction
Among the strengths of electronic communications media such as electronic mail (e-
mail) are the relatively low transmission costs, high reliability and generally fast
delivery. Electronic messaging is not only cheap and fast; it is also easy to automate.
These properties make it obviously also very attractive for commercial advertising
purposes, and in recent years we have experienced a development where electronic
messaging is abused by flooding users’ mailboxes with unsolicited messages.
The most common purpose for spamming is advertising. Offered goods range from
pornography, computer software and medical products to credit card accounts,
investments and university diplomas. Many of these products have an ill-reputed or
questionable legal nature. The main motivation for spamming is commercial profit. As
we mentioned above, the costs for sending millions of spam mail messages are very
low. In order to make good profit, it suffices if only a very small fraction (0.1% or even
less) of the sent out spam e-mail are replied to and lead to business transactions.
Spam has severe negative effects on e-mail users. Obviously, it consumes computer,
storage and network resources as well as human time and attention to dismiss unwanted
messages. Moreover, it has various indirect effects which are very difficult to account
for – the spectrum reaches from measurable costs like spam filter software and
administration to not measurable costs like a lost e-mail (expensive for business, not
that expensive for a private person).
We can distinguish five different types of spam: Beyond e-mail spam, there is
messaging spam (often called spim – spam using instant messaging), newsgroup spam
(excessive multiple postings in newsgroups), mobile phone spam (text messages), and
Internet telephony spam (via voice over IP).
9
centralized solutions suitable for ISPs. In such a context, user feedback – if feasible –
can be one way to control and improve quality, but it should not be an integral part of
anti-spam methods.
Unsolicited Bulk E-mail (UBE): “E-Mail with substantially identical content sent
to many recipients who did not ask to receive it. Almost all UBE is also UCE.” [2], or:
“Unsolicited Bulk E-Mail, or UBE, is Internet mail (‘e-mail’) that is sent to a group of
recipients who have not requested it. A mail recipient may have at one time asked a
sender for bulk e-mail, but then later asked that sender not to send any more e-mail or
otherwise not have indicated a desire for such additional mail; hence any bulk e-mail
sent after that request was received is also UBE.” [4]
In our opinion, no e-mail that is solicited can be considered spam. However, there
may be spam which is not sent out in bulk or which does not involve (direct)
commercial interest. Ultimately, classification of an e-mail message as spam often
10
becomes a highly subjective decision and it is very difficult – if not impossible – to
establish common criteria covering a wide range of affected user. Nevertheless, based
on the statements mentioned above, we identify three central features, which we
consider defining properties of spam (not always all three of them have to apply):
Two more relevant technical terms have been established for the special context of
newsgroup postings:
There are many sources of information about statistics on the development of spam,
such as [5] or [6].
11
that trend, some pessimists even announce the end of the mail infrastructure for 2007
[9].
Until July 2004, the anti-spam software developer Brightmail published monthly
statistics about spam, as shown in Figure 1.
Figure 2: Interaction of legislative measures, law enforcement and percentage of spam [11]
12
It is clearly visible that the percentage of spam in all e-mail messages sent still has
an increasing trend, but it also tends to react significantly to the introduction of new
legislative measures and to legal actions taken against spammers. This interpretation
also has to be seen in the light of the conjecture that 80% of the spam sent worldwide
comes from very few (roughly 200) distinct spammers [12].
Others 11.76%
Table 2: The top twelve sources of spam, geographically [13]
For collecting the data summarized in this table, researchers used honey pots1, to
collect spam. It is interesting to note that, compared to the data from February 2004,
Canada reduced its rate from 6.8 % to 2.9%, whereas South Korea tripled its rate. In
general, about 40 % of the world’s spam is sent out from “zombie computers”2 [8].
1
The term “honeypot” (spam trap) refers to an e-mail address never published to humans. Any e-mail
sent to such addresses has to be spam.
2
The term “zombie computer” refers to a computer infected with viruses of all kinds and misused for
sending out spam.
13
Figure 3: The top ten sources of spam (domains) [14]
Postini [15] provides some interesting statistics investigating the sources of spam
and of directory harvest attacks3. On their graphic illustration, Austria seems to be
among the hotspots of spammers’ activity. Upon closer examination, it turns out that
the visual impression is due to the three following entries [15]:
48.22 16.37 AT VIENNA WIEN (state) RIPE ev_dictatk 20881
48.22 16.37 AT VIENNA WIEN (state) RIPE ev_spamatk 26
48.22 16.37 AT VIENNA WIEN (state) RIPE ev_dictatk 211
Whereas the first three entries specify the location (latitude, longitude, additional
location information), the last two entries describe the event type and the intensity of
the attack. That means there were 26 spam attacks (=ev_spamatk) and more than 20,000
directory harvest attacks (=ev_dictatk) during the last six month having their origin
from Austria.
3
A directory harvest attack is the theft of confidential e-mail directory information, for example of
lists of e-mail addresses of all employees of an organization.
14
Spam categorized in terms of content (Data July 2004)
Political Spiritual
Other 3% 2%
6% Products
Internet
25%
5%
Leisure
4%
Fraud
5%
Health
8%
Adult
15%
Scams
9%
Financial
18%
Spam is sent out by companies and by individuals, but primarily for a single reason –
to make profit using a new form of direct marketing. Classical direct marketing, using
methods such as brochures, TV and radio spots, telephone calls, doorstep sales, etc. has
been used for a long time. For these marketing methods, the costs associated with every
step in this process are significant. More importantly, those costs for direct marketing
increase proportionally with the number of potential customers reached and revenue is
only created by selling real products or services. In this classical approach, frauds are
almost excluded because initial investment is necessary for advertising in order to make
money down the line.
With the availability of e-mail communication, new direct marketers were able to
reduce the costs for direct marketing to a negligible amount in proportion to the number
of potential customers reached. This increases the margin of profit considerably. In the
following, we describe this business model (spammers’ costs and revenues) in more
detail.
15
1.3.1.1. Cost Factors
The following list summarizes a few of the cost factors characteristic for spamming
businesses.
Product: Most of the spammers do not sell anything to the recipients of spam – they
are just acting as marketers (thus, spammers do not have any investments for actually
purchasing products).
Marketing Material: The creation of an e-mail does not need any highly specialized
software or knowledge. Thus, producing the marketing material is very cheap, and –
one of the most important differences to classical marketing – the costs for sending out
marketing material do not increase proportionally with the number of potential
customers reached.
Spam Tools: Tools for generating and sending out millions of personalized e-mail
are available, very inexpensive (often even for free) and easy to use.
The Spam Campaign: Set up an Internet connection (for example, a free trial
account), send out millions of messages from this account in a short period, and move
to the next ISP for getting a new (free) account.
Other Costs: These include hardware and maintenance costs, but may also include
costs for responding to interested buyers (automated, in order to avoid personal
interaction, for example, via a Web interface).
In the following, we list some of the most important sources of income and profit for
spammers.
Direct Income: The most common form of income for spammers is that they act as
marketing companies and are paid for marketing campaigns.
Web Banner Revenues: In many cases, spammers get revenue for every visit on a
Web site, which is advertised, in a spam e-mail.
Sell Spam Business Models: The above is a special case of a more general concept
where spammers sell the information collected from responses to spam messages to
others.
16
Scams: In many cases, spam messages are hidden attempts to find out personal or
access information (“phishing”), such as credit card information, bank account
information, etc., which can then be used for criminal activities (theft, illegal
investment, etc.). Other kinds of scam could be: dubious job offers, ponzi schemes4,
Internet gambling, auctions, sexual offers and pre-paid purchase orders with no supply
of the ordered goods [20].
Product Selling: Only a minority of companies who send out spam are also selling
the advertised products themselves.
1.3.1.3. An Example
The following example gives an impression how spammers’ businesses operate. The
description is based on the interview [21] with an anonymous spammer who runs a
rather small-scale operation.
The spammer used an account at Send-safe [22], which allowed him to send out
400,000 e-mail messages via open proxies for US$ 50. On average, he sent out
approximately 61,000 e-mail messages per day. The recipients were taken from a CD
containing 4,000,000 e-mail addresses, which he bought for 300 Euros from [23]. It
turned out that only 56% of the addresses on the CD were syntactically correct, and
25% of these bounced due to full or out of use mailboxes. As Web site referred to in his
spam, he used bulletproof hosting5 in China via Worldsoftwarehouse [24], which
charges Euro 125 per month. He used a link-counter [25] to get an idea how many
persons view his e-mail (by counting how often it is opened in an e-mail client). On
average, per day about 30 persons ordered 2.5 units of the product offered.
Table 3 summarizes this operation for a typical month (one month = 30 days, prices
are given in Euro6).
Quantity of E-Mail:
4
An investment swindle in which some early investors are paid off with money put up by later ones
in order to encourage more and bigger risks.
5
The term bulletproof is used to indicate that nothing can shut down the hosting service. Such
services can enable the sending of spam without the threat of Web site account cancellation.
6
For this data, the conversion rate 1 US$=0.75913 Euro was used.
17
Fixed costs:
Variable costs:
Revenue:
Although some costs like computer hard- and software, Internet access costs and
taxes are not included here, this simple example shows that spamming is highly
profitable.
To get an impression of really large operations where millions of spam messages are
sent per day, see [26].
1.3.3. Conclusion
As illustrated before, e-mail communication currently provides an excellent means for
spammers to make high profits from sending out spam and related activities. All other
Internet users, including individual users, businesses and ISPs, suffer from all the
damaging side effects of spamming activities, reaching from mailboxes filled with junk
mail, which threatens the usefulness of e-mail as a means of communication, to all sorts
18
of other costs in terms of bandwidth usage, storage requirements, and last, but not least,
manpower required to fight spam.
It seems obvious that advanced approaches for fighting the spam problem must
include strategies to make spamming less attractive – not only by increasing the risks
for spammers through stricter legal regulations, but also by harming spammers’
business model, that is, by decreasing the potential margins of profit. In Section 2.2 we
will discuss some approaches in this direction in detail. For best results, the
infrastructure of the current Internet should be changed partly – and a lot of further
work is to be done – see also [31].
The main objective of SMTP is to support reliable and efficient mail transfer. The
Internet Message Format defines the structure of e-mail messages.
An e-mail message usually consists of three different parts – the SMTP envelope, the
header and the body. SMTP specifies a set of commands to transmit an e-mail message
between an SMTP client and an SMTP server. The exchange of these commands
between the client and the server forms the SMTP envelope and is known as the so-
called SMTP dialogue. A minimum SMTP implementation consists of nine commands.
There is also a service extension model that permits the client and server to agree to
utilize shared functionality beyond the original SMTP requirements. Table 4 shows a
typical communication scenario.
19
10 Client: data
11 Server: 354 Start mail input; end with <CRLF>.<CRLF>
12 Client: mail text…
13 Client: .
14 Server: 250 2.6.0 eylFCMHJ7jHp200000001@sending.server.com
Queued mail for delivery
15 Client: quit
16 Server: 221 receiving.server.com Service closing transmission channel
Table 4: Typical SMTP dialogue
The first step in the SMTP dialogue (lines 1-3) establishes the connection initiated
by the client. The standard SMTP port is 25. After connection establishment, the server
replies with code 220 (service ready, line 3). Relevant for the communication is just the
reply code, the text behind it can vary. Now the client sends a “helo” command (line 4).
In line 5, the server replies with code 250 (requested mail action okay) finishing the
SMTP handshake. After specifying sender and recipient (lines 6-9), the client uses the
“data” command (line 10) to tell the server that now the message itself will be
transferred. The server acknowledges (line 11) whereupon the client specifies the
content of the message (line 12). The end of the message is indicated by a “.” (line 13).
After receipt, an acknowledgement is sent to the client including an internal message
number assigned by the server (line 14). At the end of the communication, the client
sends a “quit” command (line 15) to close the transmission channel. The server
confirms with code 221 (Service closing transmission channel, line 16). With the
exception of the IP address of the client, any information provided by the client within
the SMTP dialogue can be forged (cf. [36]).
Message Header. All header fields have the same general syntactic structure: A field
name, followed by a colon, followed by the field body. The header fields can be
grouped into “originator fields”, “destination address fields”, “identification fields”,
“information fields”, “resent fields”, “trace fields” and “optional fields”. The “trace
fields” are also discussed in [32].
Table 5 summarizes the most important header fields in the Internet Message
Format.
Originator fields:
From: Specifies the author of the message
Sender: Sender of the message
Reply-To: Reply address
Destination address fields
To: Primary recipient(s)
20
CC: Other recipients
BCC: Blind carbon copy (addresses are not submitted to
the other recipients)
Identification fields:
Message-ID: unique message identifier
Information fields:
Subject: subject of the message
Trace fields:
Return-path: The address to which messages indicating non-
delivery or other mail system failures are sent.
Received: When a SMTP server accepts a message, either for
relaying or for delivery, it inserts a trace record
including the sending and the receiving host and
arrival date and time of the message.
Table 5: The most important header fields in the Internet Message Format [33]
Message Body. The second part of a message, called message body, contains the
information itself and, if structured, is defined according to the MIME-Protocol
(Multipurpose Internet Mail Extension).
The message transfer between the original sender and the final recipient can occur in
a single connection or in a series of hops through intermediary systems. The relaying of
messages from unknown sources to unknown destinations causes one of the biggest
problems of today’s mail traffic because spammers often use open relays7 for
transmitting their mail.
In Section 2.3.1 we will discuss existing methods for identifying spam based on
header information in detail. Additional detailed information is also given in the
diploma thesis [36].
There are many other techniques spammers use to mislead or bypass filters. Again,
we only give a very brief survey here; many of those techniques will be mentioned
again in the context of the respective anti-spam method in Chapter 2. One important
technique, which is very processing power consuming, though, is the receiver
personalization of every message (no BCC, every receiver gets his “own” e-mail) in
order to obscure bulk mailing. Less processor consuming is the randomization of the
7
SMTP or ESMTP server that provides everyone unrestricted relaying services
21
subject field and the “From:” address line. Other techniques commonly used are forging
the Message-ID, omitting the “To:” header, or adding random words and strings to a
message in order to mislead Bayes filters.
Table 6 shows some of the main techniques used by spammers and how their
approach has shifted in the last two years in response to the development of anti-spam
methods.
22
2. Anti-Spam Methods
In recent years, a vast number of methods and techniques for coping with the spam
problem have been proposed and developed, ranging from legal countermeasures to
very technical approaches. This is also reflected in a large amount of publications on
that topic. In order to bring some structure into this enormous amount of information
we are introducing a categorization of anti-spam methods, shown in Figure 5.
Our basic distinction is between methods “acting” before an e-mail is sent out (“pre-
send”), methods “acting” after the message has been sent out (“post-send”), and new
regulations “acting” during the transfer of an e-mail (new protocols for mail transfer).
This comprises virtually all existing approaches, ranging from attempts to decrease the
amount of spam sent out to approaches based on text analysis and classification
methods applied to a received e-mail.
In this chapter, we will discuss all these methods in detail. We will also point out
relations between relevant techniques, and evaluate them from the perspective taken in
this study.
In the context of anti-spam methods (and in all upcoming parts of this reports) we
follow the widespread conventions to use the term “positives” for denoting spam
messages, and the term “negatives” for denoting ham messages. Consequently, any
message will be classified as “positive” (spam) or “negative” (ham) by the anti-spam
method. If this message actually is spam, but it was (wrongly) classified as negative, it
is called a “false negative”. If it actually is ham, but it was (wrongly) classified as
positive, it is called a “false positive”.
Table 7 summarizes this concept. Each row corresponds to the known type of a
message, and each column denotes the class assigned by a binary classifier. According
to the table, a positive can be either a true positive, a spam message classified as spam,
or a false positive, a ham message classified as spam. On the other hand, a ham
message assigned to the ham group is a true negative, whereas a spam message that is
classified as ham is a false negative.
8
Which of the two given classes is denoted as positives or negatives is up to the beholder.
24
Based on these quantities, relative quality criteria of a binary classifier can be
defined:
true positives
sensitivity =
(true positives + false negatives)
and
true negatives
specificity = .
(true negatives + false positives)
Both of these quality metrics are between zero and one (often quoted as a
percentage), and each of them measures the correctness per class. The sensitivity of a
spam classifier is the proportion of messages classified as spam of all spam messages.
The closer to one the value of the sensitivity, the more spam is classified correctly.
Specificity denotes the correctness for the negatives or ham, respectively.
25
2.2.1.1. Technical Solutions
Most technical solutions are based on CPU time: The sender of an e-mail is required to
compute a moderately expensive function – a so-called pricing function – before the e-
mail is actually sent. Since in general e-mail is not expected to be a medium for real
time communication, such a moderate delay for each e-mail is expected not to have any
significance for the average regular e-mail user, who may in most cases not send much
more than 20-50 e-mail messages a day, but it is very disturbing for a spammer,
because it reduces the number of potential customers reached per unit of time (for
details, see [31].
Since there is no need to change the SMTP protocol, it is easy to install such a
system with a pricing function. There is a major drawback, though, of this approach –
lack of fairness of most pricing functions found so far. Ideally, a pricing function
system should be “fair” in the sense that the delay it causes is independent of the
hardware of the computer system. Many solutions have been proposed, for example,
CPU-bound functions, memory-bound functions, or Turing-tests [38]. Especially
different ways of using memory-bound functions currently receive a lot of attention
[39][40]. An example for a Turing-type test based on human interaction is mentioned in
Section 2.3.1 (SFM). However, so far it remains an open question to find a pricing
function, which leads to at least comparable delays on old, slow computers and on the
latest hardware.
In the following, we take a closer look at a relatively wide spread and well-known
representative of technical solutions to increasing sender costs – Hashcash, which is
based on a CPU-bound function.
Hashcash [41] is a software plug-in for mail clients which add Hashcash stamps to
sent e-mail. Adding a Hashcash stamp means inserting a line starting with “X-
Hashcash:” into the header of a message as shown in Table 8:
Date: xx.xx.200x
X-Hashcash: 0:030626:adam@cypherspace.org:6470e06d773e05a8
In order to create a Hashcash stamp, the resource CPU time needs to be “spent” (on
an average desktop computer, a few seconds). One stamp is required for each individual
recipient (even if it is sent as BCC) and it indicates the degree of difficulty of a task
performed in order to “spend” CPU time. It is expected that the more difficult this task
is (and thus, the more CPU time is spent) for an e-mail the less likely, this e-mail is
26
spam. Thus, Hashcash stamps can be used as (part of) a criterion whether to accept an
e-mail message or not.
Technically, the tasks used by Hashcash are based on hash functions, more
specifically on so-called partial hash-collisions. A hash function H is a cryptographic
function for which it is supposedly hard to find two inputs that produce the same
output. A collision occurs, if two inputs do produce the same output: H(x) == H(y)
although x != y.
Common hash functions, like for example MD5 or SHA1, are designed to be
collision resistant (it is very hard to find SHA1(x) == SHA1(y) where x != y). For
common hash functions, computing a full collision is almost impossible, but partial
collisions can be found more easily. In contrast to a full collision, where all bits of, for
example, SHA1(x) must match SHA1(y), for a k-bit partial collision only the k most
significant bits of SHA1(x) and SHA1(y) have to match. On a 400 MHz PII, a 16-bit
partial collision for SHA1 can be computed in about one third of a second, whereas
computing a 32-bit collision would last seven hours.
Hashcash uses the recipient’s mail address and the current date as inputs for the
hash-collision.
The basic idea behind money-based solutions is to “pay” some amount of (possibly
symbolic) currency (micro payment) for each e-mail to be sent. The idea is that an e-
mail is more likely to be ham the higher the amount paid for its delivery. In the
following, we describe a concrete proposal for implementing this idea.
Based on this protocol, Turner et al. [42] also propose a payment mechanism where
servers require payment for accepting incoming messages. The mail transfer agent is
responsible for organizing the payment, so the client is not involved. Currently, delivery
costs per e-mail message are estimated to be about 0.01 US cent (which corresponds to
US$ 100.- for 1,000,000 e-mail messages). Even if the price was raised to 1 US cent per
e-mail (which corresponds to US$ 10,000 for 1,000,000 messages), sending e-mail
would still be very cheap compared to sending snail mail (which costs more than 20 US
cents per letter).
27
recipient verifies the payment and the issuer responds with an account activity
statement.
Each user can earn some quota of a currency, and spammers would be forced to
make investments to purchase credits using real-world money, which narrows their
margin of profit. Since the costs for sending e-mail messages increase linearly with the
number of messages sent in this model, it is expected that spammers are forced to
increase their rate of return and thus they need to focus their efforts on recipients where
they have a high probability of revenue (which contradicts the current business model
of spammers illustrated in Section 1.3).
Generally speaking, methods for increasing sender costs and thus harming the business
model of spammers are a very interesting and promising approach to address the spam
problem. In contrast to many other approaches which tend to focus on the “symptoms”
only, they try to fight the problem at its “root” and consequently avoid the demand of
resources common to all approaches acting later in the spamming process. Moreover,
they are not user-specific, technically very accessible to ISPs and e-mail providers and
thus they fit very naturally into concepts suitable from the perspective of an ISP.
However, there are still a few important shortcomings, which lead us to believe that
those methods alone will not suffice, but rather will have to be integrated and combined
with other approaches in “multibarrier concepts”.
In the area of technical solutions, one of the main open questions is how to adapt
pricing functions to different hardware. Whereas CPU-bound pricing functions (such as
Hashcash) suffer from a possible unfairness due to differences in processing speeds
among different types of computer systems, some experts expect memory-bound
pricing functions [39] to be less sensitive to this problem.
A simple example illustrates the promise of this approach as well as its potential
shortcomings: if due to a pricing function it takes ten seconds to process an outgoing e-
mail, one computer can send at most 8,640 e-mail messages in 24 hours. Without a
pricing function, an estimated 2-3 billion spam messages can be sent per day – in order
to achieve this output with the pricing function of this example, the spammer would
need 250,000 to 375,000 computers. However, if pricing functions are needed such that
even an average user can send out only two e-mail messages per hour and if users with
old hardware experience even bigger delays, then this approach render itself useless
because it limits the effectiveness of e-mail as a communication medium in general.
Careful discussion of central questions (optimum type of pricing function, optimum
delay, etc.) at a scientific level is beyond the scope of this report, but is included in a
diploma thesis currently under preparation [31].
28
The main problems of money-based solutions are the relatively high administration
overhead and the fact that the very popular free e-mail accounts do not fit into this
strategy.
The last point leads to a potential general weakness of all current approaches to
increasing sender costs – their success would require some degree of coordination
among providers of e-mail services and the commitment of at least a significant part of
those providers worldwide. If only a minority of e-mail services worldwide adopts
policies to increasing sender costs spammers will simply elude the obstacle and pick
providers who do not implement such policies.
Similar to the careful analysis of the pricing functions still required for technical
solutions to increasing sender costs, the optimal fee structure in money-based solutions
still needs to be investigated carefully. The concept will be considered for practical
application only if it can be shown how to set it up in practice such that the amount of
spam is reduced significantly without burdening regular e-mail users too much.
The European Union and the United States of America have both decided to enact a
legal basis for criminal prosecution of senders of UBE. Detailed information on the
different legal systems of the United States, the European Union and other countries is
available on [43], and in the summary given by Sabadello [44]. Since jurisdiction is not
our area of expertise, we will only give a very short overview in this section.
Opt-In vs. Opt-Out. Generally speaking, one can distinguish between an opt-in and an
opt-out system for anti-spam regulations. Opt-in means that nobody is allowed to send
UBE unless the receiver has explicitly agreed to receive such messages. In an opt-out
system, anybody is allowed to send UBE to anybody else as long as the receiver has the
possibility to opt out at any time he wants, that is, to declare that he does not want to
receive such messages any more.
USA. The United States of America implemented the CAN-SPAM act [45] on January
1, 2004. This is an opt-out system. It is very contended, because (like any opt-out
system) the act of opting out gives a spammer the possibility to verify that a mail
address is valid. Consequently, if the receiver tries to opt out via an automated
mechanism offered in the message, he may receive even more spam afterwards because
his e-mail address could be “verified”.
Despite this potential weakness, the CAN-SPAM Act also provides a basis to deal
with some major problems of unsolicited bulk e-mail: It is the basis for criminal
prosecution for header-forging, relaying commercial mail through open proxies or
29
through other infrastructure that is used for concealing the identity and it also prohibits
address harvesting and dictionary attacks are forbidden.
European Union. The European Union decided to implement an opt-in system. In June
2000, the European Parliament passed the directive on electronic commerce [46] and in
June 2002 the directive on privacy and electronic communication [47], which form the
basis for legal action. Consequences for sending UBE are not covered in these
directives, but it is incumbent upon the individual members of the European Union to
do so.
Austria. The current legal situation in Austria distinguishes between private individuals
and companies. According to § 107 of the bill on Telecommunication [48], sending
UBE to private individuals is not allowed and requires the previous agreement of the
individual (opt-in). This also covers commercial e-mail as well as other e-mail with
more than 50 recipients. The situation for companies is completely different. In general,
it is allowed to send UBE to companies as long as the recipient has a possibility to opt
out (similar to the CAN-SPAM act). Moreover, the bill on Electronic Commerce [49]
introduced the maintenance of a so-called “Robinson List” containing all individuals
and companies that in no case want to receive UBE. This list has to be taken into
consideration even when the delivery of mail would be allowed by the bill on
Telecommunication.
Practical experience shows that, although there is some deterrent effect (cf. Figure 2),
the legal framework of neither the United States nor the one of the European Union will
be able to completely solve the spam problem. Spammers do not care too much about
any legal consequences because they can easily hide their identity or even move their
operations to other countries where no legal basis for prosecution exists. Similar to
approaches for increasing sender costs, legal actions against spammers requires a much
higher degree of coordination among countries worldwide than this is the case
currently.
30
source is legitimated to use a claimed identity. The third method is to verify that the
sender is willing to invest some additional effort to contact the receiver. In detail we
discuss:
• Blacklists, whitelists (good/bad sender)
• Sender Policy Framework, Caller ID, Sender ID, Domain Keys (legitimate/non
legitimate sender)
• Greylists, ChoiceMail and SFM (challenge-response systems)
1. Connecting
SMTP Client SMTP Server
(IP: 101.105.32.23) (mysmtpserver.com)
4. Disconnect
3. Answer: 2. DNS-Lookup:
127.0.0.2 23.32.105.101.rbl.org
rbl.org
31
2.3.1.2. Is the Sender Legitimate?
Important efforts have also focused on developing methods and techniques for
determining whether the sender of an e-mail can be authenticated or whether he is
legitimate. This includes various kinds of policy frameworks [53] or digital signatures
[54].
The underlying idea is on the one hand that spammers do not want to be
authenticated in order to avoid criminal prosecution and on the other hand that – for the
same reason – spammers tend to fake header information in their e-mail (cf. Section
1.4) which may lead to inconsistent information (for example, the pretended sender is
not legitimated for the pretended sending mail server).
In the following, we will briefly summarize the most important techniques in this
area. They were originally submitted as proposals to the Internet Engineering Task
Force (IETF, www.ietf.org).
Proposals for Anti-Spam Standards for coping with the spam problem, submitted to
the IETF in 2004.
SPF, Sender-ID and DomainKeys are concepts to eliminate the possibility of domain
spoofing. The protocols try to leave the common transmission process unaffected and
interoperate with SMTP to support the distribution and acceptance of the protocol. All
described protocols have in common that they are using DNS for verifying e-mail. So,
more network traffic is used, because every received e-mail must be checked at the
domain specified at the e-mail address.
The Sender Policy Framework (SPF) [55], developed by Meng Wong and Mark
Lentczner, uses the “MAIL FROM:” identity of the SMTP dialogue to verify the
senders’ domain. This allows rejecting mail already within the SMTP dialogue. The
protocol is a hybrid of the Designated Mailers Protocol [56] and the Reverse MX
Protocol [57]. An SPF-Record designates the outbound SMTP Servers of the senders’
domain. When an SMTP Client connects to a mail exchanger, the server looks for an
SPF-Record in the DNS-tree of the claimed sender domain. If the result received from
the DNS-Query contains the IP-Address of the client, the sender is authorized to use the
domain in the “MAIL FROM:” argument. If not the domain was spoofed.
The Caller-Id [58] concept, developed by Microsoft, realizes the same concept but
uses the so-called “purported responsible address” for verification. The purported
responsible address refers to the mailbox that has directly initiated the transmission
process. It is determined by inspecting the header of the message. For example if the
header contains a “From:” field and a “Sender:” field, the PRA is extracted from the
“Sender:” field [59]. Both Caller-Id and SPF suffer from the fact that in the case of
involved mail forwarding systems and mailing lists during the transmission process the
32
IP-Address of the client often cannot be mapped to the domain of the sender. Therefore,
additional concepts like SRS [60] (Sender Rewriting Scheme) must be implemented.
The Sender-Id [61] framework is the result of a merger between Caller-Id and SPF.
DomainKeys [62] also use DNS but the verification process works via digital
signature instead of IP-Addresses. The sending side of this variant consists of two steps.
• Set up: In this first step, the domain owner generates a public/private key
pair. This key pair is used for signing all outgoing mail. The DNS holds the
public key, and the private key is located at the outbound mail server.
For verifying an e-mail on the receiver side, three steps are necessary:
• Preparing: The DomainKey enabled system on the receiver side extracts the
signature and the claimed “From:” domain from the e-mail headers and
fetches the public key from the DNS for the claimed “From:” domain.
• Verifying: With the public key getting from the DNS the receiving e-mail
system verifies if the signature was generated by the matching private key.
2.3.1.3. Challenge-Response
Challenge-response systems initially block or hold e-mail from unknown senders. The
senders are notified of the blocking, then required to prove they are human by taking a
“quasi-Turing test”. If they pass, the e-mail is delivered [63].
33
There are many implementations of challenge-response systems – we take a closer
look at three different types – greylists, a human interaction system called ChoiceMail
and a subscription mail server called SFM (Spam Free Mail).
Greylisting [64] is an aggressive method for blocking spam. It uses the fact that
sending spam is not failure tolerant. Because spammers often do not know if their
recipient addresses actually exist, they do not try to resend messages if an error occurs
during the transmission process.
When a client connects to a SMTP server using a Greylist, the server records the
following information:
The server then compares this triplet to a local database. If no record matches, the
message will be refused with a “temporary failure” response and the triplet is stored.
Usually RFC compliant MTA`s try to resend this message within a certain period of
time. When the message is received a second time within a specified time slot
(normally after a timestamp for blocking and before the expiration date of the triplet)
the message will be delivered.
ChoiceMail [65] can be run in different modes. Free for home use, Server edition
and Enterprise edition, and uses a challenge-response system.
This short e-mail directs the sender to a Web page where he will be asked for his or
her name, e-mail address and reason for contacting you. The sender also will be asked
to fill in a code that appears on the screen as a graphic, something a person can do
easily but a computer cannot do at all.
34
This simple process eliminates almost all junk e-mail for two reasons. First,
spammers usually use invalid reply addresses and therefore never receive the
registration request. Second, spammers depend on automation, and the registration
response cannot be automated. The registration feature can be turned off.
The principles of operation are very easy. There are two types of dynamic addresses:
publishable (=master) and personal (aliases). An alias is intentionally restricted to a
single contact or a group. If someone is trying to contact you for the first time, he sends
an e-mail to your master address. This message never reaches its destination; the sender
instead gets a challenge like this.
For more information, and also if you cannot see the image
that has arrived with this message, please follow THIS
LINK.
A new alias remains open for a predetermined amount of time and during this time
anyone can use it to send you messages. Whenever this happens, the sender’s address is
added to the alias personalization. After it becomes closed, it will only accept e-mail
from senders on the personalization list.
When a message is sent through the server, it locates the proper alias personalized to
the recipient, or if no such alias is available, generates a new alias personalized to the
recipient on the fly and forwards the message to the recipient substituting the alias for
your sender address.
35
2.3.1.4. Strengths and Weaknesses
Low resource requirements and its ease of maintenance are the two main benefits of
blacklists. Any spam message can be rejected, before it is downloaded. Another big
advantage is that some spammers remove e-mail addresses automatically, if an e-mail is
rejected. There are only a few configuration changes necessary inside the server
software.
A big disadvantage is the lack of granularity – either all of the e-mail from a given
host is accepted, are all of it is rejected. Some spammers try to hide behind big ISP’s
and use Hotmail or AOL accounts for spamming (see Figure 3). One big problem of
blacklists is the possible refusal of legitimate mail because blacklists are often poorly
maintained and not up-to-date.
There are similar limitations with whitelists as with using blacklists. If a spammer
spoofs an address, he will get through a whitelist. They must be updated regularly and
this needs some time, but black- and whitelists typically stop around 10% of spam. [67]
The header of an e-mail includes various information of the sender and the mail
infrastructure involved during the transmission process. Generally, any information
given in the SMTP dialog and the header can be forged because there are no integrity
checks and authentication mechanisms defined in the standard SMTP. The only reliable
information is the IP-Address of the client. Spammers often forge header entries of an
e-mail. They try to inhibit the backtracking of the messages to keep their identity secret.
There is no other reason to forge a header entry but to conceal one’s identity. The
following analysis shall give an overview of what can be forged and focuses just on
entries that give information of the sender or the mail infrastructure involved.
“Return-Path:”:
The Return Path is just a record of the argument specified in the “MAIL FROM:”
command during the SMTP dialog. If it is forged, the Return-Path is also not trustable.
“Received:” Lines:
“Receive-id:” Lines are the most important header entries for backtracking messages
and for fixing bugs in a mail environment. RFC compliant MTAs must prepend a
36
“Received:” Line for messages that are not routed in a private area. As any other
information it can be forged easily in a way that it is not possible to distinguish whether
it is manipulated or unaltered. If a spammer uses an open proxy, there is no reason to
forge any “Received:” Line because the IP-Address of the spammer does not appear in
the message. For the receiver’s purposes you can only trust the lines that are processed
by your own MTAs.
“Date:” Inconsistent “Date:” fields can be ascribed to various scenarios. They can
lead back to a forgery as well as to different time zones of sender and receiver. In
addition, a bad configuration of the processing MTAs is possible so you cannot classify
a message as spam when an inconsistency is detected.
“From:”, “Sender:”, Reply-to:”, “To:”: These fields can be forged easily. Just a few
syntactical checks can be performed verifying that the entry in this field may represent a
valid mailbox.
The following example shall illustrate that it is impossible to identify a forged header
in most cases.
Return-Path: <merrileekosmala@pakistan.com>
Received: from mx6.univie.ac.at (mx6.univie.ac.at [131.130.1.49])
by atat.at (8.12.10/8.12.10) with SMTP id i7I5LSaT007420
for <thornaper.maraja@xyz.at>; Wed, 18 Aug 2004 06:21:29 GMT
Received: from pakistan.com (pakistan.com [222.65.113.88])
by mx6.univie.ac.at (8.12.10/8.12.10) with SMTP id i7I5D126028130
for <thornaper.maraja@xyz.at>; Wed, 18 Aug 2004 08:13:28 +0200
Message-ID: <C14061DC.51DE2D3@pakistan.com>
Date: Wed, 18 Aug 2004 15:14:18 +0900
From: "jamison tevlin" <merrileekosmala@pakistan.com>
To: "Thornaper Maraja" <thornaper.maraja@xyz.at>
37
A consistent header is not a sign for legitimate mail. Because of the various
scenarios appearing within the mail distribution process the same is the case for an
inconsistent header. Plausibility checks can only be applied to very simple forgeries but
can not be used for efficient spam detection.
Keyword based approaches involve simple searches of the body and/or the subject line
of a message for specific keywords and phrases like “Viagra”, “Cialis” or “get this for
free”. If these words or phrases appear, this fact is used as an indicator for spam. The
three main types of keyword based matching are described below.
Keyword Based: Search for words or phrases that match exactly. For example,
“Viagra” only matches “Viagra”.
Pattern Matching: Covers simple variations by mixing constant text and flexible
components like wildcards, case (in)sensitiveness, number of occurrences. This kind of
pattern matching is based on regular expressions [68]. For example, “V*i*a*g*r*a”
matches “Viagra”, “V.i.a.g.r.a”, “Vviiaaggrraa”, ...
Rule Based: Rules are more complex constructs a message can be checked against.
For instance, the rule “Mentions Generic Viagra” detects if generic Viagra is a main
topic in a given message (via several regular expressions). It is a common practice to
assign a certain value to each rule and to sum up those values to compute an overall
spam rating (see Section 3.3.1)
URL analysis in its simplest form means white- or blacklisting of URLs (compare
Section 2.3.1). However, approaches that are more sophisticated have been developed,
as the one explained in this section that combines several techniques.
Filtering Spam Using Search Engines [69]. An approach for filtering spam using
search engines like Google and Yahoo has been developed at the Georgia Institute of
Technology. The key idea is to filer spam according to the URLs (and their content)
that occurs in an e-mail message (for example, whether they link to Web sites a user
might be interested in or not). This is done by categorizing URLs via search engines as
well as using Bayesian classifiers on Web site content to define a user’s interest (in
terms of keywords resulting from the Bayesian analysis). The approach distinguishes
categorized URLs, which have already been indexed by a search engine, and
uncategorized URLs, which are not listed in any Web directory.
38
Such a system has to be trained. The first training step is to make a list of acceptable
categories (to define the categories a user is interested in). For this purpose, URLs are
extracted from legitimate mail messages in the user’s mailbox, which are then classified
through search engines. The content of the Web sites is also retrieved from the search
engines' caches and used to train a Bayesian classifier. Legitimate URLs, that is, URLs
that occur in the user’s message but cannot be found in a Web directory are whitelisted,
that is, a regular expression is created for each URL, resulting in a set of regular
expressions Aregex that represent legitimate URLs. At the end of the training process the
user is able to edit and verify the training results. After the training phase there should
be a set of legitimate categories, called Acategories and a set of regular expressions, called
Aregex that map the users preferences (or a list of URLs to be accepted).
After training the system is ready to classify mail. Figure 10 depicts the
classification process in detail. A message that does not contain any URLs is classified
as ham. If a message contains URLs, every URL is processed. If an URL matches a
regular expression in Aregex, the URL’s category is in the Acategories set or the URL was
previously classified as legitimate it is not considered any more. The remaining set of
URLs called Ur includes only categorized URLs with categories not in Acategories or
uncategorized URLs never seen before in legitimate messages that do not match any of
the regular expressions in Aregex. If an URL has a category not in Acategories the message
is classified as spam. For each uncategorized URL remaining in Ur the content referred
to is evaluated through the output of the Bayesian classifier.
True
2.3.2.3. Authentication
The problem with spam messages is that it is hard to tell if a message is spam or not.
The obvious answer to this problem is that there has to be a way to recognize non-spam
messages. Whitelists have been a method of choice for a long time, but they cannot
solve one important problem: The e-mail protocol currently used does not provide any
39
security features (cf. Section 1.4). This means that anybody may use more or less
anything as a sender’s address.
A working public key infrastructure would solve this problem. If every message was
signed with a private key, there would be no problem to authenticate all senders.
Unfortunately, there is one rather big problem with this solution. Currently only a very
small number of e-mail users have a valid certificate. Creating an infrastructure, which
allows every e-mail user to use digital signatures, would be a big challenge.
Generally there are two different options for a public key infrastructure: Either there
is one single root certification authority, which means that there is a heavy burden on
this central authority, or there are many different certification authorities, which means
that a lot of trust is required for each and everyone. A crafty spammer might start his
own certification authority and therefore be able to sign all his messages and therefore
get by this security measure with relative ease.
In many cases the sender needs to fetch a public key for each recipient, would be
necessary to integrate all common encryption systems (PGP, X509, …) into all mail
clients (which, for example, is currently not accepted by Microsoft for its Internet
Explorer).
Even if this sounds rather ineffective it still adds a certain amount of work to the
spammer’s plan of sending large amounts of e-mail messages. Whether this adds
enough work to harm spammers’ business model is a question that is currently not
answered yet.
Static techniques are useful to some extent at the individual or even corporate level.
However, the word “Viagra” may be of interest to a physician or pharmacist, thus
keyword based filtering cannot be used as a general solution. Performance may be the
main advantage of those primitive approaches, but another drawback is the need to
update the keywords.
At first glance, URL analysis seems to be promising. Taking a closer look reveals a
couple of drawbacks, though. Doing multiple queries in a search engine, or even
running a Bayes classifier may require a lot of time. This can lead to a point where
denial-of-service attacks based on messages containing vast amounts of URLs paralyze
a complete e-mail service.
40
2.3.3. Using Source and Content
In many cases information from the body or the header of an e-mail alone is not enough
for a classification. Especially mass mailer detection needs as much information as
possible to be able to compare messages as thoroughly as possible. In this chapter we
take a look at different technologies using this approach.
Digital fingerprint: a value calculated from the content of other data that changes, if the
data upon which it is based changes.
Cyclic Redundancy Checks are more reliable than checksums, they normally reflect
even minor changes to the input data, but it is relatively easy to generate a completely
different file that produces the same CRC value.
Hash algorithms and message digest: “one way hash algorithms” produce a “hash”
value, that means it is easy to compute b from a, but it is very difficult (or impossible)
to compute a if you only have b (compare Section 2.2.1). Two well-known hash
algorithms are MD5 and SHA.
Checksum based spam filtering is a method for detecting spam by simply auditing how
often a received message had been sent (to other users). It is a client-server architecture
where the client calculates a checksum of an incoming message and sends it to a server
which looks for exact matches in its database and returns an indicator (for example the
number of times that the message had already been reported). Up to a user defined
policy the client then decides whether the message is spam or ham.
The most popular implementations of this concept are the Distributed Checksum
Clearinghouse (DCC [70]) from Rhyolite Software and Vipul’s Razor [71]. DCC and
Vipul’s Razor differ in the way messages are reported to the server.
A DCC client reports the checksums of any incoming message to the DCC server,
DCC basically enables mass mailer detection. It does not decide whether a message is
spam or not. DCC just reports how many copies of a message have already been
received. For this reason, clients have to maintain a whitelist including senders of
solicited bulk mail.
41
Table 9 shows the parts of a message DCC computes checksums for.
Checksum Description
The most remarkable types are the fuzzy values that prevent spammers from
including random characters in their spam messages to avoid registration by DCC.
Besides, it is not entirely clear how the fuzzy checksums are computed (namely to keep
it secret from spammers). A typical response from a DCC server for a given (spam)
message is described by Table 10.
The response lists the computed checksums for each part of the message (Fuzzy 2
has no value here because this checksum requires a certain message length which was
not given in this example) and the number of registered occurrences (if any). The client
42
(for instance a spam filter using DCC like SpamAssassin can then handle the message
according to the registration count)
Vipul’s Razor follows a different approach. Within this system the user himself
reports a message to the server so the database of the server should only contain
checksums of approved spam. Therefore, in contrast to DCC, Razor is a tool for spam
detection. One problem appearing here is the trustworthiness of the report’s sender.
With the new version Razor 2.0 this problem has been eliminated, because every user
needs a generated key for signing. The server looks in the database for reports agreeing
with the voting received. The higher the agreement with other reports the higher is the
reliability of the sender [70].
The main idea behind their usage in spam filtering is to find a suitable (computer-
readable) representation for mail messages and to classify them as spam or ham. This
representation is compared to training data and assigned to a class based on various
techniques that are briefly described in the following. Some of the relevant technologies
originate from the areas of text information retrieval and text analysis.
Representation of Texts
Text analysis is mainly based on words or tokens that occur in the documents of the
used text collection. The task is not to fully understand a text's meaning but rather to
extract relevant tokens. Tokens can be entire words, phrases, or n-grams (overlapping
tokens consisting of n characters). Although this approach might miss some of the
information content, it has clear performance advantages and is independent of the
text's language. Several models exist for text representation based on those
tokens/terms, the most common ones are listed below.
Training Models
Training can denote a simple storing of examples or involve more sophisticated and
time consuming methods, particularly important when token frequencies shall be held
up-to-date. According to [74], there are three major training methods TEFT, TOE,
TuM.
43
TEFT (Training on Everything): every message is used to update the database.
TOE (Train-on Error): only messages that were incorrectly classified are used for
training (usually after a corpus train). An advantage is the dynamic handling of errors;
the downside is the amount of human interaction needed (to find false classifications).
TuM (Train until Mature) [75]: Provides a hybrid between TEFT and TOE. TuM
will train the individual tokens in a message only up until a point where they have
reached maturity (for instance 25 hits per token). New types of training data are still
trained as well as immature tokens. TuM trains all tokens whenever an error is being
retrained. Therefore, it has both advantages – a balance between volatility and static
data, and the ability to adapt to new types of e-mail.
Distance Measures
The similarity between a query message and the messages in the training sets is
measured via distance functions. The query vector (consisting of term frequencies) is
compared to the examples in a training set so that one or more most similar vectors can
be found (for example, the similarity between an incoming mail message and a ham or
spam training set). Examples for distance measures are:
d ( x, y ) = ∑ |x − y|
2
Euclidean Distance:
d ( x, y ) = (x − y ) C (x − y )
−1
t
Mahalanobis Distance:
Cosine: d ( x, y ) = cos( x, y )
Classification Decision
After the computation of distance measures the query vector has to be assigned to a
class according to the training set, that is, it tags a given message as either ham or spam
according to a spam and a ham training corpus. It is not always the best choice to base
the decision whether a query vector belongs to one class or not on the one most similar
vector in the training set only.
Many different algorithms and models for classification tasks have been developed,
most of them following the procedure just presented methods, slightly differing in one
or the other detail. The methods presented in the following give an overall idea of
existing technologies, but the list does not claim completeness.
Although the application of Bayesian analysis to spam is rather new, the Bayesian logic
was actually first published by the Royal Society in 1763 and is based on Thomas
Bayes (born 1702 in London).
44
In basic terms, Bayes’ Formula allows us to determine the probability of an event
occurring based on the probabilities of two or more independent events. The general
formula is written as:
P ( B | Ai ) P( Ai )
P ( Ai | B) = k
∑ P( B | A ) P( A )
j =1
j j
Any incoming e-mail is now represented by the most important tokens from these
lists, either “most positive” or “most negative”. The overall spam probability is defined
as the joint probability of independent events (the tokens).
Assuming that the variables a, b and c represent spam probabilities for three different
tokens, the total spam probability of a message is equal to:
abc
abc + (1 − a )(1 − b)(1 − c)
The decision whether a message is treated as spam or ham is based on this overall
spam probability (via a simple threshold function).
Bayesian filters use a variety of different tokens, a few are listed below [74]:
Standard Bayes: Each word is a token – this method is used in most spam filter
programs. Text is not preprocessed at all (everything is used as token, including header
info, java script, etc.).
Token Grab Bag: A sliding window of five words is moved across the input text. All
combinations of those five words are taken in an order-sensitive way – every
combination is a feature.
Token Sequence Sensitive: A sliding window of five words is moved across the input
text. All combinations of word deletions are applied (except that the first word in the
45
window is never deleted) and the resulting sequence-sensitive set of words is used as a
feature.
Markovian matching: This is similar to Sparse Binary Polynomial Hashing, but the
individual features are given variable weights. The increase of the weights is quadratic
increasing with the length of the token, so that a feature that contains more words than
any of its sub features can outweigh all of its sub-features combined.
The Support Vector Machines model, introduced by Vapnik [76][77], has proven to be
a powerful classification algorithm and is used in many categorization tasks including
text categorization. The main idea is to map the input data into a high dimensional
feature space and separate this data by the hyper plane that has provides the highest
margin between the two classes. If classes are not linearly separable, SVMs make use of
so called kernels (convolution functions) to transform the initial feature space to another
where a separating hyper plane exists.
A query is compared to all samples in the training set (according to a distance function,
Euclidean distance is very common). The query is assigned to the class the most of the
K-Nearest neighbors belong to (the k most similar vectors). For instance, if a message's
five nearest neighbors consist of two spam messages and three hams, the message is
classified as ham. K-Nearest Neighbor is an example for a decision part of a
classification system.
Another technique widely used for classification and pattern recognition tasks are feed-
forward neural networks. Neural networks differ from other approaches because of their
extensive training phase and their heuristic way of initialization. Far more resources are
needed for training than for actual classification, in contrast to the K-NN algorithm
where training only means storing vectors and classification includes the costly
comparison to all training examples [78].
46
2.3.3.7. Strength and Weaknesses
Bayesian filters offer a good method to detect spam messages. They represent a
content based solution that is easy to implement. Disadvantages are the need for
permanent filter training, limited applicability for ISP’s, potential counterattacks from
spammers (insertion of random words – see [79]) and possible performance problems.
One big advantage of checksum based systems is the low rate of false positives.
False positives can only occur when a message has been sent many times or the
checksums of different messages are accidentally the same (a very small chance).
Approaches that are based on fuzzy checksums can be used to detect messages that
contain random words (often used by spammers to bypass keyword based filtering).
The neural network based approaches and support vector machines need an
extensive training phase and do not allow to draw conclusions because of their heuristic
initialization (they are a black box, the user does not know why a specific message is
classified as ham or spam). On the other hand, classification itself is faster which may
be an important advantage, although the main performance problem of spam detection
is message analyzing itself.
One of the first papers about the use of SVMs to classify spam messages was
published in 1999 [81]. Another paper [82], proposing a similar approach was
published in 2001. So far, the application of SVM as well as K-NN and its variations to
the spam problem has been discussed numerous times [83]. However, due to its quite
recent development and a rather complex implementation, SVMs are rarely used in
commercial anti-spam systems at the moment. It is very important to take into account
the data pre-processing and training phase, as they are a crucial part of the classification
process. All methods discussed here tend to obtain good results only if they are trained
on a regular basis. It is essential for keeping the performance of a content filter at a
satisfactory level and for avoiding a significant performance decrease over time.
47
receiver. This comprises suggestions for new e-mail transfer protocols, two of which
are mentioned here. It cannot be expected that they will be implemented and used in the
near future.
2.4.1. IM 2000
IM 2000 has been developed by D.J. Bernstein, the creator of qmail. Today’s Internet
mail infrastructure implements a push-system. The senders cost of sending a message to
thousands of recipients is nearly zero. IM 2000 purposes a pull mechanism where the
messages are stored at the sender’s side. This concept has some ramifications to a new
infrastructure [84][85]:
• Each message is stored under the sender's disk quota at the sender's ISP. ISPs
accept messages only from authorized local users.
• The sender's ISP, rather than the receiver's ISP, is the always-online post office
from which the receiver picks up the message.
• The message is not copied to a separate outgoing mail queue. The sender's
archive is the outgoing mail queue.
• The message is not copied to the receiver's ISP. All the receiver needs is a brief
notification that a message is available.
• After downloading a message from the sender's ISP, the receiver can efficiently
confirm success. The sender's ISP can periodically retransmit notifications until
it receives a confirmation. The sender can check for confirmation. There is no
need for bounces.
• Recipients can check on occasion for new messages in archives that interest
them. There is no need for mailing-list subscriptions.
2.4.2. AMTP
The Authenticated Mail Transfer Protocol (AMTP [87]) is currently specified in an
Internet-Draft. The last version was submitted on April 26, 2004 to the Internet
Engineering Task Force. AMTP enables trusted relationship between entities operating
Mail Transfer Agents. This works over TLS like SSL for Web Servers. Both client and
server must present valid X.509 certificates, each signed by a trusted Certificate
Authority (CA), in order to begin a transaction. AMTP also provides a mechanism to
publish concisely-defined policies. This allows the parties in the trusted relationship to
hold each other responsible for operating their servers within the constraints of agreed-
upon rules. AMTP inherits the specification of SMTP and builds upon it. By operating
on a different TCP port AMTP can run in parallel with SMTP. It is hoped that this
supports an easy and smooth adoption [87].
48
3. Products and Tools
This chapter describes the products that were used in our experiments. In the first
section we give some general information about anti-spam software. Afterwards we
discuss in detail the characteristics of the commercial and open source spam filters
tested.
3.1. Overview
The following section gives an abstract about anti-spam software. At first some quality
criteria for product reviews are mentioned. Then we suggest online resources that shall
help finding the right choice within the wide variety of available solutions.
Processing Speed: The processing speed needed depends on the mail volume
received. If the number of messages received exceeds the capabilities of the spam filter
mail could be lost due to congestion. The processing speed depends mainly on the
methods used for analyzing the incoming messages.
Detection Rate: The detection rate is definitely the most important criterion. It must
be pointed out that this refers to the detection rates of spam as well as ham. A good
spam filter must have a very low rate of false positives and on the other hand detect as
many spam messages as possible.
49
One is the so called Compare-O-Matic, available at NetworkWorldFusion [88]. First
there is a registration required (no fees), then you can search through the anti-spam
buyer’s guide [88]. You can choose between server based and client based products and
anti-spam services. A very good feature is the so called Compare-o-matic, where two or
more products can be compared (different features for comparison can be chosen).
The second tool can be found at Spamotomy [89]. You can choose between any kind
of solutions, desktop software, server software, hosted services and disposable
addresses, for every tool there is a short summary and a short description of the
methods used.
50
Figure 11 shows the typical processing path of Symantec Brightmail Anti-Spam.
Available Methods
• Open Proxy List: list of IP addresses that are open proxies (often used by
spammers).
• Safe List: IP addresses from which virtually no outgoing e-mail is
spam.
• Suspect List: IP addresses from which virtually all of the outgoing e-
mail is spam.
Other content based methods: When evaluating whether messages are spam or not,
Brightmail Anti-Spam calculates a spam score from 1 to 100 for each message, based
on techniques such as pattern matching and heuristic analysis. If an e-mail score is in
the range from 90 to 100 it is considered spam, if the score is below 25 it is considered
ham. E-Mail with scores between 25 and 90 are suspected to be spam. For a more
aggressive filtering, thresholds can be varied and it is possible to specify different
actions for messages identified as suspected spam or spam based on different filtering
51
policies. Brightmail Anti-Spam allows creating custom filters based on keywords and
phrases found in specific areas of a message.
User’s View: Symantec Brightmail Anti-Spam provides a set of basic features that
cannot be disabled to ensure protection against spam. These features are the Spam
Scoring and the Suspect List within the Reputation Service. The only way to take the
whole product offline is to disable the services in the Microsoft Services Console.
The product is easy to handle and provides a comfortable Web interface for
administration and Spam Scoring configuration. The processing speed is at a very high
level. Anti-spam and anti-virus definitions are updated regularly so there is no effort for
maintaining the product. It offers many good features and is a good combination of
anti-spam and anti-virus protection but there are disadvantages too. The information
provided in the logging section (statistics about status information and classification
results) is only updated every hour. A real time monitoring of the processing status is
therefore impossible.
Available Methods
Methods based on source of e-mail: Kaspersky Anti-Spam 2.0 supports filtering mail
by officially blacklisted addresses as well as using local black- and whitelists created by
administrators. Furthermore a heuristic analysis checks some of the formal attributes of
an e-mail, such as sender’s address, recipient’s address, sender's IP address, size of
message, and format of message.
Other content based methods: The content of each message can be categorized by
Kaspersky Anti-Spam. In this context, “content” refers to the body of an e-mail,
52
excluding subject and header. Moreover, conditions can refer to non-formal attributes
of a message, the results of the content filtering. Therefore, the classical rule based
approach is combined with content analysis. Kaspersky uses two basic methods to
detect messages with "suspicious" content:
Additionally, every message is processed via filtering rules. Every rule includes one
or more conditions that involve an analysis of the message – only if all the conditions of
a rule are met, the action of that rule will be applied. Such conditions include tests for:
sender’s IP-address, sender’s e-mail address and message size.
Administrators can switch between several standard rule sets (called common
profiles, valid for all users) or create new ones. In addition certain rules can be added on
a per user basis. The rule based approach is quite powerful, that is, it allows many
settings (the standard profile includes 34 rules – most of them handle header
substitution and modification).
53
The core SurfControl E-Mail Filter solution consists of the following software
components:
• Message Administrator: Allows the user to review and act on delayed and
isolated messages, together with querying the various system logs.
• E-Mail Filter Administrator: Enables to control the E-Mail Filter remotely
via a Web browser.
• E-Mail Monitor: Provides a window onto the progress of individual messages
through the E-Mail Filter.
• Rules Administrator: enables to set up rules to monitor and/or block
messages.
• Scheduler: This is the interface to automate repetitive tasks, such as receiving
updates from SurfControl's anti-spam database.
Available Methods
Methods based on source of e-mail: The SurfControl E-Mail filter allows creating a
whitelist database by entering information of known individuals. Like the other
products, the Surf Control E-Mail filter supports custom and real time blacklists.
Other content based methods: SurfControl uses its Anti-Spam Agent that
automatically detects and deals with common non-business or high-risk e-mail, such as
humorous graphics, chain letters, hoaxes and jokes. It is continuously updated by
SurfControl to maintain accuracy and quality. The filter also enables Boolean searches
to check for words, combinations of words or pairs of words within a message. There is
also a library of dictionaries to detect e-mail content that an organization may want to
avoid. These dictionaries contain words associated with different aspects of unwanted
content, for example adult material, hate speech and gambling.
Other Features: It is possible to remove active HTML content from the body of e-
mail messages. Active content is code that automatically installs and runs on your
computer, such as scripts or ActiveX Controls [93]. SurfControl E-Mail Filter can
detect various routing relay techniques and deny e-mail that have been forwarded or
routed. File Attachments and messages can be blocked if they do not comply with the
MIME standard or exceed a specific size. Looping messages between two or more e-
mail servers and messages that exceed a specific number of recipients can be detected
and removed. SurfControl E-Mail Filter has also an image recognition tool that scans
graphics files for explicit adult content.
For anti-virus protection an agent is available that helps to protect the system by
deleting viruses and cleaning infected files when they occur. It uses the McAfee
Olympus Anti-Virus engine to detect files that could damage a system.
User’s View: The entire configuration of the product is up to the user. There are no
preconfigured settings available. It is possible to turn all features off so that the
54
SurfControl E-Mail filter just acts as a simple SMTP-gateway. The settings that can be
taken by the user are very extensive and some time is needed to overlook the
functionality of the product.
The product provides a good user interface to handle the functionality and the
components are clearly arranged. The SurfControl E-Mail Monitor allows a real time
supervision of the processing state. It is possible to integrate external code by using the
External Program Plug-In.
The processing speed of the SurfControl E-Mail filter is far behind the other tested
products. The Virtual Image Agent seems not to be working at a sufficient level,
because it blocks harmless pictures even at the lowest sensitivity level. The Virtual
Learning Agent just supports pure text files so training the agent is very time-
consuming. The Loop Detection also seems not to be working correctly because it also
blocks messages that are already tagged with an “X-Spam”-Flag which is not a
significant sign for a looping message. Some of the wordbooks provided do not seem to
be useful because they include words appearing in nearly every message (for example
“html”).
Figure 12: Typical processing path of Symantec Mail Security for SMTP
Available Methods
Methods based on source of e-mail: To limit potential spam, Symantec Mail Security
can support up to three real time blacklists. There is also the ability to block e-mail by a
custom blacklist (which contains the sender’s address or domain). Domains and e-mail
addresses that shall bypass the heuristic and blacklist detection can be added to a
55
whitelist. There is also an auto-generating whitelist feature that, if enabled, adds all
domains of outgoing messages that are not in the local routing list.
Other content based methods: The Symantec Mail Security for SMTP allows
defining spam rules to be used for processing the message body. Each rule consists of
one ore more terms that can be combined using AND, OR, and NOT operators. For
example, the rule "top secret" OR "confidential" triggers if one of these terms appears
in the message body.
Other Features: Symantec Mail Security for SMTP allows blocking messages by
message size, by subject line or by file name. Also dropping messages that exceed
various container limits, like the file size, the cumulative size or the number of nested
containers is possible. Functionality for handling encrypted container files is included.
Relay restrictions can be configured within Symantec Mail Security for SMTP so that it
refuses to deliver e-mail that has a source outside of the organization.
The anti-virus scanning feature tries to detect virus infected e-mail. New or unknown
viruses can be detected through a heuristic method. The sensitivity of this feature is
variable. Another component of the anti-virus policy is the Mass Mailer Cleanup that
deletes mass mail or worm infected messages. The services search for a match between
virus name patterns and the signature returned by the anti-virus scan. If a match is
detected, then the message is dropped.
User’s View: The complete configuration of the product is up to the user. There are
no preconfigured settings available. It is possible to turn off all features so that
Symantec Mail Security for SMTP just acts as a simple SMTP-gateway. The delivery
of messages can be fully stopped and all messages can be rejected to set the Symantec
Mail Security for SMTP offline.
The product is easy to handle and provides a comfortable Web interface for
administration. The auto-generated whitelist is a useful feature saving time for editing
the list. The reporting function allows a good supervision of the processing state.
Reports are always up-to-date and include most of the relevant information. The
processing speed of the Symantec Mail Security for SMTP is at a very high level.
The most important spam detecting feature, the heuristic spam detection, does not
provide a sufficient detection rate. The effort for creating spam and content rules is too
high in relation to the expected increase in the spam detection rate. There is no ability to
manage probabilities for words or combinations of words appearing in an e-mail
message. The latest online update of the spam patterns file dates back to 2004-04-18,
since then the Live Update functionality seems to have had no impacts at all.
56
3.2.5. Borderware MXtreme Mail Firewall
The tested Version MX-400 combines three functionalities – MTA, e-mail gateway and
firewall. It has its own operating system – S-Core OS which is a Unix system based on
FreeBSD. In opposite to the other solutions, MXtreme is hardware based.
Available Methods
Methods based on source of e-mail: The MXtreme Mail Firewall supports blacklists
(custom as well as real time) and whitelists. These lists can be specified on the user or
system level.
Methods based on fingerprints: The MXtreme Mail Firewall uses DCC for spam
detection.
STA uses three sources of data to build its database: the initial tables supplied by
BorderWare based on analysis of known spam, tables derived from an analysis of local
legitimate mail (“local learning” or “training” and mail identified as “bulk” by DCC is
also analyzed to provide an example of local spam.
Other content based methods: The MXtreme Mail Firewall supports pattern based
filtering. Filters can be specified using simple English terms such as “contains” and
“matches” or using regular expressions. These filters are processed in the order of their
priority.
The product is easy to handle and provides a comfortable Web interface for
administration. Its main advantages are that it combines firewall and anti-spam
functionality and that it is rather easy to maintain. DCC and Statistical Token Analysis
performed quite well, although the classification performance may differ in production
use due to the chosen training policy.
57
Available methods
Methods based on fingerprints: No global fingerprint services like DCC are used in
this product. All incoming messages are classified and the ones classified as spam are
used to create tokens for future message classification.
User’s View: Generally speaking Ikarus mySpamWall is very easy to use. It has a
Web interface that allows a simple setup of threshold values for possible spam and
spam. Optionally, an advanced interface allows the addition of simple rules to blacklist
or whitelist certain senders, receivers, subject lines or content using simple regular
expressions.
From our point of view it seems generally a good idea to offer spam protection for
companies as a service. This is a good solution for rather small companies that cannot
afford an IT department to deal with the spam problem.
Especially the fact that this product offers a complete integration of a mail transfer
agent and an anti-spam solution seems to be a big advantage compared to most of the
others. This advantage is used in a very potent greylist that is able to detect many spam
messages without the risk of creating “real” false positives, as messages can always be
sent again and then be delivered.
3.2.7. Spamkiss
Spamkiss [96] as well as some other projects aims at a goal completely different from
that of regular spam filters. While regular spam filters accept the fact that spam exists
and try to eliminate it after it is sent, these approaches try to make sending out spam
mail as expensive as possible (compare Section 2.2.1). These expenses should sooner or
later make spamming less attractive commercially.
58
The technology involved in this approach consists of a combination of two different
methods: A whitelist and a challenge-response protocol.
The challenge-response protocol is used for the initial contact. The user’s mail
address is modified with a random token, which is only valid for a certain period. This
modified e-mail address has to be used for the initial contact only. After that, the
sender’s address is added to the user’s whitelist. At this point e-mail communication
continues just as it does now. Both partners can communicate as they want, without
adding any new random tokens to mail addresses.
The way in which the tokens are distributed is the major difference between systems
using this approach and is similar to key distribution in a public key infrastructure. As
soon as the key is exchanged (or in this case the token), communication is no problem
at all. Spamkiss offers a simple exchange by either personally talking to each other (or
telling your partner how to modify your address for the first contact), or by getting the
currently used token from a server. A system very similar to Spamkiss is also used at
the Computing Science Department at the University of Alberta, CA. It is called SFM
[66] (Spam Free E-Mail service see Chapter 2.3.1).
To make sure that the current token cannot be harvested by a computer, a distorted
image is used. This technology is often also used to access free services like stock
quotes, which should not be machine-readable. The general idea is rather simple:
distorted images cannot be used by a computer (at the moment), as text recognition
software cannot read it. The human brain on the other hand is easily able to recognize
the letters, even though they are distorted and have a fancy background pattern.
Available Methods
This kind of technology promises that no more legitimate messages will be lost, as
all messages generated by a human being (who reads replies to his address) are
delivered, or bounced with a request to send it again to a different, modified address.
This modified address simply consists of the original address and the current token.
The approach of stopping spam messages this early seems like a good idea at the
first look. Messages are not delivered at the first attempt. Later on, messages that are
legitimate are delivered, and those that are not are not accepted by the mail server,
telling the sender to fetch the current token first. If a token is compromised, which
means that the combination of token and e-mail address ends up on a spammer’s list, it
simply gets changed, without affecting communication with those already added to the
whitelist.
Several major questions seem to be unanswered so far. On the one hand it might still
pose a problem to handle automatically generated messages, and on the other hand
bouncing back a lot of messages may increase network traffic considerably.
59
Taking a closer look reveals a lot of additional work. Users should know their
tokens, so they can give them to future communication partners. After this initial
contact the sender’s address gets stored in the whitelist. This solution may work in
many cases, but there are many situations, which may cause problems in this kind of
environment. Many users do have several e-mail addresses, which means that a user
who uses different sender mail addresses has to go through the initial process several
times. In addition to that it is often hard to tell who will be the sender of messages that
are delivered by a mailing list, or the exact mailing address of automatically generated
messages by an online store. Checks are basically performed on the sender’s address
only. This means that someone who knows the addresses of your trusted partners may
easily send you any kind of message.
3.3.1. SpamAssassin
SpamAssassin [97] is written in Perl and part of the Apache Software Foundation. The
primary target platforms are Unix operating systems. Some Windows products
available use SpamAssassin, though they are not open source. SpamAssassin extracts
different features from incoming e-mail messages. This analysis is done through so
called tests, which capture the header, body or full text of an e-mail (a full listing can be
found at [98]). SpamAssassin can be configured to include RBL checks (see Section
2.3.1) and a Bayesian classifier. The overall rating of a message is computed from the
values for the different results of the text analysis plus the results of the Bayesian
filtering plus the results from the distributed hash databases. Hence SpamAssassin
never relies on one single technique. After the testing a header containing the overall
score and a spam mark is added to all processed e-mail messages. The messages can be
classified according to this mark or score (probably-spam if the mark is present,
certainly-spam above a certain threshold).
Available Methods
Methods based on fingerprints: Vipul's Razor, Pyzor and DCC are supported to
block spam and bulk mail.
60
Classification Methods: SpamAssassin uses a Bayesian-like kind of probability-
analysis classification, so that a user can train it to recognize mail messages similar to a
training set [99]. Many of SpamAssassin’s tests aim at static patterns in the text of mail
messages.
User’s View: The user can specify which rules to use and whether Bayesian
methods should be used or not. Furthermore, one can specify which features should be
computed and which online resources should be consulted (DCC, Razor, Pyzor). The
user can also specify which RBL should be used (if any), therefore SpamAssassin is
very adjustable to the user's needs. SpamAssassin also offers an auto-learn function,
which uses all messages below and above certain, scores (mail that is very clearly
classified as spam or ham) as learning input for the Bayesian classifier. The auto-learn
function is not considered in our tests, all training is done prior to testing.
Available Methods
Methods based on source of e-mail: CRM 114 supports personal white- and
blacklists.
Incoming mail is piped to CRM114 by the local MDA and adds a new header
containing the spam score to each message.
User’s View: The most important and resource intensive aspect of CRM114 is
training. The configuration is not very user friendly and gives kind of a “not yet
finished” impression. Bulk training takes a very long time even for a small amount of
messages. CRM114 is best used on a per user basis to personalize the training sets (like
other statistical approaches).
Installation and training are a bit tricky but the results are rather good. The
recommended training method is to only use false classifications as training input,
whereas we used bulk training (train on errors is much easier on the individual level).
61
3.3.3. Bogofilter
Bogofilter [102] is a "Paul Graham based" Bayesian spam filter. The application is
written in C and available as open source (several Unix operating systems are
supported).
Available Methods
Bogofilter is used by the MDA (mail delivery agent) and computes tokens and
finally spam or ham probabilities for every incoming message. It adds an X-Bogosity
header to all e-mail containing the ham- and spam scores of this message. The messages
can be moved to their designated folders according to this header.
Users have to train Bogofilter with both a ham and a spam corpus. After that initial
training Bogofilter is ready to classify mail. It is recommended to retrain on a regular
basis to adjust to changes in mail messages.
User’s View: The user can choose between two (spam, ham) or three classes (spam,
probably spam, ham) and is responsible for the training. Moreover the threshold values
can be specified.
The results of Bogofilter are of particular interest because it is the only Bayes-only
application included in our test series. In our experience Bogofilter works without
problems (installation and usage).
62
4. Performance Evaluation
The previous chapters described available tools and the most relevant methods used for
spam detection. To experiment with the mentioned tools it is very important to have a
comparable test set. Existing evaluations that can be found in scientific literature use
both publicly available spam and ham samples (like the Ling spam corpus and PU1,
both proposed in [103], or the SpamAssassin sample [104]) or self collected samples,
which are usually more up-to-date, but in some cases not publicly available. We
decided to test the tools with our own collected sample, as it is more up-to-date, and
also with the SpamAssassin sample because of its public availability. In this chapter we
want to describe the source and the composition of our own test sample and the hard-
and software configuration used for the tests.
The collection of ham messages is much harder due to legal preconditions. It is not
allowed to use ham messages without consent of both the receiver and the sender, so we
depended on volunteers and on our own e-mail.
The following chapter describes the situation of our partners Mobilkom Austria,
UPC Telekabel and the University of Vienna.
Ham collection was difficult, but some volunteers made their inboxes available.
Therefore we took our own messages and those of the volunteers – special thanks to
Mr. Hatz, Ms. Marosi, Ms. Khan, Ms. Thanheiser and Mr. Bobrowski.
The repository currently holds about 100 false positives and roughly 3,600 spam
messages. Several employees had the possibility to move messages to the respective
folders, but unfortunately only very few of them did so. The largest part of the
messages was forwarded to these folders. This means that their original headers and
envelopes were lost. Moreover, most messages were forwarded as inline messages. This
implies that their message body was changed too.
Some of the messages were forwarded as attachments leaving all the important
information (header, body) unchanged. However, as all messages provided by
Mobilkom Austria are in a folder located behind a corporate firewall, accessing them is
only possible through a Web based service. This makes it quite difficult to retrieve
messages.
Spam sources:
• 1,382 (ZID, collected via spam traps, 18.08.2004)
• 22 (Spanish messages from W. Strauss, 16.07.2004 – 19.07.2004)
• 44 (Department, Thanheiser, Khan, Hatz, 18.08.2004)
• 52 (W. Gansterer, 09.03.2004 – 05.08.2004)
Ham sources:
• 593 (Hatz, 01.01.2004 – 19.08.2004)
• 245 (Strauss, 23.09.2002 – 15.08.2004)
• 181 (Department, 20.06.2004 – 20.08.2004)
• 12 (Thanheiser, 04.08.2004 – 20.08.2004)
• 44 (Khan, 06.08.2004 – 20.08.2004)
• 36 (Marosi, 04.08.2004 – 13.08.2004)
• 244 (Ilger, 01.07.2004 – 22.08.2004)
• 145 (Newsletter account Chello, 16.08.2004 – 20.08.2004)
64
Sample size for 1,500 messages: ham 57.745 MB, spam 6.441 MB
Sample size for 1,000 messages: ham 4.17 MB, spam 6.26 MB
Message Store
Windows Configuration Linux Configuration
Configuration
Each test set is stored in its own IMAP folder and can be accessed remotely. Due to
the different architectural characteristics of the Linux and Windows spam filters we
decided to implement different ways for testing them9:
9
The configuration of the MXtreme Mail Firewall is not included because it is a hardware filter and
runs its own operating system. The test process is similar to the Windows test process. We also use our
Java application to deliver the messages to the MXtreme.
65
4.3.1. Windows Test Process
We use a small Java application for fetching the messages out of the IMAP store and
send them via SMTP directly to the spam filters. The spam filters work on the standard
SMTP-Port (port 25). For each test-run only one filter is active. All tested products are
configured as gateways10 and forward the messages to the Microsoft SMTP-Service.
The Microsoft SMTP-Service finally delivers the e-mail to the according mailboxes.
10
Symantec Brightmail Anti-Spam works directly in conjunction with the Microsoft SMTP Service
and is not a separate gateway.
66
5. Experimental Results
The following chapter summarizes our experiments with various anti-spam tools. This
includes open source tools as well as a small selection of commercial products.
At this point we need to emphasize again that the goal of our experiments was not to
thoroughly evaluate or compare various commercial products. Instead, we focused on
analyzing individual methods. As a consequence, our experimental results cannot not be
used as the basis of an evaluation or comparison (“ranking”) of commercial products.
Chapter 5.1 outlines the results achieved with our own test sample, Chapter 5.2
describes the results for the SpamAssassin sample. In many cases, there is a vast
number of configuration options. With our limited resources it was not possible to
determine the optimal configuration in terms of performance for each tool.
Consequently, we normally used the standard (default) setup. If simple choices had to
be made, we tried to minimize the number of false positives while keeping the spam
detection rate as high as possible.
67
Figure 14: Results for tested commercial products – our test sample.
The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non
commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to
limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2).
5.1.1.1.Product 1
68
total mail received 1,500 1,500
5.1.1.2.Product 2
5.1.1.3. Product 3
69
5.1.1.4. Product 4
5.1.1.5. Product 5
70
classified as spam 0 (0%) 915 (91.5%)
5.1.1.6. Product 6
11
Error-Code: 533 Malformed Sender Address
71
Figure 15: Results for tested open source tools – our test sample
The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non
commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to
limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2).
5.1.2.1. SpamAssassin
Tested Version: SpamAssassin, Version 2.64 and Version 3.0
Operating System: SuSE Linux 9.1
Category: Open source
Training Necessity: Yes
72
Parameter Settings 2: SpamAssassin low (2.64),
Spam threshold = 4,
Bayes disabled,
Network tests disabled
73
SpamAssassin ham sample spam sample
5.1.2.2. Bogofilter
74
Category: Open source
Training Necessity: Yes
Parameter Settings 1: self-trained – trained with parts of our own e-mail messages
5.1.3. Conclusion
A look at the results of our experiments with our own test sample (Figure 16) shows
that most products have quite comparable detection rates. Especially the false positive
rate can be made low with most products while spam recognition rate is usually around
90 percent. It is remarkable that there is no big difference between the detection rates of
the open source tools and the commercial products.
Furthermore, the products which were tested in multiple configurations, show that
the activation of additional features or using a training feature can have a significant
75
influence on the performance. The best example for this is SpamAssassin, which was
tested in four different versions with an increasing number of features activated.
Figure 16: Results for all tested products (our test sample)
The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non
commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to
limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2).
76
Figure 17: Results for tested commercial products – SpamAssassin sample
The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non
commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to
limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2).
5.2.1.1.Product 1
Tested Version: Version 4.7
ham sample spam sample
Product 1
12
Due to invalid message format
77
Product 1 (alternative ham sample spam sample
configuration)
5.2.1.2. Product 2
13
Due to invalid message format
78
5.2.1.3.Product 3
5.2.1.4. Product 4
14
Due to invalid message format
79
5.2.1.5. Product 5
5.2.1.6. Product 6
80
total mail sent 1,000 991
java exception15
Figure 18: Results for tested open source tools – SpamAssassin sample
The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non
commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to
limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2).
15
Due to invalid message format
81
5.2.2.1. SpamAssassin
82
Spam threshold = 4,
Bayes enabled,
Network tests enabled
5.2.2.2. Bogofilter
83
Training Necessity: Yes
Parameter Settings 1: self-trained – trained with parts of our own e-mail messages
84
Parameter Settings 2: pre-trained – pre-trained configuration files were used
5.2.3. Conclusion
A comparison of the results with the SpamAssassin test sample (Figure 19) shows a
bigger difference between the products than with our own test sample. The reason for
this might be that the messages included in this sample are rather old and therefore may
not be included any more in modern signature databases. Another explanation could be
that the general properties of spam messages changed and modern filters therefore
cannot recognize old spam. We see that the best implementations can achieve a spam
recognition rate of 80 percent or more with virtually no false positives.
85
Figure 19: Results for all tested products – SpamAssassin sample
The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non
commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to
limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2).
86
6. Conclusion
Summarizing, we can state the following observations with respect to anti-spam
methods and available tools, respectively.
6.1. Methods
Most tools and products currently available focus on what we call “post-send” methods
in our categorization of methods against spam e-mail presented in this report (Figure 5).
Their focus is on detecting spam or filtering spam out. Although they are able to
achieve acceptable detection and false positive rates, as our experiments show, many of
those methods (especially classical filters) have some serious drawbacks: They often
require a lot of effort for reacting to changes in the types of spam sent (new rules,
training, etc.), and their performance tends to decrease relatively fast if they are not
“maintained” well; they are often most useful on an individualized, personal basis,
which is an undesirable feature from the point of view of an ISP; they are usually
unable to reduce the waste of resources caused by spam e-mail (network bandwidth,
storage capacity, etc.); and they are often “one step behind” spammers’ tricks.
Nevertheless, there are some interesting newer approaches in this category that we
included under “classification methods”. They are motivated by more general
techniques from the areas of text classification or data mining, sometimes
algorithmically quite evolved. In our opinion they have the potential to overcome some
of these drawbacks. However, so far they are mostly discussed at an academic level and
are often not mature enough for practical use. It will be a major focus in the next project
phase to investigate approaches of this type in greater detail.
The situation is quite different with “pre-send” methods. Theoretically, they seem to
have big potential for big progress in the spam problem, mostly because they target the
source of the problem (commercial motivation) rather than (only) fighting the
symptoms – here, the idea is to prevent spam rather than to detect and filter it out.
Unfortunately, also in this case there are some important disadvantages: These
approaches tend to require a big administration overhead and, more importantly, their
success strongly depends on a worldwide agreement to deploy them – this holds for
proposals to increase the costs for sending e-mail as well as for legal regulations for
prohibiting sending out UBE and UCE. It is of course unrealistic that e-mail providers
worldwide will commit to common policies in the next years. Since national or regional
boundaries do not exist on the internet, we conclude that the pre-send approaches in the
current situation will not “solve” the problem, either.
Beyond those two big categories of methods there are also some more “radical”
approaches, such as new protocols for e-mail transfer (instead of SMTP), or the opinion
that we need to shift to a paradigm where we filter ham in instead of filtering spam out.
Although each of these ideas certainly has some merit, their widespread applicability in
practice is certainly not expected in the near future, definitely not for an ISP or in any
commercial context.
87
Our careful analysis of the situation leads us to the conclusion that there is some
potential for significant improvements of existing methods. Moreover, in order to
achieve best results, a multi-layered approach with several “defense lines” seems to be
required. Details will be investigated in the next phase of our project.
6.2. Experiments
As indicated in the beginning, our goal was to evaluate anti-spam methods and not to
compare products or tools. It would have been beyond the scope and resources of this
project to tune the tools we experimented with in order achieve the best possible
performance for each of them. In most cases we used more or less a standard
configuration, and if some simple choices were to be made, we tried to maximize the
rate of true positives for the lowest possible rate of false positives. Although the
experimental performance achieved has to be interpreted as an approximation for this
reason, no major improvements or new insights are to be expected from tuning
parameter settings.
These experimental results again indicate that there is substantial room for
improvement, which we will investigate actively in the next phase of this project.
88
7. List of Figures
Figure 1: Percentage of e-mail identified as spam, June 2004 [ (no newer data available).................... 12
Figure 2: Interaction of legislative measures, law enforcement and percentage of spam [11.................. 12
Figure 3: The top ten sources of spam (domains) [14] ............................................................................. 14
Figure 4: Spam categorized in terms of content (data from [10) .............................................................. 15
Figure 5: Categorization of anti-spam methods........................................................................................ 23
Figure 6: Typical scenario for a blacklist ................................................................................................. 31
Figure 7: Sender registration ChoiceMail ................................................................................................ 34
Figure 8: Challenge of SFM...................................................................................................................... 35
Figure 9: Example of a forged header ...................................................................................................... 37
Figure 10: URL analysis based on [69] .................................................................................................... 39
Figure 11: Typical processing path of Symantec Brightmail Anti-Spam .................................................. 51
Figure 12: Typical processing path of Symantec Mail Security for SMTP ............................................... 55
Figure 13: Windows test process vs. Linux test process............................................................................ 66
Figure 14: Results for tested commercial products – our test sample....................................................... 68
Figure 15: Results for tested open source tools – our test sample ............................................................ 72
Figure 16: Results for all tested products (our test sample) ..................................................................... 76
Figure 17: Results for tested commercial products – SpamAssassin sample ............................................ 77
Figure 18: Results for tested open source tools – SpamAssassin sample.................................................. 81
Figure 19: Results for all tested products – SpamAssassin sample........................................................... 86
89
8. List of Tables
Table 1: Products/tools considered, methods used by these, further remarks see page 2 8
Table 2: The top twelve sources of spam, geographically [13] 13
Table 3: Cost-profit equation of a spammer (simplified, monthly basis) 18
Table 4: Typical SMTP dialogue 20
Table 5: The most important header fields in the Internet Message Format [33 21
Table 6: Adaptation of spammers’ techniques to development of filtering techniques [ 22
Table 7: Quality metrics of binary classifiers for the spam problem 24
Table 8: A typical X-Hashcash header 26
Table 9: DCC checksums 42
Table 10: Example of a DCC record 42
Table 11: Hardware and software configuration 65
Table 12: Product 1 – results our test sample 68
Table 13: Product 1 (alternative configuration) – results our test sample 69
Table 14: Product 2 – results our test sample 69
Table 15: Product 3 – results our test sample 69
Table 16: Product 4 – results our test sample 70
Table 17: Product 5 – results our test sample 70
Table 18: Product 5 (alternative configuration) – results our test sample 71
Table 19: Product 6 – results our test sample 71
Table 20: SpamAssassin standard (2.64) – results our test sample 72
Table 21: SpamAssassin low (2.64) – results our test sample 73
Table 22: SpamAssassin Bayes (2.64) – results our test sample 73
Table 23: SpamAssassin (3.0) – results our test sample 74
Table 24: Bogofilter – results our test sample 74
Table 25: CRM 114 self-trained – results our test sample 75
Table 26: CRM 114 pre-trained – results our test sample 75
Table 27: Product 1 – results SpamAssassin test sample 77
Table 28: Product 1 (alternative configuration) – results SpamAssassin test sample 78
Table 29: Product 2 – results SpamAssassin test sample 78
Table 30: Product 3 – results SpamAssassin test sample 79
Table 31: Product 4 – results SpamAssassin test sample 79
Table 32: Product 5 – results SpamAssassin test sample 80
Table 33: Product 5 (alternative configuration) – results SpamAssassin test sample 80
Table 34: Product 6 – results SpamAssassin test sample 81
Table 35: SpamAssassin standard – results SpamAssassin test sample 82
Table 36: SpamAssassin low – results SpamAssassin test sample 82
Table 37: SpamAssassin Bayes – results SpamAssassin test sample 83
Table 38: SpamAssassin 3.0 – results SpamAssassin test sample 83
Table 39: Bogofilter – results SpamAssassin test sample 84
Table 40: CRM 114 self-trained – results SpamAssassin test sample 84
Table 41: CRM 114 pre-trained – results SpamAssassin test sample 85
90
9. Index
A
Address Harvesting Tools............................................................................................................. 16
AMTP ........................................................................................................................................... 48
B
Bayes Filter................................................................................................................................... 44
blacklist......................................................................................................................................... 31
Bogofilter...................................................................................................................................... 62
Borderware MXtreme Mail Firewall ............................................................................................ 57
C
Caller-Id ....................................................................................................................................... 32
CAN-SPAM ................................................................................................................................. 16
Challenge-Response ..................................................................................................................... 33
ChoiceMail ................................................................................................................................... 34
CRM 114 ...................................................................................................................................... 61
D
DCC .............................................................................................................................................. 41
Digital Signature........................................................................................................................... 40
DomainKeys ................................................................................................................................. 33
E
Excessive Cross-Posting ............................................................................................................... 11
Excessive Multi-Posting ............................................................................................................... 11
G
Greylist ......................................................................................................................................... 34
H
Hashcash....................................................................................................................................... 26
I
Ikarus mySpamWall
...................................................................................................................................................... 57
IM 2000 ........................................................................................................................................ 48
Internet Message Format
...................................................................................................................................................... 20
K
Kaspersky Anti-Spam
...................................................................................................................................................... 52
Keyword Based............................................................................................................................. 38
K-Nearest ..................................................................................................................................... 46
L
Lightweight Currency Protocol..................................................................................................... 27
N
Neural Networks........................................................................................................................... 46
91
P
Pattern Matching........................................................................................................................... 38
Pyzor............................................................................................................................................. 41
R
Rule Based.................................................................................................................................... 38
S
Sender-Id ..................................................................................................................................... 33
SFM .............................................................................................................................................. 35
Simple Mail Transfer Protocol...................................................................................................... 16
SpamAssassin ............................................................................................................................... 60
Spamkiss....................................................................................................................................... 58
Spam Tools ................................................................................................................................... 16
Sender Policy Framework ............................................................................................................ 32
SurfControl E-Mail Filter for SMTP ............................................................................................ 53
Support Vector Machines ............................................................................................................. 46
Symantec Mail Security for SMTP............................................................................................... 55
Symantec Brightmail Anti-Spam.................................................................................................. 50
U
Unsolicited Bulk E-mail ............................................................................................................... 10
Unsolicited Commercial E-Mail ................................................................................................... 10
URL Analysis ............................................................................................................................... 38
V
Vipul’s razor ................................................................................................................................. 41
W
whitelist ........................................................................................................................................ 31
92
10. Bibliography
[4] P. Hofmann: “Unsolicited Bulk E-mail: Definitions and Problems“, October 5, 1997.
http://www.imc.org/ube-def.html
[5] David Madigan: “Statistics and the War on Spam (A Guide to the Unknown)”, 2004.
http://www.stat.rutgers.edu/~madigan/PAPERS/sagtu.pdf
[7] Mitteilung der Kommission an das europäische Parlament über unerbetene Werbenachrichten,
22.01.2004.
http://europa.eu.int/information_society/topics/ecomm/doc/useful_information/library/communi
c_reports/spam/spam_com_2004_28_de.pdf
[12] The Spamhaus Project, List of the 200 biggest spammers called ROKSO.
http://www.spamhaus.org/rokso/
[13] Anti-Spam and Virus Software Vendor Sophos, Dirty Dozen, the 12 most spamming countries.
http://www.sophos.com
[16] Center for Democracy and Technology: “Why am I getting all this spam? Unsolicited
commercial e-mail six month report”, March 2003.
http://www.cdt.org/speech/spam/030319spamreport.shtml
93
[17] Mailutilities, Advanced E-Mail Extractor.
http://www.mailutilities.com/aee/
[19] E-Mail Marketing Software, Mail utilities for Internet business and e-commerce.
http://www.massmailsoftware.com/extractweb/purchase-email-addresses.htm
94
[35] J. Postel: “Internet Protocol”, September 1981.
ftp://ftp.rfc-editor.org/in-notes/rfc791.txt
[36] Peter Lechner: “Das Simple Mail Transfer Protokoll und die Spamproblematik“, Diplomarbeit
am Institut für Verteilte und Multimediale Systeme, Fakultät für Informatik, Universität Wien,
2005 (in preparation).
[37] G. Hulten et al.: “Trends in Spam Products and Methods”, Microsoft Research, 2004.
www.ceas.cc/papers-2004/165.pdf
[39] C. Dwork, A. Goldberg, and M. Naor: "On Memory-Bound Functions for Fighting Spam",
Proceedings of the 23rd Annual International Cryptology Conference (CRYPTO 2003), August
2003.
[42] D. Turner, D. Havey: “Controlling spam through Lightweight Currency”, November 4, 2003.
http://ftp.csci.csusb.edu/turner/papers/turner_spam.pdf
[43] D. Sorkin: “Overview over the most important anti-spam laws”, December 2003.
www.spamlaws.com
95
[51] Habeas, Sender Warranted E-Mail, 2004.
http://www.habeas.com
[53] John Ioannidis: “Fighting Spam by Encapsulating Policy in E-Mail Addresses”, Proceedings of
Network and Distributed Systems Security Conference (NDSS), 2003.
[54] T. Tompkins, D. Handley: "Giving e-mail back to the users: Using digital signatures to solve the
spam problem”, First Monday, 8(9), September 2003.
http://firstmonday.org/issues/issue8_9/tompkins/index.html
[58] Microsoft: “Caller ID for E-Mail Technical Specification: The Next Step to Deterring Spam”,
Februar 12, 2004.
http://www.microsoft.com/downloads/details.aspx?FamilyID=9a9e8a28-3e85-4d07-9d0f-
6daeabd3b71b&displaylang=en
[59] J. Lyon: “Purported Responsible Address in E-Mail Messages Specification”, October 2004.
http://www.microsoft.com/downloads/details.aspx?familyid=f8e9cb40-cc7c-46d6-8cd1-
3a86a46546d5&displaylang=en
[63] T. Loder, M.V. Alstyne, R. Wash: “An economic answer to unsolicited communication”, ACM
2004.
[64] E. Harris: “The Next Step in the Spam Control War: Greylisting”, August 28, 2003.
http://projects.puremagic.com/greylisting/whitepaper.html
[65] DigiPortal Software Inc.: “Choice Mail, A Spam Blocker – Not just a spam filter”.
http://www.digiportal.com
96
[68] Jeffrey E.F. Friedl: “Mastering Regular Expressions, Powerful Techniques for Perl and Other
Tools”, ISBN: 1-56592-257-3, O'Reilly, January, 1997.
[69] Oleg, Kolesnikov, Wenke Lee, Richard Lipton: “Filtering Spam Using Search Engines”, 2003.
http://www.cc.gatech.edu/~ok/
[70] Karl A. Krueger: “The Spam Battle 2002: A Tactical Update”, SANS GSEC Practical, v1.4,
September 2002.
http://www.sans.org/rr/whitepapers/email/589.php and http://www.rhyolite.com/anti-spam/dcc/
[71] Vipul’s Razor: “A distributed, collaborative, spam detection and filtering network”, December 3,
2004.
http://razor.sourceforge.net/
72] Pyzor.
http://pyzor.sourceforge.net/
[73] G. Salton, C. Buckley: “Term Weighting Approaches in Automatic Text Retrieval”, Information
Processing and Management, 24:513 523, 1988.
[74] W. Yerazunis: “The Spam Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past
It”, MIT Spam Conference 2004.
[76] Corinna Cortes, Vladimir Vapnik: “Support-vector networks”, Machine Learning, 20(3):273-
297, November 1995.
[77] Vladimir Vapnik: “The Nature of Statistical Learning Theory”, Springer-Verlag, Heidelberg,
Germany, 1995.
[78] Christopher M. Bishop: “Neural Networks for Pattern Recognition”, Oxford University Press,
1995.
[79] G. Wittel, S. Wu, U. Davis: “On Attacking Statistical Spam Filters”, CEAS 2004.
http://www.ceas.cc/papers-2004/slides/170.pdf
[80] K. Eide: “Winning the war on spam: Comparison of bayesian spam filters”, August 2003.
http://home.dataparty.no/kristian/reviews/bayesian/
[81] A. Kolcz, J. Alspector: “SVM based filtering of e-mail spam with content-specific
misclassification costs”, In Proceedings of the TextDM'01 Workshop on Text Mining – held at
the 2001 IEEE International Conference on Data Mining, 2001.
[82] H. Drucker, Donghui Wu, and V.N. Vapnik: “Support vector machine for spam categorization”
IEEE Transactions on Neural Networks, 10(5):1048–1054, September 1999.
97
[85] Jonathan de Boyne Pollard: “Fleshing out IM2000”.
http://homepages.tesco.net./%7EJ.deBoynePollard/Proposals/IM2000/
[90] Installation Guide and Administration Guide for Brightmail Anti-Spam, Version 6.0 (Document
Version 1.0).
[92] SurfControl E-Mail Filter for SMTP: Administrator’s Guide (Version 4.7 created September
2003).
[94] Symantec Mail Security for SMTP: Administration Guide (Documentation Version 4.0).
[101] Paolo Frasconi and Giovanni Soda and Alessandro Vullo, “Hidden Markov Models for Text
Categorization in Multi-Page Documents”, J. Intell. Inf. Syst. 18 2-3, 195--217, 2002.
98
[103] Ling spam and PU1 spam corpus.
http://www.iit.demokritos.gr/skel/i-config/downloads/
99