Internet DRAFT - draft-spamfilt-inoculation
draft-spamfilt-inoculation
INTERNET-DRAFT Bill Yerazunis
draft-spamfilt-inoculation-01.txt Jonathan Zdziarski
spamfilt group
October 2003
Expires April 2004
A MIME Encoding for Spam Inoculation Messages
Status of this Memo
This document is an Internet-Draft and is subject to all provisions
of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-
Drafts as reference material or to cite them other than as
"work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Distribution of this memo is unlimited.
Abstract
This document describes in detail a method for encapsulating an
email message or text sample for the purpose of training (or
"inoculating") a mail filter. The sample messages or text (the
"payload") provide the contextual information necessary for the
filter to reject ("spam") or accept ("non-spam") the message being
inoculated, or messages similar in design.
RFC 1521 defines the MIME format. This document expands on this by
adding an "inoculation" MIME subtype, and also adds additional
header fields necessary to the functionality being provided.
This message format is designed to enable different mail filters of
different design to communicate inoculations with one another using
the MIME subtype introduced.
1. Introduction
Analytical anti-spam tools are all subject to the same inherent
problem, which is that spam is dynamic; it evolves. This constant
spamfilt group Inoculation Message Format [Page 1]
INTERNET-DRAFT Inoculation Message Format
mutation guarantees a marginal error rate in all such anti-spam
tools, making it difficult to achieve perfect accuracy.
The premise behind inoculation is to distribute these new mutations
to other users so that an entire group may benefit from one user's
misfortune. In light of the fact that there are many different
anti-spam tools available today, a standard for sharing an
inoculation of either spam or nonspam must be created to both
encourage and enable the widespread acceptance of this practice.
This memo describes several components that combine to create the
message format for sharing an inoculation payload. In particular,
it describes:
1. The inoculation subtype, which identifies that the message
being received is an inoculation and should be treated
accordingly.
2. An Inoculation-Sender field, which identifies the sender of the
inoculation, and provides an identity the recipient can query
locally for authentication information (such as a shared secret,
public key, etcetera).
3. An Inoculation-Type field, which specifies the type of
inoculation payload being sent (spam or nonspam), to instruct
the filter how to proceed with importing the inoculation payload.
4. An Inoculation-Authentication field which specifies the method
of authentication provided (if any) to verify that the
inoculation is from a trusted user.
5. Extended authentication message components, such as a public key
signature, may be present depending on the authentication
mechanism used.
6. The inoculation payload, which is the actual information
provided to seed the filter tool.
This memo expands on RFCs 822 and 1521 which outline the relevant
standards for the Internet Mail format (email), and depends upon
the standards outlined in RFC 2015 for PGP signed data.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC 2119].
2. Notations, Conventions, and Generic Grammar
Many of the mechanisms specified in this memo are described formally
in RFC 822 and RFC 1521. Implementors will need to be familiar with
spamfilt group Inoculation Message Format [Page 2]
INTERNET-DRAFT Inoculation Message Format
this notation in order to understand this specification, and are
referred to RFC 822 and RFC 1521 for a complete explanation.
The term "message", when not further qualified, means either the
(complete of "top-level") message being transferred on a network,
or a message encapsulated in a body of type "message".
3. The Inoculation Subtype
The inoculation subtype specifies the nature of the message body
to be a complete message, spam or nonspam, presented for inoculation
to the recipient's filter agent. The media type identifies the
payload being sent.
In the augmented BNF notation of RFC 822, the message/inoculation
MIME type is represented in the Content-Type header field defined
as follows:
content := "Content-Type" ":" type "/" subtype *(";"
auth-parameter)
; case-insensitive matching of type and subtype
type := "message" / "text" / "multipart"
; All values case-insensitive
subtype := token ; case-insensitive
auth-parameter := auth-attribute "=" value
auth-attribute := token ; case-insensitive
value := token / quoted-string
token := 1*<any (ASCII) CHAR except SPACE, CTLs,
or tspecials>
tspecials := "(" / ")" / "<" / ">" / "@"
/ "," / ";" / ":" / "\" / <">
/ "/" / "[" / "]" / "?" / "="
; Must be in quoted-string,
; to use within parameter values
The three initial pre-defined media types are detailed in the bulk
of this memo. They are:
message -- complete message. defines the inoculation as a
complete message (spam or nonspam) with its own
message structure in compliance with RFC 822.
text -- miscellaneous text. defines the inoculation as a
string of related text without any specific structure.
spamfilt group Inoculation Message Format [Page 3]
INTERNET-DRAFT Inoculation Message Format
multipart -- an inoculation consisting of multiple parts of
independent data types.
RATIONALE: A filter may process the analysis of the inoculation
payload differently depending on the type of information
being sent. In order to insure the most effective use
of the inoculation payload, each inoculation must
provide this basic information about itself to avoid
ambiguity.
It should be noted that the list of Content-Type values given here
may be augmented in time, via the mechanisms described above, and
that the set of types is expected to grow substantially.
When a mail reader encounters mail with a subtype of 'inoculation'
and an unknown type value, it should generally treat it as
equivalent to "text/inoculation", as described in this memo.
4. The Inoculation-Sender Field
The Inoculation-Sender field identifies the sender to the recipient
using a common identity shared between the two (for example, an email
address or user name). The sender identity is necessary to
authentication of the inoculation by providing a reference to the
correct secret, public key, or other authentication information.
This field has not been defined by any previous standard. The
field's value is a single token specifying the sender's identity, as
shown below. Formally:
sender := "Inoculation-Sender" ":" token
token := 1*<any (ASCII) CHAR except SPACE, CTLs,
or tspecials>
tspecials := "(" / ")" / "<" / ">" / ","
/ ";" / "\" / <"> / "/" / "["
/ "]"
; Must be in quoted-string,
; to use within parameter values
The values used are case insensitive. That is, BOB and bob are both
the same sender. Identities should be specific enough to avoid any
potential collisions with other users. A single user should have a
single identity for among the other users they are sharing
inoculations with. For this reason, the sender field can support an
email address or fingerprint identity.
The Inoculation-Sender field is a required field and must be present
in all inoculation messages. If the message is a
multipart/inoculation media type, the Inoculation-Sender field
spamfilt group Inoculation Message Format [Page 4]
INTERNET-DRAFT Inoculation Message Format
should follow the rules below:
1. If all parts of the message are being sent by the same sender, the
Inoculation-Sender field may appear only once in the message's
top-level headers, or individually in each part of the message.
2. If the message consists of parts being sent by different senders,
the Inoculation-Sender field must not appear in the message's
top-level headers, but must appear individually in each part of the
message.
RATIONALE: The sender's identity may not always match the "From"
field of the message. It is necessary to use a field
specific to the sender's identification to provide the
flexibility to the sender to change their email address,
name, or other such data they may use in the "From" field
to identify themeselves casually.
5. The Inoculation-Type Field
The Inoculation-Type field identifies the type of inoculation being
sent. The two inoculation types presently supported are "spam" and
"nonspam". It is necessary to specify the type of inoculation in
order to direct the appropriate method of learning chosen by the
filter.
This field has not been defined by any previous standard. the
field's value is a single token specifying the type of inoculation,
as shown below. Formally:
type := "Inoculation-Type" ":" attribute
attribute := "spam" / "nonspam" / x-token
; all values case insensitive
x-token := <The two characters "X-" or "x-" followed, with
no intervening white space, by any token>
token := 1*<any (ASCII) CHAR except SPACE, CTLs,
or tspecials>
tspecials := "(" / ")" / "<" / ">" / "@"
/ "," / ";" / ":" / "\" / <">
/ "/" / "[" / "]" / "?" / "="
; Must be in quoted-string,
; to use within parameter values
These values are not case sensitive. That is, SPAM and spam and SpAm
are all equivalent.
A "spam" inoculation type must be accompanied by a message that is
spamfilt group Inoculation Message Format [Page 5]
INTERNET-DRAFT Inoculation Message Format
deemed to be spam by the sender and a "nonspam" inoculation type must
be accompanied by a message that is deemed to be innocent by the
sender. The Inoculation-Type field is a required field and must be
present in all inoculation messages. If the message is a
multipart/inoculation media type, an Inoculation-Type field should
be present in the headers of each part of the message.
Implementors may, if necessary, define new Inoculation-Type values,
but must use an X-token, which is a name prefixed by "X-" to
indicate its non-standard status, e.g.:
Inoculation-Type: x-my-new-type
However the creation of new Inoculation-Type values is strongly
discouraged, as it seems likely to hinder interoperability with
little potential benefit.
6. The Inoculation-Authentication Field
The Inoculation-Authentication field specifies the type of
authentication being used to authenticate the sender's identity.
Authentication is necessary to insure that the sender is not a
malicious party attempting to reprogram the recipient's filter
(something a spammer, for example, may attempt to do with mass
inoculation mailings).
The defined authentication methods provide a means of authenticating
both the sender and the message, to insure that the message has not
been modified in transit.
This field has not been defined by any previous standard. The field's
value is a single token specifying the type of authentication
mechanism used, as shown below. Formally:
type := "Inoculation-Authentication" ":" attribute
auth-type := "none" / "md5" / "signed" / x-token
; all values case insensitive
x-token := <The two characters "X-" or "x-" followed, with
no intervening white space, by any token>
token := 1*<any (ASCII) CHAR except SPACE, CTLs,
or tspecials>
tspecials := "(" / ")" / "<" / ">" / "@"
/ "," / ";" / ":" / "\" / <">
spamfilt group Inoculation Message Format [Page 6]
INTERNET-DRAFT Inoculation Message Format
/ "/" / "[" / "]" / "?" / "="
; Must be in quoted-string,
; to use within parameter values
These values are not case sensitive. That is, md5, MD5, and Md5 are
all equivalent.
The Inoculation-Authentication field is a required field and must be
present in all inoculation messages. If the message uses a
multipart/inoculation media type, an Inoculation-Authentication field
must be provided for each part of the message.
Implementors may, if necessary, define new Inoculation-Authentication
values, but must use an X-token, which is a name prefixed by "X-" to
indicate its non-standard status, e.g.,
Inoculation-Authentication: x-my-new-mechanism"
However the creation of new Inoculation-Authentication values is
strongly discouraged, as it seems likely to hinder interoperability
with little potential benefit.
6.1 The "none" Authentication Mechanism
The "none" authentication mechanism identifies the message as having
no means of authentication. The decision is left up to the recipient
as to whether to accept or reject an inoculation with no means of
authentication.
6.2 The "md5" Authentication Mechanism
The "md5" authentication mechanism identifies the message as using an
md5 checksum in conjunction with a shared secret to authentication
the sender and the inoculation payload. The formal grammar for the
Inoculation-Authentication field for md5 is as follows:
auth-type := "md5" ";" "checksum" "=" checksum
checksum := 1*<any (ALNUM) CHAR>
The checksum provided should be both generated by the sender and
authenticated by the recipient using the MD5 algorithm in the
following manner:
1. The sender and recipient have agreed on a shared secret, or
verification code, to authenticate using this mechanism.
2. The recipient will, based on the sender identified in the
Inoculation-Sender field, look up the sender's shared secret.
spamfilt group Inoculation Message Format [Page 7]
INTERNET-DRAFT Inoculation Message Format
3. An MD5 checksum is generated by combining the shared secret +
a newline character + the inoculation payload.
4. If the checksum generated by the recipient matches the checksum
provided by the sender, the inoculation is authenticated.
If the inoculation message has a media type of multipart/inoculation,
a separate authentication checksum must be provided for every part
of the message using the inoculation media subtype.
6.3 The "signed" Authentication Mechanism
The "signed" authentication mechanism identifies the message as using
a public-key signature to authenticate the sender and the inoculation
payload. In order to use the signed authentication mechanism, the
media type must be set to multipart/inoculation, however signed
authentication limits each inoculation message to only a single
inoculation payload. This is necessary as the public-key signature
itself will use a separate part of the message.
A separate part of the message containg the public-key signature for
the inoculation payload must be provided. Authentication of the
inoculation payload should be performed using the standard outlined
in RFC 2015.
7. The Inoculation Payload
The inoculation payload is the only component provided in the body of
an inoculation message or message component, and represents all of
the data specific to the payload itself.
Depending on the media type of the inoculation, the payload may
contain different information covered below.
When processing the inoculation payload, special care should be
taken to compare the 'Content-Length' as specified in the message
with the actual content's length to insure that the entire message
has been received.
7.1. The 'message' Payload
If the media type specified for the payload is message, the
inoculation payload must consist of a complete message including
message headers as outlined in RFC 822. An RFC 1521-compliant
message incorporating MIME may also be provided, granted that the
boundaries specified in the message do not conflict with the
boundaries used in the top-level message.
7.2. The 'text' Payload
Inoculation payloads with a media type of 'text' should be treated as
spamfilt group Inoculation Message Format [Page 8]
INTERNET-DRAFT Inoculation Message Format
plain text. This media type should be used when the headers for the
inoculation payload are not available or nonexistant, or if the
payload does not conform to the Internet Message standard outlined
in RFC 822.
7.3. The 'multipart' Payload
No payload is provided or assumed when the media type for the
top-level message is multipart. Instead, the individual components
of the message must be examined as to the standards set in RFC 1521.
Each message component must provide its own specific media type,
which must be either 'message' or 'text' when specifying a media
subtype of inoculation.
8.0 Message Examples
This section provides some examples of the message format. Depending
on the format of this draft, the message's whitespace and
structure may have been changed leaving the checksums in the example
to fail.
Assume the recipient (spamsucks@myhouse.com) is willing to accept
inoculations of antispam from the sender
(jonathan@nuclearelephant.com). They have previously agreed on the
shared authentication secret of 'beware the jabberwock'.
The steps involved in receiving and processing this inoculation are
as follows.
0. The recipient's inoculation-aware spam tool notes that this is an
inoculation-type message.
1. The recipient spam tool parses the headers to find the claimed
sender is jonathan@nuclearelephant.com, and the claimed
inoculation type is spam.
2. The recipient spam tool checks the local set of authorized
inoculators, and finds that jonathan@nuclearelephant.com is
permitted to inoculate spam.
3. The recipient spam tool looks up jonathan@nuclearelephant.com,
and finds that the corresponding authentication shared secret is
the string of 'beware the jabberwock'.
4. The recipient spam tool tests to confirm that this is not a
multipart inoculation, and that the payload is the entire data
text area.
5. The recipient spam tool forms the authentication text by
concatenating the authentication shared secret, a newline, and
the full data text area (omitting the obligatory newline-newline
spamfilt group Inoculation Message Format [Page 9]
INTERNET-DRAFT Inoculation Message Format
after the last header line) and continuing to end-of-file on the
email text or the length of the content, specified in the
'Content-Length' field, if present.
6. The recipient spam tool calculates the md5 checksum of this
authentication text.
7. The recipient spam tool compares the calculated checksum (from
step 6) with the claimed checksum found in the message header.
If the checksum does not match, no automatic inoculation is done
and the MTA may either notify the user of an attempted
inoculation failure, or may simply drop the message and exit with
nonerror status. It is recommended that this behavior be
user-configurable.
8. Having validated the authenticity of the sender / checksum /
payload tuple, the payload (and only the payload) is forwarded to
the proper user-configured spam filtering program's learning
interface, including the information that the payload was "spam".
Please note also should the message contain a 'From ' header, a space
must precede the line in order to comply with RFC 821. This space
should be part of the inoculation payload, and stripped out by the
recipient's spam tool. No 'From ' lines are used in the examples
below.
8.1 message/inoculation example
To: Everyone on my list <spamsucks@myhouse.com>
From: Jonathan A. Zdziarski <jonathan@nuclearelephant.com>
Subject: This is a test inoculation
Inoculation-Authentication: md5;
checksum="dcdac94fab6ded79f33b0134d665d02f"
Inoculation-Type: spam
Inoculation-Sender: jonathan@nuclearelephant.com
Content-Type: message/inoculation
Content-Length: 169
From: Bob Denver <bob@dead.com>
Subject: This is a spam
To: You <you@youremail.com>
This is a test innoculation. The checksum is correct, however.
-Bill Yerazunis
8.2 text/inoculation example
From: Jonathan A. Zdziarski <jonathan@nuclearelephant.com>
To: Everyone on my list <spamsucks@myhouse.com>
Subject: This is a test inoculation
spamfilt group Inoculation Message Format [Page 10]
INTERNET-DRAFT Inoculation Message Format
Inoculation-Authentication: md5;
checksum="d5c883bce00de5391fbd8f7d17fb56a4"
Inoculation-Type: spam
Inoculation-Sender: jonathan@nuclearelephant.com
Content-Type: text/inoculation
Content-Length: 84
This is a test innoculation. The checksum is correct, however.
-Bill Yerazunis
8.3 multipart/inoculation example
To: Everyone on my list <spamsucks@myhouse.com>
From: Jonathan A. Zdziarski <jonathan@nuclearelephant.com>
Subject: This is a test inoculation
Inoculation-Sender: jonathan@nuclearelephant.com
Content-Type: multipart/inoculation; boundary="--NextPart-010203"
----NextPart-010203
Inoculation-Authentication: md5;
checksum="c3a47b29744062288cbd5c305897eaa9"
Inoculation-Type: spam
Content-Type: message/inoculation
Content-Length: 169
From: Bob Denver <bob@dead.com>
Subject: This is a spam
To: You <you@youremail.com>
This is a test innoculation. The checksum is correct, however.
-Bill Yerazunis
----NextPart-010203
Inoculation-Authentication: md5;
checksum="d5c883bce00de5391fbd8f7d17fb56a4"
Inoculation-Type: spam
Content-Type: text/inoculation
Content-Length: 84
This is a test innoculation. The checksum is correct, however.
-Bill Yerazunis
----NextPart-010203--
Acknowledgements
Many thanks to Brian Burton for his input and comments to this
document.
References
spamfilt group Inoculation Message Format [Page 11]
INTERNET-DRAFT Inoculation Message Format
[RFC822] - Standard for the format of ARPA Internet text messages
[RFC1521] - MIME (Multipurpose Internet Mail Extensions) Part One:
Mechanisms for Specifying and Describing the Format of
Internet Message Bodies
Author's Address
Please send all coments to one of the authors listed below.
Bill Yerazunis
Mitsubishi Electric
201 Broadway
Cambridge, MA 02139
USA
Phone: +1 617 621 7530
Email: wsy@merl.com
Jonathan A. Zdziarski
3069 Heritage Rd.
Milledgeville, GA 31061
USA
Phone: +1 478 452 8187
Email: jonathan@nuclearelephant.com
Full Copyright Statement
Copyright (C) The Regents of the Anti-Spam Community (2003).
All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implmentation may be prepared, copied, published and
distributed, in whole or in part, without restriction of any kind,
provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the copyright holder or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
spamfilt group Inoculation Message Format [Page 12]
INTERNET-DRAFT Inoculation Message Format
"AS IS" basis and THE AUTHORS DISCLAIM ALL WARRANTIES, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."
October 2003
Expires April 2004
spamfilt group Inoculation Message Format [Page 13]