Internet DRAFT - draft-spamfilt-inoculation

draft-spamfilt-inoculation





INTERNET-DRAFT                                     Bill Yerazunis
draft-spamfilt-inoculation-01.txt                  Jonathan Zdziarski
                                                     spamfilt group
                                                     October 2003
                                                     Expires April 2004

              A MIME Encoding for Spam Inoculation Messages

Status of this Memo

   This document is an Internet-Draft and is subject to all provisions 
   of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as
   Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time.  It is inappropriate to use Internet-
   Drafts as reference material or to cite them other than as
   "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   Distribution of this memo is unlimited. 

Abstract

   This document describes in detail a method for encapsulating an 
   email message or text sample for the purpose of training (or 
   "inoculating") a mail filter.  The sample messages or text (the 
   "payload") provide the contextual information necessary for the 
   filter to reject ("spam") or accept ("non-spam") the message being 
   inoculated, or messages similar in design.

   RFC 1521 defines the MIME format.  This document expands on this by
   adding an "inoculation" MIME subtype, and also adds additional 
   header fields necessary to the functionality being provided.

   This message format is designed to enable different mail filters of
   different design to communicate inoculations with one another using 
   the MIME subtype introduced.

1. Introduction

   Analytical anti-spam tools are all subject to the same inherent 
   problem, which is that spam is dynamic; it evolves.  This constant 

spamfilt group          Inoculation Message Format             [Page 1]

INTERNET-DRAFT          Inoculation Message Format

   mutation guarantees a marginal error rate in all such anti-spam 
   tools, making it difficult to achieve perfect accuracy.  

   The premise behind inoculation is to distribute these new mutations 
   to other users so that an entire group may benefit from one user's 
   misfortune.  In light of the fact that there are many different 
   anti-spam tools available today, a standard for sharing an 
   inoculation of either spam or nonspam must be created to both 
   encourage and enable the widespread acceptance of this practice. 
   
   This memo describes several components that combine to create the 
   message format for sharing an inoculation payload.  In particular, 
   it describes:


   1.  The inoculation subtype, which identifies that the message 
       being received is an inoculation and should be treated 
       accordingly.

   2.  An Inoculation-Sender field, which identifies the sender of the
       inoculation, and provides an identity the recipient can query 
       locally for authentication information (such as a shared secret,
       public key, etcetera).

   3.  An Inoculation-Type field, which specifies the type of 
       inoculation payload being sent (spam or nonspam), to instruct 
       the filter how to proceed with importing the inoculation payload.

   4.  An Inoculation-Authentication field which specifies the method 
       of authentication provided (if any) to verify that the 
       inoculation is from a trusted user.

   5.  Extended authentication message components, such as a public key
       signature, may be present depending on the authentication 
       mechanism used.

   6.  The inoculation payload, which is the actual information 
       provided to seed the filter tool.

   This memo expands on RFCs 822 and 1521 which outline the relevant 
   standards for the Internet Mail format (email), and depends upon 
   the standards outlined in RFC 2015 for PGP signed data.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC 2119].

2. Notations, Conventions, and Generic Grammar

   Many of the mechanisms specified in this memo are described formally
   in RFC 822 and RFC 1521.  Implementors will need to be familiar with

spamfilt group          Inoculation Message Format             [Page 2]

INTERNET-DRAFT          Inoculation Message Format

   this notation in order to understand this specification, and are 
   referred to RFC 822 and RFC 1521 for a complete explanation.

   The term "message", when not further qualified, means either the 
   (complete of "top-level") message being transferred on a network, 
   or a message encapsulated in a body of type "message".

3. The Inoculation Subtype

   The inoculation subtype specifies the nature of the message body
   to be a complete message, spam or nonspam, presented for inoculation
   to the recipient's filter agent.  The media type identifies the 
   payload being sent.

   In the augmented BNF notation of RFC 822, the message/inoculation 
   MIME type is represented in the Content-Type header field defined 
   as follows:

     content  :=   "Content-Type"  ":"  type  "/"  subtype  *(";"
     auth-parameter)
               ; case-insensitive matching of type and subtype

     type :=          "message"   /    "text"   /   "multipart" 
               ; All values case-insensitive

     subtype := token ; case-insensitive

     auth-parameter := auth-attribute "=" value

     auth-attribute := token   ; case-insensitive

     value := token / quoted-string

     token  :=  1*<any (ASCII) CHAR except SPACE, CTLs,
                      or tspecials>
                                                                        
     tspecials :=  "(" / ")" / "<" / ">" / "@"
                /  "," / ";" / ":" / "\" / <">
                /  "/" / "[" / "]" / "?" / "="
               ; Must be in quoted-string,
               ; to use within parameter values

   The three initial pre-defined media types are detailed in the bulk 
   of this memo.  They are:

   message   -- complete message.  defines the inoculation as a 
                complete message (spam or nonspam) with its own 
                message structure in compliance with RFC 822.

   text      -- miscellaneous text.  defines the inoculation as a 
                string of related text without any specific structure.

spamfilt group          Inoculation Message Format             [Page 3]

INTERNET-DRAFT          Inoculation Message Format

   multipart -- an inoculation consisting of multiple parts of 
                independent data types.  

   RATIONALE: A filter may process the analysis of the inoculation 
              payload differently depending on the type of information 
              being sent.  In order to insure the most effective use 
              of the inoculation payload, each inoculation must 
              provide this basic information about itself to avoid 
              ambiguity.

   It should be noted that the list of Content-Type values given here
   may be augmented in time, via the mechanisms described above, and
   that the set of types is expected to grow substantially.

   When a mail reader encounters mail with a subtype of 'inoculation' 
   and an unknown type value, it should generally treat it as 
   equivalent to "text/inoculation", as described in this memo.

4. The Inoculation-Sender Field

  The Inoculation-Sender field identifies the sender to the recipient 
  using a common identity shared between the two (for example, an email
  address or user name).  The sender identity is necessary to 
  authentication of the inoculation by providing a reference to the 
  correct secret, public key, or other authentication information.

  This field has not been defined by any previous standard.  The 
  field's value is a single token specifying the sender's identity, as 
  shown below.  Formally:

     sender := "Inoculation-Sender" ":" token

     token  :=  1*<any (ASCII) CHAR except SPACE, CTLs,
                   or tspecials>

     tspecials :=  "(" / ")" / "<" / ">" /  "," 
                 / ";" / "\" / <"> /  "/" / "[" 
                 / "]" 
               ; Must be in quoted-string,
               ; to use within parameter values

  The values used are case insensitive.  That is, BOB and bob are both 
  the same sender.  Identities should be specific enough to avoid any 
  potential collisions with other users.  A single user should have a 
  single identity for among the other users they are sharing 
  inoculations with.  For this reason, the sender field can support an
  email address or fingerprint identity.

  The Inoculation-Sender field is a required field and must be present 
  in all inoculation messages.  If the message is a 
  multipart/inoculation media type, the Inoculation-Sender field 

spamfilt group          Inoculation Message Format             [Page 4]

INTERNET-DRAFT          Inoculation Message Format

  should follow the rules below:

  1. If all parts of the message are being sent by the same sender, the
  Inoculation-Sender field may appear only once in the message's 
  top-level headers, or individually in each part of the message.

  2. If the message consists of parts being sent by different senders,
  the Inoculation-Sender field must not appear in the message's 
  top-level headers, but must appear individually in each part of the 
  message.
  
  RATIONALE: The sender's identity may not always match the "From" 
             field of the message.  It is necessary to use a field 
             specific to the sender's identification to provide the 
             flexibility to the sender to change their email address, 
             name, or other such data they may use in the "From" field 
             to identify themeselves casually.

5. The Inoculation-Type Field

  The Inoculation-Type field identifies the type of inoculation being 
  sent.  The two inoculation types presently supported are "spam" and 
  "nonspam".  It is necessary to specify the type of inoculation in 
  order to direct the appropriate method of learning chosen by the 
  filter.  

  This field has not been defined by any previous standard.  the 
  field's value is a single token specifying the type of inoculation, 
  as shown below.  Formally:

     type := "Inoculation-Type" ":" attribute

     attribute := "spam" / "nonspam" / x-token
                         ; all values case insensitive

     x-token := <The two characters "X-" or "x-" followed, with
                    no intervening white space, by any token>

     token  :=  1*<any (ASCII) CHAR except SPACE, CTLs,
                      or tspecials>

     tspecials :=  "(" / ")" / "<" / ">" / "@"
                /  "," / ";" / ":" / "\" / <">
                /  "/" / "[" / "]" / "?" / "="
               ; Must be in quoted-string,
               ; to use within parameter values

  These values are not case sensitive.  That is, SPAM and spam and SpAm
   are all equivalent.   

  A "spam" inoculation type must be accompanied by a message that is 

spamfilt group          Inoculation Message Format             [Page 5]

INTERNET-DRAFT          Inoculation Message Format

  deemed to be spam by the sender and a "nonspam" inoculation type must
  be accompanied by a message that is deemed to be innocent by the 
  sender.  The Inoculation-Type field is a required field and must be 
  present in all inoculation messages.  If the message is a 
  multipart/inoculation media type, an Inoculation-Type field should 
  be present in the headers of each part of the message.

  Implementors may, if necessary, define new Inoculation-Type values, 
  but must use an X-token, which is a name prefixed by "X-" to 
  indicate its non-standard status, e.g.:

    Inoculation-Type: x-my-new-type

   However the creation of new Inoculation-Type values is strongly 
   discouraged, as it seems likely to hinder interoperability with 
   little potential benefit. 

6. The Inoculation-Authentication Field

  The Inoculation-Authentication field specifies the type of 
  authentication being used to authenticate the sender's identity.  
  Authentication is necessary to insure that the sender is not a 
  malicious party attempting to reprogram the recipient's filter
  (something a spammer, for example, may attempt to do with mass 
  inoculation mailings).

  The defined authentication methods provide a means of authenticating 
  both the sender and the message, to insure that the message has not 
  been modified in transit.

  This field has not been defined by any previous standard.  The field's
  value is a single token specifying the type of authentication 
  mechanism used, as shown below.  Formally:

     type := "Inoculation-Authentication" ":" attribute
                                                                                
     auth-type := "none" / "md5" / "signed" / x-token
                         ; all values case insensitive
                                                                                
     x-token := <The two characters "X-" or "x-" followed, with
                    no intervening white space, by any token>
                                                                                
     token  :=  1*<any (ASCII) CHAR except SPACE, CTLs,
                      or tspecials>
                                                                                
     tspecials :=  "(" / ")" / "<" / ">" / "@"
                /  "," / ";" / ":" / "\" / <">

spamfilt group          Inoculation Message Format             [Page 6]

INTERNET-DRAFT          Inoculation Message Format

                /  "/" / "[" / "]" / "?" / "="
               ; Must be in quoted-string,
               ; to use within parameter values

  These values are not case sensitive.  That is, md5, MD5, and Md5 are 
  all equivalent.
                                                                                
  The Inoculation-Authentication field is a required field and must be 
  present in all inoculation messages.  If the message uses a 
  multipart/inoculation media type, an Inoculation-Authentication field
  must be provided for each part of the message.

  Implementors may, if necessary, define new Inoculation-Authentication
  values, but must use an X-token, which is a name prefixed by "X-" to 
  indicate its non-standard status, e.g., 
 
    Inoculation-Authentication: x-my-new-mechanism"

  However the creation of new Inoculation-Authentication values is 
  strongly discouraged, as it seems likely to hinder interoperability 
  with little potential benefit.

6.1 The "none" Authentication Mechanism

  The "none" authentication mechanism identifies the message as having 
  no means of authentication.  The decision is left up to the recipient
  as to whether to accept or reject an inoculation with no means of 
  authentication.

6.2 The "md5" Authentication Mechanism

  The "md5" authentication mechanism identifies the message as using an
  md5 checksum in conjunction with a shared secret to authentication 
  the sender and the inoculation payload.  The formal grammar for the 
  Inoculation-Authentication field for md5 is as follows:

     auth-type := "md5" ";" "checksum" "=" checksum

     checksum :=  1*<any (ALNUM) CHAR>

  The checksum provided should be both generated by the sender and 
  authenticated by the recipient using the MD5 algorithm in the 
  following manner:

   1. The sender and recipient have agreed on a shared secret, or 
      verification code, to authenticate using this mechanism.

   2. The recipient will, based on the sender identified in the 
      Inoculation-Sender field, look up the sender's shared secret.


spamfilt group          Inoculation Message Format             [Page 7]

INTERNET-DRAFT          Inoculation Message Format

   3. An MD5 checksum is generated by combining the shared secret + 
      a newline character + the inoculation payload.

   4. If the checksum generated by the recipient matches the checksum 
      provided by the sender, the inoculation is authenticated.

  If the inoculation message has a media type of multipart/inoculation,
  a separate authentication checksum must be provided for every part 
  of the message using the inoculation media subtype.

6.3 The "signed" Authentication Mechanism

  The "signed" authentication mechanism identifies the message as using
  a public-key signature to authenticate the sender and the inoculation
  payload.  In order to use the signed authentication mechanism, the 
  media type must be set to multipart/inoculation, however signed
  authentication limits each inoculation message to only a single 
  inoculation payload.  This is necessary as the public-key signature 
  itself will use a separate part of the message.  

  A separate part of the message containg the public-key signature for
  the inoculation payload must be provided.  Authentication of the 
  inoculation payload should be performed using the standard outlined 
  in RFC 2015.

7. The Inoculation Payload 

  The inoculation payload is the only component provided in the body of
  an inoculation message or message component, and represents all of 
  the data specific to the payload itself.  

  Depending on the media type of the inoculation, the payload may 
  contain different information covered below.

  When processing the inoculation payload, special care should be 
  taken to compare the 'Content-Length' as specified in the message 
  with the actual content's length to insure that the entire message 
  has been received.

7.1. The 'message' Payload

  If the media type specified for the payload is message, the 
  inoculation payload must consist of a complete message including 
  message headers as outlined in RFC 822.  An RFC 1521-compliant 
  message incorporating MIME may also be provided, granted that the 
  boundaries specified in the message do not conflict with the 
  boundaries used in the top-level message.

7.2. The 'text' Payload

  Inoculation payloads with a media type of 'text' should be treated as

spamfilt group          Inoculation Message Format             [Page 8]

INTERNET-DRAFT          Inoculation Message Format

  plain text.  This media type should be used when the headers for the 
  inoculation payload are not available or nonexistant, or if the 
  payload does not conform to the Internet Message standard outlined
  in RFC 822.

7.3. The 'multipart' Payload

  No payload is provided or assumed when the media type for the 
  top-level message is multipart.  Instead, the individual components 
  of the message must be examined as to the standards set in RFC 1521. 
  Each message component must provide its own specific media type, 
  which must be either 'message' or 'text' when specifying a media 
  subtype of inoculation.

8.0 Message Examples

  This section provides some examples of the message format. Depending 
  on the format of this draft, the message's whitespace and 
  structure may have been changed leaving the checksums in the example 
  to fail.

  Assume the recipient (spamsucks@myhouse.com) is willing to accept 
  inoculations of antispam from the sender 
  (jonathan@nuclearelephant.com).  They have previously agreed on the 
  shared authentication secret of 'beware the jabberwock'.  

  The steps involved in receiving and processing this inoculation are 
  as follows.  

  0. The recipient's inoculation-aware spam tool notes that this is an 
     inoculation-type message.

  1. The recipient spam tool parses the headers to find the claimed 
     sender is jonathan@nuclearelephant.com, and the claimed 
     inoculation type is spam.

  2. The recipient spam tool checks the local set of authorized 
     inoculators, and finds that jonathan@nuclearelephant.com is 
     permitted to inoculate spam.

  3. The recipient spam tool looks up jonathan@nuclearelephant.com, 
     and finds that the corresponding authentication shared secret is 
     the string of 'beware the jabberwock'.

  4. The recipient spam tool tests to confirm that this is not a 
     multipart inoculation, and that the payload is the entire data 
     text area.

  5. The recipient spam tool forms the authentication text by 
     concatenating the authentication shared secret, a newline, and 
     the full data text area (omitting the obligatory newline-newline 

spamfilt group          Inoculation Message Format             [Page 9]

INTERNET-DRAFT          Inoculation Message Format

     after the last header line) and continuing to end-of-file on the 
     email text or the length of the content, specified in the 
     'Content-Length' field, if present.

  6. The recipient spam tool calculates the md5 checksum of this 
     authentication text.

  7. The recipient spam tool compares the calculated checksum (from 
     step 6) with the claimed checksum found in the message header.  
     If the checksum does not match, no automatic inoculation is done 
     and the MTA may either notify the user of an attempted 
     inoculation failure, or may simply drop the message and exit with 
     nonerror status.  It is recommended that this behavior be 
     user-configurable.

  8. Having validated the authenticity of the sender / checksum / 
     payload tuple, the payload (and only the payload) is forwarded to 
     the proper user-configured spam filtering program's learning 
     interface, including the information that the payload was "spam".  

  Please note also should the message contain a 'From ' header, a space
  must precede the line in order to comply with RFC 821.  This space 
  should be part of the inoculation payload, and stripped out by the 
  recipient's spam tool. No 'From ' lines are used in the examples 
  below.

8.1 message/inoculation example

  To: Everyone on my list <spamsucks@myhouse.com>
  From: Jonathan A. Zdziarski <jonathan@nuclearelephant.com>
  Subject: This is a test inoculation
  Inoculation-Authentication: md5;
      checksum="dcdac94fab6ded79f33b0134d665d02f"
  Inoculation-Type: spam
  Inoculation-Sender: jonathan@nuclearelephant.com
  Content-Type: message/inoculation
  Content-Length: 169

  From: Bob Denver <bob@dead.com>
  Subject: This is a spam
  To: You <you@youremail.com>

  This is a test innoculation.  The checksum is correct, however.

     -Bill Yerazunis

8.2 text/inoculation example

  From: Jonathan A. Zdziarski <jonathan@nuclearelephant.com>
  To: Everyone on my list <spamsucks@myhouse.com>
  Subject: This is a test inoculation

spamfilt group          Inoculation Message Format            [Page 10]

INTERNET-DRAFT          Inoculation Message Format

  Inoculation-Authentication: md5;
      checksum="d5c883bce00de5391fbd8f7d17fb56a4"
  Inoculation-Type: spam
  Inoculation-Sender: jonathan@nuclearelephant.com
  Content-Type: text/inoculation
  Content-Length: 84

  This is a test innoculation.  The checksum is correct, however.

     -Bill Yerazunis

8.3 multipart/inoculation example

  To: Everyone on my list <spamsucks@myhouse.com>
  From: Jonathan A. Zdziarski <jonathan@nuclearelephant.com>
  Subject: This is a test inoculation
  Inoculation-Sender: jonathan@nuclearelephant.com
  Content-Type: multipart/inoculation; boundary="--NextPart-010203"

  ----NextPart-010203
  Inoculation-Authentication: md5;
      checksum="c3a47b29744062288cbd5c305897eaa9"
  Inoculation-Type: spam
  Content-Type: message/inoculation
  Content-Length: 169

  From: Bob Denver <bob@dead.com>
  Subject: This is a spam
  To: You <you@youremail.com>

  This is a test innoculation.  The checksum is correct, however.

     -Bill Yerazunis
  ----NextPart-010203
  Inoculation-Authentication: md5;
      checksum="d5c883bce00de5391fbd8f7d17fb56a4"
  Inoculation-Type: spam
  Content-Type: text/inoculation
  Content-Length: 84

  This is a test innoculation.  The checksum is correct, however.

     -Bill Yerazunis
  ----NextPart-010203--

Acknowledgements

   Many thanks to Brian Burton for his input and comments to this 
   document.

References

spamfilt group          Inoculation Message Format            [Page 11]

INTERNET-DRAFT          Inoculation Message Format


   [RFC822] - Standard for the format of ARPA Internet text messages 
   
   [RFC1521] - MIME (Multipurpose Internet Mail Extensions) Part One: 
               Mechanisms for Specifying and Describing the Format of 
               Internet Message Bodies

Author's Address

   Please send all coments to one of the authors listed below.

   Bill Yerazunis
   Mitsubishi Electric
   201 Broadway
   Cambridge, MA 02139
   USA

   Phone: +1 617 621 7530
   Email: wsy@merl.com

   Jonathan A. Zdziarski
   3069 Heritage Rd.
   Milledgeville, GA 31061
   USA

   Phone: +1 478 452 8187
   Email: jonathan@nuclearelephant.com

Full Copyright Statement

   Copyright (C) The Regents of the Anti-Spam Community (2003).  
   All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implmentation may be prepared, copied, published and
   distributed, in whole or in part, without restriction of any kind,
   provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the copyright holder or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an

spamfilt group          Inoculation Message Format            [Page 12]

INTERNET-DRAFT          Inoculation Message Format

   "AS IS" basis and THE AUTHORS DISCLAIM ALL WARRANTIES, EXPRESS OR 
   IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."

October 2003
Expires April 2004













































spamfilt group          Inoculation Message Format            [Page 13]