TOC |
|
This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”
The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.
This Internet-Draft will expire on July 13, 2009.
Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info). Please review these documents carefully, as they describe your rights and restrictions with respect to this document.
Many Web servers supply incorrect Content-Type headers with their HTTP responses. In order to be compatible with these Web servers, Web browsers must consider the content of HTTP responses as well as the Content-Type header when determining the effective mime type of the response. This document describes an algorithm for determining the effective mime type of HTTP responses that balances security and compatibility considerations.
1.
Introduction
2.
Metadata
3.
Web Pages
4.
Text or Binary
5.
Unknown Type
6.
Image
7.
Feed or HTML
§
Authors' Addresses
TOC |
The HTTP Content-Type header indicates the mime type of an HTTP responses. However, many HTTP servers supply a Content-Type that does not match the actual contents of the response. Historically, Web browsers have been tolerated these servers by examining the content of HTTP responses in addition to the Content-Type header to determine the effective mime type of the response.
Without a clear specification of how to "sniff" the mime type, each browser vendor was forced to reverse engineer the behavior of the other borwsers and to developed their own algorithm. These divergent algorithms have lead to a lack of interoperability between browsers and to security issues when the site intends an HTTP response to be interpreted as one mime type but the browser interpretes the responses as another mime type.
These security issues are must severe when a Web site lets users upload files and then serves the contents of those files with a low-privilege mime type (such as text/plain or image/jpeg). In the absense of mime sniffing, this user-generated content will not be able to run JavaScript, but if the browser treats the response as text/html, then the user can mount a cross-site scripting attack by including JavaScript code in the uploaded file.
This document describes a mime sniffing algorithm that carefully balances the compatibility needs of browser vendors with the security constraints. The algorithm has been constructed with reference to mime sniffing algorithms present in popular Web browsers, an extensive database of Web content, and metrics collected from implementations deployed to a sizable number of Web users.
Warning! It is imperative that the algorithm in this document be followed exactly. When a user agent uses different heuristics for content type detection than the server expects, security problems can occur. For example, if a server believes that the client will treat a contributed file as an image (and thus treat it as benign), but a Web browser believes the content to be HTML (and thus execute any scripts contained therein), the end user can be exposed to malicious content, making the user vulnerable to cookie theft attacks and other cross-site scripting attacks.
TOC |
What explicit Content-Type metadata is associated with the resource (the resource's type information) depends on the protocol that was used to fetch the resource.
For HTTP resources, only the first Content-Type HTTP header, if any, contributes any type information; the explicit type of the resource is then the value of that header, interpreted as described by the HTTP specifications. If the Content-Type HTTP header is present but the value of the first such header cannot be interpreted as described by the HTTP specifications (e.g. because its value doesn't contain a U+002F SOLIDUS ('/') character), then the resource has no type information (even if there are multiple Content-Type HTTP headers and one of the other ones is syntactically correct). [HTTP]
For resources fetched from the file system, user agents should use platform-specific conventions, e.g. operating system extension/type mappings.
Extensions must not be used for determining resource types for resources fetched over HTTP.
For resources fetched over most other protocols, e.g. FTP, there is no type information.
The algorithm for extracting an encoding from a Content-Type, given a string s, is as follows. It either returns an encoding or nothing.
Return the string between this character and the next earliest occurrence of this character.
Return nothing.
Return the string from this character to the first U+0009, U+000A, U+000C, U+000D, U+0020, or U+003B character or the end of s, whichever comes first.
Note: The above algorithm is a willful violation of the HTTP specification. [RFC2616]
TOC |
The sniffed type of a resource must be found as follows:
+-------------------------------+--------------------------------+ | Bytes in Hexadecimal | Textual representation | +-------------------------------+--------------------------------+ | 74 65 78 74 2f 70 6c 61 69 6e | text/plain | +-------------------------------+--------------------------------+ | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=ISO-8859-1 | | 3b 20 63 68 61 72 73 65 74 3d | | | 49 53 4f 2d 38 38 35 39 2d 31 | | +-------------------------------+--------------------------------+ | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=iso-8859-1 | | 3b 20 63 68 61 72 73 65 74 3d | | | 69 73 6f 2d 38 38 35 39 2d 31 | | +-------------------------------+--------------------------------+ | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=UTF-8 | | 3b 20 63 68 61 72 73 65 74 3d | | | 55 54 46 2d 38 | | +-------------------------------+--------------------------------+
...then jump to the "text or binary" section below.
TOC |
+----------------------+--------------+ | Bytes in Hexadecimal | Description | +----------------------+--------------+ | FE FF | UTF-16BE BOM | | FF FE | UTF-16LE BOM | | EF BB BF | UTF-8 BOM | +----------------------+--------------+
...then the sniffed type of the resource is "text/plain". Abort these steps.
+-------------------------+ | Binary data byte ranges | +-------------------------+ | 0x00 -- 0x08 | | 0x0B | | 0x0E -- 0x1A | | 0x1C -- 0x1F | +-------------------------+
Warning! It is critical that this step not ever return a scriptable type (e.g. text/html), as otherwise that would allow a privilege escalation attack.
TOC |
If the "and" operator, applied to the index_streamth byte of the stream and the index_patternth byte of the mask, yield a value different that the index_patternth byte of the pattern, then skip this row.
Otherwise, increment index_pattern to the next byte in the mask and pattern and index_stream to the next byte in the byte stream.
"WS" means "whitespace", and allows insignificant whitespace to be skipped when sniffing for a type signature.
If the index_streamth byte of the stream is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space), then increment only the index_stream to the next byte in the byte stream.
Otherwise, increment only the index_pattern to the next byte in the mask and pattern.
The table used by the above algorithm is:
+-------------------+-------------------+-----------------+------------+ | Mask in Hex | Pattern in Hex | Sniffed type | Security | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | 3C 21 44 4F 43 54 | text/html | Scriptable | | DF DF DF FF DF DF | 59 50 45 20 48 54 | | | | DF DF | 4D 4C | | | | | | Comment: The string "<!DOCTYPE HTML" in US-ASCII or compatible | | encodings, case-insensitively. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | WS 3C 48 54 4D 4C | text/html | Scriptable | | | | Comment: The string "<HTML" in US-ASCII or compatible encodings, | | case-insensitively, possibly with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | WS 3C 48 45 41 44 | text/html | Scriptable | | | | Comment: The string "<HEAD" in US-ASCII or compatible encodings, | | case-insensitively, possibly with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | WS 3C 53 43 52 49 | text/html | Scriptable | | DF DF | 50 54 | | | | | | Comment: The string "<SCRIPT" in US-ASCII or compatible | | encodings, case-insensitively, possibly with leading | | spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF FF | 25 50 44 46 2D | application/pdf | Scriptable | | | | Comment: The string "%PDF-", the PDF signature. | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF FF FF | 25 21 50 53 2D 41 | application/ | Safe | | FF FF FF FF FF | 64 6F 62 65 2D | postscript | | | | | Comment: The string "%!PS-Adobe-", the PostScript signature. | +-------------------+-------------------+-----------------+------------+ | FF FF 00 00 | FE FF 00 00 | text/plain | n/a | | | | Comment: UTF-16BE BOM | +-------------------+-------------------+-----------------+------------+ | FF FF 00 00 | FF FF 00 00 | text/plain | n/a | | | | Comment: UTF-16LE BOM | +-------------------+-------------------+-----------------+------------+ | FF FF FF 00 | EF BB BF 00 | text/plain | n/a | | | | Comment: UTF-8 BOM | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF FF FF | 47 49 46 38 37 61 | image/gif | Safe | | | | Comment: The string "GIF87a", a GIF signature. | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF FF FF | 47 49 46 38 39 61 | image/gif | Safe | | | | Comment: The string "GIF89a", a GIF signature. | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF FF FF | 89 50 4E 47 0D 0A | image/png | Safe | | FF FF | 1A 0A | | | | | | Comment: The PNG signature. | +-------------------+-------------------+-----------------+------------+ | FF FF FF | FF D8 FF | image/jpeg | Safe | | | | Comment: A JPEG SOI marker followed by a byte of another marker. | +-------------------+-------------------+-----------------+------------+ | FF FF | 42 4D | image/bmp | Safe | | | | Comment: The string "BM", a BMP signature. | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF | 00 00 01 00 | image/vnd. | Safe | | | | microsoft.icon | | | | | Comment: A 0 word following by a 1 word, a Windows Icon signature. | +-------------------+-------------------+-----------------+------------+
Note: I'd like to add types like MPEG, AVI, Flash, Java, etc, to the above table.
User agents may support further types if desired, by implicitly adding to the above table. However, user agents should not use any other patterns for types already mentioned in the table above, as this could then be used for privilege escalation (where, e.g., a server uses the above table to determine that content is not HTML and thus safe from XSS attacks, but then a user agent detects it as HTML anyway and allows script to execute).
The column marked "security" is used by the algorithm in the "text or binary" section, to avoid sniffing text/plain content as a type that can be used for a privilege escalation attack.
TOC |
If the resource's official type is "image/svg+xml", then the sniffed type of the resource is its official type (an XML type).
Otherwise, if the first bytes of the resource match one of the byte sequences in the first column of the following table, then the sniffed type of the resource is the type given in the corresponding cell in the second column on the same row:
+-------------------------+--------------------------+----------+ | Bytes in Hexadecimal | Sniffed type | Comment | +-------------------------+--------------------------+----------+ | 47 49 46 38 37 61 | image/gif | "GIF87a" | | 47 49 46 38 39 61 | image/gif | "GIF89a" | | 89 50 4E 47 0D 0A 1A 0A | image/png | | | FF D8 FF | image/jpeg | | | 42 4D | image/bmp | "BM" | | 00 00 01 00 | image/vnd.microsoft.icon | | +-------------------------+--------------------------+----------+
Otherwise, the sniffed type of the resource is the same as its official type.
TOC |
Note: User agents are allowed, by the first step of this algorithm, to wait until the first 512 bytes of the resource are available.
Increase pos by 1 and repeat this step.
Increase pos by 1 and go to the next step.
The sniffed type of the resource is "text/html". Abort these steps.
+----------------------+-----------------------------------+-----------+ | Bytes in Hexadecimal | Requirement | Comment | +----------------------+-----------------------------------+-----------+ | 72 73 73 | The sniffed type of the resource | "rss" | | | is "application/rss+xml"; abort | | | | these steps. | | +----------------------+-----------------------------------+-----------+ | 66 65 65 64 | The sniffed type of the resource | "feed" | | | si "application/atom+xml"; abort | | | | these steps. | | +----------------------+-----------------------------------+-----------+ | 72 64 66 3A 52 44 46 | Continue to the next step in this | "rdf:RDF" | | | algorithm. | | +----------------------+-----------------------------------+-----------+
If none of the byte sequences above match the bytes in s starting at pos, then the sniffed type of the resource is "text/html". Abort these steps.
For efficiency reasons, implementations may wish to implement this algorithm and the algorithm for detecting the character encoding of HTML documents in parallel.
TOC |
Adam Barth | |
Univeristy of California, Berkeley | |
Email: | abarth@eecs.berkeley.edu |
URI: | http://www.adambarth.com/ |
Ian Hickson | |
Google, Inc. | |
Email: | ian@hixie.ch |
URI: | http://ln.hixie.ch/ |