Internet DRAFT - draft-lee-sdch-spec
draft-lee-sdch-spec
Network Working Group J. Butler
Internet-Draft
Intended status: Informational W. Lee
Expires: May 1, 2017
B. McQuade
K. Mixter
October 28, 2016
A Proposal for Shared Dictionary Compression over HTTP
draft-lee-sdch-spec-00
Abstract
This paper proposes an HTTP/1.1-compatible extension that supports
inter-response data compression by means of a reference dictionary
shared between user agent and server.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 1, 2017.
Copyright Notice
Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
Butler, et al. Expires May 1, 2017 [Page 1]
Internet-Draft sdch-spec October 2016
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
1. Introduction
In order to reduce payload size, HTTP/1.1 supports response
compression via the Accept-Encoding and Content-Encoding headers.
The most commonly used HTTP response compression encoding is gzip,
which compresses data that is repeated within a given response.
However, HTTP/1.1 does not provide a mechanism for compressing data
that is repeated between responses. A different class of encoding
technique, known as delta encoding, has proven effective at
compressing inter-response data.
Previous efforts to extend HTTP/1.1 to support delta compression have
focused on encoding an HTTP response as a delta of a previous version
of that response. One such approach is discussed in RFC 3229 "Delta
encoding in HTTP" [RFC3229]. While RFC 3229 is effective at reducing
payload size for many types of resources, it may not be suitable for
certain classes of responses.
Specifically, under RFC 3229, deltas can only be applied to responses
originating from the same URL, and the means of identifying the
instance to delta "from" is by a Last-Modified timestamp or entity-
tag. This makes RFC 3229 unsuitable for compressing dynamically
generated responses to a given URL with varying query parameters
(e.g. a search results page), since these types of responses are
difficult to identify uniquely using entity tags or last modified
timestamps. Content hashes can be used, but false positives are
possible. Also, storing all previous responses on the server may not
be practical.
2. Proposal: Shared Dictionary Compression over HTTP
Existing techniques compress each response in isolation, and so
cannot take advantage of cross-payload redundancy. For example,
retrieving a set of HTML pages with the same header, footer, inlined
JavaScript and CSS requires the retransmission of the same data
multiple times. This paper proposes a compression technique that
leverages this cross-payload redundancy.
In this proposal, a dictionary is a file downloaded by the user agent
from the server that contains strings which are likely to appear in
subsequent HTTP responses. In the case described above, if the
header, footer, JavaScript and CSS are stored in a dictionary
possessed by both user agent and server, the server can substitute
these elements with references to the dictionary, and the user agent
can reconstruct the original page from these references. By
Butler, et al. Expires May 1, 2017 [Page 2]
Internet-Draft sdch-spec October 2016
substituting dictionary references for repeated elements in HTTP
responses, the payload size can be reduced.
If either the user agent or the server does not support the
extension, then ordinary HTTP responses are served.
If both the user agent and the server support the extension but the
user agent does not have an applicable dictionary (as described in
detail below), the server responds with an ordinary HTTP response
that includes a header advertising the location of a relevant
dictionary. This dictionary can be retrieved out-of-band by the user
agent.
If both the user agent and the server support the extension and the
user agent has an applicable dictionary, then each HTTP response
includes references to strings in the dictionary, rather than
repeating those strings in the response. The references require
fewer bytes to encode than the strings themselves, reducing the
payload size.
The HTTP header-based protocol for negotiating the presence of
dictionaries on user agent and server is referred to in this proposal
as the SDCH protocol. The compression scheme based on a particular
dictionary shared between user agent and server is referred to as the
SDCH encoding, and is built upon the VCDIFF compression data format
[RFC3284].
3. Syntax
The grammar descriptions in the sections that follow depend on the
following syntax: DIGIT (decimal digit), BASE64URLDIGIT (alphanumeric
digit or "-" or "_"), PAYLOADBYTE (a byte), token (informally, a
sequence of non-special, non-white space characters), rest-of-line
(informally, a sequence of characters not including carriage return
or line-feed). In the grammar below, HTTP_url, abs_path, and query
are defined in RFC 7230 [RFC7230].
header = attr ":" value "\n"
attr = token
value = rest-of-line
dictionary-client-id = 1*BASE64URLDIGIT
dictionary-server-id = 1*BASE64URLDIGIT
payload = 1*PAYLOADBYTE
vcdiff-payload = 1*PAYLOADBYTE
partial-url = HTTP_url | abs_path [ "?" query ]
The attribute names (attr) are case-insensitive. White space is
permitted between tokens.
Butler, et al. Expires May 1, 2017 [Page 3]
Internet-Draft sdch-spec October 2016
4. Dictionary Description
4.1. General
In the proposed protocol, a dictionary can only be used with a
limited set of URLs and for a limited duration of time, referred to
as its scope and lifetime, respectively. A dictionary is composed of
the data used by the compression algorithm, known as the payload, as
well as metadata describing its scope and lifetime. The scope is
specified by a domain attribute and path attribute that are patterned
after the same named attributes from the HTTP State Management
Specification [RFC2965].
4.2. Syntax of Dictionary Metadata
The syntax of dictionary metadata is as follows:
dictionary-metadata = 1*dictionary-header "\n"
dictionary-header = "domain" ":" value "\n"
| "path" ":" value "\n"
| "path-equals" ":" value "\n"
| "format-version" ":" value "\n"
| "max-age" ":" value "\n"
| "port" ":" <"> portlist <"> "\n"
portlist = 1#portnum
portnum = 1*DIGIT
A complete dictionary definition then has this format: n dictionary-
definition = dictionary-metadata payload
Informally, the metadata for a dictionary is a series of headers,
similar in form to HTTP headers, terminated by an empty line. The
dictionary payload begins immediately after this blank line.
The valid dictionary header identifiers are described below:
o Domain: domain.
Required. Indicates the domain to which the dictionary applies. The
domain specification must explicitly start with a dot. For example,
a dictionary with the domain specification ".google.com" may be used
to compress a response served from the host name www.google.com, but
not used to compress a response served from the host name
www.gmail.com. Only printable ASCII characters are permitted in the
domain value. International Domain Names must be specified using
IDNA.
o Path: path.
Butler, et al. Expires May 1, 2017 [Page 4]
Internet-Draft sdch-spec October 2016
Optional. Indicates the set of URL paths for which this dictionary
is valid. If unspecified, the dictionary applies to all paths within
the given domain.
o Path-equals: path.
Optional. Indicates the exact URL path for which this dictionary is
valid. If both "path" and "path-equals" are specified, the
dictionary applies only to those URLs which satisfy both criteria.
o Format-version: version.
Optional. Indicates the version of the dictionary payload. If
unspecified, the format version defaults to "1.0". Currently, the
only acceptable value is "1.0".
o Max-age: delta-seconds.
Optional. Indicates the amount of time that a dictionary can be
advertised to the server by the user agent, relative to the time it
was downloaded. If unspecified, the default is 30 days from the time
the dictionary was downloaded by the user agent. * Port: port list.
Optional. Indicates the comma-separated list of ports to which this
dictionary applies. If unspecified, the dictionary applies to all
ports.
Like HTTP headers, dictionary header identifiers are case-
insensitive. Unknown headers will be ignored by the user agent,
allowing other headers to be added in the future.
4.3. Dictionary Scope
The specific rules of when a dictionary can be applied to a URL, i.e.
that define its scope, are modeled after the rules for cookie
scoping. The term "domain-match" is defined in RFC 2965. We define
path-matching as follows For two strings that represent paths, P1 and
P2, P1 path-matches P2 if either:
1. P2 is equal to P1
2. P2 is a prefix of P1 and either the final character in P2 is "/"
or the character following P2 in P1 is "/".
For example, "/tec/waldo" path-matches "/tec", "/tec/", and "/tec/
waldo", but does not path-match "/tec/wal".
Butler, et al. Expires May 1, 2017 [Page 5]
Internet-Draft sdch-spec October 2016
Given these definitions of domain-match and path-match, a request URL
falls within a dictionary's scope exactly when all of the following
are true:
1. The request URL's host name domain-matches the Domain attribute
of the dictionary.
2. If the dictionary has a Port attribute, the request port is one
of the ports listed in the Port attribute.
3. The request URL path-matches the path attribute of the
dictionary.
4. The request URL's scheme matches the scheme of the dictionary.
If a URL falls within a dictionary's scope, the dictionary is said to
"apply" to the URL.
4.4. Dictionary Identifier
In communications between user agent and server, a dictionary is
identified by the first 96 bits of the SHA-256 digest [RFC6234] of a
dictionary's metadata and payload (see dictionary-definition above)
exactly as it is received by the user agent from the server. Both
user agent and server compute this identifier independently, based on
the metadata and the payload of the dictionary. This digest should
be unique within a dictionary's scope (domain and path) in order to
prevent dictionary identifier collisions.
The digest serves not only as an identifier but also as a safeguard
against attempts to maliciously intercept or otherwise modify
dictionary contents, since a compromised dictionary will hash to a
different identifier and the server will not recognize it. The user
agent identifier for a dictionary is defined as the URL-safe base64
encoding (as described in RFC 3548, section 4 [RFC3548] of the first
48 bits (bits 0..47) of the dictionary's SHA-256 digest. The server
identifier for a dictionary is the URL-safe base64 encoding of the
second 48 bits (bits 48..95). When identifying a dictionary to the
server, the user agent uses the user agent identifier, and similarly,
when identifying a dictionary to the user agent, the server uses the
server identifier. Note that both user agent and server have the
entire dictionary and can thus compute both identifiers for the
dictionary.
As a consequence of this scheme, dictionaries do not need to be
explicitly named by site maintainers, as the protocol avoids
identifying them in any way other than the above digest-generated
identifiers.
Butler, et al. Expires May 1, 2017 [Page 6]
Internet-Draft sdch-spec October 2016
4.5. Differences between Dictionaries and Cookies
Dictionaries are similar to cookies in that they allow sharing of
state over HTTP. Thus, we have modeled dictionaries after cookies,
as described in RFC 2965. However, because dictionaries are
typically larger than cookies, embedding a dictionary in the response
would increase latency of the response. Thus a dictionary is always
sent as a separate HTTP response (unlike a cookie which is included
in a Set-Cookie header of any HTTP response). The Get-Dictionary
HTTP response header is used to tell the user agent that it should
fetch a dictionary separately for use in future requests.
Likewise, rather than including the dictionary contents in the HTTP
request headers (like a cookie in the Cookie header), dictionary
identifiers (described above) are used to advertise available
dictionaries in HTTP requests from the user agent to the server.
5. User Agent / Server Interaction Description
5.1. User Agent Role in HTTP Request Generation
The user agent:
1. Advertises support for the proposed protocol by adding the "sdch"
token to the Accept-Encoding header of HTTP requests.
2. Advertises any dictionaries it possesses that apply to the URL
being requested (per the scoping rules above) in the Avail-
Dictionary request header.
The Avail-Dictionary header syntax is as follows: avail-dictionary-
header = "Avail-Dictionary" ":" 1#dictionary-client where dictionary-
client-id is the user agent identifier part for the dictionary based
on the SHA-256 digest as described above. The value of this header
is informally a comma separated list of user agent dictionary
identifiers.
The user agent must advertise every dictionary it has cached that
applies to the requested URL. It is only the presence of the
dictionary identifier in this header that indicates to the server
that the user agent possesses and therefore does not need to download
the dictionary. Since the user agent must advertise every dictionary
it has, it is the site maintainer's responsibility to avoid making
too many dictionaries available at a given time. Advertising many
dictionaries in this header can counteract the benefits of
compression.
Butler, et al. Expires May 1, 2017 [Page 7]
Internet-Draft sdch-spec October 2016
Note that for each individual request the user agent has discretion
over whether or not to add "sdch" Accept-Encoding token and the
Avail-Dictionary header. Since some responses, such as image data,
are unlikely to benefit from dictionary compression, the user agent
can reduce the size of its requests by not sending this token and
header. The user agent may decide whether or not to add these
headers based on file extensions in URLs or the context of the
request. For instance, the user agent may choose to not advertise
SDCH for URLs referenced in IMG elements.
5.2. Server Role in HTTP Response Generation
When a server that supports the extension receives a request that
indicates that the user agent supports the protocol (e.g. the "sdch"
token is present in the Accept-Encoding request header), two
independent decisions must be made. The server must decide: 1. if it
wants to send an encoded response. 2. if it wants to inform the user
agent about additional dictionaries it can download and use in the
future.
The server may return an encoded response only if all of the
following are true: 1. The Accept-Encoding request header contains
the "sdch" token. 2. The server can send a response compressed with
a dictionary whose dictionary-client-id is in the Avail-Dictionary
request header.
A server may return a response that is not encoded even if it
recognizes a dictionary advertised by the user agent. If the server
decides to not use SDCH encoding when a Avail-Dictionary header is
present, it must include a specific HTTP header X-SDCH-Encoding with
value "0" in the response. The syntax of the X-SDCH-Encoding header
is:
sdch-not-used-header = "X-SDCH-Encoding" ":" "0"
The server indicates that an HTTP response is encoded by inserting
the token "sdch" into the Content-Encoding header of the HTTP
response.
A compatible server may instruct a compatible user agent to download
one or more new dictionaries by including the Get-Dictionary header
in the HTTP response. The server may advertise a Get-Dictionary
header even if the response is not encoded. The syntax of the Get-
Dictionary header is: get-dictionary-header = "Get-Dictionary" ":"
1#partial-url where partial-url is either a complete URL, or just the
absolute URL path (in which case the scheme, host, and port of the
originating server would be used when requesting the dictionary). If
a complete URL is provided, it must have the same scheme, host, and
Butler, et al. Expires May 1, 2017 [Page 8]
Internet-Draft sdch-spec October 2016
port as the originating server. The Content-Type header of
dictionary responses must be application/x-sdch-dictionary. The
value in the get dictionary header is a comma-separated list of
partial-url elements.
The server must not advertise a dictionary with a dictionary-client-
id that the user agent has listed in the Avail-Dictionary header.
The server may use SDCH compression with a dictionary that the user
agent has advertised and also include a Get-Dictionary header for a
different dictionary that the user agent has not advertised.
The server must prevent SDCH-encoded responses from being cached by
intermediate proxies. See the section below on proxy caching for
additional details.
The server should limit the number of active dictionaries at any one
time, by using well-scoped dictionaries. A server that has many
active dictionaries with overlapping scope will cause user agents to
generate a very long Avail-Dictionary header, the overhead of which
can counteract the benefits of SDCH compression.
The server may decide to precompute and cache SDCH-encoded responses
if a given SDCH-encoded response will be served multiple times (e.g.
for static content).
The server may apply multiple Content-Encodings to the response,
(e.g. sdch and gzip) in which case subsequent encoding tokens are
appended to the Content-Encoding header, per the HTTP/1.1 RFC section
14.11.
5.3. User Agent Role in HTTP Response Handling
An SDCH-compatible user agent must inspect the Content-Encoding HTTP
response header to determine if the response is SDCH-encoded. If the
Content-Encoding includes the "sdch" token, the user agent must
perform SDCH decompression on the response.
If the HTTP response includes a Get-Dictionary header, the user agent
must verify that the partial-url specified refers to the same server
that generated the response. If so, the user agent may download the
dictionary at the given URL.
There are two different URLs to consider when downloading and storing
a dictionary. The referer URL is the URL of the request that
resulted in the server responding with a Get-Dictionary header.
The dictionary URL is defined as follows:
Butler, et al. Expires May 1, 2017 [Page 9]
Internet-Draft sdch-spec October 2016
1. If the partial-url is a complete URL, the dictionary URL is the
partial-url.
2. If the partial-url is just a path URL, the dictionary URL is
generated from the scheme and host name of the referrer URL and
the path in the partial-url.
The user agent may retrieve a dictionary if the origin of the
dictionary matches the origin of the referrer. HTTP redirects may
only be followed if the origin matches as well.
Upon retrieving the dictionary, the user agent must validate the
dictionary. Here again, the validation rules are modeled after the
rules for when a user agent can accept an HTTP cookie. A dictionary
is invalid and must not be stored if any of the following are true:
1. The dictionary has no Domain attribute.
2. The effective host name that derives from the referrer URL host
name does not domain-match the Domain attribute.
3. The Domain attribute is a top level domain.
4. The referrer URL host is a host domain name (not IP address) and
has the form HD, where D is the value of the Domain attribute,
and H is a string that contains one or more dots.
5. If the dictionary has a Port attribute and the referrer URL's
port was not in the list.
If the dictionary is valid and user agent decides to store the
dictionary, the scheme of the dictionary URL should also be stored
along with dictionary.
5.4. SDCH-Encoded Response Body
An SDCH-encoded response starts with the dictionary-server-id used to
compress the response. The syntax of the SDCH-encoded response is:
dictionary-compression-response = dictionary-server-id "\0" vcdiff-
payload
6. Examples
For the purpose of these examples, assume the following dictionaries
exist on the server and can be downloaded from the following URLs:
"Search results" dictionary
Butler, et al. Expires May 1, 2017 [Page 10]
Internet-Draft sdch-spec October 2016
o domain: .google.com
o path: /search
o user agent ID: TWFuIGlz
o server ID: JOWk0d2N
o download location: /dictionaries/search_dict
"Help pages" dictionary
o domain: .google.com
o path: /
o user agent ID: GVhc3V48
o server ID: O9d2_m3-
o download location: /dictionaries/help_dict
Note that the dictionary identifier consists of two parts: user agent
ID and the server ID. Most of the detail of the request and response
headers has been omitted.
6.1. Example 1: Initial Interaction, User Agent has No Dictionaries
1. user agent's request
GET /search?q=sprouts HTTP/1.1
Host: www.google.com
Accept-Encoding: sdch, gzip
1. server's response
HTTP/1.1 200 OK
Content-type: text/html
Content-Encoding: gzip
Get-Dictionary: /dictionaries/search_dict, /dictionaries/help_dict
Cache-Control: private
Note that the response returned by the server does NOT use SDCH
encoding, since the user agent does not have a dictionary. The
server simply provides the locations of the dictionaries for future
use. The user agent may choose to retrieve one or both dictionaries
separately.
Butler, et al. Expires May 1, 2017 [Page 11]
Internet-Draft sdch-spec October 2016
6.2. Example 2: User Agent Requests the Dictionary
1. user agent's request
GET /dictionaries/search_dict HTTP/1.1
Host: www.google.com
Accept-Encoding: sdch, gzip
1. server's response
HTTP/1.1 200 OK
Content-type: application/x-sdch-dictionary
Content-Encoding: gzip
Domain: .google.com
Path: /search
Format-version: 1.0
...dictionary contents...
Upon receiving this response, the user agent computes the digest of
the dictionary and determines the user agent ID is TWFuIGlz and the
server ID is JOWk0d2N.
6.3. Example 3: User Requests Page AND User Agent Has Already
Downloaded
the Dictionary
1. user agent's request
GET /search&q=brussel+sprouts HTTP/1.1
Host: www.google.com
Accept-Encoding: sdch, gzip
Avail-Dictionary: TWFuIGlz
1. server's response
HTTP/1.1 200 OK
Content-type: text/html
Content-Encoding: sdch, gzip
Get-Dictionary: /dictionaries/help_dict
Cache-Control: private
JOWk0d2N<NUL>...VCDIFFed response...
(note that the response shown to the left the result of gzip
decompression)
Butler, et al. Expires May 1, 2017 [Page 12]
Internet-Draft sdch-spec October 2016
The server has properly identified the dictionary using its server ID
and the user agent can confirm that the second 48 bits of the SHA-256
digest of the dictionary match its computation. It can then
decompress the VCDIFF response using this dictionary. Even though
the "search results" dictionary was used to decompress the response,
the server has chosen to indicate another dictionary could be
requested by the user agent from http://www.google.com/dictionaries/
help_dict. This dictionary must be different than the "search
results" dictionary as the server must never request the user agent
download a dictionary it knows the user agent already has. Let's
assume the user agent decides to download this dictionary.
6.4. Example 4: User Requests with Multiple Dictionaries
1. user agent's request
GET /search&q=brussels HTTP/1.1
Host: www.google.com
Accept-Encoding: sdch, gzip
Avail-Dictionary: GVhc3V48,TWFuIGlz
1. server's response
HTTP/1.1 200 OK
Content-type: text/html
Content-Encoding: sdch, gzip
Cache-Control: private
JOWk0d2N<NUL>...VCDIFFed response... (note that the response shown
to the left the result of gzip decompression)
The user agent advertises that it has already downloaded two
dictionaries that apply. The server may compress the response with
either dictionary. As the server has no other dictionaries that
apply to the request, it does not advertise any dictionaries in its
response.
7. Implementation Considerations
7.1. Implementation Limits
There are practical limitations to the number and size of the
dictionaries a user agent can store. It is suggested that general
use, non-mobile user agents should have the following minimum
capabilities:
o At least 300 dictionaries stored
Butler, et al. Expires May 1, 2017 [Page 13]
Internet-Draft sdch-spec October 2016
o At least 100KB of payload per dictionary
o At least 10MB of total dictionary contents
o At least 20 dictionaries stored per domain
7.2. Dictionary Downloading
The user agent always has the choice of whether or not to download a
dictionary. It is recommended that the user agent be implemented
with sufficient state to avoid downloading too many dictionaries from
the same server. A malfunctioning server may also request the user
agent continually download the same dictionary. One simple method to
avoid both of these possibilities is for the user agent to rate-limit
downloading dictionaries from the same domain.
When the user agent receives a response with a Get-Dictionary header
with dictionary download URLs that it may fetch, it should perform
the dictionary downloads in the background. This is possible as the
dictionary to be downloaded is guaranteed to not be needed to
decompress the response with the Get-Dictionary header. The user
agent should be careful to abort background dictionary downloads that
do not complete in a reasonable amount of time.
7.3. Data Integrity
If the dictionaries are tied to individual users or specific user
actions, HTTP may leak this information to passive attacker by
allowing the Get-Dictionary info to be seen. When using HTTPS, the
same risk is prevented in the design document since Get-Dictionary
URLs are required to be same-origin as the response.
However, Downloading dictionaries over HTTPS or advertising
dictionaries over HTTPS might introduce new security risks.
TODO: add some examples. For example, SDCH-over-HTTPS subject to
compression oracle attacks similar to CRIME/BREACH with the
difference that the compression context is not supplied by the
attacker. If an attacker had the contents of a dictionary, there is
a theoretical possibility where a server sends a static response
XOR'ed with user-provided data. The Attacker can provide data which
reduced the size of the response when XOR'ed with the static
response, the attacker may then be able to determine the contents of
the static response.
The protocol needs to ensure that the content as decompressed by the
user agent with a given dictionary is identical to the server's
Butler, et al. Expires May 1, 2017 [Page 14]
Internet-Draft sdch-spec October 2016
originally intended content. The three areas that can cause a data
integrity problem are discussed below.
7.3.1. Data tampered by Proxy
We have found incorrectly implemented proxies which tamper with an
SDCH response and make the response unable to be decompressed to the
server's originally intended content. The tampering may not be
detected in the SDCH encoding itself if the proxy makes SDCH content
look like non-SDCH content, for instance, by stripping the 'sdch'
token from the content-encoding header of the response or by adding
additional encodings (like gzip) on top of the SDCH and gzipped
response without making the Content-Encoding header match. In order
to detect when this occurs, the HTTP header X-SDCH-Encoding must be
added to the response by the server to inform the client that the
response was originally not SDCH encoded by the server. Should the
user agent advertise SDCH capability in the request but receive a
non-SDCH encoded response without the X-SDCH-Encoding header, it
suggests that the response was tampered by a proxy. The user agent
may then take action to avoid using SDCH in the future.
7.3.2. Dictionary mismatch
When a dictionary information is exchanged between user agent and
server, it is necessary to ensure that the dictionary identifiers are
completely unambiguous, or the decompressed result may differ from
the original content. To address this issue, SDCH uses the first 96
bits of the SHA-256 digest of a dictionary's metadata and payload to
create the dictionary identifiers used by the user agent and server
to avoid ambiguity. (Please refer to the section "Dictionaries
description" above for details.)
7.3.3. Data corruption / malicious attacks
While this issue is not specific to SDCH, it can be exacerbated due
to the nature of the stateful compression. For example, if the
dictionary is corrupted or maliciously modified in a persistent on-
disk cache, all subsequent responses decoded by using this dictionary
will be corrupt. For this reason, the user agent and server should
revalidate the dictionaries' integrity when they are loaded from non-
volatile storage.
Other issues like data corruption during transmission in the encoded
payload could have much bigger adverse effect than that in the plain
text. TCP provides a checksum, but it cannot detect some errors like
swapped bytes. To address this issue, SDCH includes an Adler32
checksum [RFC1950] in the encoded data shards. (Please refer to
appendix "VCDIFF Encoding Format and SDCH" for details.)
Butler, et al. Expires May 1, 2017 [Page 15]
Internet-Draft sdch-spec October 2016
8. Response Caching
8.1. User Agent Cache
The user agent should honor HTTP caching directives (Cache-Control,
Expires,...) for caching responses, whether or not the responses are
SDCH-encoded. When caching the SDCH-encoded responses, the SDCH-
encoded responses should be decoded before being written to the
cache. If this is not possible, the user agent may cache SDCH-
encoded responses, unless the HTTP response headers indicate that the
response is not cacheable. In this case, an SDCH-encoded cache entry
should be invalidated when (1) the dictionary used to encode that
response is deleted from the dictionary store, (2) the SDCH
decompression user agent is uninstalled (if it is implemented as a
browser add-on), or (3) the SDCH capable user agent is disabled.
Intermediate Caches
The server should use HTTP cache headers that prevent non-SDCH-aware
intermediate cache servers from storing the encoded contents. The
cache directive "Cache-Control: private" can be used for this
purpose.
If the compressed response can be cached by proxy caches, the server
must include the HTTP header "Vary: Accept-Encoding, Avail-
Dictionary" to alert proxies about sending the cached content only to
the user agents who can decode it. Note that some proxies may not
respect the Vary header, in which case non-SDCH-capable user agents
would end up downloading SDCH-encoded responses. Thus, we recommend
that SDCH-encoded responses not be cacheable by intermediate proxies
unless there is a very compelling reason. Further, "Vary: Accept-
Encoding, Avail-Dictionary" will not match requests unless these
headers match exactly.
A proxy cache may provide one of three levels of support for caching
SDCH-encoded objects.
1. No support - Never cache any response if the header Vary is
present.
2. Basic support - The proxy cache only serves cached SDCH-encoded
content if all cache serving conditions are satisfied and the
values of the HTTP headers specified in the Vary header of the
cached content exactly match the corresponding headers in the
HTTP request.
3. Full support - The proxy should understand the SDCH protocol,
should know what dictionary is used to encode/decode the
Butler, et al. Expires May 1, 2017 [Page 16]
Internet-Draft sdch-spec October 2016
response, and should be able to download advertised dictionaries.
The cache needs to have both SDCH user agent and server logic in
it. The server should store the SDCH decoded responses in its
cache.
Dictionary Caching User Agent Cache
As dictionary payloads may be large compared to the size of
individual HTTP responses, in order to maximize latency improvements
and minimize the bandwidth overhead of downloading dictionaries, it
is recommended that the user agent persistently store dictionaries in
a dictionary cache (e.g. on disk). It is suggested that the user
agent implement a maximum limit on number of dictionaries stored per
domain in order to avoid allowing one domain to force dictionaries
for other domains out of the user agent's dictionary cache. To
implement a fixed maximum size cache it is recommended that the cache
manager first evict the dictionaries that were least recently used
for decoding.
Ideally dictionaries will be stored in the same cache as HTTP
responses and may be inspected and cleared by the user using existing
user interfaces. However, new support may be created to fulfill the
need for the user agent to be able to quickly determine which
dictionaries should be advertised for a given request.
The user agent should be careful to validate that a dictionary
matches its original identifier before being used for decompression
to prevent malicious attacks on the dictionary cache. The user agent
may implicitly handle this by always recomputing the hash before
advertising the dictionary. However, to improve efficiency, the user
agent may cache the original digest of the dictionary, advertise the
dictionary with that digest, and then only for the dictionary
selected by the server to encode the response, verify that the cached
dictionary digest still matches the digest computed from the cached
dictionary.
The user agent must not evict dictionaries from its dictionary store
that have been advertised in the Avail-Dictionary header of a HTTP
request for which a response has not yet been returned.
If a user agent downloads a dictionary which has the same identifier
as another previously downloaded dictionary which are applicable to
the same hosts, the user agent must be careful to either ignore the
new dictionary or evict the old dictionary. If the two dictionaries
with the same identifier have exactly the same contents the choice is
not important, however this indicates a server error as a server must
never instruct the user agent to download a dictionary that was
advertised by the user agent. The user agent may want to avoid
Butler, et al. Expires May 1, 2017 [Page 17]
Internet-Draft sdch-spec October 2016
downloading dictionaries from this server in the future as they may
not be new and downloading unnecessary dictionaries can increase
latency.
Intermediate Caches
The dictionary should be treated as a regular HTTP response by
intermediate proxies. Thus, the normal HTTP caching consideration
for intermediate proxies should apply to the dictionary as well.
9. Future Directions
=====================
As currently proposed, SDCH is not applicable to another case where
differential compression would be beneficial: large files that change
infrequently and in small ways, such as JavaScript and CSS files
referenced by other HTML documents.
TODO: Re-evaluate dictionary scoping rules, current approach that
patterned after the same named attributes from the HTTP State
Management Specification [RFC2965] may not be the best choice.
10. Current Status and Updates
For current information about the status of this proposal:
https://groups.google.com/group/SDCH
11. IANA Considerations
This document makes no requests of IANA.
12. Security Considerations
Some security considerations are discussed in the data integrity
section above, but the author anticipates further work to describe
these.
13. Acknowledgements
The authors would like to acknowledge the support of Google, Inc. for
the development of this work. Technical editor: Harriett Hardman.
Feedback and comments: Greg Badros, Chandra Chereddi, Darren Fisher,
Ted Hardie, Ashu Jain, Ian Hickson, Othman Laraki, Jim Roskind, Ryan
Sleevi, Lincoln Smith, Randy Smith, and Linus Upson.
Butler, et al. Expires May 1, 2017 [Page 18]
Internet-Draft sdch-spec October 2016
14. Appendix: VCDIFF Encoding Format and SDCH
Although the SDCH protocol is proposed so that it could be adapted
for use with any differential-encoding format, it currently uses the
VCDIFF encoding format. This format was chosen because its
definition is publicly available as the RFC 3284 draft standard. The
VCDIFF format is independent of the method used for finding the
longest possible matches between the dictionary (source) data and the
payload (target) data.
An encoder and decoder for the VCDIFF format, intended for use with
SDCH, has been released as open-source under the Apache license.
This package is called "open-vcdiff". It uses the Bentley/McIlroy
technique for finding matches between the dictionary and target data.
It conforms to the VCDIFF draft standard, with the following
exceptions:
Interleaved format
The VCDIFF draft standard format divides each encoded delta window
into three sections (data, instructions, and addresses), with the aim
of improving compressibility of the encoded file using a secondary
compressor such as gzip. The drawback to this approach is that none
of the target data can be reconstructed unless the entire delta
window is available. The delta window is received in packets over
the network and it is desirable to be able to process its contents as
they arrive. In order to facilitate decoding a stream of packets
from the network, we have modified the VCDIFF format so that it
interleaves the data, instructions, and addresses instead of placing
them in three separate sections. Each instruction is followed by its
size and then by an address or literal data.
Adler32 checksum
The format can be modified to include an Adler32 checksum [RFC1950]
of the target window data. If the checksum format is used, then bit
2 (0x04, defined as VCD_CHECKSUM) of the Win_Indicator byte will be
set, and the checksum will appear just after the "Length of addresses
for COPYs" field and before the "Data section for ADDs and RUNs"
section in the encoding.
Version header byte (Header4)
If either of the two enhancements described above is used, then the
resulting format will not conform to the VCDIFF draft standard as
described in RFC 3284. In order to indicate this deviation from the
standard, the fourth byte in the encoding (Header4, reserved for the
VCDIFF version code) will be set to 0x53 (a capital "S" character in
Butler, et al. Expires May 1, 2017 [Page 19]
Internet-Draft sdch-spec October 2016
ASCII.) If neither enhancement is used, the fourth byte may be 0x00
(a null character), the default value described in the standard.
VCD_TARGET flag and target COPY instructions not allowed for SDCH
The SDCH protocol is intended to produce a delta between static
dictionary data and target data. Secondary compression with gzip
will be used to eliminate redundancy within the target data. For
this reason, when using VCDIFF for SDCH, the Win_Indicator flag
should always include the VCD_SOURCE flag, never the VCD_TARGET flag.
COPY instructions should only reference addresses within the source
data, never within the previously decoded target.
The Xdelta package (http://xdelta.org) produces a format based on
VCDIFF, though not 100% compatible with the RFC draft standard. That
package has been released under the GNU General Public License.
15. References
15.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<http://www.rfc-editor.org/info/rfc2119>.
[RFC7230] Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer
Protocol (HTTP/1.1): Message Syntax and Routing",
RFC 7230, DOI 10.17487/RFC7230, June 2014,
<http://www.rfc-editor.org/info/rfc7230>.
15.2. Informative References
[RFC3284] Korn, D., MacDonald, J., Mogul, J., and K. Vo, "The VCDIFF
Generic Differencing and Compression Data Format",
RFC 3284, DOI 10.17487/RFC3284, June 2002,
<http://www.rfc-editor.org/info/rfc3284>.
[RFC3229] Mogul, J., Krishnamurthy, B., Douglis, F., Feldmann, A.,
Goland, Y., van Hoff, A., and D. Hellerstein, "Delta
encoding in HTTP", RFC 3229, DOI 10.17487/RFC3229, January
2002, <http://www.rfc-editor.org/info/rfc3229>.
[RFC3929] Hardie, T., "Alternative Decision Making Processes for
Consensus-Blocked Decisions in the IETF", RFC 3929,
DOI 10.17487/RFC3929, October 2004,
<http://www.rfc-editor.org/info/rfc3929>.
Butler, et al. Expires May 1, 2017 [Page 20]
Internet-Draft sdch-spec October 2016
[RFC3548] Josefsson, S., Ed., "The Base16, Base32, and Base64 Data
Encodings", RFC 3548, DOI 10.17487/RFC3548, July 2003,
<http://www.rfc-editor.org/info/rfc3548>.
[RFC2965] Kristol, D. and L. Montulli, "HTTP State Management
Mechanism", RFC 2965, DOI 10.17487/RFC2965, October 2000,
<http://www.rfc-editor.org/info/rfc2965>.
[RFC1950] Deutsch, P. and J-L. Gailly, "ZLIB Compressed Data Format
Specification version 3.3", RFC 1950,
DOI 10.17487/RFC1950, May 1996,
<http://www.rfc-editor.org/info/rfc1950>.
[RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms
(SHA and SHA-based HMAC and HKDF)", RFC 6234,
DOI 10.17487/RFC6234, May 2011,
<http://www.rfc-editor.org/info/rfc6234>.
Authors' Addresses
Jon Butler
Email: jkbutler@google.com
Wei-Hsin Lee
Email: weihsinl@google.com
Bryan McQuade
Email: mcquade@google.com
Kenneth Mixter
Email: kmixter@google.com
Butler, et al. Expires May 1, 2017 [Page 21]