Internet DRAFT - draft-williams-i18n-boundary-analysis
draft-williams-i18n-boundary-analysis
Network Working Group N. Williams
Internet-Draft Cryptonector
Intended status: Informational September 20, 2013
Expires: March 24, 2014
Boundary Analysis for Internationalization and Localization
draft-williams-i18n-boundary-analysis-00
Abstract
Internationalization of protocols and programs often requires
determining where to use one or another character repertoire,
codeset, encoding, where to perform localization, and so on. This
document aims to serve as a guide to Internet protocol designers in
determining what they may or should recommend or require of protocol
implementors. Of particular interest in this document are filesystem
protocols.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on March 24, 2014.
Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
Williams Expires March 24, 2014 [Page 1]
Internet-Draft I18N Boundary Analysis September 2013
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction and Motivation . . . . . . . . . . . . . . . 3
1.1. Conventions used in this document . . . . . . . . . . . . 3
2. Internationalization . . . . . . . . . . . . . . . . . . . 4
2.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4
3. Filesystems and Remote/Distributed Filesystem Protocols . 6
3.1. On Filesystem Client and Server Implementation
Architectures and their Relevance . . . . . . . . . . . . 6
3.2. Obvious Boundaries . . . . . . . . . . . . . . . . . . . . 6
3.3. Legacy . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3.1. Legacy Problem #1: Loss of Metadata at the System Call
Boundary . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3.2. Legacy Problem #2: Unknown Character String Metadata
in Existing Filesystem Content . . . . . . . . . . . . . . 7
3.3.3. Legacy Problem #3: Poor Handling of Unicode
Equivalence (Normalization) . . . . . . . . . . . . . . . 8
3.3.4. Legacy Problem #4: Ignored Requirements . . . . . . . . . 8
3.3.5. Legacy Problem #5: Constraints Imposed by Non-Internet
Standards . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4. A World Without Legacy . . . . . . . . . . . . . . . . . . 8
3.5. Coping with / Accepting Legacy . . . . . . . . . . . . . . 9
3.5.1. Implications . . . . . . . . . . . . . . . . . . . . . . . 9
3.6. Recommendations for Filesystem Protocols, Filesystems,
and Operating Systems . . . . . . . . . . . . . . . . . . 10
3.7. Interoperability Considerations for Filesystem
Protocols . . . . . . . . . . . . . . . . . . . . . . . . 11
4. Security Considerations . . . . . . . . . . . . . . . . . 12
5. IANA Considerations . . . . . . . . . . . . . . . . . . . 13
6. References . . . . . . . . . . . . . . . . . . . . . . . . 14
6.1. Normative References . . . . . . . . . . . . . . . . . . . 14
6.2. Informative References . . . . . . . . . . . . . . . . . . 14
Author's Address . . . . . . . . . . . . . . . . . . . . . 15
Williams Expires March 24, 2014 [Page 2]
Internet-Draft I18N Boundary Analysis September 2013
1. Introduction and Motivation
As the IETF has attempted to internationalize Internet protocols we
have learned some valuable lessons. It is time to write these down.
This document focuses on internationalization problems in the
filesystem and remote / distributed filesystem protocols space.
This document is INFORMATIVE.
1.1. Conventions used in this document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
Where RFC2119 key words are used herein for stating requirements or
recommendations, they are used to as part of suggested normative
language to be used by normative Internet protocol specifications
that accept the internationalization advice given in this document.
Williams Expires March 24, 2014 [Page 3]
Internet-Draft I18N Boundary Analysis September 2013
2. Internationalization
Internationalizing a protocol roughly requires the following tasks:
1. decide where to use Unicode [XXX add reference] and what encoding
of Unicode
2. decide where any conversions to other codesets should be done, if
any
3. decide what Unicode characters (and non-characters) to permit or
forbid
4. decide what Unicode character mappings are appropriate
5. decide how to handle string equality, including case-sensitive
and case-insensitive behavior, and whether and how to handle
Unicode equivalence (normalization)
In practice, because historically most protocols and data formats do
not tag strings with any language nor codeset information, and
because codesets and their encodings often overlap, and other legacy
problems, there's no simple way to decide where to perform any
conversions, mappings, or checks.
We describe here our experience with NFSv4 in particular and
filesystems in general.
2.1. Terminology
[...]
Some terms used in this document:
just-use-8 Where a program or protocol component accepts character
strings, treating them as arbitrary octet strings, often assuming
that byte values less than 0x80 are US-ASCII, or that specific
byte values are specific US-ASCII characters (e.g., filesystem
path component separators).
just-send-8 Where a program or protocol component sends character
strings without regard as to whether the string's codeset/encoding
are the expected on-the-wire codeset/encoding.
just-use-UTF-8 Where a program or protocol component accepts
character strings that are valid UTF-8 strings withour regard to
normalization.
Williams Expires March 24, 2014 [Page 4]
Internet-Draft I18N Boundary Analysis September 2013
just-send-UTF-8 Where a program or protocol component sends UTF-8
character strings without attempting to normalize or perform any
similar steps (e.g., applying character mappings and/or
prohibitions).
... Define lots more and reference other RFCs...
Williams Expires March 24, 2014 [Page 5]
Internet-Draft I18N Boundary Analysis September 2013
3. Filesystems and Remote/Distributed Filesystem Protocols
Filesystems and filesystem protocols may be the most difficult
application to internationalize that we in the IETF have seen to
date. Initially, for NFSv4 [RFC3530] we believed that we could
simply mandate the use of UTF-8 [RFC3629], forbid some characters,
require a choice of normalization forms, and we'd be done. In
practice it was not so simple.
3.1. On Filesystem Client and Server Implementation Architectures and
their Relevance
To understand the difficulties faced in internationalizing NFSv4 we
need to understand the typical architecture of NFSv4 clients and
servers. We say "typical", but it really is typical: the vast
majority, if not all of the major general-purpose operating systems
in use at this time and over the entire history of NFSv4 share the
architecture that we describe here, differing only in minor details.
Normally the architecture and design of clients and servers would be
of no interest to the IETF: we want neiter to dictate nor be unduly
constrained by such things. In this case architecture and legacy
combine to create unusual problems for filesystem protocols. In this
case we must take implementation architecture into account!
Both, clients and servers, typically have a "kernel" that executes
privileged mode object code and which has a pluggable "virtual
filesystem switch" (VFS) -- an interface that abstracts filesystems
so as to permit support for many different types of filesystems.
Clients, and usually servers, also run user-mode object code -less
privileged than kernel-mode object code- that interfaces with
filesystems by invoking privileged kernel-mode code through well-
defined interfaces ("system calls") that allow the kernel to maintain
privilege separation and isolation. These system calls too present a
common, standard, abstract interface to all filesystems that can be
plugged into the kernel's VFS. Some servers run no user-mode object
code to speak of, running all fileserver protocol implementations in
kernel mode, nonetheless, the architecture is roughly the same for
servers as for clients.
3.2. Obvious Boundaries
Some boundaries are immediately evident:
o the system call layer, between user-mode and kernel mode
o the VFS boundary, between generic kernel object code and specific
filesystem implementations
Williams Expires March 24, 2014 [Page 6]
Internet-Draft I18N Boundary Analysis September 2013
o the network, between the client implementation and the server
implementation
o the VFS again, between the server and the filesystems beneath it
o persistent storage network, between specific filesystem
implementations and persistent storage
Their relevance to I18N will be discussed further below.
3.3. Legacy
Many, perhaps all commonly used general-purpose operating systems,
predate modern internationalization efforts. (Some operating
systems, such as those for mobile devices, are new enough that they
might well pose no legacy I18N issues for filesystems.)
Most such operating systems simply treated character strings as
mostly opaque at many if not all of the boundaries described in
Section 3.2, at most interpreting path component separator
characters, in the process assuming US-ASCII [XXX add reference] as
the lowest common denominator for the purpose of finding path
component separators.
Because these operating systems, filesystem on-disk formats, and
actual on-disk filesystems, predate modern internationalization
efforts, there exist many filesystems with object name strings of
unknown or mixed codesets. Strings, such as object names, in
filesystems are never tagged with codeset information because the
codeset information was and still is usually lost at the system call
boundary. The actual codesets (and encodings) used typically varies
along with user (and system administrator) locale preferences.
3.3.1. Legacy Problem #1: Loss of Metadata at the System Call Boundary
The first and foremost problem, then, is the loss of locale metadata
at the system call boundary. Without fixing this we cannot move to
an all-Unicode world in filesystems protocols.all
3.3.2. Legacy Problem #2: Unknown Character String Metadata in Existing
Filesystem Content
The second most important problem in filesystem internationalization
is the lack of locale (codeset, encoding) metadata for existing
(legacy) filesystem content, specifically file and directory names.
Williams Expires March 24, 2014 [Page 7]
Internet-Draft I18N Boundary Analysis September 2013
3.3.3. Legacy Problem #3: Poor Handling of Unicode Equivalence
(Normalization)
Historically Unicode input methods tend to produce pre-composed
codepoints -- something close to Normalization Form Composed (NFC).
But this is not always so.
Historically most filesystems treat file (and directory) names as
opaque, but at least one filesystem (Apple's HFS+ [XXX add
reference]) assumes UTF-8 and normalizes to Normalization Form
Decomposed (NFD) at object-create and object-lookup time.
This can result in subtle interoperability problems, as two objects
with equivalent names may exist in namespaces (directories) where
names are expected to be unique, or users may fail to input names
that match those that exist in a filesystem.
3.3.4. Legacy Problem #4: Ignored Requirements
The original NFSv4 specification [RFC3530] requires some character
mappings and prohibitions. Most implementations have ignored this
requirement.
3.3.5. Legacy Problem #5: Constraints Imposed by Non-Internet Standards
POSIX [XXX add reference] is one common standard for system call
interfaces to filesystems. Arguably it requires that:
1. applications observe the same file/directory names -when listing
a directory- as they created;
2. no aliases may exist for files/directories that are not
"symlinks" or "hardlinks".
This makes it very difficult to deploy Unicode normalization anywhere
other than the application. But it is not possible to fix every
POSIX application to normalize on create or lookup either!
3.4. A World Without Legacy
If we didn't have the legacy problems described above we could simply
mandate the use of Unicode in one specific encoding (e.g., UTF-8) "in
the middle", with the middle being: from the system call boundary, to
the VFS boundary, as well "on the wire". Any codeset conversions and
Unicode normalization would be performed at the system call boundary
(i.e., on the client) and at the VFS boundary (if, for example, a
filesystem on-disk format requires different codeset/encoding than
the protocol does on the wire).
Williams Expires March 24, 2014 [Page 8]
Internet-Draft I18N Boundary Analysis September 2013
Or perhaps in an ideal world all user applications may run only in
Unicode locales, and must perform explicit codeset conversions when
handling legacy (non-Unicode) data. This ideal is one we will likely
obtain in time, as legacy non-Unicode locales are abandoned, legacy
filesystems cleaned up, and new operating systems (or new versions of
them) take over older ones.
In an ideal world there would be no Unicode normalization problems
because either there would be just one normal form for Unicode or
because all implementations of filesystem clients, servers,
filesystems, and filesystem-using applications, would use a single,
common normal form. In practice this is almost certainly an
impossible ideal.
3.5. Coping with / Accepting Legacy
Legacy abounds. We must cope with it.
First, the IETF can't cause the system call boundary metadata loss
problem to be fixed. The architectures of the relevant operating
systems is such that the simplest fix for that problem is to convert
between the user-mode locale's codeset/encoding and the codeset/
encoding expected by the kernel. But getting such a fix to be
implemented and deployed is difficult for a number of reasons, not
the least of which is its impact on performance (for users using
locales that require conversions), but also complexity: the user-mode
side of system calls can sometimes be in a bootstrapping state during
which I18N object code may not have been loaded yet. The simplest
fix for this problem is to recommend that users use only locales that
use Unicode as the charater repertoire and codeset, preferably with
the encoding expected on the kernel-side of the system call boundary.
The second legacy problem -legacy filesystem content- can be
addressed by requiring manual inspection and repair of legacy
content, but there exist such vast amounts of legacy contents that
this is not a realistic option. There is no fix for the legacy
filesystem content problem.
3.5.1. Implications
Some implications of accepting legacy:
o we may want Unicode in the middle, but sometimes we'll have non-
Unicode content
o we can stop the creation of new non-Unicode content on disk, but
we can't really preclude access to it
Williams Expires March 24, 2014 [Page 9]
Internet-Draft I18N Boundary Analysis September 2013
o normalization-on-create is problematic
o normalization-on-lookup is problematic
o normalization-insensitive lookups are problematic
o ignoring normalization is problematic
With respect to normalization there's no one solution appropriate to
all use cases.
3.6. Recommendations for Filesystem Protocols, Filesystems, and
Operating Systems
o Filesystems SHOULD be configurable to reject object names which
are not valid in the filesystem's chosen Unicode encoding.
This allows filesystems (and their servers) to put a stop to the
rot, except, of course, for non-Unicode strings that happen to
appear as valid Unicode strings due to codeset/encoding aliasing.
o Remote / distributed filesystem protocols _should_ specify the use
of Unicode on the wire, but they should also allow the use of non-
Unicode names, leaving it to the filesystem to decide whether to
accept or reject such names.
* For example, this means that NFSv4 _servers_ SHOULD accept
object names -from clients- which are not valid UTF-8, contrary
to the original NFSv4 specification [RFC3530].
o Remote / distributed filesystem protocols should permit servers to
return non-Unicode object names to clients. This allows servers
to serve legacy non-Unicode content.
* For example, this means that NFSv4 clients SHOULD be prepared
to accept non-UTF-8 names from NFSv4 servers, contrary to the
original NFSv4 specification [RFC3530].
o Filesystem servers should accept object names -from filesystems-
which are not valid in the host operating system's chosen codeset
and encoding for use above the VFS.
o Filesystems SHOULD be configurable as to Unicode normalization,
allowing at least the following two options:
* Normalization-insensitive lookups.
Williams Expires March 24, 2014 [Page 10]
Internet-Draft I18N Boundary Analysis September 2013
* No normalization at all.
o Filesystems MAY be configurable as to Unicode normalization,
allowing these additional options:
* Normalize on create and lookup
o Operating systems SHOULD be configurable as to codeset/encoding
conversions at the system call boundary, allowing these options:
* convert to/from non-Unicode locales' codesets
* no conversion
o Operating systems that do not support codeset/encoding conversions
at the system call boundary SHOULD at least encourage users to use
or switch to using Unicode locales.
3.7. Interoperability Considerations for Filesystem Protocols
[[anchor1: Intent: describe interoperability problems that arise
given current NFSv4 deployments and legacy filesystem contents.]]
Williams Expires March 24, 2014 [Page 11]
Internet-Draft I18N Boundary Analysis September 2013
4. Security Considerations
[[anchor2: Lots to talk about here. For example, aliasing issues
w.r.t. multiple equivalent Unicode forms, and the resulting potential
for confusion.]]
Williams Expires March 24, 2014 [Page 12]
Internet-Draft I18N Boundary Analysis September 2013
5. IANA Considerations
There are no IANA considerations in this document.
Williams Expires March 24, 2014 [Page 13]
Internet-Draft I18N Boundary Analysis September 2013
6. References
6.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
6.2. Informative References
[RFC3530] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R.,
Beame, C., Eisler, M., and D. Noveck, "Network File System
(NFS) version 4 Protocol", RFC 3530, April 2003.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, November 2003.
Williams Expires March 24, 2014 [Page 14]
Internet-Draft I18N Boundary Analysis September 2013
Author's Address
Nicolas Williams
Cryptonector, LLC
Email: nico@cryptonector.com
Williams Expires March 24, 2014 [Page 15]