Internet Engineering Task Force | N. Williams, Ed. |
Internet-Draft | Cryptonector, LLC |
Intended status: Best Current Practice | July 6, 2020 |
Expires: January 7, 2021 |
Internationalization Considerations for Filesystems and Filesystem Protocols
draft-williams-filesystem-18n-00
This document describes requirements for internationalization (I18N) of filesystems specifically in the context of Internet protocols, the architecture for filesystems in most currently popular general purpose operating systems, and their implications for filesystem I18N. From the I18N requirements for filesystems and the architecture of running code we derive requirements and recommendations for implementors of operating systems and/or filesystems, as well as for Internet remote filesystem protocols.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 7, 2021.
Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
[TBD: Add references galore. How to reference Unicode? How to reference US-ASCII? How best to reference HFS+? How best to reference ZFS? May have to find useful references for POSIX and WIN32. Various blog entries may be of interest -- can they be referenced?]
We, the Internet community, have long concluded that we must internationalize all our protocols. This is generally not an easy task, as often we are constrained by the realities of what can be achieved while maintaining backwards compatibility.
In this document we focus on filesystem internationalization (I18N), specifically only for file names and file paths. Here we address the two main issues that arise in filesystem I18N:
These two issues are different flavors of the same generic issue: that there can be more than one way to write text with the same rendering and/or semantics.
Only I18N issues relating to file names and paths are addressed here. In particular, I18N issues related to representations of user identities and groups, for use in access control lists (ACLs) or other authorization systems, are out of scope for this document. Also out of scope here are I18N issues related to Uniform Resource Identifiers (URIs) [RFC3986] or Internationalized Resource Identifiers (IRIs) [RFC3987].
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
We must address two issues:
Unicode can represent certain character strings in multiple visually- and semantically-equivalent ways. For example, there are two ways to express LATIN SMALL LETTER A WITH ACUTE (á):
For some glyphs there is a single way to write them. For others there are two. And for yet others there can be many more than two.
To deal with the equivalence problem, Unicode defines Normal Forms (NFs), of which there are two basic ones: Normal Form Composed (NFC), and Normal Form Decomposed (NFD). There are also NFs that use "compatibility" Foldings, NFKC and NFKD. Unicode-aware applications can normalize text to avoid ambiguities, or they can use form-insensitive string comparisons, or both.
Some filesystems support case-insensitivity, which is trivial to define and implement for US-ASCII, but non-trivial for Unicode, requiring not only larger case-folding tables, but also localized case-folding tables as case-folding rules might differ from locale to locale.
For case-sensitive filesystems, only Unicode equivalence issues arise as to file names and file paths. These can be addressed in one of two ways:
The first option yields normalized file names on-disk and on the wire (e.g., when listing directories). We shall term this "normalize-on-CREATE", or sometimes "normalize-on-CREATE-and-LOOKUP", or even just "NoCL".
The second option preserves form as originally produced by the user or on their behalf by their system's text input modes, but otherwise is form-insensitive. That is, this option permits either encoding of, e.g., LATIN SMALL LETTER A WITH ACUTE on-disk and on the wire, but permits only one form of any string, whether normal or not. We shall term this option "form-insensitive", or sometimes "form-insensitive and form-preserving", or just "FIP".
Unicode compatibility equivalence allows equivalence between different representations of the same abstract character that may nonetheless have different visual appearance of behavior. There are two canonical forms that support compatibility equivalence: NFKC and NFKD. Using NoCL with NFKC or NFKD may be surprising to users in a visual way. While form-insensitivity with NFKC or NFKD may surprise users who might consider two file names distinct even when Unicode considers them equivalent under compatibility equivalence. The latter seems less likely and less surprising, though that is an entirely subjective judgement.
We do not recommend either of NoCL or FIP over the other.
Case-insensitivity implies folding characters of one case to another for comparison purposes, typically to lower-case. These case foldings are defined by Unicode. Generally, case-insensitive filesystems preserve original case just form-insensitive filesystems preserve original form.
It is possible that some case foldings may have to vary by locale. A commonly used example of character where case foldings that varies by locale is LATIN SMALL LETTER DOTLESS I (U+0131).
In some cases it may be possible to construct case-folding tailorings that are locale-neutral. For example, all of the following conuld be considered equivalent:
which might satisfy a mix of users including those familiar with Turkish and those not, using the same filesystem.
Remote filesystem protocols often involve caching on clients, which caching may require knowledge of filesystem I18N settings in order to permit local operations to be performed using cached directory listings that work the same way as on the server. We do not specify any case foldings here. Instead we will either create a registry of case folding tailorings, or use the Common Locale Data Repository (CLDR), then require that filesystems and servers be able to identify what case foldings are in effect for case-insensitive filesystems.
Surprisingly, almost all if not all general purpose operating systems in common use today have a "virtual filesystem switch" (VFS) [McKusick86] [wikipedia] interface that permits the use of multiple different filesystem types on one system, all accessed through the same filesystems application programming interfaces (APIs). The VFS is essentially a pluggable layer that includes functionality for routing calls from user processes to the appropriate filesystems. The VFS has even been generalized and extended to support isolation, thus we have the Filesystem in Userspace (FUSE), which is akin to a remote filesystem protocol, but for use over local inter-process communications (IPC) facilities.
The VFS architecture was developed in the 1980s, before Unicode adoption. It is not surprising then that in general -if not simply always today- the code path from the interface between a user application and the operating system all the way to the filesystem implements no I18N functionality whatsoever, and does the absolute minimum of character data interpretation:
For example, the 4.4BSD operating system and derivatives have a VFS [BSD4.4], as do Solaris and derivatives [SolarisInternals], Windows <https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/>, OS X, and Linux. A VFS of a sort, including FUSE, may well be the only reasonable way to support more than one kind of filesystem while retaining compatibility with previously-existing filesystem APIs. This explains why so many modern operating systems have a VFS.
Thus in most if not all general purpose operating systems today, the code path from the boundary between the application and the operating system, and the boundary between the VFS and the filesystem, is "just-use-8" or "just-use-16" (as in UTF-16 [UNICODE]), with no attempt at normalization or case folding done anywhere in between.
There are filesystem servers that access raw storage directly and implement the filesystem and the remote filesystem protocol server in one monolythic stack without a VFS in the way, but it is very common to have remote filesystem protocol servers implemented on top of the VFS or on top of the system calls. Even monolythic servers tend to support a notion of multiple filesystems in a server or volume, and may have different I18N settings for each filesystem. Thus it's common to leave I18N handling to code layers close to the filesystem even in monolythic server implementations.
In practice all of foregoing has led to I18N functionality residing strictly in the filesystem. Two filesystems have defined the best current practices in this regard:
Altogether, these circumstances make it very difficult to reliably and always locate I18N functionality above the VFS, or to not use a VFS at all: there are too many places to alter, and all must agree exactly on I18N choices. Moreover, implementing case-insensitive but case-preserving behavior above the VFS requires fully reading each directory, and so does implementing form-insensitive and form-preserving behavior at the VFS layer itself. The only behaviors that can be reliably implemented at or above the VFS are normalize- and case-fold-on-CREATE (and LOOKUP).
Consider the set of already-running code that must all be modified in order to reliably implement I18N above the filesystem on general purpose operating systems:
Regarding system calls and system call stubs in user process system libraries, the continued use of statically-linked executables means that these cannot reliably be modified. Indeed, on some systems the Application Binary Interface (ABI) between user-space applications and the operating system kernel is well-defined and long-term stable. The system call handlers cannot reliably inspect the calling process to determine any attributes of its locale. Adding new system calls is possible, but existing running code wouldn't use them. For similar reasons, the VFS layer is generally (always) completely unaware of any attributes of the locale of applications calling it, whether via system calls or any other path.
Unix-like operating systems are generally (always) "just-use-8", assuming only that file names and paths are C strings (i.e., terminated by zero-valued bytes) and sufficiently compatible with US-ASCII that the file path component separator character, US-ASCII '/', is meaningful. As a result, it is possible to find I18N-unaware filesystems with one or more non-Unicode, non-ASCII codesets in use for file names! We leave non-ASCII and non-Unicode file names out of scope here.
For these reasons it is simply not practical to implement I18N at any layer above the VFS.
Even in the VFS, form- and case-insensitive and -preserving behaviors would be difficult to implement as performantly as in the filesystem. The VFS would have to list a directory completely before being able to apply those behaviors. It is reasonable to expect caching clients of remote filesystems to cache directory listings (especially for offline operation), but it isn't reasonable to expect the same of the VFS. Compare to the filesystem itself, which can maintain a fast index (e.g., hash table or b-tree) where the keys are normalized and possibly case-folded file names and thus may not need to read directories in order to perform fast lookups that are form- and even case-insensitive.
The only way to implement I18N behaviors in the VFS layer rather than at the filesystem is to abandon form- and case-preserving behaviors. For case-insensitivity this would require using sentence-case, or all lower-case, perhaps, and all such choices would surely be surprising to users. At any rate, that approach would also render much running code "non-compliant" with any Internet filesystem protocol I18N specification.
Therefore, generally speaking, only the filesystem can reliably, interoperably, and performantly implement I18N behaviors in general purpose operating systems.
Note that variations in I18N behaviors can happen even on the same server with multiple filesystems of the same type. This can happen because of
Locale variations are only relevant to case-folding for case-insensitivity. Running code mostly uses default case-folding rules, but there is no reason to assume that locale-specific case-folding rules won't be supported by running code in the future.
It may not be possible or easy for a filesystem to adopt new Unicode versions, or adopt backwards-incompatible case foldings, after content has been created in it that would be ambiguous under new rules. This implies that where a client for a remote filesystem must know what I18N functionality to implement for use with cached directory listings, the client must know specifically what profile of I18N functionality each cached filesystem implements.
We begin be recognizing and accepting that much running code implements I18N functionality at the filesystem. Given this, we catalogue the range of acceptable behaviors. Filesystems adhering to this specification MUST implement only acceptable I18N behaviors as specified here. Acceptable variations may be registered in a to-be-determined (IANA?) registry of filesystem I18N behaviors.
Each filesystem MUST identify a Unicode version for their I18N behaviors. Filesystem implementations SHOULD adopt new Unicode versions as they are produced, though it is understood that it may be difficult to migrate non-empty filesystems to new Unicode versions.
Remote filesystem protocols that allow clients to perform lookups against cached directory listings MUST allow clients to discover all relevant I18N behaviors of the filesystem whence any given directory listing:
Foldings are identified via a folding set name as registered in a to-be-determined (IANA?) registry.
Because some filesystems might allow for different I18N settings on a per-directory basis, remote filesystem protocols MUST allow those settings to be discoverable on a per-directory basis.
Internet filesystem servers MUST reject attempts to create new non-Unicode file names. (Note that this requirement is weaker ("SHOULD") for the actual filesystems, since those might have to allow non-Unicode content for legacy reasons via interfaces other than Internet filesystem protocols.)
Caching clients of remote filesystems either MUST NOT perform lookups against cached directory listings, or MUST query the directories' filesystems' I18N profiles and apply the same I18N equivalent form policis and case-insensitivity case foldings.
This document deals in internationalization throughout.
[ALTERNATIVELY use locale names and CLDR? Need to determine the stability of CLDR locales... Basically, we need stable locale names, and stable case-folding mappings.]
We hereby request the creation of a new IANA registry with Expert Review registration rules with the following fields:
The case foldings registered here will be used by case-insensitive filesystems and filesystem protocols to identify tailored case foldings so that caching clients can implement the same case-insensitive behavior using cached directory listings.
Security considerations of Unicode and filesystem protocols apply. No new security considerations are added or need be noted here.
The methods of handling equivalent Unicode strings cause aliasing. This is not expected to be a security problem.
Case-insensitivity causes aliasing. This is not expected to be a security problem.
No effort is made here to handle confusables. This is not expected to be a serious security problem in the context of file servers.
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997. |
[RFC3629] | Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 2003. |
[UNICODE] | The Unicode Consortium, "The Unicode Standard, Version 12.1.0", May 2019. |