Internet DRAFT - draft-williams-filesystem-18n

draft-williams-filesystem-18n







Internet Engineering Task Force                         N. Williams, Ed.
Internet-Draft                                         Cryptonector, LLC
Intended status: Best Current Practice                      July 6, 2020
Expires: January 7, 2021


   Internationalization Considerations for Filesystems and Filesystem
                               Protocols
                    draft-williams-filesystem-18n-00

Abstract

   This document describes requirements for internationalization (I18N)
   of filesystems specifically in the context of Internet protocols, the
   architecture for filesystems in most currently popular general
   purpose operating systems, and their implications for filesystem
   I18N.  From the I18N requirements for filesystems and the
   architecture of running code we derive requirements and
   recommendations for implementors of operating systems and/or
   filesystems, as well as for Internet remote filesystem protocols.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 7, 2021.

Copyright Notice

   Copyright (c) 2020 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect



Williams                 Expires January 7, 2021                [Page 1]

Internet-Draft           Accept-Auth & Redirect                July 2020


   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   1.1.  Requirements Language . . . . . . . . . . . . . . . . . . .   3
   1.2.  Filesystem Internationalization . . . . . . . . . . . . . .   3
   1.2.1.  Canonical Equivalence (Normalization) . . . . . . . . . .   4
   1.2.2.  Case Foldings for Case-Insensitivity  . . . . . . . . . .   4
   1.2.3.  Caching Clients . . . . . . . . . . . . . . . . . . . . .   5
   1.3.  Running Code Architecture Notes . . . . . . . . . . . . . .   5
   2.  Filesystem I18N Guidelines  . . . . . . . . . . . . . . . . .   9
   2.1.  Filesystem I18N Guidelines: Non-Unicode File names  . . . .   9
   2.2.  Filesystem I18N Guidelines: Case-Insensitivity  . . . . . .   9
   2.3.  I18N Versioning . . . . . . . . . . . . . . . . . . . . . .   9
   3.  Filesystem Protocol I18N Guidelines . . . . . . . . . . . . .  10
   3.1.  I18N and Caching in Filesystem Protocol Clients . . . . . .  10
   4.  Internationalization Considerations . . . . . . . . . . . . .  10
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  10
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  11
   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  11
   7.1.  Normative References  . . . . . . . . . . . . . . . . . . .  11
   7.2.  Informative References  . . . . . . . . . . . . . . . . . .  12
   7.3.  URIs  . . . . . . . . . . . . . . . . . . . . . . . . . . .  12
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  12

1.  Introduction

   [TBD: Add references galore.  How to reference Unicode?  How to
   reference US-ASCII?  How best to reference HFS+?  How best to
   reference ZFS?  May have to find useful references for POSIX and
   WIN32.  Various blog entries may be of interest -- can they be
   referenced?]

   We, the Internet community, have long concluded that we must
   internationalize all our protocols.  This is generally not an easy
   task, as often we are constrained by the realities of what can be
   achieved while maintaining backwards compatibility.

   In this document we focus on filesystem internationalization (I18N),
   specifically only for file names and file paths.  Here we address the
   two main issues that arise in filesystem I18N:

   o  Unicode equivalence




Williams                 Expires January 7, 2021                [Page 2]

Internet-Draft           Accept-Auth & Redirect                July 2020


   o  Case foldings for case-insensitivity

   These two issues are different flavors of the same generic issue:
   that there can be more than one way to write text with the same
   rendering and/or semantics.

   Only I18N issues relating to file names and paths are addressed here.
   In particular, I18N issues related to representations of user
   identities and groups, for use in access control lists (ACLs) or
   other authorization systems, are out of scope for this document.
   Also out of scope here are I18N issues related to Uniform Resource
   Identifiers (URIs) [RFC3986] or Internationalized Resource
   Identifiers (IRIs) [RFC3987].

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

1.2.  Filesystem Internationalization

   We must address two issues:

   o  Unicode equivalence

   o  Case foldings for case-insensitivity

   Unicode can represent certain character strings in multiple visually-
   and semantically-equivalent ways.  For example, there are two ways to
   express LATIN SMALL LETTER A WITH ACUTE (รก):

   o  U+00E1

   o  U+0061 U+0301

   For some glyphs there is a single way to write them.  For others
   there are two.  And for yet others there can be many more than two.

   To deal with the equivalence problem, Unicode defines Normal Forms
   (NFs), of which there are two basic ones: Normal Form Composed (NFC),
   and Normal Form Decomposed (NFD).  There are also NFs that use
   "compatibility" Foldings, NFKC and NFKD.  Unicode-aware applications
   can normalize text to avoid ambiguities, or they can use form-
   insensitive string comparisons, or both.

   Some filesystems support case-insensitivity, which is trivial to
   define and implement for US-ASCII, but non-trivial for Unicode,



Williams                 Expires January 7, 2021                [Page 3]

Internet-Draft           Accept-Auth & Redirect                July 2020


   requiring not only larger case-folding tables, but also localized
   case-folding tables as case-folding rules might differ from locale to
   locale.

1.2.1.  Canonical Equivalence (Normalization)

   For case-sensitive filesystems, only Unicode equivalence issues arise
   as to file names and file paths.  These can be addressed in one of
   two ways:

   o  normalize file names when created and when looked up,

   o  perform form-insensitive string comparisons on lookup.

   The first option yields normalized file names on-disk and on the wire
   (e.g., when listing directories).  We shall term this "normalize-on-
   CREATE", or sometimes "normalize-on-CREATE-and-LOOKUP", or even just
   "NoCL".

   The second option preserves form as originally produced by the user
   or on their behalf by their system's text input modes, but otherwise
   is form-insensitive.  That is, this option permits either encoding
   of, e.g., LATIN SMALL LETTER A WITH ACUTE on-disk and on the wire,
   but permits only one form of any string, whether normal or not.  We
   shall term this option "form-insensitive", or sometimes "form-
   insensitive and form-preserving", or just "FIP".

   Unicode compatibility equivalence allows equivalence between
   different representations of the same abstract character that may
   nonetheless have different visual appearance of behavior.  There are
   two canonical forms that support compatibility equivalence: NFKC and
   NFKD.  Using NoCL with NFKC or NFKD may be surprising to users in a
   visual way.  While form-insensitivity with NFKC or NFKD may surprise
   users who might consider two file names distinct even when Unicode
   considers them equivalent under compatibility equivalence.  The
   latter seems less likely and less surprising, though that is an
   entirely subjective judgement.

   We do not recommend either of NoCL or FIP over the other.

1.2.2.  Case Foldings for Case-Insensitivity

   Case-insensitivity implies folding characters of one case to another
   for comparison purposes, typically to lower-case.  These case
   foldings are defined by Unicode.  Generally, case-insensitive
   filesystems preserve original case just form-insensitive filesystems
   preserve original form.




Williams                 Expires January 7, 2021                [Page 4]

Internet-Draft           Accept-Auth & Redirect                July 2020


   It is possible that some case foldings may have to vary by locale.  A
   commonly used example of character where case foldings that varies by
   locale is LATIN SMALL LETTER DOTLESS I (U+0131).

   In some cases it may be possible to construct case-folding tailorings
   that are locale-neutral.  For example, all of the following conuld be
   considered equivalent:

   o  LATIN CAPITAL LETTER I (U+0049)

   o  LATIN SMALL LETTER I (U+0069)

   o  LATIN CAPITAL LETTER I WITH DOT ABOVE (U+0130)

   o  LATIN SMALL LETTER DOTLESS I (U+0131)

   which might satisfy a mix of users including those familiar with
   Turkish and those not, using the same filesystem.

1.2.3.  Caching Clients

   Remote filesystem protocols often involve caching on clients, which
   caching may require knowledge of filesystem I18N settings in order to
   permit local operations to be performed using cached directory
   listings that work the same way as on the server.  We do not specify
   any case foldings here.  Instead we will either create a registry of
   case folding tailorings, or use the Common Locale Data Repository
   (CLDR), then require that filesystems and servers be able to identify
   what case foldings are in effect for case-insensitive filesystems.

1.3.  Running Code Architecture Notes

   Surprisingly, almost all if not all general purpose operating systems
   in common use today have a "virtual filesystem switch" (VFS)
   [McKusick86] [wikipedia] [1] interface that permits the use of
   multiple different filesystem types on one system, all accessed
   through the same filesystems application programming interfaces
   (APIs).  The VFS is essentially a pluggable layer that includes
   functionality for routing calls from user processes to the
   appropriate filesystems.  The VFS has even been generalized and
   extended to support isolation, thus we have the Filesystem in
   Userspace (FUSE), which is akin to a remote filesystem protocol, but
   for use over local inter-process communications (IPC) facilities.

   The VFS architecture was developed in the 1980s, before Unicode
   adoption.  It is not surprising then that in general -if not simply
   always today- the code path from the interface between a user
   application and the operating system all the way to the filesystem



Williams                 Expires January 7, 2021                [Page 5]

Internet-Draft           Accept-Auth & Redirect                July 2020


   implements no I18N functionality whatsoever, and does the absolute
   minimum of character data interpretation:

   o  use of US-ASCII NUL (for "C string" termination),

   o  use of US-ASCII '/' and/or '\' (for file path component
      delimiting).

   For example, the 4.4BSD operating system and derivatives have a VFS
   [BSD4.4], as do Solaris and derivatives [SolarisInternals], Windows
   <https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/>, OS
   X, and Linux.  A VFS of a sort, including FUSE, may well be the only
   reasonable way to support more than one kind of filesystem while
   retaining compatibility with previously-existing filesystem APIs.
   This explains why so many modern operating systems have a VFS.

   Thus in most if not all general purpose operating systems today, the
   code path from the boundary between the application and the operating
   system, and the boundary between the VFS and the filesystem, is
   "just-use-8" or "just-use-16" (as in UTF-16 [UNICODE]), with no
   attempt at normalization or case folding done anywhere in between.

   There are filesystem servers that access raw storage directly and
   implement the filesystem and the remote filesystem protocol server in
   one monolythic stack without a VFS in the way, but it is very common
   to have remote filesystem protocol servers implemented on top of the
   VFS or on top of the system calls.  Even monolythic servers tend to
   support a notion of multiple filesystems in a server or volume, and
   may have different I18N settings for each filesystem.  Thus it's
   common to leave I18N handling to code layers close to the filesystem
   even in monolythic server implementations.

   In practice all of foregoing has led to I18N functionality residing
   strictly in the filesystem.  Two filesystems have defined the best
   current practices in this regard:

   o  HFS+, which does normalize-on-CREATE (and LOOKUP), normalizing to
      a form that is very close to NFD and is case-sensitive;

   o  ZFS, which implements form-insensitive, form-preserving behavior
      and optionally implements case-insensitive, case-preserving
      behavior on a per-filesystem basis.

   Altogether, these circumstances make it very difficult to reliably
   and always locate I18N functionality above the VFS, or to not use a
   VFS at all: there are too many places to alter, and all must agree
   exactly on I18N choices.  Moreover, implementing case-insensitive but
   case-preserving behavior above the VFS requires fully reading each



Williams                 Expires January 7, 2021                [Page 6]

Internet-Draft           Accept-Auth & Redirect                July 2020


   directory, and so does implementing form-insensitive and form-
   preserving behavior at the VFS layer itself.  The only behaviors that
   can be reliably implemented at or above the VFS are normalize- and
   case-fold-on-CREATE (and LOOKUP).

   Consider the set of already-running code that must all be modified in
   order to reliably implement I18N above the filesystem on general
   purpose operating systems:

   o  filesystem protocol servers, including but not limited to:

      *  Network File System (NFSv4) [RFC7530];

      *  Hypertext Transfer Protocol (HTTP) servers serving resources
         hosted on filesystems[RFC7230];

      *  SSH File Transfer Protocol (SFTP) [I-D.ietf-secsh-filexfer];

      *  various remote filesystem protocols that are not Internet
         Protocols (i.e., not standards-track Internet RFCs);

   o  POSIX system call layers or user process system call stub
      libraries;

   o  WIN32 system call layers or user process system call stub
      libraries.

   Regarding system calls and system call stubs in user process system
   libraries, the continued use of statically-linked executables means
   that these cannot reliably be modified.  Indeed, on some systems the
   Application Binary Interface (ABI) between user-space applications
   and the operating system kernel is well-defined and long-term stable.
   The system call handlers cannot reliably inspect the calling process
   to determine any attributes of its locale.  Adding new system calls
   is possible, but existing running code wouldn't use them.  For
   similar reasons, the VFS layer is generally (always) completely
   unaware of any attributes of the locale of applications calling it,
   whether via system calls or any other path.

   Unix-like operating systems are generally (always) "just-use-8",
   assuming only that file names and paths are C strings (i.e.,
   terminated by zero-valued bytes) and sufficiently compatible with US-
   ASCII that the file path component separator character, US-ASCII '/',
   is meaningful.  As a result, it is possible to find I18N-unaware
   filesystems with one or more non-Unicode, non-ASCII codesets in use
   for file names!  We leave non-ASCII and non-Unicode file names out of
   scope here.




Williams                 Expires January 7, 2021                [Page 7]

Internet-Draft           Accept-Auth & Redirect                July 2020


   For these reasons it is simply not practical to implement I18N at any
   layer above the VFS.

   Even in the VFS, form- and case-insensitive and -preserving behaviors
   would be difficult to implement as performantly as in the filesystem.
   The VFS would have to list a directory completely before being able
   to apply those behaviors.  It is reasonable to expect caching clients
   of remote filesystems to cache directory listings (especially for
   offline operation), but it isn't reasonable to expect the same of the
   VFS.  Compare to the filesystem itself, which can maintain a fast
   index (e.g., hash table or b-tree) where the keys are normalized and
   possibly case-folded file names and thus may not need to read
   directories in order to perform fast lookups that are form- and even
   case-insensitive.

   The only way to implement I18N behaviors in the VFS layer rather than
   at the filesystem is to abandon form- and case-preserving behaviors.
   For case-insensitivity this would require using sentence-case, or all
   lower-case, perhaps, and all such choices would surely be surprising
   to users.  At any rate, that approach would also render much running
   code "non-compliant" with any Internet filesystem protocol I18N
   specification.

   Therefore, generally speaking, only the filesystem can reliably,
   interoperably, and performantly implement I18N behaviors in general
   purpose operating systems.

   Note that variations in I18N behaviors can happen even on the same
   server with multiple filesystems of the same type.  This can happen
   because of

      different Unicode versions being used at the times of creation of
      various filesystems, and

      different locale settings on various filesystems.

   Locale variations are only relevant to case-folding for case-
   insensitivity.  Running code mostly uses default case-folding rules,
   but there is no reason to assume that locale-specific case-folding
   rules won't be supported by running code in the future.

   It may not be possible or easy for a filesystem to adopt new Unicode
   versions, or adopt backwards-incompatible case foldings, after
   content has been created in it that would be ambiguous under new
   rules.  This implies that where a client for a remote filesystem must
   know what I18N functionality to implement for use with cached
   directory listings, the client must know specifically what profile of
   I18N functionality each cached filesystem implements.



Williams                 Expires January 7, 2021                [Page 8]

Internet-Draft           Accept-Auth & Redirect                July 2020


2.  Filesystem I18N Guidelines

   We begin be recognizing and accepting that much running code
   implements I18N functionality at the filesystem.  Given this, we
   catalogue the range of acceptable behaviors.  Filesystems adhering to
   this specification MUST implement only acceptable I18N behaviors as
   specified here.  Acceptable variations may be registered in a to-be-
   determined (IANA?) registry of filesystem I18N behaviors.

2.1.  Filesystem I18N Guidelines: Non-Unicode File names

   o  Filesystems SHOULD reject attempts to create new non-Unicode file
      names.

   o  Filesystems either MUST normalize on CREATE (and LOOKUP), or MUST
      be form-insensitive and form-preserving.

   o  Filesystems MUST specify a Unicode version for their equivalence
      behaviors.

2.2.  Filesystem I18N Guidelines: Case-Insensitivity

   o  Filesystems MAY support case-insensitivity, in which case they
      SHOULD be case-preserving.  Filesystems that are case-insensitive
      but not case-preserving either MUST specify a case form, such as
      title case or sentence case.

   o  Case foldings for case-insensitive filesystems MUST be identified.
      The Unicode default case foldings SHOULD be the default case
      algorithms for the identified Unicode version without additional
      tailorings.  Filesystems that use case algorithms tailored to
      specific locales SHOULD use case foldings registered in a to-be-
      determined (IANA?) registry.

   o  Case-insensitive filesystems MUST specify a Unicode version for
      their case-insensitive behavior.

2.3.  I18N Versioning

   Each filesystem MUST identify a Unicode version for their I18N
   behaviors.  Filesystem implementations SHOULD adopt new Unicode
   versions as they are produced, though it is understood that it may be
   difficult to migrate non-empty filesystems to new Unicode versions.








Williams                 Expires January 7, 2021                [Page 9]

Internet-Draft           Accept-Auth & Redirect                July 2020


3.  Filesystem Protocol I18N Guidelines

   Remote filesystem protocols that allow clients to perform lookups
   against cached directory listings MUST allow clients to discover all
   relevant I18N behaviors of the filesystem whence any given directory
   listing:

   o  whether the filesystem normalizes on CREATE (and LOOKUP), and if
      so, to what NF in what Unicode version;

   o  whether the filesystem is form-insensitive and form-preserving,
      and if so, in what Unicode version;

   o  whether the filesystem is case-insensitive and case-preserving,
      and if so, with what foldings (default or tailured, and if
      tailored provide an identifier for the set of foldings), and a
      Unicode version.

   Foldings are identified via a folding set name as registered in a to-
   be-determined (IANA?) registry.

   Because some filesystems might allow for different I18N settings on a
   per-directory basis, remote filesystem protocols MUST allow those
   settings to be discoverable on a per-directory basis.

   Internet filesystem servers MUST reject attempts to create new non-
   Unicode file names.  (Note that this requirement is weaker ("SHOULD")
   for the actual filesystems, since those might have to allow non-
   Unicode content for legacy reasons via interfaces other than Internet
   filesystem protocols.)

3.1.  I18N and Caching in Filesystem Protocol Clients

   Caching clients of remote filesystems either MUST NOT perform lookups
   against cached directory listings, or MUST query the directories'
   filesystems' I18N profiles and apply the same I18N equivalent form
   policis and case-insensitivity case foldings.

4.  Internationalization Considerations

   This document deals in internationalization throughout.

5.  IANA Considerations

   [ALTERNATIVELY use locale names and CLDR?  Need to determine the
   stability of CLDR locales...  Basically, we need stable locale names,
   and stable case-folding mappings.]




Williams                 Expires January 7, 2021               [Page 10]

Internet-Draft           Accept-Auth & Redirect                July 2020


   We hereby request the creation of a new IANA registry with Expert
   Review registration rules with the following fields:

   o  name, an identifier-like name

   o  Unicode version number

   o  listing of case folding tailorings and/or references to external
      case folding tailoring specifications

   The case foldings registered here will be used by case-insensitive
   filesystems and filesystem protocols to identify tailored case
   foldings so that caching clients can implement the same case-
   insensitive behavior using cached directory listings.

6.  Security Considerations

   Security considerations of Unicode and filesystem protocols apply.
   No new security considerations are added or need be noted here.

   The methods of handling equivalent Unicode strings cause aliasing.
   This is not expected to be a security problem.

   Case-insensitivity causes aliasing.  This is not expected to be a
   security problem.

   No effort is made here to handle confusables.  This is not expected
   to be a serious security problem in the context of file servers.

7.  References

7.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
              2003, <https://www.rfc-editor.org/info/rfc3629>.

   [UNICODE]  The Unicode Consortium, "The Unicode Standard, Version
              12.1.0", May 2019,
              <https://www.unicode.org/versions/Unicode12.1.0/>.






Williams                 Expires January 7, 2021               [Page 11]

Internet-Draft           Accept-Auth & Redirect                July 2020


7.2.  Informative References

   [BSD4.4]   McKusik, M., Bostic, K., Karels, M., and J. Quarterman,
              "The Design and Implementation of the 4.4BSD Operating
              System", DOI 10.5555/231070, 1996.

   [I-D.ietf-secsh-filexfer]
              Galbraith, J. and O. Saarenmaa, "SSH File Transfer
              Protocol", draft-ietf-secsh-filexfer-13 (work in
              progress), July 2006.

   [McKusick86]
              McKusik, M. and M. Karels, "Towards a Compatible File
              System Interface", Jun 1986.

   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
              Resource Identifier (URI): Generic Syntax", STD 66,
              RFC 3986, DOI 10.17487/RFC3986, January 2005,
              <https://www.rfc-editor.org/info/rfc3986>.

   [RFC3987]  Duerst, M. and M. Suignard, "Internationalized Resource
              Identifiers (IRIs)", RFC 3987, DOI 10.17487/RFC3987,
              January 2005, <https://www.rfc-editor.org/info/rfc3987>.

   [RFC7230]  Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer
              Protocol (HTTP/1.1): Message Syntax and Routing",
              RFC 7230, DOI 10.17487/RFC7230, June 2014,
              <https://www.rfc-editor.org/info/rfc7230>.

   [RFC7530]  Haynes, T., Ed. and D. Noveck, Ed., "Network File System
              (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530,
              March 2015, <https://www.rfc-editor.org/info/rfc7530>.

   [SolarisInternals]
              McDougal, R. and J. Mauro, "Solaris Internals -- Solaris
              10 and OpenSolaris Kernel Architecture", 2007.

7.3.  URIs

   [1] https://en.wikipedia.org/wiki/Virtual_file_system

Author's Address









Williams                 Expires January 7, 2021               [Page 12]

Internet-Draft           Accept-Auth & Redirect                July 2020


   Nico Williams (editor)
   Cryptonector, LLC
   Austin, TX
   USA

   Email: nico@cryptonector.com













































Williams                 Expires January 7, 2021               [Page 13]