Internet DRAFT - draft-bray-unichars
draft-bray-unichars
Network Working Group T. Bray
Internet-Draft Textuality Services
Intended status: Standards Track P. Hoffman
Expires: 14 April 2024 ICANN
12 October 2023
Unicode Character Repertoire Subsets
draft-bray-unichars-07
Abstract
This document discusses specifying subsets of the Unicode character
repertoire for use in protocols and data formats.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 14 April 2024.
Copyright Notice
Copyright (c) 2023 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Bray & Hoffman Expires 14 April 2024 [Page 1]
Internet-Draft Unicode Subsets October 2023
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Characters and Code Points . . . . . . . . . . . . . . . . . 3
2.1. Transformation Formats . . . . . . . . . . . . . . . . . 3
2.2. Problematic Code Points . . . . . . . . . . . . . . . . . 4
2.2.1. Surrogates . . . . . . . . . . . . . . . . . . . . . 4
2.2.2. Control Codes . . . . . . . . . . . . . . . . . . . . 4
2.2.3. Noncharacters . . . . . . . . . . . . . . . . . . . . 5
3. Dealing With Problematic Code Points . . . . . . . . . . . . 5
4. Subset Character Repertoires . . . . . . . . . . . . . . . . 6
4.1. Unicode Scalars . . . . . . . . . . . . . . . . . . . . . 6
4.2. XML Characters . . . . . . . . . . . . . . . . . . . . . 7
4.3. Unicode Assignables . . . . . . . . . . . . . . . . . . . 7
5. Restricting Character Repertoires . . . . . . . . . . . . . . 8
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8
7. Security Considerations . . . . . . . . . . . . . . . . . . . 8
8. Normative References . . . . . . . . . . . . . . . . . . . . 8
9. Informative References . . . . . . . . . . . . . . . . . . . 9
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 9
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10
1. Introduction
When a protocol or data format has text fields, that text is normally
composed of Unicode [UNICODE] characters, to support use by speakers
of many languages. The "set of all Unicode characters" is generally
not a good choice for use in text fields. Instead, subsets such as
those discussed in this document are typically used.
The term "character repertoire" is a well-understood concept when
applied to an encoding standard. In this document, "character
repertoire" describes subsets of the Unicode character repertoire
that exclude some or all of the entities that are "problematic" as
defined in Section 2.2. Authors should have a way to concisely and
exactly reference a stable specification that identifies a protocol
or data format's character repertoire.
This document discusses issues that apply in choosing subsets, names
two subsets that have been popular choices in specifying character
repertoires, and suggests one new subset. The goal is to provide a
convenient target for cross-reference from other specifications which
discuss character repertoires.
Bray & Hoffman Expires 14 April 2024 [Page 2]
Internet-Draft Unicode Subsets October 2023
1.1. Notation
In this document, the numeric values assigned to Unicode characters
are provided in hexadecimal. In the text, Unicode’s standard "U+",
zero-padded to four places, is used. For example, "A", decimal 65,
would be expressed as U+0041, and "😉" (Winking Face), decimal
128,521, would be U+1F609.
Groups of numeric values described in Section 4 are given in ABNF
[RFC5234]. In ABNF, hexadecimal values are preceded by "%x" rather
than "U+".
All the numeric ranges in this document are inclusive.
2. Characters and Code Points
Definition D9 in section 3.4 of [UNICODE] defines "Unicode codespace"
as "a range of integers from 0 to 10FFFF_16". Definition D10 defines
"code point" as "Any value in the Unicode codespace".
The Unicode Standard's definition of "Unicode character" is
conceptual. However, each Unicode character is assigned a code
point, used to represent the characters in computer memory and
storage systems and, in specifications, to specify the allowed
repertoires of Unicode characters.
There are 1,114,112 code points; as of Unicode 15.1 (2023), fewer
than 150,000 have been assigned to characters. It is difficult to
specify that unassigned code points should be avoided, because they
regularly become assigned when new characters are added to Unicode.
2.1. Transformation Formats
Unicode describes a variety of "transformation formats", ways to
marshal code points into byte sequences. A survey of transformation
formats is beyond the scope of this document. However, it is useful
to note that the "UTF-16" format represents each code point with one
or two 16-bit chunks, and the “UTF-8” format uses variable-length
byte sequences.
The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277],
says "Protocols MUST be able to use the UTF-8 charset", which becomes
a mandate to use UTF-8 for any protocol or data format that specifies
a single transformation format. UTF-8 is widely used for
interoperable data formats such as JSON, YAML, and XML.
Bray & Hoffman Expires 14 April 2024 [Page 3]
Internet-Draft Unicode Subsets October 2023
2.2. Problematic Code Points
Definition D10a in section 3.4 of [UNICODE] defines seven code point
types. Three types of code points are assigned to entities which are
not actually characters or whose value as Unicode characters in text
fields is questionable: "Surrogate", "Control", and "Noncharacter".
In this document, "problematic" refers to code points whose type is
"Surrogate" or "Noncharacter", and to "legacy controls" as defined in
Section 2.2.2.2.
2.2.1. Surrogates
A total of 2,048 code points, in the range U+D800-U+DFFF, are divided
into two blocks called "high surrogates" and "low surrogates";
collectively the 2,048 code points are referred to as "surrogates".
Surrogates may only be used in Unicode texts encoded in UTF-16, where
a high-surrogate/low-surrogate pair represents a code point greater
than U+FFFF.
A surrogate which occurs in text encoded in any transformation format
other than UTF-16 has no meaning and may cause malfunction in
software that encounters it. In particular, it is impossible to
represent a surrogate in well-formed UTF-8.
2.2.2. Control Codes
Section 23.1 of [UNICODE] introduces the "Control Codes", for
compatibility with legacy pre-Unicode standards. They comprise 65
code points in the ranges U+0000-U+001F ("C0 Controls") and
U+0080-U+009F (“C1 Controls”), plus U+007F, "DEL".
2.2.2.1. Useful Controls
The C0 Controls include newline (U+000A), carriage return (U+000D),
and tab (U+0009); this document refers to these three characters as
the "useful controls".
2.2.2.2. Legacy Controls
Aside from the useful controls, the control codes are mostly obsolete
and generally lack interoperable semantics. This document uses the
phrase "legacy controls" to describe control codes that are not
useful controls.
Since the code points for C0 Controls include the 32 smallest
integers including zero, they are likely to occur in data as a result
of programming errors.
Bray & Hoffman Expires 14 April 2024 [Page 4]
Internet-Draft Unicode Subsets October 2023
2.2.3. Noncharacters
Certain code points are classified as "noncharacters", and [UNICODE]
asserts repeatedly that they are not designed or used for open
interchange.
Code points are organized into 17 "planes", each containing 2^16 code
points. The last two code points in each plane are noncharacters:
U+00FFFE, U+00FFFF, U+01FFFE, U+01FFFF, U+02FFFE, U+02FFFF, and so
on, up to U+10FFFE, U+10FFFF.
The code points in the range U+FDD0-U+FDEF are noncharacters.
3. Dealing With Problematic Code Points
[RFC9413], "Maintaining Robust Protocols", provides a thorough
discussion of strategies for dealing with issues in input data, for
example problematic code points.
Different types of problematic code points cause different issues.
Noncharacters and legacy controls are unlikely to cause software
failures, but they cannot usefully be displayed to humans, and can be
used in attacks based on misleading human readers of text that
attempt to display them [TR36].
Surrogate code points have been observed to cause software failures.
The behavior of software which encounters them is unpredictable and
differs in programming-language implementations, even between
different API calls in the same language.
Section 3.9 of [UNICODE] makes it clear that a UTF-8 byte sequence
which would map to a surrogate is ill-formed. Thus, in theory, if a
specification requires that input data be encoded with UTF-8,
implementors should never have to concern themselves with surrogates.
Unfortunately, industry experience teaches that problematic code
points, including surrogates, can and do occur in program input where
the source of input data is not controlled by the implementor. In
particular, the specification of JSON allows any code point to appear
in object member names and string values [RFC8259]; the following is
a conforming JSON text:
{"example": "\u0000\u0089\uDEAD\uD9BF\uDFFF"}
The value of the "example" field contains the C0 Control NUL, the C1
Control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired
surrogate, and the noncharacter U+7FFFF encoded per JSON rules as two
escaped UTF-16 surrogate code points. It is unlikely to be useful as
Bray & Hoffman Expires 14 April 2024 [Page 5]
Internet-Draft Unicode Subsets October 2023
the value of a text field. That value cannot be serialized into
well-formed UTF-8, but the behavior of libraries asked to parse the
sample is unpredictable; some will silently parse this and generate
an ill-formed UTF-8 string.
Reasonable options for dealing with problematic input include, first,
rejecting text containing problematic code points, and second,
replacing them with placeholders. (As an exception, [UNICODE] notes
that it may in some cases be appropriate, specifically for
noncharacters, to treat them as non-problematic unassigned code
points.)
Silently deleting an ill-formed part of a string is a known security
risk. Responding to that risk, [UNICODE] section 3.2 recommends
dealing with ill-formed byte sequences by signaling an error, or
replacing problematic code points, ideally with "�" (U+FFFD,
REPLACEMENT CHARACTER), although some popular software platforms,
notably Java, use "?".
RFC9413 emphasizes that when encountering problematic input, software
should consider the field as a whole, not individual code points or
bytes.
4. Subset Character Repertoires
This section describes subsets of the code points that can be used in
specifying character repertoires for text fields in protocols and
data types. Specifications can refer to these subsets by the names
"Unicode Scalars", "XML Characters", and "Unicode Assignables".
4.1. Unicode Scalars
Definition D76 in section 3.9 of [UNICODE] defines the term "Unicode
scalar value" as "Any Unicode code point except high-surrogate and
low-surrogate code points."
The "Unicode Scalars" subset can be expressed as an ABNF production:
unicode-scalar =
%x0-D7FF / %xE000-10FFFF ; exclude surrogates
This subset is the default character repertoire for CBOR [RFC8949],
and has the advantage of excluding surrogates. However, it includes
legacy controls and noncharacters.
Bray & Hoffman Expires 14 April 2024 [Page 6]
Internet-Draft Unicode Subsets October 2023
4.2. XML Characters
The XML 1.0 Specification [XML], in its grammar production labeled
"Char", specifies a subset of Unicode code points that excludes
surrogates, legacy C0 Controls, and the noncharacters U+FFFE and
U+FFFF.
The "XML Characters" subset can be expressed as an ABNF production:
xml-character =
%x9 / %xA / %xD / ; useful controls
%x20-D7FF / ; exclude surrogates
%xE000-FFFD/ ; exclude FFFE and FFFF nonchars
%x100000-10FFFF
While this subset does not exclude all the problematic code points,
the C1 Controls are less likely than the C0 Controls to appear
erroneously in data, and have not been observed to be a frequent
source of problems. Also, the noncharacters greater in value than
U+FFFF are rarely encountered.
4.3. Unicode Assignables
This document defines the "Unicode Assignables" subset as all the
Unicode code points that are not problematic. This subset comprises
all code points that are currently assigned, or might in future be
assigned, to characters that are not legacy control codes.
Unicode Assignables can be expressed as an ABNF production:
unicode-assignable =
%x9 / %xA / %xD / ; useful controls
%x20-7E / ; exclude C1 Controls and DEL
%xA0-D7FF / ; exclude surrogates
%xE000-FDCF ; exclude FDD0 nonchars
%xFDF0-FFFD / ; exclude FFFE and FFFF nonchars
%x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
%x30000-3FFFD / %x40000-4FFFD /
%x50000-5FFFD / %x60000-6FFFD /
%x70000-7FFFD / %x80000-8FFFD /
%x90000-9FFFD / %xA0000-AFFFD /
%xB0000-BFFFD / %xC0000-CFFFD /
%xD0000-DFFFD / %xE0000-EFFFD /
%xF0000-FFFFD / %x100000-10FFFD
Bray & Hoffman Expires 14 April 2024 [Page 7]
Internet-Draft Unicode Subsets October 2023
5. Restricting Character Repertoires
Many IETF specifications rely on well-known data formats such as
JSON, I-JSON, CBOR, YAML, and XML. These formats have default
character repertoires. For example, JSON allows object member names
and string values to include any Unicode code point, including all
the problematic types.
A protocol based on JSON can be made more robust and implementor-
friendly by restricting the contents of object member names and
string values to one of the subsets described in Section 4.
Equivalent restrictions are possible for other packaging formats such
as I-JSON, XML, YAML, and CBOR.
Note that escaping techniques such as those in the JSON example in
Section 3 cannot be used to circumvent this sort of character-
repertoire restriction, which applies to data content, not textual
representation in packaging formats. If a specification restricted a
JSON field value to the Unicode Assignables, the example would remain
a conforming JSON Text but the data it represents would not
constitute Unicode Assignable code points.
6. IANA Considerations
This document makes no requests of IANA.
7. Security Considerations
Unicode Security Considerations [TR36] is a wide-ranging survey of
the issues implementors should consider while writing software to
process Unicode text. Many of the attacks it discusses are aimed at
deceiving human readers, but vulnerabilities involving issues such as
surrogates and noncharacters are also covered, and in fact can
contribute to human-deceiving exploits.
Note that the Unicode-character subsets specified in this document
include a successively-decreasing number of problematic code points,
and thus should be less and less susceptible to vulnerabilities. The
Section 4.3 subset, "Unicode Assignables", excludes all of them.
8. Normative References
[RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", STD 68, RFC 5234,
DOI 10.17487/RFC5234, January 2008,
<https://www.rfc-editor.org/info/rfc5234>.
Bray & Hoffman Expires 14 April 2024 [Page 8]
Internet-Draft Unicode Subsets October 2023
[TR36] The Unicode Consortium, "Unicode Security Considerations",
<https://www.unicode.org/reports/tr36/>. Note that this
reference is to the latest version of this document,
rather than to a specific release. It is not expected
that future updates will affect the referenced
discussions.
[UNICODE] The Unicode Consortium, "The Unicode Standard",
<http://www.unicode.org/versions/latest/>. Note that this
reference is to the latest version of Unicode, rather than
to a specific release. It is not expected that future
changes in the Unicode Standard will affect the referenced
definitions.
9. Informative References
[RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277,
January 1998, <https://www.rfc-editor.org/info/rfc2277>.
[RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data
Interchange Format", STD 90, RFC 8259,
DOI 10.17487/RFC8259, December 2017,
<https://www.rfc-editor.org/info/rfc8259>.
[RFC8949] Bormann, C. and P. Hoffman, "Concise Binary Object
Representation (CBOR)", STD 94, RFC 8949,
DOI 10.17487/RFC8949, December 2020,
<https://www.rfc-editor.org/info/rfc8949>.
[RFC9413] Thomson, M. and D. Schinazi, "Maintaining Robust
Protocols", RFC 9413, DOI 10.17487/RFC9413, June 2023,
<https://www.rfc-editor.org/info/rfc9413>.
[XML] Bray, T., Paoli, J., McQueen, C.M., Maler, E., and F.
Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth
Edition)", 26 November 2008,
<http://www.w3.org/TR/2008/REC-xml-20081126/>. Note that
this reference is to a specific release, based on a
history of previous "Edition" releases having changed this
production.
Acknowledgements
Thanks are due to Guillaume Fortin-Debigaré, who filed an Errata
Report against RFC 8259, The JavaScript Object Notation, noting
frequent references to "Unicode characters", when in fact the RFC
formally specifies the use of Unicode Code Points.
Bray & Hoffman Expires 14 April 2024 [Page 9]
Internet-Draft Unicode Subsets October 2023
Thanks also to Asmus Freytag for careful review and many constructive
suggestions aimed at making the language more consistent with the
structure of the Unicode Standard.
Thanks also to James Manger for the correctness of the ABNF and JSON
samples.
Authors' Addresses
Tim Bray
Textuality Services
Email: tbray@textuality.com
Paul Hoffman
ICANN
Email: paul.hoffman@icann.org
Bray & Hoffman Expires 14 April 2024 [Page 10]