Network Working Group T. Bray
Internet-Draft Textuality Services
Intended status: Standards Track P. Hoffman
Expires: 4 March 2024 ICANN
1 September 2023
Specifying Unicode Character Repertoires in RFCs
draft-bray-unichars-02
Abstract
This document describes how to specify the use of Unicode characters
in a helpful and unambiguous way.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 4 March 2024.
Copyright Notice
Copyright (c) 2023 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Bray & Hoffman Expires 4 March 2024 [Page 1]
Internet-Draft Specifying Unicode September 2023
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Notation . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Character Concepts . . . . . . . . . . . . . . . . . . . . . 3
2.1. Transformation Formats . . . . . . . . . . . . . . . . . 3
2.2. Problematic Code Point Types . . . . . . . . . . . . . . 4
2.2.1. Surrogates . . . . . . . . . . . . . . . . . . . . . 4
2.2.2. Control Codes . . . . . . . . . . . . . . . . . . . . 4
2.2.3. Noncharacters . . . . . . . . . . . . . . . . . . . . 5
3. Subsets Defined in the Unicode Standard . . . . . . . . . . . 5
3.1. Unicode Code Points . . . . . . . . . . . . . . . . . . . 5
3.2. Unicode Scalar Values . . . . . . . . . . . . . . . . . . 6
4. Other Definitions . . . . . . . . . . . . . . . . . . . . . . 6
4.1. XML Characters . . . . . . . . . . . . . . . . . . . . . 6
4.2. Useful Assignables . . . . . . . . . . . . . . . . . . . 7
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7
6. Security Considerations . . . . . . . . . . . . . . . . . . . 8
7. Normative References . . . . . . . . . . . . . . . . . . . . 8
8. Informative References . . . . . . . . . . . . . . . . . . . 9
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 9
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9
1. Introduction
When a protocol or data format has text fields, that text is normally
composed of Unicode [UNICODE] characters, to support use by speakers
of many languages. IETF policy mandates this [RFC2277].
Unfortunately, the Unicode Standard does not define the term "Unicode
character" in a way that is useful for technical specifications.
Protocols and data formats SHOULD describe exactly which selection of
the available Unicode characters are to be used. The term "character
repertoire" is normally applied to an encoding standard; in this
document it describes selected subsets of the Unicode characters.
Authors should have a way to concisely and exactly reference a stable
specification that identifies a protocol or data format's character
repertoire
This document describes and names several subsets that have been
popular choices in specification character repertoires, and suggests
one new subset. The goal is to provide a convenient target for
cross-reference from other specifications which discuss character
repertoires.
Bray & Hoffman Expires 4 March 2024 [Page 2]
Internet-Draft Specifying Unicode September 2023
1.1. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
1.2. Notation
In this document, the numeric values assigned to Unicode characters
are provided in hexadecimal. In the text, Unicode’s standard "U+"
notation [RFC5137] is used. For example, "A", decimal 65, would be
expressed as U+0041, and "😉" (Winking Face), decimal 128,521, would
be U+1F609.
Certain groups of numeric values described in Section 3 and Section 4
are given in ABNF [RFC5234]. In ABNF, the hexadecimal values for
characters are preceded by "%x" rather than "U+".
All the numeric ranges in this document are inclusive.
2. Character Concepts
The Unicode Standard's definition of "Unicode character" is
conceptual. However, each Unicode character is assigned an integer
identifier in the range U+0000 through U+10FFFF. These numbers are
used to represent the characters in computer memory and storage
systems and, in specifications, to specify the allowed repertoires of
Unicode characters.
The numbers assigned to Unicode characters are called “code points”;
there are potentially 1,114,112 of them. As of 2023, fewer than
150,000 characters have had code points assigned. While the
inclusion of unassigned code points in text data is undesirable, it
is difficult to specify that it should be avoided, because unassigned
code points regularly become assigned as new characters are added to
Unicode. Fortunately, the occurrence of unassigned code points in
texts is generally unlikely to cause software to malfunction.
2.1. Transformation Formats
Unicode describes a variety of "transformation formats", ways to
encode code points in bytes of computer memory. A survey of
transformation formats is beyond the scope of this document.
However, it is useful to note that the "UTF-16" transformation format
represents each code point with one or two 16-bit chunks, and the
“UTF-8” transformation format uses variable-length byte sequences.
Bray & Hoffman Expires 4 March 2024 [Page 3]
Internet-Draft Specifying Unicode September 2023
Use of the UTF-8 transformation format is mandated by the IETF
[RFC2277] and widely used for interoperable data formats such as
JSON, YAML, and XML.
2.2. Problematic Code Point Types
Definition D10a in section 3.4 of [UNICODE] defines seven code point
types. Three types of code points are assigned to constructs which
are not actually characters or whose value as Unicode characters is
questionable: "Control", "Surrogate", and "Noncharacter".
2.2.1. Surrogates
A total of 2,048 code points, in the range U+D800-U+DFFF, are divided
into two blocks called "high surrogates" and "low surrogates";
collectively the 2,048 code points are referred to as "surrogates".
Surrogates may only be used in Unicode texts encoded in UTF-16, where
a high-surrogate/low-surrogate pair represents a code point greater
than U+FFFF.
A surrogate which occurs as a singleton, or in an improperly-composed
pair, or in text encoded in any transformation format other than UTF-
16, has no meaning and may cause malfunction in software that
encounters it. In particular, it is impossible to represent a
surrogate in well-formed UTF-8.
2.2.2. Control Codes
Section 23.1 of [UNICODE] introduces the "Control Codes" for
compatibility with legacy pre-Unicode standards. They comprise 65
code points in the ranges U+0000-U+001F ("C0 Controls") and
U+0080-U+009F (“C1 Controls”), plus U+007F, "DEL".
2.2.2.1. Useful Controls
The C0 Controls include the newline (U+000A), carriage return
(U+000D), and Tab (U+0009); this document refers to these three
characters as the "useful controls".
2.2.2.2. Useless Controls
Aside from the useful controls, the control codes are mostly obsolete
and generally lack interoperable semantics. This document uses the
phrase "useless controls" to describe control codes that are not
useful controls.
Bray & Hoffman Expires 4 March 2024 [Page 4]
Internet-Draft Specifying Unicode September 2023
Since the code points for C0 Controls include the 32 smallest
integers including zero, they are likely to occur in data as a result
of programming errors.
2.2.3. Noncharacters
Certain code points are classified as "noncharacters", and [UNICODE]
asserts in multiple chapters that they are not designed or used for
open interchange.
Code points are organized into 17 "planes", each containing 2^16 code
points. The last two code points in each plane are noncharacters:
U+00FFFE, U+00FFFF, U+01FFFE, U+01FFF, U+02FFFE, U+02FFFF, and so on,
up to U+10FFFE, U+10FFFF.
The code points in the range U+FDD0 to U+FDEF are noncharacters.
3. Subsets Defined in the Unicode Standard
This section describes popular subsets of the code points that are
defined in [UNICODE]. Specifications can refer to these repertoires
by the names "Unicode Code Points" and "Unicode Scalar Values".
3.1. Unicode Code Points
Definition D9 in section 3.4 of [UNICODE] defines the term "Unicode
codespace" as "a range of integers from 0 to 10FFFF_16". Definition
D10 defines the term "Code point" as "Any value in the Unicode
codespace".
The "Unicode Code Points" subset can be expressed as an ABNF
production:
unicode-code-points =
%x0-10FFFF
This subset is notable for including all possible code points. It
has been adopted by JSON [RFC8259]. However, it includes all of the
code points with problematic types listed above. For example, the
sample below is a legal JSON text.
{"example": "\u0000\uDEAD\u7FFFF"}
The value of the "example" field contains the C0 Control NUL, an
unpaired surrogate, and the noncharacter U+7FFFF. It cannot be
serialized into legal UTF-8, but many libraries will silently parse
this and generate an ill-formed UTF-8 string. Implementors must be
prepared to deal with these sorts of problematic code points.
Bray & Hoffman Expires 4 March 2024 [Page 5]
Internet-Draft Specifying Unicode September 2023
3.2. Unicode Scalar Values
Definition D76 in section 3.9 of [UNICODE] defines the term "Unicode
scalar value" as "Any Unicode code point except high-surrogate and
low-surrogate code points."
The "Unicode Scalar Values" subset can be expressed as an ABNF
production:
unicode-scalar-values =
%x0-D7FF / %xE000-10FFFF ; exclude surrogates
This subset has the advantage of excluding surrogates, which can
never add any value and have the potential to cause problems. This
subset has been adopted by I-JSON [RFC7493]. However it includes
useless controls and noncharacters. For example, the sample below is
a legal I-JSON text.
{"example": "\u0000\u7FFFF"}
The value of the "example" field can be serialized into legal UTF-8,
but is unlikely ever to be useful in practice.
4. Other Definitions
This section lists other ways to specify subsets of the code points
beyond those provided by the Unicode Standard itself. These subsets
may serve as more appropriate character repertoires for some
protocols and data formats than those in Section 3, depending on
their needs. Specifications can refer to these repertoires by the
names "XML Characters" and "Useful Assignables".
4.1. XML Characters
The XML 1.0 Specification [XML], in its grammar production labeled
"Char", specifies a range of Unicode codepoints that excludes
surrogates, useless C0 Controls, and the noncharacters U+FFFE and
U+FFFF.
XML characters exclude surrogates and some but not all useless
controls. For example, the sample below is a well-formed XML
document.
翿
The "example" element contains the useless DEL control, the useless
"CHARACTER TABULATION WITH JUSTIFICATION" control, and the
noncharacter U+7FFFF. It is unlikely ever to be useful in practice.
Bray & Hoffman Expires 4 March 2024 [Page 6]
Internet-Draft Specifying Unicode September 2023
The "XML Characters" subset can be expressed as an ABNF production:
xml-chars =
%x9 / %xA / %xD / ; useful controls
%x20-D7FF / ; exclude surrogates
%xE000-FFFD/ ; exclude FFFE and FFFF nonchars
%x100000-10FFFF
While this subset does not exclude all the problematic code points,
the C1 Controls are less likely than the C0 Controls to appear
erroneously in data, and have not been observed to be a frequent
source of problems. Also, the noncharacters greater in value than
U+FFFF are rarely encountered.
This subset may be especially appropriate for data formats which may
be represented in either JSON or XML.
4.2. Useful Assignables
For convenience, this document defines the "Useful Assignables"
subset as the Unicode code points, excluding the useless controls,
surrogates, and noncharacters. This comprises all code points that
are currently assigned, or might in future be assigned, to characters
that are not legacy control codes, plus the useful controls.
Useful Assignables can be expressed as an ABNF production:
useful-assignables =
%x9 / %xA / %xD / ; useful controls
%x20-7E / ; exclude C1 Controls and DEL
%xA0-D7FF / ; exclude surrogates
%xE000-FDCF ; exclude FDD0 nonchars
%xFDF0-FFFD / ; exclude FFFE and FFFF nonchars
%x1000-1FFFD / %x2000-2FFFD / ; (repeat per plane)
%x3000-3FFFD / %x4000-4FFFD /
%x5000-5FFFD / %x6000-6FFFD /
%x7000-7FFFD / %x8000-8FFFD /
%x9000-9FFFD / %xA000-AFFFD /
%xB000-BFFFD / %xC000-CFFFD /
%xD000-DFFFD / %xE000-EFFFD /
%xF000-FFFFD / %x10000-10FFFD
5. IANA Considerations
This document makes no requests of IANA.
Bray & Hoffman Expires 4 March 2024 [Page 7]
Internet-Draft Specifying Unicode September 2023
6. Security Considerations
Unicode Security Considerations [TR36] is a wide-ranging survey of
the issues implementors should consider while writing software to
process Unicode text. Many of the exploits it discusses are aimed at
deceiving human readers, but vulnerabilities involving issues such as
surrogates and noncharacters are also covered, and in fact can
contribute to human-deceiving exploits.
Note that the Unicode-character subsets specified in this document
include a successively-decreasing number of surrogates and
noncharacters, and thus should be less and less susceptible to
vulnerabilities. The Section 4.2 subset, "Basic Unicode Characters",
excludes all of them.
7. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, .
[TR36] The Unicode Consortium, "Unicode Security Considerations",
. Note that this
reference is to the latest version of this document,
rather than to a specific release. It is not expected
that future updates will affect the referenced
discussions.
[UNICODE] The Unicode Consortium, "The Unicode Standard",
. Note that this
reference is to the latest version of Unicode, rather than
to a specific release. It is not expected that future
changes in the Unicode Standard will affect the referenced
definitions.
[XML] Bray, T., Paoli, J., McQueen, C.M., Maler, E., and F.
Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth
Edition)", 26 November 2008,
. Note that
this reference is to a specific release, based on a
history of previous "Edition" releases having changed this
production.
Bray & Hoffman Expires 4 March 2024 [Page 8]
Internet-Draft Specifying Unicode September 2023
8. Informative References
[RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277,
January 1998, .
[RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters",
BCP 137, RFC 5137, DOI 10.17487/RFC5137, February 2008,
.
[RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", STD 68, RFC 5234,
DOI 10.17487/RFC5234, January 2008,
.
[RFC7493] Bray, T., Ed., "The I-JSON Message Format", RFC 7493,
DOI 10.17487/RFC7493, March 2015,
.
[RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data
Interchange Format", STD 90, RFC 8259,
DOI 10.17487/RFC8259, December 2017,
.
Acknowledgements
Thanks are due to Guillaume Fortin-Debigaré, who filed an Errata
Report against RFC8259, The JavaScript Object Notation, noting
frequent references to "Unicode characters", when in fact the RFC
formally specifies the use of Unicode code points.
Thanks are due to Asmus Freytag for careful review and many
constructive suggestions aimed at making the language more consistent
with the structure of the Unicode Standard.
Authors' Addresses
Tim Bray
Textuality Services
Email: tbray@textuality.com
Paul Hoffman
ICANN
Email: paul.hoffman@icann.org
Bray & Hoffman Expires 4 March 2024 [Page 9]