Internet DRAFT - draft-freytag-troublesome-characters
draft-freytag-troublesome-characters
IETF A. Freytag
Internet-Draft ASMUS, Inc.
Intended status: Standards Track J. Klensin
Expires: December 31, 2018
A. Sullivan
Oracle Corp.
June 29, 2018
Those Troublesome Characters: A Registry of Unicode Code Points Needing
Special Consideration When Used in Network Identifiers
draft-freytag-troublesome-characters-02
Abstract
Unicode's design goal is to be the universal character set for all
applications. The goal entails the inclusion of very large numbers
of characters. It is also focused on written language in general;
special provisions have always been needed for identifiers. The
sheer size of the repertoire increases the possibility of accidental
or intentional use of characters that can cause confusion among
users, particularly where linguistic context is ambiguous,
unavailable, or impossible to determine. A registry of code points
that can be sometimes especially problematic may be useful to guide
system administrators in setting parameters for allowable code points
or combinations in an identifier system, and to aid applications in
creating security aids for users.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 31, 2018.
Freytag, et al. Expires December 31, 2018 [Page 1]
Internet-Draft Troublesome Characters June 2018
Copyright Notice
Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Unicode code points and identifiers . . . . . . . . . . . . . 3
2. Background and Conventions . . . . . . . . . . . . . . . . . 5
3. Techniques already in place . . . . . . . . . . . . . . . . . 5
4. A registry of code points requiring special attention . . . . 7
4.1. Description . . . . . . . . . . . . . . . . . . . . . . . 7
4.2. Maintenance . . . . . . . . . . . . . . . . . . . . . . . 10
4.3. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5. Registry initial contents . . . . . . . . . . . . . . . . . . 11
5.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2. Interchangeable Code Points . . . . . . . . . . . . . . . 12
5.3. Excludable Code Points . . . . . . . . . . . . . . . . . 13
5.4. Combining Marks . . . . . . . . . . . . . . . . . . . . . 14
5.5. Mitigation . . . . . . . . . . . . . . . . . . . . . . . 15
5.5.1. Mitigation Strategies . . . . . . . . . . . . . . . . 16
5.5.2. Limits of Mitigation . . . . . . . . . . . . . . . . 18
5.6. Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6. Table of Code Points . . . . . . . . . . . . . . . . . . . . 19
6.1. References for Registry . . . . . . . . . . . . . . . . . 27
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28
8. Security Considerations . . . . . . . . . . . . . . . . . . . 29
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 29
9.1. Normative References . . . . . . . . . . . . . . . . . . 29
9.2. Informative References . . . . . . . . . . . . . . . . . 30
Appendix A. Additional Background . . . . . . . . . . . . . . . 31
A.1. The Theory of Inclusion . . . . . . 31
A.2. The Difference Between Theory and Practice . . . . . . . 33
A.2.1. Confusability . . . . . . . . . . . . . . . . . . . . 33
Appendix B. Examples . . . . . . . . . . . . . . . . . . . . . . 34
Appendix C. Discussion Venue . . . . . . . . . . . . . . . . . . 37
Appendix D. Change History . . . . . . . . . . . . . . . . . . . 37
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 38
Freytag, et al. Expires December 31, 2018 [Page 2]
Internet-Draft Troublesome Characters June 2018
1. Unicode code points and identifiers
Unicode [Unicode] is a coded character set that aims to support every
writing system. Writing systems evolve over time and are sometimes
influenced by one another. As a result, Unicode encodes many
characters that, to a reader, appear to be the same thing; but that
are encoded differently from one another. This sort of difference is
usually not important in written texts, because competent readers and
writers of a language are able to compensate for the selection of the
"wrong" character when reading or writing. Finally, the goal of
supporting every writing system also implies that Unicode is designed
to properly represent written language; special provisions are needed
for identifiers.
Identifiers that are used in a network or, especially, an Internet
context present several special problems because of the above feature
of Unicode:
[[CREF1: AF: This whole business of language context seems
unconnected from the data we have in the registry: that data is about
code points and sequences that look the same, and many examples are
in the same language. For example the duplicated shapes for digit /
letter pairs. In very few cases would knowing the language context
make a difference. In some cases, if you knew the script (not for
the label, but the code point) you might be able to distinguish two
labels, but that is it. I think we should further rewrite this
summary so it matches better with the what the proposed registry
contains.]]
1. In many (perhaps most) uses of identifiers, they are neither
constrained to words in a particular language, nor would it be
possible to ascertain reliably the language context in which the
identifier is being or will be used. In the case of an
internationalized domain name, for instance, each label could in
principle represent a new locus of control, because there could
be a delegation there. A new locus of control means that the
administrator of the resulting zone could speak, read, or intend
a different language context than the one from the parent.
Moreover, at least some domains (such as the root) have an
Internet-wide context and therefore do not really have a language
context as such. In any case, the language context is simply not
available as part of a DNS lookup, so there is no way to make the
DNS sensitive to this sort of issue. Even in the case of email
local-parts, where a sender is likely to know at least one of the
languages of the receiver, the language context that was in use
at the time the identifier was created is often unknown.
Freytag, et al. Expires December 31, 2018 [Page 3]
Internet-Draft Troublesome Characters June 2018
2. Identifiers on the network are in general exact-match systems,
because an ambiguous identifier is problematic. Sometimes, but
not always, there are facilities for aliasing such that multiple
identifiers can be put together as a single identity; the DNS,
for example, does not have such an aliasing capability, because
in the DNS all aliases are one-way pointers. Aliasing techniques
are in any case just an extension of the exact-match approach,
and do not work the way a competent human reader does when
interpolating the "right" character upon seeing the "wrong" one.
3. Because there are many characters that may appear to be the same
(or even, that are defined in such a way that they are all but
guaranteed to be rendered by the same glyphs), it is fairly easy
to create an identifier either by accident or on purpose that is
likely to be confused with some other identifier even by
competent readers and writers of a language. In some cases
knowing the language context would be of no help to recognition,
for example, in cases where a language uses the same shape for a
letter as for one of the digits.
4. For some scripts their repertoire of shapes overlaps with one or
more other scripts, so that there are cases where two strings
look identical to each other, even though all the code points in
the first string are of one script, and all the code points in
the second string are of another script. In these cases, the
strings cannot be distinguished by a reader, and the whole
strings are confusable.
5. For some scripts, both users and rendering systems do not expect
to encounter code points in arbitrary sequence. Most code points
normally occur only in specific locations within a syllable. If
random labels were permitted, some would not display as expected
(including having some features misplaced or not displayed) while
others would present recognition problems to users experienced
with the script. Some devices may also not support arbitrary
input.
Beyond these issues, human perception is easily tricked, so that
entirely unrelated character sequences can become confusable -- for
example "rn" being confused with "m". Humans read strings, not
characters, and they will mostly see what they expect to see. Some
additional discussion of the background can be found in Appendix A.
The remainder of this document discusses techniques that can be used
to design the label generation rules for a particular zone so they
ameliorate or avoid entirely some of the issues caused by the
interaction between the Unicode Standard and identifiers. The
Freytag, et al. Expires December 31, 2018 [Page 4]
Internet-Draft Troublesome Characters June 2018
registry is intended to highlight code points that require such
techniques.
2. Background and Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
A reader needs to be familiar with Unicode [Unicode], IDNA2008
[RFC5890] [RFC5891] [RFC5892] [RFC5893] [RFC5894], PRECIS (at least
the framework, [RFC7564]), and conventions for discussion of
internationalization in the IETF (see [RFC6365]).
3. Techniques already in place
In the IDNA mechanism for including Unicode code points [RFC5892], a
code point is only included when it meets the needs of
internationalizing domain names as explained in the IDNA framework
[RFC5894]. For identifiers other than those specified by IDNA, the
PRECIS framework [RFC7564] generalizes the same basic technique. In
both cases, the overall approach is to assume that all characters are
excluded, and then to include characters according to properties
derived from the Unicode character properties. This general strategy
cuts the enormous size of the Unicode database somewhat, avoiding
including some characters that are necessarily unsuited for use as
identifiers.
The mechanism of inclusion by derived property, while helpful, is
insufficient to guarantee every included character is safe for use in
identifiers. Some characters' properties lead them to be included
even though they are not obviously good candidates. In other cases,
individual characters are good for inclusion, but are problematic in
combination. Finally, there are cases where characters (or sequences
of characters) are not problematic by themselves, or if used in a
mutually exclusive manner in the same identifier, but become
problematic when their choice represents the only difference between
otherwise identical identifiers. For some examples, see Appendix B.
Operators of systems that create identifiers (whether through a
registry or through a peer-to-peer identifier negotiation system)
need to make policies for characters they will permit. Operators of
registries, for instance, can help by adopting good registration
policies: "Users will benefit if registries only permit characters
from scripts that are well-understood by the registry or its
advisers."[RFC5894]
Freytag, et al. Expires December 31, 2018 [Page 5]
Internet-Draft Troublesome Characters June 2018
The difficulty for many operators, however, is that they do not have
the writing system expertise to claim any character is "well-
understood", and they do not really have the time to develop that
expertise. Such operators should in fact not use or register such
characters. Unfortunately, in many cases the operators are stewards
of systems where the user population demands identifiers useful to
them in their local languages. In other cases, operators may proceed
without a proper understanding owing to financial or market share
incentives. The risk for Internet identifiers in such cases is
obviously that ill-understood and potentially exploitable gaps in
registration policies will open.
To help mitigate such issues, this document proposes a registry of
Unicode code points that are known to present special issues for
network identifiers with the aim to guide protocol and operating
decisions about whether to permit a given code point or sequence of
code points. By necessity, any list or guidance can only reflect
issues that are known and understood at the time of writing. By
limiting itself largely to characters that are widely used to write
languages in contemporary use, the registry will address the more
critical needs, while simultanesously focusing on characters that are
well understood and for which there may already be some
implementation experience in IDNs.
By itself, such a registry will not completely protect against poor
registration or use, but it may provide operational guidance
necessary for people who are responsible for creating policies. It
also obviates the need for everyone to repeat basic investigation
into the behavior of Unicode characters. Instead, scarce expertise
can be focused on ways to mitigate issues, perhaps caused by user
requirements for a specific character.
Note that the registry defined herein does not address any of the
issues created by whole-string confusables where each of the
identifiers is of a different script. A common workaround, limiting
a registry to identifiers of only a single script, would mitigate
this issue. [[CREF2: AF: we should evaluate that; cross-script
variants that are homoglyphs have now been collected across modern
scripts as part of the root zone LGR and are easily captured in a
registry.]]
For some of the code points (or code point sequences) listed as
presenting issues for identifiers, it may be most expeditious to
simply not include them, even though they are valid according to the
protocol. Sometimes, one of a pair of identical code points (or code
point sequences) may be deemed preferable over the other for
practical reasons.
Freytag, et al. Expires December 31, 2018 [Page 6]
Internet-Draft Troublesome Characters June 2018
However, simply leaving out any code point listed in this registry
would render a registry of doubtful value for many scripts. It is
not always necessary or desirable to exclude characters. Sometimes,
it is merely necessary to ensure that for two otherwise identical
identifiers, only one of a set of mutually exclusive code points (or
sequences of code points) is used, while preventing the later
registration of the label containing the other one in order to avoid
ambiguity. This way the operator does not need to impose a choice.
In cases where two or more variants of such an identifier mean the
same thing to the native reader, an operator may decide to allow all
of the variant labels to be registered simultaneously, but only to
the same entity (and with proper safeguards that limit the
multiplicity of such allocatable variant labels).
The implementation of this strategy would be via the variant
mechanism described in [RFC7940] and [RFC8228] which allows
mechanical processing of mutual exclusion and /or bundling of
identifiers respectively.
This specification defines a registry of code points and sequences
that have been identified as requiring special attention when they
are to be used in identifiers. An administrator who does not have
the time or inclination to develop the requisite policies might
contemplate simply not to permit these code points at all.
However, for some scripts the remaining subset might not be usable in
a meaningful way. Identifiers in these scripts cannot be safely
implemented without understanding the issues involved. Further note
that many code points listed here are problematic only in their
relationship to other code points and that as long as these issues
are adequately addressed, for example using the variant mechanism,
they do not need to be excluded. [[CREF3: AF: the above needs more
editing, it's a bit repetitive.]]
4. A registry of code points requiring special attention
4.1. Description
The registry contains four fields. [[CREF4: AF If we are limited to
the "texttable" format, we are limited to three columns, there's no
way we can fit more than that into the RFC plain text format and
remain legible. If we want more columns, then we need to use some
other data format, including PDF ( which would allow us to show the
images for the code points).]]
1. The first field, called "Code Point(s)", is a code point or
sequence of code points. Sequences in this and other fields are
Freytag, et al. Expires December 31, 2018 [Page 7]
Internet-Draft Troublesome Characters June 2018
listed as space separated code point values. For completeness,
full code point sequences are listed, even if some of their
constituents are "Not recommended". A code point value is a
series of 4-6 uppercase hexadecimal digits, as defined in
[Unicode].
2. The second field, "Related CP", contains zero or more cross
references to related code points or sequences. Cross references
consist of single code points or sequences. Multiple cross
references are separated by a comma.
3. The third field, called "References", contains one or more
references to documents describing the code point and the reason
why it presents an issue. References are cited by numeric
values, each in square brackets; multiple references are
separated by space.
4. The last field, "Comment", is a free form text field that briefly
describes the issue; it also The comment field starts with a
category, separated by a colon, to allow quick identification of
similar cases
The following are the defined category values:
Not Recommended While the code point (or sequence) is not
DISALLOWED, there is emerging consensus in the community that it
is not recommended for identifiers, or it is considered as such in
the Unicode Standard. This includes, but is not limited to code
points that are formally deprecated in the Unicode standard, as
well as code points or sequences listed in the standard as "Do not
use" or not preferred or similar. Code points not in active use,
obsolete code points, or those intended for specialist use may
also be listed under this category. Details are given in the
explanation and references.
Identical The code point (or sequence) is normally identical in
appearance to another code point (or sequence); or may be
identical in some contexts. If the related CP is listed as
"PREFERRED", it is recommended that this code point (or sequence)
be excluded; in the case of a sequence, it may be appropriate to
exclude, the constituent combining marks (after first consulting
the details given in the listing for the marks). Otherwise, it is
recommended to make the two identical code points or sequences
mutually exclusive by treating them as variants. Details are
given in the explanation and references.
Restricted Context The code point is problematic in relation to some
other code points in the same label. For example, it should be
Freytag, et al. Expires December 31, 2018 [Page 8]
Internet-Draft Troublesome Characters June 2018
used only after some code points or not adjacent to certain other
code points. Further details are given in the explanation and
references. This is a common case for certain combining marks or
other code points in so-called "complex" scripts. These scripts
generally require a coordinated set of context rules; in those
cases the registry would not list any specific context rules, but
to point to documentation of existing Label Generation Rulesets
implementing a coherent set of rules as examples. Code points
with IDNA2008 property of CONTEXTJ or CONTEXTO are not listed, as
long as the given context rules mitigate any concerns.
Preferred The code point is preferred to some other code point given
in the cross reference (with the other code point normally
"IDENTICAL" or "NOT RECOMMENDED"). In some cases this represents
a preference for a code point (or sequence) that is a basic
constituent in some alphabet over a code point (or sequence) that
is rare or has specialized use. In some cases the preference may
be formally specified or otherwise represent established community
consensus. Details are given in the explanation and references.
Other All cases that do not fit one of the other categories.
Details are given in the explanation and references.
If a character appears in the registry, that does not automatically
mean that it is a bad candidate for use in identifiers generally.
Absent a well-defined and verifiable policy, however, such a code
point or sequence might well be treated with suspicion by users and
by tools.
For code points tagged as being "identical" to or "indistinguishable"
from other code points, it may be that one is preferred over the
other, but it may also be that implementing a scheme for mutual
exclusion of any resulting identical labels is the best solution,
such as assigning them "blocked" variants according to [RFC7940] and
[RFC8228].
Where characters are confusable with a combining sequence, only the
combining sequence is listed; suggested mitigation may consist of
disallowing either the specific combining sequence or disallowing the
combining marks involved. It is usually inappropriate to exclude any
of the basic letters involved, as they are generally members of the
standard alphabet for one or more languages.
The registry and this document are to be understood as guidance for
the purpose of developing operational policies that are used for
protocols under normal administrative scope. For instance, zone
operators that support IDNA are expected to create policies governing
the code points that they will permit (see [RFC5894] and
Freytag, et al. Expires December 31, 2018 [Page 9]
Internet-Draft Troublesome Characters June 2018
[I-D.rfc5891bis]). The registry herein defined is intended to
highlight particularly troublesome code points or code point
sequences for the benefit of administrators creating such policies.
It is also intended to highlight characters that may create
identifier ambiguities and thereby create security vulnerabilities.
However, by itself it is no substitute for such policies.
The registry is by necessity limited to code points for which
adequate information is available; by and large this means code
points used in connection with modern languages or writing systems,
except that specialized extensions to modern scripts may be
indicated, if their use would fall into any of the categories
defined. Historic scripts, and any modern scripts not represented in
the registry can be assumed to not be well-understood; operators are
cautioned to locate other sources of information and to develop the
necessary policies before deploying such scripts.
4.2. Maintenance
The registry is updated by Expert Review using an open process. From
time to time, additional code points may be added to the Unicode
standard, or further information may be discovered related to code
points, to existing code points or those already listed here. The
Unicode Standard may recommend against using a code point for all or
some purposes. Or a script community may have gained more experience
in deploying IDNs for that script and may create or update
recommendations as to best policy.
4.3. Scope
Code points that are DISALLOWED in IDNA 2008 are not eligible to be
listed. Code points that are CONTEXTJ or CONTEXTO are not included
here unless there are documented concerns that are not mitigated by
the existing IDNA context rules. The focus is on scripts that are
significant for identifiers; code points from scripts that are
historic or otherwise of limited use have generally not been
considered - however exceptions may exist where authoritative
information is readily available. Code points and code point
sequences included are those that need special policies (including,
but not limited to policies of exclusion).
New code points of sequences are listed whenever information becomes
available that identifies a specific issue that requires attention in
crafting a policy for the use of that code point or sequence in
network identifiers. Likewise cross references, categories,
explanations and references cited may be updated.
Freytag, et al. Expires December 31, 2018 [Page 10]
Internet-Draft Troublesome Characters June 2018
The contents of the registry generally does not represent original
research but a collection of issues documented elsewhere, with
appropriate references cited. An exception might be cases that are
in clear analogy to existing entries, but not explicitly covered by
existing references, for example, because the code point in question
was recently added to Unicode.
If a particular language or script community reaches an apparent
consensus that some code point is problematic, or that of two
identical code points or sequences one should be preferred over the
other, such recommendations, if known, should be documented in this
registry.
In addition, if the Unicode Standard designates a code point as
formally "deprecated" or less formally as "do not use", or identifies
code points that are "intentionally identical", this is also
something that should be reflected in the registry. Another source
of potential information might be existing registry policies or
recommended policies, particularly where it is apparent that they
represent a careful analysis of the issue or a wider consensus, or
both.
Proposed additions to the registry are to be shared on a mailing list
to allow for broader comment and vetting.
If there is a disagreement about the existence of an issue or its
severity, it is preferable to document both the issue and the
different evaluations of it. In all cases, the information and
documentation presented must allow a user to fully evaluate the
status of any entry in the registry.
There is no requirement for the registry to form a stable body of
data to which any future document would have to be backward
compatible in any way. If new information emerges, additional code
points may be considered problematic, or they may need to be
reclassified. In case of significant changes, the explanation should
note the nature of the change and cite a reference to document the
basis for it.
5. Registry initial contents
5.1. Overview
IDNA 2008 uses an inclusion process based on Unicode properties to
define which code points are PVALID, but also recognizes that some
code points require a context rule (CONTEXTJ, CONTEXTO).
Freytag, et al. Expires December 31, 2018 [Page 11]
Internet-Draft Troublesome Characters June 2018
A number of code points which are PVALID in [RFC5892] may require
additional attention in the design of label generations rules. In
some cases, the issue is not necessarily with an individual code
point, but with a code point sequence. In the following, "code
point" and "code point sequence" are used synonymously unless
explicitly called out. The fact that a code point require such
attention does not affect its status under IDNA 2008.
The following describes a number of conditions that pose problems for
network identifiers and common strategies for mitigating them.
5.2. Interchangeable Code Points
At times two code points or code point sequences are considered by
all users (or a significant fraction) as equivalent to a degree that
they accept one of them as substitute for another. This has obvious
implications for the unambiguous recognition of identifiers. This
document lists the code points and sequences affected (except for
certain generic classes too numerous to list here). Note that one of
the two may be preferred over the other, in which case the non-
preferred one may be excluded or folded away. But in many cases
either one is equally preferred. Mitigation techniques for such
cases are discussed below.
Homoglyphs Homoglyphs are code points that have identical
appearance, or are so close in appearance that they are
indistinguishable if not presented side-by-side. Whenever two
labels differ only by code points that are homoglyphs of each
other and occur in the same position, users cannot distinguish the
labels from each other or tell which label is intended, even
though the underlying code points are different. Users will
substitute one label for another.
Code points that are merely similar in appearance, including
strongly similar code points, or code points that are difficult to
distinguish (such as certain diacritical marks) are not considered
here; handling such similarities often requires case by case
judgment.
Instead, this document considers these types of code points that
can be fully substituted for one another:
1. code points that, by design or derivation, are identical to
each other;
2. code points that assume the same shape in some context, e.g.
at the end of a label;
Freytag, et al. Expires December 31, 2018 [Page 12]
Internet-Draft Troublesome Characters June 2018
3. code points of a striking similarity based on derivation or
common origin;
4. and code points that are otherwise indistinguishable from one
another unless placed side by side.
Cross-script Homoglyphs A number of code points are homoglyphs of
code points in another script (cross-script homoglyphs). Cross-
script homoglyphs are a concern for any zone that supports labels
from more than one script, even if each label is required to be in
a single script. Note that some writing systems ordinarily use a
combination of scripts (such as the use of Han, Hiragana and
Katakana for Japanese). For many writing systems, an admixture of
Latin letters is not uncommon, for example in brand or product
names. If not handled carefully, this can prove problematic for
identifiers.
Homophones As discussed in [202], the Amharic language treats many
code points from the Ethiopic script as sound-alikes (homophones).
In writing, these are freely substituted, users do not recognize
some spelling as more correct. A conservative approach would
treat these as mutually exclusive; the alternative, to make all
variants available to the same applicant is appears not feasible
due to the high number of such variants per label.
Semantic Variants The Chinese writing system, shared among several
geographically distributed user communities, has many instances of
code points that represent the same semantic. Even though they
are visually distinct, they can be substituted for one another;
typically these correspond to the simplified and traditional forms
of Chinese characters. See [RFC4713] for details.
5.3. Excludable Code Points
Code points that are not substitutable but troublesome for other
reasons are candidates for exclusion from a zone's repertoire. For
each such code point, the comment field briefly describes why it
should be excluded or considered troublesome. There is no identified
mitigation strategy that can be recommended for general usage: unless
careful study indicates that a code point with this status is
exceptionally acceptable for a particular zone, after all, it should
normally be excluded from the repertoire. These reasons are varied.
Deprecated Code Points Deprecated code points are those that
[Unicode] recommends not to use for any purpose. They should be
excluded from identifiers; there is no mitigation. In addition,
Unicode recommends against the use of some sequences and code
Freytag, et al. Expires December 31, 2018 [Page 13]
Internet-Draft Troublesome Characters June 2018
points for any purpose, but without formal deprecation. These
should likewise be excluded from identifiers.
Non-preferred or other Troublesome Code Point This category includes
all code points that are troublesome for other reasons; they
include code points that represent non-preferred variations; or
code points that not meant to be used in a combining sequence for
letter; or code points that may be indistinguishable from a
punctuation mark or other DISALLOWED code point. For each such
code point, the comment field briefly describes why it should be
excluded or considered troublesome.
Obsolete or not in Active Use Many code points across scripts that
are otherwise in modern use represent additions for use in
obsolete orthographies and writing systems, that is for writing
languages that are extinct or not longer written in that script.
Some have been researched and no evidence of active use could be
found. These code points are not recommended for use in
identifiers and should be excluded. Except for specialists, users
are unlikely to recognize them, or find them of use in
constructing mnemonic strings for identifiers. In addition, they
often have not been sufficiently analyzed as to whether they
represent other issues for identifiers. That makes their use
risky. Obsolete, rare and code points otherwise not in active are
generally not listed here. The reader can find a list of code
points with high probability of being in active use in [MSR].
5.4. Combining Marks
Non Normalizable Sequences Certain combining marks are part of non-
normalizable sequences. Normally, when a combining sequence is an
alternate encoding to a composite code point, normalization can be
used to select a preferred representation. For IDNA 2008, which
uses NFC to normalize, this means the composite code point.
However, some combining marks are not considered identical to the
same mark when graphically part of a composite character.
Sequences with these marks may look more or less like some
composite code point, but they are considered different, and
therefore not normalized. For identifiers, the best
recommendation is to exclude those combining marks.
Combining marks that are also part of precomposed letters
Many combining marks are part of canonical decompositions. For
identifiers that are normalized to the composed forms using NFC
(as required by IDNA 2008), these combining marks usually are not
needed on their own, that is as separate element of a combining
sequence after normalization. (The vast majority of letters using
these marks have been encoded as precomposed characters). It is
Freytag, et al. Expires December 31, 2018 [Page 14]
Internet-Draft Troublesome Characters June 2018
strongly recommended to exclude these combining marks on their
own, but, as needed for a specific language, to enumerate the
needed sequences. (One notable example is Vietnamese which, after
normalization to NFC uses a mixture of precomposed code points and
combining marks). [TBD]The most common generic combining marks
affected have been entered in the registry as excluded.
Non-spacing combining marks These marks are typically accents,
diacritics and the like. They pose an additional problem: if they
are allowed to occur twice in a row, some rendering systems will
"overprint" them, in effect making them indistinguishable from
single marks. This problem can be avoided by allowing only
enumerated sequences, or alternatively by a context rule.
Ambiguous Rendering There are other ways in which certain code
points and sequences representing particular combinations of code
points may suffer from unreliable rendering, because rendering
engines normally do not expect to encounter them. While Unicode
allows the use of combining marks, in principle, in combination
with any base character, in practice this can lead to
unrecognizable labels, or labels that are not reliably distinct.
This situation mostly affects the so-called complex scripts.
Combining marks in complex scripts In some scripts, there are no
precomposed sequences. Usually, these scripts are "complex"
scripts, that require context rules for many classes of code
points. For these scripts, context rules (see [RFC7940]) should
be used to limit non-spacing marks to acceptable contexts. For an
example of such rules see [204], [206].
Soft Dotted and Dotless Letters Unicode code points with the
Soft_Dotted property encode letter that lose their dot if followed
by a diacritical mark above. (See [UCD]) If the following mark is
a COMBINING DOT ABOVE, the combination is indistinguishable from
the letter by itself. This can be mitigated by limiting or
excluding the code point for DOT ABOVE. A soft dotted code point
followed by any other diacritical mark above will look identical
to the corresponding dotless letter with diacritical mark above.
All combinations of dotless letters followed by diacritical marks
should be excluded. (This can be done with a context rule, see
[RFC7940]).
5.5. Mitigation
Thiere are several techniques that can be used to help to mitigate
confusion. The focus in the following is on issues addressable by
protocol or registry policy. However, user agents might implement
Freytag, et al. Expires December 31, 2018 [Page 15]
Internet-Draft Troublesome Characters June 2018
additional mitigation approaches, such as always using a font
designed to distinguish among different characters.
5.5.1. Mitigation Strategies
Exclusion The primary mitigation technique is to reduce the problem
space: operators should only ever use the smallest repertoire of
code points possible for their environment. So, for example, if
there is a code point that is sometimes used but is perhaps a
little obscure, it is better to leave it out. Users are unlikely
to be familiar with many code points added to Unicode for the
representation of historical forms of writing a script, or for
highly specialized purposes. That unfamiliarity may present
challenges to correct identification or keyboard entry, making the
code point less usable. In addition, their use may present other
problems not appreciated by anyone not familiar with them.
For these reasons, code points used only in a language with which
the administrator is not familiar should probably be excluded.
The same applies to code points used in specialized contexts, such
as those only found in historic or sacred documents, or only used
for phonetic transcription or poetry.
By reducing the repertoire to a well-understood essential subset
it is often possible to eliminate some possible instances of
confusion. For example, in the Arabic script, combining marks are
generally used for optional or specialized aspects of the writing
system. At the same time, many combining sequences are confusable
with basic letters of the script. Because of this, excluding all
Arabic combining mark would greatly reduce confusability without
significantly affecting usability of the script for identifiers.
Preferred code points Sometimes, each of these code points will be
used by a different user community; or one of the code points is
not in wide use, for example because it is intended for special
purposes like phonetic annotation or transliteration. In such
cases, the one not needed for a given zone could be excluded.
In other cases, zones may be shared by a wider community, making
it unattractive or impossible to institute a preference. A common
method of mitigating issues from such homoglyphs is to make two
labels that differ only by using a different homoglyph mutually
exclusive. This can be done by making the homoglyphs code point
variants, usually of type "blocked". See [RFC8228].
In some cases, while two code points may be homoglyphs, one of
them can be identified as the preferred alternative to encode the
intended character. In these cases, one of the code points has
Freytag, et al. Expires December 31, 2018 [Page 16]
Internet-Draft Troublesome Characters June 2018
been identified as "preferred", while the other has been
identified as "troublesome"; or "excluded". In all other cases,
no such preference exists in the general usage; a conservative
mitigation might be to define the alternatives as blocked
variants. However, the users of a given zone might have a
specific preference, in which case one of the alternatives could
be excluded instead.
For convenience in presentation, this document presents pairs or
sets of homoglyphs as mutually exclusive variants of type
"homoglyph". Other ways of handling these code points are
possible. While one might implement such a variant relation in
many cases as one label blocking another, in some cases allowing
both to be registered to the same applicant may be appropriate.
Finally, in some case eliminating one or both code points from the
repertoire may be a feasible alternative to establishing a variant
relation.
Script limitation For homoglyphs, a large number of cases (but not
all of them) turn out to be in different scripts. As a result, it
is usually a good idea to adopt the operational convention that
identifiers for a protocol should always be in a single script.
This mitigation strategy has limits. First, even if any given
identifier is only in a single script, it may co-exist with
identifiers from other scripts. Sometimes the repertoire used in
operation allows multiple scripts that create whole string
confusables -- strings made up entirely of homoglyphs of another
string in a different script (such as can be found between
Cyrillic and Latin, for example). In such cases, mitigation must
turn to other means of preventing the registration of mutually
confusable string, for example by In that case, a robust mechanism
for mutual exclusion of confusable identifiers must exist,
ensuring that the registration of one of them (whichever comes
first) blocks the later registration of the other.
Second, some writing systems use a combination of scripts and for
commercial names in many scripts, admixture of Latin letters is
common. Allowing limited script mixing may be an essential
requirement in some cases.
Lastly, identifiers are not always under the operational control
of a single authority (such as in the case of DNS, where the
system is under distributed control so that different parts of the
hierarchy can have different operational rules).
Freytag, et al. Expires December 31, 2018 [Page 17]
Internet-Draft Troublesome Characters June 2018
In the case of IDNA, some client programs restrict display of
U-labels to top-level domains known to have policies about single-
script labels.
Exact homoglyphs No policy or convention, other than ensuring mutual
exclusion, will do anything to help mititgate confusion for strict
homoglyphs of each other in the same script (see Appendix B for
some example cases.)
Beyond the issue of mutual confusability, some combining sequences
in particular can give rise to other difficulties in recognition -
usually because client systems will not reliably and correctly
display them. One particular case concerns sequences of more than
one instance of the same non-spacing combining mark such as the
repetition of an accent or diacritic. These are often rendered
indistinguishably from single instances of the same mark.
Operators should prohibit such repetition, particularly, as there
are no known cases where they would be required in ordinary
writing. Note that this prohibition would also apply to a non-
spacing mark following a pre-composed code point containing the
same diacritic. A more general mitigation technique would be to
limit nonspacing marks to known combinations which can be
enumerated. Where that is not possible for some scripts, some
other context restrictions can usually be applied.
There are some writing systems where characters do not normally
occur in arbitrary locations in the context of each syllable.
Neither users nor rendering systems for such scripts are adept at
handling arbitrary sequences of such characters. While some
latitude beyond strict spelling rules may be accommodated,
policies that enforce a minimal set of structural rules are
required to ensure that users can identify the identifier and
systems can render them predictably.
5.5.2. Limits of Mitigation
As noted in Section 1, it is not possible to solve all the problems
with identifier systems, particularly when human factors are taken
into account. In addition, each of the mitigation approaches has its
own limits of the type of problems that can be addressed, whether it
is by exclusion of specific code points; requiring or prohibiting
contexts for certain code points; restriction to a single script per
label; or mutual exclusion of labels differing only by code points
identical or otherwise confusably equivalent to other code points.
Additional policies may be needed to prevent registration of labels
that are problematic or confusable for other reasons.
Freytag, et al. Expires December 31, 2018 [Page 18]
Internet-Draft Troublesome Characters June 2018
There are a number of issues in implementing and presenting
identifiers to the user which are not specific to individually
identifiable code points (or sequences). For example, fonts can vary
widely in whether they make or do not make a distinction in
appearance of characters; relying on the native reader to get the
intended meaning from context. It is up to user agents to make sure
to select fonts that render each code point as distinct as possible.
When new code points are assigned in Unicode, systems, keyboards,
fonts and rendering engines may all be updated unevenly, with
considerable delays. During a possibly lengthy transition period,
this will lead to inconsistent user experience or inability to
distinguish certain labels. Even if unsupported labels are presented
as A-labels, users may not reliably identify them, because they
appear as essentially random sequences of letters and digits.
5.6. Notes
In the explanation the character names have been abbreviated. The
following list shows sample entries for the proposed registry. It is
non-normative, and only included for illustrative purposes. Also see
the examples below (Appendix B).
6. Table of Code Points
------------------------------------------------------------------
Code Point: 01C0
Related CP:
References: [120] [155]
Comment: Not Recommended: Indistinguishable from a
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 01C1
Related CP:
References: [120] [155]
Comment: Not Recommended: Indistinguishable from a
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 01C2
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from a
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 01C3
Related CP:
References: [120] [150]
Comment: Not Recommended: Indistinguishable from a
Freytag, et al. Expires December 31, 2018 [Page 19]
Internet-Draft Troublesome Characters June 2018
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 01DD
Related CP: 0259
References: [150]
Comment: Identical: Identical in appearance to U+0259
------------------------------------------------------------------
Code Point: 0259
Related CP: 01DD
References: [150]
Comment: Identical: Identical in appearance to U+01DD
------------------------------------------------------------------
Code Point: 0131
Related CP:
References: [100]
Comment: Restricted Context: If followed by any combining
mark above, renders the same way as U+0069 in any
good font. Should be restricted to where it is not
followed by a combining mark above
------------------------------------------------------------------
Code Point: 0237
Related CP:
References: [115]
Comment: Not Recommended: If followed by any combining mark
above, renders the same way as U+006A in any good
font. As its use is limited, it is best excluded.
------------------------------------------------------------------
Code Point: 025F
Related CP:
References: [115]
Comment: Not Recommended: If followed by any combining mark
above, renders the same way as U+0249 in any good
font. As its use is limited, it is best excluded.
------------------------------------------------------------------
Code Point: 02A3
Related CP: 0064 007A
References: [115]
Comment: Not Recommended: Looks like small LETTER D plus
LETTER Z, except for slight kerning; in limited
use.
------------------------------------------------------------------
Code Point: 02A6
Related CP: 0074 0073
References: [115]
Comment: Not Recommended: Looks like small LETTER T plus
LETTER S, except for slight kerning; in limited
use.
------------------------------------------------------------------
Freytag, et al. Expires December 31, 2018 [Page 20]
Internet-Draft Troublesome Characters June 2018
Code Point: 02A7
Related CP: 0074 0283
References: [115]
Comment: Not Recommended: Looks like small LETTER T plus
LETTER ESH, except for slight kerning; in limited
use.
------------------------------------------------------------------
Code Point: 02AA
Related CP: 006C 0073
References: [115]
Comment: Not Recommended: Looks like small LETTER L plus
LETTER S, except for slight kerning; in limited
use.
------------------------------------------------------------------
Code Point: 02AB
Related CP: 006C 007A
References: [115]
Comment: Not Recommended: Looks like small LETTER L plus
LETTER Z, except for slight kerning; in limited
use.
------------------------------------------------------------------
Code Point: 02B9
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from a
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02BA
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from a
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02BB
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from a
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02BC
Related CP:
References: [6912]
Comment: Not Recommended: Indistinguishable from a
punctuation character (U+2019), which is not
PVALID
------------------------------------------------------------------
Code Point: 02BD
Related CP:
Freytag, et al. Expires December 31, 2018 [Page 21]
Internet-Draft Troublesome Characters June 2018
References: [120]
Comment: Not Recommended: Indistinguishable from
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02BE
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02BF
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02C0
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02C1
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02C6
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02C7
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02C8
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02C9
Related CP:
Freytag, et al. Expires December 31, 2018 [Page 22]
Internet-Draft Troublesome Characters June 2018
References: [120]
Comment: Not Recommended: Indistinguishable from
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02CA
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 02CB
Related CP:
References: [120]
Comment: Not Recommended: Indistinguishable from
punctuation character that is not PVALID
------------------------------------------------------------------
Code Point: 0300
Related CP:
References: [100]
Comment: Not Recommended: Not recommended other than as
part of enumerated sequences
------------------------------------------------------------------
Code Point: 0301
Related CP:
References: [100]
Comment: Not Recommended: Not recommended other than as
part of enumerated sequences
------------------------------------------------------------------
Code Point: 0302
Related CP:
References: [100]
Comment: Not Recommended: Not recommended other than as
part of enumerated sequences
------------------------------------------------------------------
Code Point: 0303
Related CP:
References: [100]
Comment: Not Recommended: Not recommended other than as
part of enumerated sequences
------------------------------------------------------------------
Code Point: 0304
Related CP:
References: [100]
Comment: Not Recommended: Not recommended other than as
part of enumerated sequences
------------------------------------------------------------------
Code Point: 0306
Related CP:
Freytag, et al. Expires December 31, 2018 [Page 23]
Internet-Draft Troublesome Characters June 2018
References: [100]
Comment: Not Recommended: Not recommended other than as
part of enumerated sequences
------------------------------------------------------------------
Code Point: 0307
Related CP:
References: [115]
Comment: Restricted Context: By definition, LATIN SMALL
LETTER I plus combining DOT ABOVE renders exactly
the same as LATIN SMALL LETTER I by itself and
does so in practice for any good font. The same is
true for all Unicode characters with the
soft_dotted property; they lose their dot if
followed by a combining mark. DOT ABOVE should be
excluded, or restricted to contexts where it does
not follow a soft_dotted letter.
------------------------------------------------------------------
Code Point: 0308
Related CP:
References: [100]
Comment: Not Recommended: Not recommended other than as
part of enumerated sequences
------------------------------------------------------------------
Code Point: 0624
Related CP: 0648
References: [201]
Comment: Identical: Identical in appearance in some
positional form and/or not reliably distinguished
because of small size of distinguishing features
------------------------------------------------------------------
Code Point: 0625
Related CP: 0622, 0623, 0627, 0672
References: [201]
Comment: Identical: Identical in appearance in some
positional form and/or not reliably distinguished
because of small size of distinguishing features
------------------------------------------------------------------
Code Point: 0626
Related CP: 0649, 064A, 067B, 06CC, 06CD, 06D0, 06D2
References: [201]
Comment: Identical: Identical in appearance in some
positional form and/or not reliably distinguished
because of small size of distinguishing features
------------------------------------------------------------------
Code Point: 0627
Related CP: 0622, 0623, 0625, 0672
References: [201]
Comment: Identical: Identical in appearance in some
Freytag, et al. Expires December 31, 2018 [Page 24]
Internet-Draft Troublesome Characters June 2018
positional form and/or not reliably distinguished
because of small size of distinguishing features
------------------------------------------------------------------
Code Point: 064B
Related CP:
References: [5564]
Comment: Not Recommended: Not to be used in zone files for
the Arabic language, per RFC 5564
------------------------------------------------------------------
Code Point: 064C
Related CP:
References: [5564]
Comment: Not Recommended: Not to be used in zone files for
the Arabic language, per RFC 5564
------------------------------------------------------------------
Code Point: 065C
Related CP:
References: [300]
Comment: Not Recommended: Part of homoglyph sequence(s)
not covered by normalization.
------------------------------------------------------------------
Code Point: 0660
Related CP: 06F0
References: [110]
Comment: Identical: Identical in appearance and meaning to
EXTENDED ARABIC-INDIC DIGIT ZERO
------------------------------------------------------------------
Code Point: 0661
Related CP: 06F1
References: [110]
Comment: Identical: Identical in appearance and meaning to
EXTENDED ARABIC-INDIC DIGIT ONE
------------------------------------------------------------------
Code Point: 077F
Related CP:
References: [115]
Comment: Not Recommended: Obsolote (archaic)
------------------------------------------------------------------
Code Point: 08AA
Related CP:
References: [201]
Comment: Not Recommended: No evidence of active use found;
not recommended
------------------------------------------------------------------
Code Point: 0A72 0A3F
Related CP: 0A07
References: [401]
Comment: Not Recommended: Do not use for U+0A07
Freytag, et al. Expires December 31, 2018 [Page 25]
Internet-Draft Troublesome Characters June 2018
------------------------------------------------------------------
Code Point: 0A72 0A40
Related CP: 0A08
References: [401]
Comment: Not Recommended: Do not use for U+0A08
------------------------------------------------------------------
Code Point: 0E3A
Related CP:
References: [206]
Comment: Other issue: Renders unreliably, or not at all, if
adjacent to any Thai vowel below. This may be
prevented by a context rule
------------------------------------------------------------------
Code Point: 0E41
Related CP:
References: [206]
Comment: Restricted Context: Digraph of U+0E40 SARA E
U+0E40 SARA E. Normally handled by disallowing the
sequence via a context rule
------------------------------------------------------------------
Code Point: 0E45
Related CP:
References: [206]
Comment: Restricted Context: Only occurs after two special
Thai vowels,U+0E24 RU and U+0E26 LU. Is also
potentially confused with U+0E32 SARA I. Both
issues can be addressed by defining a context
rule. Alternatively the context may be spelled out
by enumerating the two sequences and excluding
U+0E45 if occurring by itself.
------------------------------------------------------------------
Code Point: 0E4E
Related CP:
References: [206]
Comment: Not Recommended: Rarely used in modern Thai; it is
more commonly replaced with U+0E3A (PHINTHU).
Excluding it avoids issues with confusing it with
another diacritic U+0E4C (THANTHAKHAT). Both are
rendered atop a syllable and hard to distinguish
at small sizes.
------------------------------------------------------------------
Code Point: 12A5
Related CP: 12D5
References: [100] [202]
Comment: Interchangeable: U+12A5 and U+12D5 are used
interchangeably in Amharic
------------------------------------------------------------------
Code Point: 12A6
Freytag, et al. Expires December 31, 2018 [Page 26]
Internet-Draft Troublesome Characters June 2018
Related CP: 12D6
References: [100] [202]
Comment: Interchangeable: U+12A6 and U+12D6 are used
interchangeably in Amharic
------------------------------------------------------------------
Code Point: 17D2 178A
Related CP: 17D2 178F
References: [204]
Comment: Identical: When preceded by U+17D2, U+178A and
U+178F are indistinguishable
------------------------------------------------------------------
Code Point: 17D2 178F
Related CP: 17D2 178A
References: [204]
Comment: Identical: When preceded by U+17D2, U+178A and
U+178F are indistinguishable
------------------------------------------------------------------
6.1. References for Registry
[99] The Unicode Consortium, "The Unicode Standard", (latest
version) http:www.unicode.org/versions/latest (Multiple, or latest
version)
[100] Integration Panel, "Maximal Starting Repertoire (MSR-2)",
April 2015, https://www.icann.org/en/system/files/files/msr-2-
overview-14apr15-en.pdf (Code points included in MSR-2 as
potentially appropriate for the root zone)
[115] Integration Panel, "Maximal Starting Repertoire (MSR-2)",
April 2015, https://www.icann.org/en/system/files/files/msr-2-
overview-14apr15-en.pdf (Code points excluded from MSR-2 as
inappropriate for the root zone)
[120] Integration Panel, "Maximal Starting Repertoire (MSR-2)",
April 2015, https://www.icann.org/en/system/files/files/msr-2-
overview-14apr15-en.pdf (Code points considered problematic by
MSR-2)
[150] The Unicode Consortium, "Intentional.txt", Version 10.0.0,
http://www.unicode.org/Public/security/10.0.0/intentional.txt
(Code points considered identical by intention)
[155] "Proposal to Update Identical.txt", L2 17/301 (and revisions)
http://www.unicode.org/L2/L2017/17301-update-intentional.pdf (Code
points considered identical by intention)
Freytag, et al. Expires December 31, 2018 [Page 27]
Internet-Draft Troublesome Characters June 2018
[201] TF-AIDN, "Proposal for Arabic Script Root Zone LGR", 18
November 2015 https://www.icann.org/en/system/files/files/arabic-
lgr-proposal-18nov15-en.pdf (In-script variants and code points
excluded)
[202] Ethiopic Generation Panel, "Proposal for Ethiopic Script Root
Zone LGR", May 17, 2017,
https://www.icann.org/en/system/files/files/proposal-ethiopic-lgr-
17may17-en.pdf ()
[204] Khmer Generation Panel, "Proposal for Khmer Script Root Zone
Label Generation Rules (LGR)", August 15, 2016,
https://www.icann.org/en/system/files/files/proposal-khmer-lgr-
15aug16-en.pdf ()
[206] Thai Generation Panel, "Proposal for the Thai Script Root Zone
LGR", May 25, 2017 https://www.icann.org/en/system/files/files/
proposal-thai-lgr-25may17-en.pdf ()
[300] Internationalized Domain Names Variant Issues Project: Arabic
Case Study Team Issues Report, ICANN, October 7, 2011
https://archive.icann.org/en/topics/new-gtlds/arabic-vip-issues-
report-07oct11-en.pdf (In-script variants and code points
excluded)
[401] Table 12-14 in Chapter 12 "South and Central Asia-I", ,"The
Unicode Standard", Version 10.0,
https://www.unicode.org/versions/Unicode10.0.0/ch12.pdf (Vowel
sequences not to be used in Gurmukhi)
[5564] RFC 5564 (Code points to be excluded from repertoires for the
Arabic language)
[6912] RFC 6912 (Code points considered problematic)
7. IANA Considerations
The IANA Services Operator is hereby requested to create the Registry
of Unicode Code Points for Special Consideration in Network
Identifiers, and to populate it with the values in section Section 5.
The registry is to be updated by Expert Review.
This registry has no formal protocol status with respect to IDNA or
PRECIS. It is a registry intended to be used by those creating
registration or lookup policies, in order to inform the development
of such policies.
Freytag, et al. Expires December 31, 2018 [Page 28]
Internet-Draft Troublesome Characters June 2018
8. Security Considerations
The registry established by this document is intended to help
operators of identifier systems in deciding what to permit in
identifiers. It may also be useful for user agents that attempt to
provide warnings to users about suspicious or inadvisable
identifiers. Operators that fail to make policies addressing the
contents of the registry may permit the creation of identifiers that
are misleading or that may be used in attacks on the network or
users.
The registry is not a magic solution to all identifier ambiguity, and
even refusing to permit registration of, or lookup of, every code
point in the registry cannot ensure that misleading or confusing
identifiers will never be created.
9. References
9.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC4713] Lee, X., Mao, W., Chen, E., Hsu, N., and J. Klensin,
"Registration and Administration Recommendations for
Chinese Domain Names", RFC 4713, DOI 10.17487/RFC4713,
October 2006, <https://www.rfc-editor.org/info/rfc4713>.
[RFC5890] Klensin, J., "Internationalized Domain Names for
Applications (IDNA): Definitions and Document Framework",
RFC 5890, DOI 10.17487/RFC5890, August 2010,
<https://www.rfc-editor.org/info/rfc5890>.
[RFC5891] Klensin, J., "Internationalized Domain Names in
Applications (IDNA): Protocol", RFC 5891,
DOI 10.17487/RFC5891, August 2010,
<https://www.rfc-editor.org/info/rfc5891>.
[RFC5892] Faltstrom, P., Ed., "The Unicode Code Points and
Internationalized Domain Names for Applications (IDNA)",
RFC 5892, DOI 10.17487/RFC5892, August 2010,
<https://www.rfc-editor.org/info/rfc5892>.
Freytag, et al. Expires December 31, 2018 [Page 29]
Internet-Draft Troublesome Characters June 2018
[RFC5893] Alvestrand, H., Ed. and C. Karp, "Right-to-Left Scripts
for Internationalized Domain Names for Applications
(IDNA)", RFC 5893, DOI 10.17487/RFC5893, August 2010,
<https://www.rfc-editor.org/info/rfc5893>.
[RFC5894] Klensin, J., "Internationalized Domain Names for
Applications (IDNA): Background, Explanation, and
Rationale", RFC 5894, DOI 10.17487/RFC5894, August 2010,
<https://www.rfc-editor.org/info/rfc5894>.
[RFC7564] Saint-Andre, P. and M. Blanchet, "PRECIS Framework:
Preparation, Enforcement, and Comparison of
Internationalized Strings in Application Protocols",
RFC 7564, DOI 10.17487/RFC7564, May 2015,
<https://www.rfc-editor.org/info/rfc7564>.
[RFC7940] Davies, K. and A. Freytag, "Representing Label Generation
Rulesets Using XML", RFC 7940, DOI 10.17487/RFC7940,
August 2016, <https://www.rfc-editor.org/info/rfc7940>.
[UAX44] The Unicode Consortium, "Unicode Standard Annex #44,
Unicode Character Database",
<http://www.unicode.org/reports/tr44/>.
This references the most currently published version of
the description of the Unicode Character Database.
[UCD] The Unicode Consortium, "Unicode Character Database",
<http://www.unicode.org/Public/UCD/latest/ucd/>.
This references the most currently published version of
the data files for the Unicode Character Database
[Unicode] The Unicode Consortium, "The Unicode Standard, Latest
Version", <http://www.unicode.org/versions/latest/>.
This references the most currently published version
9.2. Informative References
[I-D.klensin-idna-5892upd-unicode70]
Klensin, J. and P. Faltstrom, "IDNA Update for Unicode 7.0
and Later Versions", draft-klensin-idna-5892upd-
unicode70-05 (work in progress), October 2017.
Freytag, et al. Expires December 31, 2018 [Page 30]
Internet-Draft Troublesome Characters June 2018
[I-D.rfc5891bis]
Klensin, J., "Internationalized Domain Names in
Applications (IDNA): Registry Restrictions and
Recommendations", March 2017,
<https://datatracker.ietf.org/doc/
draft-klensin-idna-rfc5891bis/>.
[MSR] Integration Panel, ""Maximal Starting Repertoire (MSR-
3)"", March 2018, <
https://www.icann.org/en/system/files/files/
msr-3-overview-28mar18-en.pdf>.
[RFC5564] El-Sherbiny, A., Farah, M., Oueichek, I., and A. Al-Zoman,
"Linguistic Guidelines for the Use of the Arabic Language
in Internet Domains", RFC 5564, DOI 10.17487/RFC5564,
February 2010, <https://www.rfc-editor.org/info/rfc5564>.
[RFC6365] Hoffman, P. and J. Klensin, "Terminology Used in
Internationalization in the IETF", BCP 166, RFC 6365,
DOI 10.17487/RFC6365, September 2011,
<https://www.rfc-editor.org/info/rfc6365>.
[RFC8228] Freytag, A., "Guidance on Designing Label Generation
Rulesets (LGRs) Supporting Variant Labels", RFC 8228,
DOI 10.17487/RFC8228, August 2017,
<https://www.rfc-editor.org/info/rfc8228>.
[RZ-LGR] Integration Panel, ""Root Zone Label Generation Rules
(LGR-2) - Overview and Summary"", July 2017, <
https://www.icann.org/sites/default/files/lgr/
lgr-2-overview-26jul17-en.pdf>.
Appendix A. Additional Background
A.1. The Theory of Inclusion
The mechanism that the IETF has come to prefer for
internationalization of identifiers may be called "inclusion-based
identifier internationalization", or "inclusion" for short. Under
inclusion, the characters that are permissible in identifiers for a
protocol are selected from the set of all Unicode characters. One
starts with an empty set of characters, and then gradually adds
characters to the set, usually based on Unicode properties (see
below, and also Section 3).
Inclusion depends in part on assumptions the IETF made when the
strategy was adopted and developed; some of those assumptions were
about the relationships between different characters and the
Freytag, et al. Expires December 31, 2018 [Page 31]
Internet-Draft Troublesome Characters June 2018
likelihood that similar such relationships would get added to future
versions of Unicode. Those assumptions turn out not to have been
true in every case. Code points at issue are among those to be
listed in the registry defined here. (See Section 5.)
The intent of Unicode is to encode all known writing systems into a
single coded character set. One consequence of that goal is that
Unicode encodes an enormous number of characters. Another is that
the work of Unicode does not end until every writing system is
encoded; even after that, it needs to continue to track any changes
in those writing systems.
Unicode encodes abstract characters, not glyphs. Because of the way
Unicode was built up over time, there are sometimes multiple ways to
encode the same abstract character. For example, an e with an acute
accent may be written by combining U+0065 LATIN SMALL LETTER E and
U+0031 COMBINING ACUTE ACCENT, or it may be written U+00E9 LATIN
SMALL LETTER E WITH ACUTE. If Unicode encodes an abstract character
in more than one way, then for most purposes the different encodings
should all be treated as though they're the same character. This
"canonical equivalence" between encodings of the same abstract
characters is explicitly called out by Unicode. A lack of a defined
canonical equivalence is tantamount to an assertion by Unicode that
the two encodings do not represent the same abstract character, even
if both happen to result in the same appearance.
Every encoded character in Unicode (more precisely, every code point)
is associated with a set of properties. The properties define what
script a code point is in, whether it is a letter or a number or
punctuation and so forth, its direction when written, to what other
code point or code point sequence it is canonically equivalent, and
many other properties. These properties are important to the
inclusion mechanism. They are defined in the Unicode Character
Database [UCD] [UAX44].
Inclusion depends on the assumption that such strings as will be used
in identifiers will not have any ambiguous matching to other strings.
In practice, this means that input strings to the protocol are
expected to be in Normalization Form C. This way, any alternative
sequences of code points for the same characters will be normalized
to a single form. If all the characters in the string are also
included for the protocol's candidate identifiers, then the string is
eligible to be an identifier under the protocol.
Freytag, et al. Expires December 31, 2018 [Page 32]
Internet-Draft Troublesome Characters June 2018
A.2. The Difference Between Theory and Practice
In principle, under inclusion identifiers should be unambiguous. It
has always been recognized, however, that for humans some ambiguity
is inevitable, because of the vagaries of writing systems and of
human perception.
Normalization Form C ("NFC") removes the ambiguities based on dual or
multiple encoding for the same abstract character. However,
characters are not the same as their glyphs. This means that it is
possible for certain abstract characters to share a glyph. We can
call such abstract characters "homoglyphs". While this looks at
first like something that should be handled (or should have been
handled) by normalization (NFC or something else), there are
important differences; the situation is in some sense an extreme case
of a spectrum of ambiguity.
A.2.1. Confusability
While Unicode deals in abstract characters and inclusion works on
Unicode code points, users interact with strings as actually
rendered: sequences of glyphs. There are characters that, depending
on font, sometimes look quite similar to one another (such as "l" and
"1"); any character that is like this is often called "visually
similar". More difficult are characters that, in any normal
rendering, always look the same as one another. The shared history
of Cyrillic, Greek, and Latin scripts, for example, means that there
are characters in each script that function similarly and that are
usually indistinguishable from one another, though they are not the
same abstract character. These are examples of "homoglyphs." Any
character that can be confused for another one can be called
confusable, and confusability can be thought of as a spectrum with
"visually similar" at one end, and "homoglyphs" at the other. (We
use the term "homoglyph" strictly: code points that normally use the
same glyph when rendered.)
Note that homoglyphs are not restricted to cross-script scenarios -
there are a number of homoglyphs where both code points or sequences
are part of the same script.
A further issue is introduced by the fact that Unicode caters not
only to living and dead languages alike, but also to scholarly and
scientific notation, as well as specialized modes of written text,
such as for poetry, religious works, or texts to be sung or chanted.
Where these notations use symbols, they are excluded under inclusion,
but where they use varieties of letter forms or marks used with
letters, they are included by default. Some of these letters or
marks, have been incorporated over time into orthographies for living
Freytag, et al. Expires December 31, 2018 [Page 33]
Internet-Draft Troublesome Characters June 2018
languages, which is one reason they were not rigorously excluded from
the start. However, in some cases, they may (alone or in combination
with ordinary letters appear the same (or very similar to) existing
letters. This makes some of these characters, and especially the
marks in question "troublesome".
Finally, IDNA 2008 has a limited appreciation for the fact that
characters in complex scripts, unlike ASCII letters, cannot simply
occur in random sequences. Neither software (for display or data
entering) nor readers are prepared to process some of these code
points "out of order". For such scripts, without a policy that
describes permissible contexts, labels could be registered that
cannot be rendered or typed reliably and which most users would not
know how to read or recognize. In some cases, combining sequences
typed in the "wrong" order may display identically to to those typed
in the "correct" ordering; again something that needs to be sorted
out by defining permissible contexts, for example by using the
context rule mechanism in [RFC7940].
Appendix B. Examples
There are a number of cases that illustrate the combining sequence or
digraph issue:
U+08A1 vs \u'0628'\u'0654' This case is ARABIC LETTER BEH WITH HAMZA
ABOVE, which is the one that was detected during expert review
that caused the IETF to first notice the issue, even though the
issue existed before this. For detailed discussion of this case
and some of the following ones, see
[I-D.klensin-idna-5892upd-unicode70].
U+0681 vs \u'062D'\u'0654' This case is ARABIC LETTER HAH WITH HAMZA
ABOVE, which (like U+08A1) does not have a canonical equivalent.
In both cases, the places where hamza above and similar Arabic
combining marks are used are specialized enough that the combining
marks are generally excluded. See [RFC5564] and [RZ-LGR].
Unicode has a policy of encoding as composite any letter needed in
an Arabic orthography, even if it appears superficially that the
same shape could be achieved by a combining sequence. (In actual
typography there's often a small but noticeable difference in
placement of the mark between a composite character and a
combining sequence.)
U+0623 vs \u'0627'\u'0654' This case is ARABIC LETTER ALEF WITH
HAMZA ABOVE. Unlike the previous two cases, it does have a
canonical equivalence with the combining sequence. Therefore,
only the composite is used in IDNs.
Freytag, et al. Expires December 31, 2018 [Page 34]
Internet-Draft Troublesome Characters June 2018
U+09E1 vs u\'098C'u\'09E2' This case is BENGALI LETTER VOCALIC LL.
This is an example in the Bengali script of a case without a
canonical equivalence to the combining sequence. Per Unicode, the
single code point should be used to represent vowel signs in text,
and the sequence of code points should not be used. There are
similar cases in many Indic scripts. It is not a simple matter of
disallowing the combining vowel mark in cases like this, because
it is commonly used as vowel sign. The recommendation would be to
add a context rule, restricting the vowel signs from appearing
directly after an independent vowel like U+098C..
U+019A vs \u'006C'\u'0335' This case is LATIN SMALL LETTER L WITH
BAR. In at least some fonts, there is a detectable difference
between the composite code point and the combining sequence, but
only if one compares them side-by-side. Unlike a separable
diacritic, there are no fast rules for placement of overlays. A
bar may cross at different heights for different glyph shape or
may cross different parts of the glyph. For this reason, there is
no canonical equivalence defined between the sequence and the
composite. Unicode has a principle of encoding barred letters of
specific shape as single code point composites when needed for any
writing system. The code point U+0335 COMBINING SHORT STROKE
OVERLAY and similar overlay diacritics are therefore never needed
as part of any orthography and are recommended to be excluded from
identifiers.
U+00F8 vs \u'006F'\u'0337' This is LATIN SMALL LETTER O WITH STROKE.
The effect is similar to the previous case. Unicode has a
principle of encoding stroked letters as composites when needed
for any writing system.
U+02A6 vs \u'0074'\u'0073' This is LATIN SMALL LETTER TS DIGRAPH,
which is not canonically equivalent to the letters t and s. The
intent appears to be that the digraph shows the two shapes as
kerned, but the difference may be slight if viewed out of context.
The use of the digraph is for specialized purposes; it can be
excluded from identifiers.
U+01C9 vs \u'006C'\u'006A' Unlike the TS digraph, the LJ digraph has
a relevant compatibility decomposition, so it fails the relevant
stability rules under inclusion and is therefore DISALLOWED in
IDNA2008. This illustrates the way that consistencies that might
be natural to some users of a script are not necessarily found in
it, possibly because of uses by another writing system.
U+06C8 vs u\'0648'u\'0670' ARABIC LETTER YU is an example where the
normally-rendered character looks just like a combining sequence,
but are named differently. This an example that shows that the
Freytag, et al. Expires December 31, 2018 [Page 35]
Internet-Draft Troublesome Characters June 2018
Unicode name is not a reliable indicator of the intended
appearance. Like other cases in Arabig, the recommendation is to
exclude the combining mark (and therefore the sequence) in favor
of the composite.
U+069 vs \u'0069'\u'0307' LATIN SMALL LETTER I followed by COMBINING
DOT ABOVE by definition, renders exactly the same as LATIN SMALL
LETTER I by itself and does so in practice for any good font. The
same would be true if "i" was replaced with any of the other
Soft_Dotted characters defined in Unicode. The character sequence
\u'0069'\u'0307' (followed by no other combining mark) is
reportedly rather common on the Internet. Because base character
and stand-alone code point are the same in this case, and the code
points affected have the Soft_Dotted property already, this could
be mitigated separately via a context rule affecting U+0307.
Other cases that demonstrate that the issue does not lie exclusively
or primarily with combining sequences:
U+0B95 vs U+0BE7 The TAMIL LETTER KA and TAMIL DIGIT ONE are always
indistinguishable, but needed to be encoded separately because one
is a letter and the other is a digit.
Arabic-Indic Digits vs. Extended Arabic-Indic Digits Seven digits of
these two sequences have entirely identical shapes. This case is
an example of something dealt with in inclusion that nevertheless
can lead to confusions that are not fully mitigated. IDNA, for
example, contains context rules restricting the digits to one set
or another; but such rules apply only to a single label, not to an
entire name. Moreover, it provides no way of distinguishing
between two labels that both conform to the context rule, but
where each contains a different member one of the seven identical
shape pairs.
U+53E3 vs U+56D7 These are two Han characters (roughly rectangular)
that are different when laid side by side; but they may be
difficult to distinguish out of context or in very small print.
U+01DD vs U+0259 The two Latin script code points share the have the
identical appearance of a lower-case upside down "e". They are
encoded differently due to different uppercase forms. The fact
that they uppercase differently is taken as evidence that they are
not the same abstract character, despite the superficial evidence
of their shared shape. The more common cases, where the uppercase
forms are identical may be of less concern, given that IDNA 2008
is limited to lower case.
Freytag, et al. Expires December 31, 2018 [Page 36]
Internet-Draft Troublesome Characters June 2018
Cross script homoglyphs usually do not involve combining sequences,
but can be mitigated by rules requiring strings to be in a single
script. For zones that support multiple scripts, it may be necessary
to have policies to prevent whole-script homographs: labels entirely
in one script that look the same as another label in the other
script. One method would be to define "blocked" variants (See
[RFC7940] and [RFC8228]).
LATIN SMALL LETTER OPEN E is one of a handful of examples of
characters borrowed from another script, in this case GREEK SMALL
LETTER EPSILON.
LATIN SMALL LETTER E and CYRILLIC SMALL LETTER IE are historically
related, both derive from uppercase forms of the GREEK CAPTIAL
LETTER EPSILON. There are a number of such pairs -- enough to
make many whole strings that look the same in both scripts (but
usually spell nonsense in one of them). An example would be
"pax".
Appendix C. Discussion Venue
Note to RFC Editor: this section should be removed prior to
publication as an RFC.
This Internet-Draft may be discussed on the IAB Internationalization
public list: i18n-discuss@iab.org.
Appendix D. Change History
Note to RFC Editor: this section should be removed prior to
publication as an RFC.
00:
* Initial version
01:
* Add background and examples from the LUCID Problem Statement
* Add a paragraph about motivation to explain the difference
between this registry and administrative policy more generally
* Expand and clarify a number of earlier points of discussion
* Attempt to make clear that this registry does not update any
protocols
Freytag, et al. Expires December 31, 2018 [Page 37]
Internet-Draft Troublesome Characters June 2018
* Move some formerly-appendix material to the body
* Expand the initial registry.
02:
* Expanded the discussion of possible mitigation approaches and
made its own section.
* Added more detail to the categories of troublesome characters
* Minor updates to "Existing techniques" section.
* Some extension to the description of the contents of the
registry and discussion of how to handle additional
information.
Authors' Addresses
Asmus Freytag
ASMUS, Inc.
Email: asmus@unicode.org
John C Klensin
1770 Massachusetts Ave, Ste 322
Cambridge, MA 02140
U.S.A.
Email: john-ietf@jck.com
Andrew Sullivan
Oracle Corp.
100 Milverton Drive
Missisauga, ON L5R 4H1
Canada
Email: andrew.s.sullivan@oracle.com
Freytag, et al. Expires December 31, 2018 [Page 38]