IETF | A. Freytag |
Internet-Draft | ASMUS, Inc. |
Intended status: Standards Track | J. Klensin |
Expires: January 1, 2019 | |
A. Sullivan | |
Oracle Corp. | |
June 30, 2018 |
Those Troublesome Characters: A Registry of Unicode Code Points Needing Special Consideration When Used in Network Identifiers
draft-freytag-troublesome-characters-02
Unicode's design goal is to be the universal character set for all applications. The goal entails the inclusion of very large numbers of characters. It is also focused on written language in general; special provisions have always been needed for identifiers. The sheer size of the repertoire increases the possibility of accidental or intentional use of characters that can cause confusion among users, particularly where linguistic context is ambiguous, unavailable, or impossible to determine. A registry of code points that can be sometimes especially problematic may be useful to guide system administrators in setting parameters for allowable code points or combinations in an identifier system, and to aid applications in creating security aids for users.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 1, 2019.
Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Unicode [Unicode] is a coded character set that aims to support every writing system. Writing systems evolve over time and are sometimes influenced by one another. As a result, Unicode encodes many characters that, to a reader, appear to be the same thing; but that are encoded differently from one another. This sort of difference is usually not important in written texts, because competent readers and writers of a language are able to compensate for the selection of the "wrong" character when reading or writing. Finally, the goal of supporting every writing system also implies that Unicode is designed to properly represent written language; special provisions are needed for identifiers.
Identifiers that are used in a network or, especially, an Internet context present several special problems because of the above feature of Unicode:
Beyond these issues, human perception is easily tricked, so that entirely unrelated character sequences can become confusable -- for example "rn" being confused with "m". Humans read strings, not characters, and they will mostly see what they expect to see. Some additional discussion of the background can be found in Appendix A.
The remainder of this document discusses techniques that can be used to design the label generation rules for a particular zone so they ameliorate or avoid entirely some of the issues caused by the interaction between the Unicode Standard and identifiers. The registry is intended to highlight code points that require such techniques.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
A reader needs to be familiar with Unicode [Unicode], IDNA2008 [RFC5890] [RFC5891] [RFC5892] [RFC5893] [RFC5894], PRECIS (at least the framework, [RFC7564]), and conventions for discussion of internationalization in the IETF (see [RFC6365]).
In the IDNA mechanism for including Unicode code points [RFC5892], a code point is only included when it meets the needs of internationalizing domain names as explained in the IDNA framework [RFC5894]. For identifiers other than those specified by IDNA, the PRECIS framework [RFC7564] generalizes the same basic technique. In both cases, the overall approach is to assume that all characters are excluded, and then to include characters according to properties derived from the Unicode character properties. This general strategy cuts the enormous size of the Unicode database somewhat, avoiding including some characters that are necessarily unsuited for use as identifiers.
The mechanism of inclusion by derived property, while helpful, is insufficient to guarantee every included character is safe for use in identifiers. Some characters' properties lead them to be included even though they are not obviously good candidates. In other cases, individual characters are good for inclusion, but are problematic in combination. Finally, there are cases where characters (or sequences of characters) are not problematic by themselves, or if used in a mutually exclusive manner in the same identifier, but become problematic when their choice represents the only difference between otherwise identical identifiers. For some examples, see Appendix B.
Operators of systems that create identifiers (whether through a registry or through a peer-to-peer identifier negotiation system) need to make policies for characters they will permit. Operators of registries, for instance, can help by adopting good registration policies: "Users will benefit if registries only permit characters from scripts that are well-understood by the registry or its advisers."[RFC5894]
The difficulty for many operators, however, is that they do not have the writing system expertise to claim any character is "well-understood", and they do not really have the time to develop that expertise. Such operators should in fact not use or register such characters. Unfortunately, in many cases the operators are stewards of systems where the user population demands identifiers useful to them in their local languages. In other cases, operators may proceed without a proper understanding owing to financial or market share incentives. The risk for Internet identifiers in such cases is obviously that ill-understood and potentially exploitable gaps in registration policies will open.
To help mitigate such issues, this document proposes a registry of Unicode code points that are known to present special issues for network identifiers with the aim to guide protocol and operating decisions about whether to permit a given code point or sequence of code points. By necessity, any list or guidance can only reflect issues that are known and understood at the time of writing. By limiting itself largely to characters that are widely used to write languages in contemporary use, the registry will address the more critical needs, while simultanesously focusing on characters that are well understood and for which there may already be some implementation experience in IDNs.
By itself, such a registry will not completely protect against poor registration or use, but it may provide operational guidance necessary for people who are responsible for creating policies. It also obviates the need for everyone to repeat basic investigation into the behavior of Unicode characters. Instead, scarce expertise can be focused on ways to mitigate issues, perhaps caused by user requirements for a specific character.
Note that the registry defined herein does not address any of the issues created by whole-string confusables where each of the identifiers is of a different script. A common workaround, limiting a registry to identifiers of only a single script, would mitigate this issue. [CREF2]AF: we should evaluate that; cross-script variants that are homoglyphs have now been collected across modern scripts as part of the root zone LGR and are easily captured in a registry.
For some of the code points (or code point sequences) listed as presenting issues for identifiers, it may be most expeditious to simply not include them, even though they are valid according to the protocol. Sometimes, one of a pair of identical code points (or code point sequences) may be deemed preferable over the other for practical reasons.
However, simply leaving out any code point listed in this registry would render a registry of doubtful value for many scripts. It is not always necessary or desirable to exclude characters. Sometimes, it is merely necessary to ensure that for two otherwise identical identifiers, only one of a set of mutually exclusive code points (or sequences of code points) is used, while preventing the later registration of the label containing the other one in order to avoid ambiguity. This way the operator does not need to impose a choice.
In cases where two or more variants of such an identifier mean the same thing to the native reader, an operator may decide to allow all of the variant labels to be registered simultaneously, but only to the same entity (and with proper safeguards that limit the multiplicity of such allocatable variant labels).
The implementation of this strategy would be via the variant mechanism described in [RFC7940] and [RFC8228] which allows mechanical processing of mutual exclusion and /or bundling of identifiers respectively.
This specification defines a registry of code points and sequences that have been identified as requiring special attention when they are to be used in identifiers. An administrator who does not have the time or inclination to develop the requisite policies might contemplate simply not to permit these code points at all.
However, for some scripts the remaining subset might not be usable in a meaningful way. Identifiers in these scripts cannot be safely implemented without understanding the issues involved. Further note that many code points listed here are problematic only in their relationship to other code points and that as long as these issues are adequately addressed, for example using the variant mechanism, they do not need to be excluded. [CREF3]AF: the above needs more editing, it’s a bit repetitive.
The following are the defined category values:
If a character appears in the registry, that does not automatically mean that it is a bad candidate for use in identifiers generally. Absent a well-defined and verifiable policy, however, such a code point or sequence might well be treated with suspicion by users and by tools.
For code points tagged as being "identical" to or "indistinguishable" from other code points, it may be that one is preferred over the other, but it may also be that implementing a scheme for mutual exclusion of any resulting identical labels is the best solution, such as assigning them "blocked" variants according to [RFC7940] and [RFC8228].
Where characters are confusable with a combining sequence, only the combining sequence is listed; suggested mitigation may consist of disallowing either the specific combining sequence or disallowing the combining marks involved. It is usually inappropriate to exclude any of the basic letters involved, as they are generally members of the standard alphabet for one or more languages.
The registry and this document are to be understood as guidance for the purpose of developing operational policies that are used for protocols under normal administrative scope. For instance, zone operators that support IDNA are expected to create policies governing the code points that they will permit (see [RFC5894] and [I-D.rfc5891bis]). The registry herein defined is intended to highlight particularly troublesome code points or code point sequences for the benefit of administrators creating such policies. It is also intended to highlight characters that may create identifier ambiguities and thereby create security vulnerabilities. However, by itself it is no substitute for such policies.
The registry is by necessity limited to code points for which adequate information is available; by and large this means code points used in connection with modern languages or writing systems, except that specialized extensions to modern scripts may be indicated, if their use would fall into any of the categories defined. Historic scripts, and any modern scripts not represented in the registry can be assumed to not be well-understood; operators are cautioned to locate other sources of information and to develop the necessary policies before deploying such scripts.
The registry is updated by Expert Review using an open process. From time to time, additional code points may be added to the Unicode standard, or further information may be discovered related to code points, to existing code points or those already listed here. The Unicode Standard may recommend against using a code point for all or some purposes. Or a script community may have gained more experience in deploying IDNs for that script and may create or update recommendations as to best policy.
Code points that are DISALLOWED in IDNA 2008 are not eligible to be listed. Code points that are CONTEXTJ or CONTEXTO are not included here unless there are documented concerns that are not mitigated by the existing IDNA context rules. The focus is on scripts that are significant for identifiers; code points from scripts that are historic or otherwise of limited use have generally not been considered - however exceptions may exist where authoritative information is readily available. Code points and code point sequences included are those that need special policies (including, but not limited to policies of exclusion).
New code points of sequences are listed whenever information becomes available that identifies a specific issue that requires attention in crafting a policy for the use of that code point or sequence in network identifiers. Likewise cross references, categories, explanations and references cited may be updated.
The contents of the registry generally does not represent original research but a collection of issues documented elsewhere, with appropriate references cited. An exception might be cases that are in clear analogy to existing entries, but not explicitly covered by existing references, for example, because the code point in question was recently added to Unicode.
If a particular language or script community reaches an apparent consensus that some code point is problematic, or that of two identical code points or sequences one should be preferred over the other, such recommendations, if known, should be documented in this registry.
In addition, if the Unicode Standard designates a code point as formally "deprecated" or less formally as “do not use”, or identifies code points that are "intentionally identical", this is also something that should be reflected in the registry. Another source of potential information might be existing registry policies or recommended policies, particularly where it is apparent that they represent a careful analysis of the issue or a wider consensus, or both.
Proposed additions to the registry are to be shared on a mailing list to allow for broader comment and vetting.
If there is a disagreement about the existence of an issue or its severity, it is preferable to document both the issue and the different evaluations of it. In all cases, the information and documentation presented must allow a user to fully evaluate the status of any entry in the registry.
There is no requirement for the registry to form a stable body of data to which any future document would have to be backward compatible in any way. If new information emerges, additional code points may be considered problematic, or they may need to be reclassified. In case of significant changes, the explanation should note the nature of the change and cite a reference to document the basis for it.
IDNA 2008 uses an inclusion process based on Unicode properties to define which code points are PVALID, but also recognizes that some code points require a context rule (CONTEXTJ, CONTEXTO).
A number of code points which are PVALID in [RFC5892] may require additional attention in the design of label generations rules. In some cases, the issue is not necessarily with an individual code point, but with a code point sequence. In the following, "code point" and "code point sequence" are used synonymously unless explicitly called out. The fact that a code point require such attention does not affect its status under IDNA 2008.
The following describes a number of conditions that pose problems for network identifiers and common strategies for mitigating them.
At times two code points or code point sequences are considered by all users (or a significant fraction) as equivalent to a degree that they accept one of them as substitute for another. This has obvious implications for the unambiguous recognition of identifiers. This document lists the code points and sequences affected (except for certain generic classes too numerous to list here). Note that one of the two may be preferred over the other, in which case the non-preferred one may be excluded or folded away. But in many cases either one is equally preferred. Mitigation techniques for such cases are discussed below.
Code points that are not substitutable but troublesome for other reasons are candidates for exclusion from a zone's repertoire. For each such code point, the comment field briefly describes why it should be excluded or considered troublesome. There is no identified mitigation strategy that can be recommended for general usage: unless careful study indicates that a code point with this status is exceptionally acceptable for a particular zone, after all, it should normally be excluded from the repertoire. These reasons are varied.
Thiere are several techniques that can be used to help to mitigate confusion. The focus in the following is on issues addressable by protocol or registry policy. However, user agents might implement additional mitigation approaches, such as always using a font designed to distinguish among different characters.
As noted in Section 1, it is not possible to solve all the problems with identifier systems, particularly when human factors are taken into account. In addition, each of the mitigation approaches has its own limits of the type of problems that can be addressed, whether it is by exclusion of specific code points; requiring or prohibiting contexts for certain code points; restriction to a single script per label; or mutual exclusion of labels differing only by code points identical or otherwise confusably equivalent to other code points. Additional policies may be needed to prevent registration of labels that are problematic or confusable for other reasons.
There are a number of issues in implementing and presenting identifiers to the user which are not specific to individually identifiable code points (or sequences). For example, fonts can vary widely in whether they make or do not make a distinction in appearance of characters; relying on the native reader to get the intended meaning from context. It is up to user agents to make sure to select fonts that render each code point as distinct as possible.
When new code points are assigned in Unicode, systems, keyboards, fonts and rendering engines may all be updated unevenly, with considerable delays. During a possibly lengthy transition period, this will lead to inconsistent user experience or inability to distinguish certain labels. Even if unsupported labels are presented as A-labels, users may not reliably identify them, because they appear as essentially random sequences of letters and digits.
In the explanation the character names have been abbreviated. The following list shows sample entries for the proposed registry. It is non-normative, and only included for illustrative purposes. Also see the examples below (Appendix B).
------------------------------------------------------------------ Code Point: 01C0 Related CP: References: [120] [155] Comment: Not Recommended: Indistinguishable from a punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 01C1 Related CP: References: [120] [155] Comment: Not Recommended: Indistinguishable from a punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 01C2 Related CP: References: [120] Comment: Not Recommended: Indistinguishable from a punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 01C3 Related CP: References: [120] [150] Comment: Not Recommended: Indistinguishable from a punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 01DD Related CP: 0259 References: [150] Comment: Identical: Identical in appearance to U+0259 ------------------------------------------------------------------ Code Point: 0259 Related CP: 01DD References: [150] Comment: Identical: Identical in appearance to U+01DD ------------------------------------------------------------------ Code Point: 0131 Related CP: References: [100] Comment: Restricted Context: If followed by any combining mark above, renders the same way as U+0069 in any good font. Should be restricted to where it is not followed by a combining mark above ------------------------------------------------------------------ Code Point: 0237 Related CP: References: [115] Comment: Not Recommended: If followed by any combining mark above, renders the same way as U+006A in any good font. As its use is limited, it is best excluded. ------------------------------------------------------------------ Code Point: 025F Related CP: References: [115] Comment: Not Recommended: If followed by any combining mark above, renders the same way as U+0249 in any good font. As its use is limited, it is best excluded. ------------------------------------------------------------------ Code Point: 02A3 Related CP: 0064 007A References: [115] Comment: Not Recommended: Looks like small LETTER D plus LETTER Z, except for slight kerning; in limited use. ------------------------------------------------------------------ Code Point: 02A6 Related CP: 0074 0073 References: [115] Comment: Not Recommended: Looks like small LETTER T plus LETTER S, except for slight kerning; in limited use. ------------------------------------------------------------------ Code Point: 02A7 Related CP: 0074 0283 References: [115] Comment: Not Recommended: Looks like small LETTER T plus LETTER ESH, except for slight kerning; in limited use. ------------------------------------------------------------------ Code Point: 02AA Related CP: 006C 0073 References: [115] Comment: Not Recommended: Looks like small LETTER L plus LETTER S, except for slight kerning; in limited use. ------------------------------------------------------------------ Code Point: 02AB Related CP: 006C 007A References: [115] Comment: Not Recommended: Looks like small LETTER L plus LETTER Z, except for slight kerning; in limited use. ------------------------------------------------------------------ Code Point: 02B9 Related CP: References: [120] Comment: Not Recommended: Indistinguishable from a punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02BA Related CP: References: [120] Comment: Not Recommended: Indistinguishable from a punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02BB Related CP: References: [120] Comment: Not Recommended: Indistinguishable from a punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02BC Related CP: References: [6912] Comment: Not Recommended: Indistinguishable from a punctuation character (U+2019), which is not PVALID ------------------------------------------------------------------ Code Point: 02BD Related CP: References: [120] Comment: Not Recommended: Indistinguishable from punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02BE Related CP: References: [120] Comment: Not Recommended: Indistinguishable from punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02BF Related CP: References: [120] Comment: Not Recommended: Indistinguishable from punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02C0 Related CP: References: [120] Comment: Not Recommended: Indistinguishable from punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02C1 Related CP: References: [120] Comment: Not Recommended: Indistinguishable from punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02C6 Related CP: References: [120] Comment: Not Recommended: Indistinguishable from punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02C7 Related CP: References: [120] Comment: Not Recommended: Indistinguishable from punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02C8 Related CP: References: [120] Comment: Not Recommended: Indistinguishable from punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02C9 Related CP: References: [120] Comment: Not Recommended: Indistinguishable from punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02CA Related CP: References: [120] Comment: Not Recommended: Indistinguishable from punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 02CB Related CP: References: [120] Comment: Not Recommended: Indistinguishable from punctuation character that is not PVALID ------------------------------------------------------------------ Code Point: 0300 Related CP: References: [100] Comment: Not Recommended: Not recommended other than as part of enumerated sequences ------------------------------------------------------------------ Code Point: 0301 Related CP: References: [100] Comment: Not Recommended: Not recommended other than as part of enumerated sequences ------------------------------------------------------------------ Code Point: 0302 Related CP: References: [100] Comment: Not Recommended: Not recommended other than as part of enumerated sequences ------------------------------------------------------------------ Code Point: 0303 Related CP: References: [100] Comment: Not Recommended: Not recommended other than as part of enumerated sequences ------------------------------------------------------------------ Code Point: 0304 Related CP: References: [100] Comment: Not Recommended: Not recommended other than as part of enumerated sequences ------------------------------------------------------------------ Code Point: 0306 Related CP: References: [100] Comment: Not Recommended: Not recommended other than as part of enumerated sequences ------------------------------------------------------------------ Code Point: 0307 Related CP: References: [115] Comment: Restricted Context: By definition, LATIN SMALL LETTER I plus combining DOT ABOVE renders exactly the same as LATIN SMALL LETTER I by itself and does so in practice for any good font. The same is true for all Unicode characters with the soft_dotted property; they lose their dot if followed by a combining mark. DOT ABOVE should be excluded, or restricted to contexts where it does not follow a soft_dotted letter. ------------------------------------------------------------------ Code Point: 0308 Related CP: References: [100] Comment: Not Recommended: Not recommended other than as part of enumerated sequences ------------------------------------------------------------------ Code Point: 0624 Related CP: 0648 References: [201] Comment: Identical: Identical in appearance in some positional form and/or not reliably distinguished because of small size of distinguishing features ------------------------------------------------------------------ Code Point: 0625 Related CP: 0622, 0623, 0627, 0672 References: [201] Comment: Identical: Identical in appearance in some positional form and/or not reliably distinguished because of small size of distinguishing features ------------------------------------------------------------------ Code Point: 0626 Related CP: 0649, 064A, 067B, 06CC, 06CD, 06D0, 06D2 References: [201] Comment: Identical: Identical in appearance in some positional form and/or not reliably distinguished because of small size of distinguishing features ------------------------------------------------------------------ Code Point: 0627 Related CP: 0622, 0623, 0625, 0672 References: [201] Comment: Identical: Identical in appearance in some positional form and/or not reliably distinguished because of small size of distinguishing features ------------------------------------------------------------------ Code Point: 064B Related CP: References: [5564] Comment: Not Recommended: Not to be used in zone files for the Arabic language, per RFC 5564 ------------------------------------------------------------------ Code Point: 064C Related CP: References: [5564] Comment: Not Recommended: Not to be used in zone files for the Arabic language, per RFC 5564 ------------------------------------------------------------------ Code Point: 065C Related CP: References: [300] Comment: Not Recommended: Part of homoglyph sequence(s) not covered by normalization. ------------------------------------------------------------------ Code Point: 0660 Related CP: 06F0 References: [110] Comment: Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT ZERO ------------------------------------------------------------------ Code Point: 0661 Related CP: 06F1 References: [110] Comment: Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT ONE ------------------------------------------------------------------ Code Point: 077F Related CP: References: [115] Comment: Not Recommended: Obsolote (archaic) ------------------------------------------------------------------ Code Point: 08AA Related CP: References: [201] Comment: Not Recommended: No evidence of active use found; not recommended ------------------------------------------------------------------ Code Point: 0A72 0A3F Related CP: 0A07 References: [401] Comment: Not Recommended: Do not use for U+0A07 ------------------------------------------------------------------ Code Point: 0A72 0A40 Related CP: 0A08 References: [401] Comment: Not Recommended: Do not use for U+0A08 ------------------------------------------------------------------ Code Point: 0E3A Related CP: References: [206] Comment: Other issue: Renders unreliably, or not at all, if adjacent to any Thai vowel below. This may be prevented by a context rule ------------------------------------------------------------------ Code Point: 0E41 Related CP: References: [206] Comment: Restricted Context: Digraph of U+0E40 SARA E U+0E40 SARA E. Normally handled by disallowing the sequence via a context rule ------------------------------------------------------------------ Code Point: 0E45 Related CP: References: [206] Comment: Restricted Context: Only occurs after two special Thai vowels,U+0E24 RU and U+0E26 LU. Is also potentially confused with U+0E32 SARA I. Both issues can be addressed by defining a context rule. Alternatively the context may be spelled out by enumerating the two sequences and excluding U+0E45 if occurring by itself. ------------------------------------------------------------------ Code Point: 0E4E Related CP: References: [206] Comment: Not Recommended: Rarely used in modern Thai; it is more commonly replaced with U+0E3A (PHINTHU). Excluding it avoids issues with confusing it with another diacritic U+0E4C (THANTHAKHAT). Both are rendered atop a syllable and hard to distinguish at small sizes. ------------------------------------------------------------------ Code Point: 12A5 Related CP: 12D5 References: [100] [202] Comment: Interchangeable: U+12A5 and U+12D5 are used interchangeably in Amharic ------------------------------------------------------------------ Code Point: 12A6 Related CP: 12D6 References: [100] [202] Comment: Interchangeable: U+12A6 and U+12D6 are used interchangeably in Amharic ------------------------------------------------------------------ Code Point: 17D2 178A Related CP: 17D2 178F References: [204] Comment: Identical: When preceded by U+17D2, U+178A and U+178F are indistinguishable ------------------------------------------------------------------ Code Point: 17D2 178F Related CP: 17D2 178A References: [204] Comment: Identical: When preceded by U+17D2, U+178A and U+178F are indistinguishable ------------------------------------------------------------------
The IANA Services Operator is hereby requested to create the Registry of Unicode Code Points for Special Consideration in Network Identifiers, and to populate it with the values in section Section 5. The registry is to be updated by Expert Review.
This registry has no formal protocol status with respect to IDNA or PRECIS. It is a registry intended to be used by those creating registration or lookup policies, in order to inform the development of such policies.
The registry established by this document is intended to help operators of identifier systems in deciding what to permit in identifiers. It may also be useful for user agents that attempt to provide warnings to users about suspicious or inadvisable identifiers. Operators that fail to make policies addressing the contents of the registry may permit the creation of identifiers that are misleading or that may be used in attacks on the network or users.
The registry is not a magic solution to all identifier ambiguity, and even refusing to permit registration of, or lookup of, every code point in the registry cannot ensure that misleading or confusing identifiers will never be created.
The mechanism that the IETF has come to prefer for internationalization of identifiers may be called "inclusion-based identifier internationalization", or "inclusion" for short. Under inclusion, the characters that are permissible in identifiers for a protocol are selected from the set of all Unicode characters. One starts with an empty set of characters, and then gradually adds characters to the set, usually based on Unicode properties (see below, and also Section 3).
Inclusion depends in part on assumptions the IETF made when the strategy was adopted and developed; some of those assumptions were about the relationships between different characters and the likelihood that similar such relationships would get added to future versions of Unicode. Those assumptions turn out not to have been true in every case. Code points at issue are among those to be listed in the registry defined here. (See Section 5.)
The intent of Unicode is to encode all known writing systems into a single coded character set. One consequence of that goal is that Unicode encodes an enormous number of characters. Another is that the work of Unicode does not end until every writing system is encoded; even after that, it needs to continue to track any changes in those writing systems.
Unicode encodes abstract characters, not glyphs. Because of the way Unicode was built up over time, there are sometimes multiple ways to encode the same abstract character. For example, an e with an acute accent may be written by combining U+0065 LATIN SMALL LETTER E and U+0031 COMBINING ACUTE ACCENT, or it may be written U+00E9 LATIN SMALL LETTER E WITH ACUTE. If Unicode encodes an abstract character in more than one way, then for most purposes the different encodings should all be treated as though they're the same character. This "canonical equivalence" between encodings of the same abstract characters is explicitly called out by Unicode. A lack of a defined canonical equivalence is tantamount to an assertion by Unicode that the two encodings do not represent the same abstract character, even if both happen to result in the same appearance.
Every encoded character in Unicode (more precisely, every code point) is associated with a set of properties. The properties define what script a code point is in, whether it is a letter or a number or punctuation and so forth, its direction when written, to what other code point or code point sequence it is canonically equivalent, and many other properties. These properties are important to the inclusion mechanism. They are defined in the Unicode Character Database [UCD] [UAX44].
Inclusion depends on the assumption that such strings as will be used in identifiers will not have any ambiguous matching to other strings. In practice, this means that input strings to the protocol are expected to be in Normalization Form C. This way, any alternative sequences of code points for the same characters will be normalized to a single form. If all the characters in the string are also included for the protocol's candidate identifiers, then the string is eligible to be an identifier under the protocol.
In principle, under inclusion identifiers should be unambiguous. It has always been recognized, however, that for humans some ambiguity is inevitable, because of the vagaries of writing systems and of human perception.
Normalization Form C ("NFC") removes the ambiguities based on dual or multiple encoding for the same abstract character. However, characters are not the same as their glyphs. This means that it is possible for certain abstract characters to share a glyph. We can call such abstract characters "homoglyphs". While this looks at first like something that should be handled (or should have been handled) by normalization (NFC or something else), there are important differences; the situation is in some sense an extreme case of a spectrum of ambiguity.
While Unicode deals in abstract characters and inclusion works on Unicode code points, users interact with strings as actually rendered: sequences of glyphs. There are characters that, depending on font, sometimes look quite similar to one another (such as "l" and "1"); any character that is like this is often called "visually similar". More difficult are characters that, in any normal rendering, always look the same as one another. The shared history of Cyrillic, Greek, and Latin scripts, for example, means that there are characters in each script that function similarly and that are usually indistinguishable from one another, though they are not the same abstract character. These are examples of "homoglyphs." Any character that can be confused for another one can be called confusable, and confusability can be thought of as a spectrum with "visually similar" at one end, and "homoglyphs" at the other. (We use the term "homoglyph" strictly: code points that normally use the same glyph when rendered.)
Note that homoglyphs are not restricted to cross-script scenarios - there are a number of homoglyphs where both code points or sequences are part of the same script.
A further issue is introduced by the fact that Unicode caters not only to living and dead languages alike, but also to scholarly and scientific notation, as well as specialized modes of written text, such as for poetry, religious works, or texts to be sung or chanted. Where these notations use symbols, they are excluded under inclusion, but where they use varieties of letter forms or marks used with letters, they are included by default. Some of these letters or marks, have been incorporated over time into orthographies for living languages, which is one reason they were not rigorously excluded from the start. However, in some cases, they may (alone or in combination with ordinary letters appear the same (or very similar to) existing letters. This makes some of these characters, and especially the marks in question "troublesome".
Finally, IDNA 2008 has a limited appreciation for the fact that characters in complex scripts, unlike ASCII letters, cannot simply occur in random sequences. Neither software (for display or data entering) nor readers are prepared to process some of these code points "out of order". For such scripts, without a policy that describes permissible contexts, labels could be registered that cannot be rendered or typed reliably and which most users would not know how to read or recognize. In some cases, combining sequences typed in the "wrong" order may display identically to to those typed in the "correct" ordering; again something that needs to be sorted out by defining permissible contexts, for example by using the context rule mechanism in [RFC7940].
There are a number of cases that illustrate the combining sequence or digraph issue:
Other cases that demonstrate that the issue does not lie exclusively or primarily with combining sequences:
Cross script homoglyphs usually do not involve combining sequences, but can be mitigated by rules requiring strings to be in a single script. For zones that support multiple scripts, it may be necessary to have policies to prevent whole-script homographs: labels entirely in one script that look the same as another label in the other script. One method would be to define "blocked" variants (See [RFC7940] and [RFC8228]).
Note to RFC Editor: this section should be removed prior to publication as an RFC.
This Internet-Draft may be discussed on the IAB Internationalization public list: i18n-discuss@iab.org.
Note to RFC Editor: this section should be removed prior to publication as an RFC.