Those Troublesome Characters: A Registry of Unicode Code Points Needing Special Consideration When Used in Network Identifiers
draft-freytag-troublesome-characters-01

Abstract

Unicode's design goal is to be the universal character set for all applications. The goal entails the inclusion of very large numbers of characters. It is also focused on written language; special provisions have always been needed for identifiers. The sheer size of the repertoire increases the possibility of accidental or intentional use of characters that can cause confusion among users, particularly where linguistic context is ambiguous, unavailable, or impossible to determine. A registry of code points that can be sometimes especially problematic may be useful to guide system administrators in setting parameters for allowable code points in an identifier system, and to aid applications in creating security aids for users.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on January 1, 2018.

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Unicode code points and identifiers
2. Background and Conventions
3. Techniques already in place
4. A registry of code points

4.1. Discussion
4.2. Registry initial contents

4.2.1. Code Point Table
4.2.2. References for Registry

5. IANA Considerations
6. Security Considerations
7. References

7.1. Normative References
7.2. Informative References

Appendix A. Additional Background

A.1. The Theory of Inclusion
A.2. The Difference Between Theory and Practice

A.2.1. Confusability
A.2.2. Not everything can be solved

Appendix B. Examples
Appendix C. Discussion Venue
Appendix D. Change History
Authors' Addresses

1. Unicode code points and identifiers

Unicode [Unicode] is a coded character set that aims to support every writing system. Writing systems evolve over time and are sometimes influenced by one another. As a result, Unicode encodes many characters that, to a reader, appear to be the same thing; but that are encoded differently from one another. This sort of difference is usually not important in written texts, because competent readers and writers of a language are able to compensate for the selection of the "wrong" character when reading or writing. Finally, the goal of supporting every writing system also implies that Unicode is designed to properly represent text in written languages, so special provisions are needed for identifiers.

Identifiers that are used in a network or, especially, an Internet context present several special problems because of the above feature of Unicode:

In many (perhaps most) uses of identifiers, it is either practically difficult or impossible to ascertain the correct language context in which the identifier is being or will be used. In the case of an internationalized domain name, for instance, each label could in principle represent a new locus of control, because there could be a delegation there. A new locus of control means that the administrator of the resulting zone could speak, read, or intend a different language context than the one from the parent. Moreover, at least some domains (such as the root) have an Internet-wide context and therefore do not really have a language context as such. In any case, the language context is simply not available as part of a DNS lookup, so there is no way to make the DNS sensitive to this sort of issue. Even in the case of email local-parts, where a sender is likely to know at least one of the languages of the receiver, the language context that was in use at the time the identifier was created is often unknown.
Identifiers on the network are in general exact-match systems, because an ambiguous identifier is problematic. Sometimes, but not always, there are facilities for aliasing such that multiple identifiers can be put together as a single identity; the DNS, for example, does not have such an aliasing capability, because in the DNS all aliases are one-way pointers. Aliasing techniques are in any case just an extension of the exact-match approach, and do not work the way a competent human reader does when interpolating the "right" character upon seeing the "wrong" one.
Because there are many characters that may appear to be the same (or even, that are defined in such a way that they are all but guaranteed to be rendered by the same glyphs), it is fairly easy to create an identifier either by accident or on purpose that is likely to be confused with some other identifier even by competent readers and writers of a language.
For some scripts the repertoire of shapes is shared, so that there are cases of two strings in which all the code points in one script in the first string, and all the code points in another script in the second string, are respectively confusable with one another. In that case, the strings cannot be distinguished by a reader, and the whole string is confusable.
For some scripts, both users and rendering systems do not expect to encounter code points in arbitrary sequence. Most code points normally occur only in specific locations within a syllable. If random labels were permitted, some would not display as expected (including having some features misplaced or not displayed) while others would present recognition problems to users experienced with the script. Some devices may also not support arbitrary input.

Beyond these issues, human perception is easily tricked, so that entirely unrelated character sequences can become confusable -- for example "rn" being confused with "m". Humans read strings, not characters, and they will mostly see what they expect to see. Some additional discussion of the background can be found in Appendix A.

2. Background and Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

A reader needs to be familiar with Unicode [Unicode], IDNA2008 [RFC5890] [RFC5891] [RFC5892] [RFC5893] [RFC5894], PRECIS (at least the framework, [RFC7564]), and conventions for discussion of internationalization in the IETF (see [RFC6365]).

3. Techniques already in place

In the IDNA mechanism for including Unicode code points [RFC5892], a code point is only included when it meets the needs of internationalizing domain names as explained in the IDNA framework [RFC5894]. For identifiers other than those specified by IDNA, the PRECIS framework [RFC7564] generalizes the same basic technique. In both cases, the overall approach is to assume that all characters are excluded, and then to include characters according to properties derived from the Unicode character properties. This general strategy cuts the enormous size of the Unicode database somewhat, avoiding including some characters that are necessarily unsuited for use as identifiers.

The mechanism of inclusion by derived property, while helpful, is insufficient to guarantee every included character is safe for use in identifiers. Some characters' properties lead them to be included even though they are not obviously good candidates. In other cases, individual characters are good for inclusion, but are problematic in combination. Finally, there are cases where characters (or sequences of characters) are not problematic by themselves, or if used in a mutually exclusive manner in the same identifier, but become problematic when their choice represents the only difference between otherwise identical identifiers. For some examples, see Appendix B.

Operators of systems that create identifiers (whether through a registry or through a peer-to-peer identifier negotiation system) need to make policies for characters they will permit. Operators of registries, for instance, can help by adopting good registration policies: "Users will benefit if registries only permit characters from scripts that are well-understood by the registry or its advisers."[RFC5894] The difficulty for many operators, however, is that they do not have the writing system expertise to claim any character is "well-understood", and they do not really have the time to develop that expertise. Such operators should in fact not use or register such characters. Unfortunately, in many cases the operators are stewards of systems where the user population demands identifiers useful to them in their local languages. In other cases, operators may proceed without a proper understanding owing to financial or market share incentives. The risk for Internet identifiers in such cases is obviously that ill-understood and potentially exploitable gaps in registration policies will open. To help mitigate such issues, a registry of Unicode code points that present special issues for network identifiers can help guide protocol and operating decisions about whether to permit a given code point or sequence of code points. This will not completely protect against poor registration or use, but it may provide operational guidance necessary for people who are responsible for creating policies.

Note that the registry defined herein does not address any of the issues created by whole-string confusables where each of the identifiers is of a different script. A common workaround, limiting a registry to identifiers of only a single script, would mitigate this issue.

For some of the code points (or code point sequences listed hat present issues for identifiers, it may be most expeditious to simply not include them, even though they are valid according to the protocol. Sometimes, one of a pair of identical code points (or code point sequences) may be deemed preferable over the other for practical reasons.

In the case of registries, it is not always necessary or desirable to exclude characters. Sometimes, it is merely necessary to ensure that for two otherwise identical identifiers, only one of a set of mutually exclusive characters (or sequences of characters) is used, while preventing the later registration of the the label containing the other one in order to avoid ambiguity. This way the operator does not need to make a choice. In certain cases, where both of these identifiers mean the same thing, an operator may decide to allow both labels to be registered simultaneously, but only to the same entity.

In every case, the registry here defined includes code points that require special attention when they are to be used in identifiers. An administrator who does not have the time or inclination to develop the requisite understanding would be well-advised simply not to permit these code points at all.

4. A registry of code points

4.1. Discussion

The registry contains three fields. The first field, called "Code Point(s)", is a code point or sequence of code points. The second, "Cross Reference", contains zero or more cross references to related code points. The third, called "Explanation", is a free form text field that briefly describes the issue. The explanation field also contains one or more references to documents defining the code point and the reason why it presents an issue. These reference may be to documents external to the registry, so long as the reference is stable.

The registry is not intended as an alternative to normal operational policies that are used for protocols under normal administrative scope. For instance, zone operators that support IDNA are expected to create policies governing the code points that they will permit (see [RFC5894] and [I-D.rfc5891bis]). The registry herein defined is intended to highlight particularly troublesome code points or code point sequences for the benefit of administrators creating such policies. It is also intended to highlight characters that may create identifier ambiguities and thereby create security vulnerabilities.

If a character appears in the registry, that does not automatically mean that it is a bad candidate for use in identifiers generally. Absent a well-defined and verifiable policy, however, such a code point or sequence might well be treated with suspicion by users and by tools.

The registry is updated by Expert Review. It ought to contain only code points that are significant in identifiers and that need special policies (including policies of exclusion). Only code points that are eligible for use in identifiers (i.e. that are not DISALLOWED) ought to be included. Code points that are CONTEXTJ or CONTEXTO ought to only be included if concerns are identified that are not mitigated by the existing IDNA context rules.

4.2. Registry initial contents

4.2.1. Code Point Table

Registry of Unicode Code Points Requiring Special Consideration in Network Identifiers
Code Point or Sequence	Cross Reference	Explanation
0307		Restricted Context: By definition, LATIN SMALL LETTER I plus combining DOT ABOVE renders exactly the same as LATIN SMALL LETTER I by itself and does so in practice for any good font. The same is true for all Unicode characters with the soft_dotted property; they lose their dot if followed by a combining mark. DOT ABOVE should be excluded, or restricted to contexts where it does not follow a soft_dotted letter. [115]
006C 0335	019A	Identical: Usually indistinguishable from LETTER L WITH BAR
006F 0337	00F8	Identical: Usually indistinguishable from LETTER O WITH STROKE
00F8	006F 0337	Identical: Usually indistinguishable in appearance from LETTER O plus combining SHORT SOLIDUS OVERLAY
02A6	0074 0073	Identical: Looks like LETTER T plus LETTER S, except for slight kerning
0074 0073	02A6	Identical: Looks like TS DIGRAPH, except for lack of kerning
019A	006C 0335	Identical: Usually indistinguishable from LETTER L plus combining SHORT STROKE OVERLAY
01C0		Not Recommended: Indistinguishable from a punctuation character that is not PVALID [120]
01C1		Not Recommended: Indistinguishable from a punctuation character that is not PVALID [120]
01C2		Not Recommended: Indistinguishable from a punctuation character that is not PVALID [120]
01C3		Not Recommended: Indistinguishable from a punctuation character that is not PVALID [120]
01DD	0259	Identical: Identical in appearance to U+0259 [150]
0259	01DD	Identical: Identical in appearance to U+01DD [150]
02B9		Not Recommended: Indistinguishable from a punctuation character that is not PVALID [120]
02BA		Not Recommended: Indistinguishable from a punctuation character that is not PVALID [120]
02BC		Not Recommended: Indistinguishable from a punctuation character (U+2019), which is not PVALID [6912]
02BD		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02BE		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02BF		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02C0		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02C1		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02C6		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02C7		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02C8		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02C9		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02CA		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02CB		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02CC		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02CD		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02CE		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02CF		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02D0		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02D1		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02EC		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
02EE		Not Recommended: Indistinguishable from punctuation character that is not PVALID [120]
0321		Not Recommended: Not intended for forming combined orthographic letters
0322		Not Recommended: Not intended for forming combined orthographic letters
0334		Not Recommended: Not intended for forming combined orthographic letters
0335		Not Recommended: Not intended for forming combined orthographic letters
0336		Not Recommended: Not intended for forming combined orthographic letters
0337		Not Recommended: Not intended for forming combined orthographic letters
0338		Not Recommended: Not intended for forming combined orthographic letters
0633 065C 06EC	069A	Identical: Identical in appearance to U+069A [300]
06A1 065C 06EC	0641 065C, 06A3	Identical: Identical in appearance to U+06A3 and to U+0641 065C [300]
0633 06DB 065C	06FA	Identical: Identical in appearance to U+06FA [300]
0635 065C 06EC	0636 065C, 06FB	Identical: Identical in appearance to U+06FC and to U+0636 U+065C [300]
0639 065C 06EC	06FC, 063A 065C	Identical: Identical in appearance to U+06FC and U+063A U+065C [300]
06BA 065C 06EC	0646 065C, 06B9	Identical: Identical in appearance to U+06B9 and to U+0646 U+065C [300]
06CF	0648 06EC	Identical: Identical in appearance to U+0648 U+06EC [300]
063A	0639 06EC	Identical: Identical in appearance to U+0639 U+06EC [300]
0636	0635 06EC	Identical: Identical in appearance to U+0635 U+06EC [300]
062E	062D 06EC	Identical: Identical in appearance to U+062D U+06EC [300]
06BF	0686 06EC	Identical: Identical in appearance to U+0686 U+06EC [300]
0630	062F 06EC	Identical: Identical in appearance to U+062F U+06EC [300]
0632	0631 06EC	Identical: Identical in appearance to U+0631 U+06EC [300]
06B6	0644 06EC	Identical: Identical in appearance to U+0644 U+06EC [300]
06AC	0643 06EC	Identical: Identical in appearance to U+0643 U+06EC [300]
06BB	066E 0615, 06BA 0615, 0679	Identical: Identical in appearance to U+06BA U+0615 and to U+06BB or U+066E U+0615 when assuming initial or medial form [300]
0679	06BB, 066E 0615, 06BA 0615	Identical: Identical in appearance to U+066E U+0615 and to U+06BB or U+06BA U+0615 when assuming initial or medial form [300]
06FF	06BE 065B, 0647 065B	Identical: Identical in appearance to U+06BE U+065B and to U+0647 U+065B [300]
06C7	0648 064F, 0648 0619	Identical: Identical in appearance to U+0648 U+064F and to U+0648 U+0619 [300]
063D	06CC 065B	Identical: Identical in appearance to U+06CC U+065B [300]
0648 06EC	06CF	Identical: Identical in appearance to U+06CF [300]
0639 06EC	063A	Identical: Identical in appearance to U+063A [300]
0635 06EC	0636	Identical: Identical in appearance to U+0636 [300]
062D 06EC	062E	Identical: Identical in appearance to U+062E [300]
0686 06EC	06BF	Identical: Identical in appearance to U+06BF [300]
062F 06EC	0630	Identical: Identical in appearance to U+0630 [300]
0631 06EC	0632	Identical: Identical in appearance to U+0632 [300]
0644 06EC	06B6	Identical: Identical in appearance to U+06B6 [300]
066F 06EC	0641, 06A1 06EC, 06A7	Identical: Identical in appearance to U+06A7 and to U+0641 or U+06A1 U+06EC when assuming initial or medial form [300]
06A1 06EC	0641, 06A7, 066F 06EC	Identical: Identical in appearance to U+0641 and to U+06A7 or U+066F U+06EC when assuming initial or medial form [300]
06BA 06EC	0646	Identical: Identical in appearance to U+0646 [300]
0643 06EC	06AC	Identical: Identical in appearance to U+06AC [300]
06BA 0615	0679, 06BB, 066E 0615	Identical: Identical in appearance to U+06BB and to 0679 or U+066E U+0615 when assuming initial or medial form [300]
066E 0615	0679, 06BA 0615, 06BB	Identical: Identical in appearance to U+0679 and to 06BB or U+06BA U+0615 when assuming initial or medial form [300]
06CC 065B	063D	Identical: Identical in appearance to U+063D [300]
0648 064F	0648 0619, 06C7	Identical: Identical in appearance to U+0648 U+0619 and to U+06C7 [300]
0648 0619	0648 064F, 06C7	Identical: Identical in appearance to U+0648 U+064F and to U+06C7 [300]
0615		Not Recommended: Part of homoglyph sequence(s) not covered by normalization. [300]
0626	0649 0654, 064A 0654, 06CC 0654	Identical: Identical in appearance to YEH plus combining HAMZAH ABOVE and U+ 06CC or U+064A plus combining HAMZAH ABOVE [300]
0628 0654	08A1	Identical: Identical in appearance to U+08A1 [IAB]
0629	06C3	Identical: Identical in appearance to U+06C3 when assuming final form [300]
062D 0615	0772	Identical: Identical in appearance to HAH with SMALL TAH ABOVE [300]
062D 0654	0681	Identical: Identical in appearance to U+0681 [300]
062F 0615	0688	Identical: Identical in appearance to U+0688 [300]
062F 065B	06EE	Identical: Identical in appearance to U+06EE [300]
0631 0615	0691	Identical: Identical in appearance to U+0691 [300]
0631 0654	076C	Identical: Identical in appearance to U+076C [300]
0631 065B	06EF	Identical: Identical in appearance to U+06EF [300]
0641	066F 06EC, 06A1 06EC, 06A7	Identical: Identical in appearance to U+06A1 U+06EC and to U+06A7 or U+066F U+06EC when assuming initial or medial form [300]
0643	06A9	Identical: Idential in appearance to U+06A9 KEHEH when assuming initial form [300]
0644 065A	06B5	Identical: Idential in appearance to U+06B5 [300]
0646	06BA 06EC, 06BA	Identical: Identical in appearance to U+06BA 06EC and to U+06BA when assuming initial or medial form [300]
0646 0615	0768	Identical: Identical in appearance in to U+0768 [300]
0646 065A	0769	Identical: Identical in appearance in to U+0769 [300]
0647	06BE, 06C1, 06D5	Identical: Identical in appearance to AE when assuming final or isolated form; Identical in appearance to U+XXX when assuming initial or medial form; identical in appearance to U+XXX when assuming isolated form [300]
0647 0654	06C0, 06C2	Identical: Identical in appearance to U+06C2 and U+06C0 [300]
0647 065B	06BE 065B, 06FF	Identical: Identical in appearance to U+06FF and to U+06BE plus combining INVERTED SMALL V ABOVE [300]
0648 065A	06C6	Identical: Identical in appearance to U+06C6
0648 065B	06C9	Identical: Identical in appearance to U+06C9
0648 0670	06C8	Identical: Identical in appearance to YU U+06C8
0649	06CC, 064A	Restricted Context: Not intended to be used with HAMZA ABOVE, use U+0626 instead, identical in appearance to U+064A when assuming initial or medial form [99] [115] [300]
0649 0654	0626, 06CC 0654	Not Recommended: This sequence not to be used; Identical in appearance in initial position to HIGH HAMZA YEH $$$, as it would be identical in appearance to U+0626 [99] [115] [300]
06CC 0654	0649 0654, 0626	Identical: identical in appearance in one or more positions to U+0626 [99] [300]
0649 065A	06CC 065A, 06CE	Identical: Identical in appearance to U+06CE and to U+06CC plus combining SMALL V ABOVE [300]
064A	06CC, 0649	Identical: Idential in appearance to U+06CC when assuming final or isolated form [300]
064A 0654	0626, 08A8	Identical: U+064A is supposed to loose its dots when combined with HAMZA ABOVE, which would make the sequence U+064A U+0654 identical in appearance to U+0626. In some fonts, the dots are retained, and the sequence is then identical in appearance with U+08A8 [99] [300]
064B		Not Recommended: Not to be used in zone files for the Arabic language, per RFC 5564 [5564]
064C		Not Recommended: Not to be used in zone files for the Arabic language, per RFC 5564 [5564]
064D		Not Recommended: Not to be used in zone files for the Arabic language, per RFC 5564 [5564]
064E		Not Recommended: Not to be used in zone files for the Arabic language, per RFC 5564 [5564]
064F		Not Recommended: Not to be used in zone files for the Arabic language, per RFC 5564. Also: Part of homoglyph sequence(s) not covered by normalization. [300] [5564]
0650		Not Recommended: Not to be used in zone files for the Arabic language, per RFC 5564 [5564]
0651		Not Recommended: Not to be used in zone files for the Arabic language, per RFC 5564 [5564]
0652		Not Recommended: Not to be used in zone files for the Arabic language, per RFC 5564 [5564]
0654		Not Recommended: Part of homoglyph sequence(s) not covered by normalization. [300]
065A		Not Recommended: Part of homoglyph sequence(s) not covered by normalization. [300]
065B		Not Recommended: Part of homoglyph sequence(s) not covered by normalization. [300]
065C		Not Recommended: Part of homoglyph sequence(s) not covered by normalization. [300]
0660	06F0	Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT ZERO [110]
0661	06F1	Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT ONE [110]
0662	06F2	Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT TWO [110]
0663	06F3	Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT THREE [110]
0667	06F7	Identical: Usually identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT SEVEN [110]
0668	06F8	Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT EIGHT [110]
0669	06F9	Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT NINE [110]
0673		Other issue: Deprecated; If required, use sequence U+0627 U+065F instead
0681	062D 0654	Identical: Identical in appearance to HAH plus combining HAMZA ABOVE [300]
0688	062F 0615	Identical: Identical in appearance to DAL plus combining SMALL HIGH TAH
068A 0615	068B, 0688 065C	Identical: Identical in appearance to U+068B and U+0688 U+065C [300]
068B	0688 065C, 068A 0615	Identical: Identical in appearance to DAL WITH DOT BELOW plus combining SMALL HIGH TAH [300]
0691	0631 0615	Identical: Identical in appearance to REH plus combining SMALL HIGH TAH
069A	0633 065C 06EC	Identical: Identical in appearance to combining sequence with two combining marks [300]
06B9	0646 065C, 06BA 065C 06EC	Identical: Identical in appearance to U+0646 U+065C and combining sequence with two combining marks [300]
06A9	0643	Identical: Idential in appearance to U+0643 KAF when assuming initial form [300]
06BE	0647, 06C1, 06D5	Identical: Idential in appearance to U+0647 when assuming initial or medial form and from U+06D5 when assuming final form [300]
06CC	064A, 0649	Identical: Idential in appearance to U+064A when assuming initila or mdeial fomr and to U+0649 when assuming final or isolated form [300]
067B	06D0	Identical: Identical in appearance to U+06D0 when assuming initial form [300]
0670		Not Recommended: Part of homoglyph sequence(s) not covered by normalization. [300]
067E	06BD, 06BA 06DB	Identical: Identical in appearance to U+06BD or U+06BA U+06DB when assuming initial or medial form [300]
06A4	06A8, 06A1 06DB, 066F 06DB	Identical: Identical in appearance to U+06A8 and U+066F 06DB when assuming initial or medial form and to U+06A1 U+06DB [300]
06A7	0641, 066F 06EC, 06A1 06EC	Identical: Identical in appearance to U+066F U+06EC and to U+0641 or U+06A1 U+06EC when assuming initial or medial form [300]
06A8	06A4, 06A1 06DB, 066F 06DB	Identical: Identical in appearance to U+06A4 and to U+06A1 U+06DB when assuming initial or medial form and to U+066F 06DB [300]
06BA	0646	Identical: Identical in appearance to U+0646 when assuming initial or medial form [300]
06B5	0644 065A	Identical: Identical in appearance to U+0644 with SMALL V ABOVE [300]
06C0	0647 0654, 06C2	Identical: Identical in appearance to U+06C2 when assuming final form and to U+0647 with HAMZA ABOVE [300]
06C1	0647, 06BE, 06D5	Identical: Idential in appearance to U+0647 and U+06D5 when assuming isolated form [300]
06C2	0647 0654, 06C0	Identical: Identical in appearance to U+06C0 when assuming final form and to U+0647 with HAMZA ABOVE [300]
06C3	0629	Identical: Identical in appearance to U+0629 when assuming final form [300]
06C6	0648 065A	Identical: Identical in appearance to WAV plus combining SMALL V ABOVE [300]
06C8	0648 0670	Identical: Identical in appearance to WAV plus combining SUPERSCRIPT ALEF U+0648 U+0670
066E 065A	0756	Identical: Identical in appearance to BEH WITH SMALL V
0697 0615	0771	Identical: Identical in appearance to REH with SMALL TAH AND TWO DOTS
06C9	0648 065B	Identical: Identical in appearance to WAV plus combining INVERTED SMALL V ABOVE
06CE	0649 065A, 06CC 065A	Identical: Identical in appearance toYEH and ALEF MAKSURA, each plus combining SMALL V ABOVE [300]
06CC 065A	06CE, 0649 065A	Identical: Identical in appearance to U+06CE, and to ALEF MASKURA plus combining SMALL V ABOVE [300]
06D0	067B	Identical: Identical in appearance to U+067B when assuming initial form [300]
06D5	0647, 06C1, 06BE	Identical: Idential in appearance to U+0647 HEH when assuming final or isolated form, and from U+06C1 when assuming isolated form, [300]
06D6		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06D7		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06D8		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06D9		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06DA		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06DB		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. Part of homoglyph sequence(s) not covered by normalization. [115] [300]
06DC		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06DF		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06E0		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06E1		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06E2		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06E3		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06E4		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06E5		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06E6		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06E7		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06E8		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06EA		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06EB		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06EC		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. Part of homoglyph sequence(s) not covered by normalization. [115] [300]
06ED		Not Recommended: Specialized use; Quranic marks not used in writing contemporary Arabic script based languages; hard to distinguish at small sizes. Not suitable for identifiers. [115] [300]
06EE	062F 065B	Identical: Identical in appearance to DAL plus combining INVERTED SMALL V ABOVE
06EF	0631 065B	Identical: Identical in appearance to REH plus combining INVERTED SMALL V ABOVE
06F0	0660	Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT ZERO [110]
06F1	0661	Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT ONE [110]
06F2	0662	Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT TWO [110]
06F3	0663	Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT THREE [110]
06F7	0667	Identical: Usually identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT SEVEN [110]
06F8	0668	Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT EIGHT [110]
06F9	0669	Identical: Identical in appearance and meaning to EXTENDED ARABIC-INDIC DIGIT NINE [110]
06FA	0633 06DB 065C	Identical: Identical in appearance to combining sequence with two combining marks [300]
06FD		Not Recommended: Does not have the XID_CONTINUE property; not considered suitable for identifiers by Unicode [120]
06FE		Not Recommended: Does not have the XID_CONTINUE property; not considered suitable for identifiers by Unicode [120]
06BE 065B	06FF, 0647 065B	Identical: Identical in appearance to U+06FF and U+0647 U+ 065B [300]
0756	066E 065A	Identical: Identical in appearance to DOTLESS BEH plus SMALL V ABOVE [300]
0762	06A9 06EC	Identical: Identical in appearance to U+06A9 with DOT ABOVE [300]
06A9 06EC	0762	Identical: Identical in appearance to U+0762 [300]
0765	0645 06EC	Identical: Identical in appearance to U+0645 with DOT ABOVE [300]
0645 06EC	0765	Identical: Identical in appearance to U+0765E [300]
0768	0646 0615	Identical: Identical in appearance to U+0646 plus SMALL V ABOVE [300]
0769	0646 065A	Identical: Identical in appearance to U+646 with SMALL V ABOVE [300]
0771	0697 0615	Identical: Identical in appearance to REH WITH TWO DOTS ABOVE plus SMALL TAH ABOVE [300]
0772	062D 0615	Identical: Identical in appearance to HAH plus SMAL TAH ABOVE [300]
076C	0631 0654	Identical: Identical in appearance to REH plus combining HAMZAH ABOVE
08A1	0628 0654	Identical: Used for Fulfulde, Identical in appearance to BEH plus combining HAMZAH ABOVE
063F	06CC 06DB, 0649 06DB	Identical: Identical in appearance to U+06CC U+06DB [300]
0634	0633 06DB	Identical: Identical in appearance to U+0633 U+06DB [300]
069C	069B 06DB	Identical: Identical in appearance to U+069B U+06DB [300]
062B	066E 06DB	Identical: Identical in appearance to U+066E U+06DB [300]
0685	062D 06DB	Identical: Identical in appearance to U+062D U+06DB [300]
0698	0631 06DB	Identical: Identical in appearance to U+0631 U+06DB [300]
068E	062F 06DB	Identical: Identical in appearance to U+062F U+06DB [300]
06A0	0639 06DB	Identical: Identical in appearance to U+0639 U+06DB [300]
06AD	0643 06DB	Identical: Identical in appearance to U+0643 U+06DB [300]
06B4	06AF 06DB	Identical: Identical in appearance to U+06AF U+06DB [300]
06B7	0644 06DB	Identical: Identical in appearance to U+0644 U+06DB [300]
06BD	067E, 06BA 06DB	Identical: Identical in appearance to U+06BA U+06DB and to U+067E when assuming initial or medial form [300]
0763	06A9 06DB	Identical: Identical in appearance to U+06A9 U+06DB [300]
0628	066E 065C	Identical: Identical in appearance to U+066E U+065C [300]
068A	062F 065C	Identical: Identical in appearance to U+062F U+065C [300]
0694	0631 065C	Identical: Identical in appearance to U+0631 U+065C [300]
06A3	0641 065C, 06A1 065C 06EC	Identical: Identical in appearance to U+0641 U+065C [300]
06FC	0639 065C 06EC, 063A 065C	Identical: Identical in appearance to U+063A U+065C and to U+0639 U+065C U+06EC [300]
06FB	0635 065C 06EC, 0636 065C	Identical: Identical in appearance to U+0636 U+065C and to U+0635 U+065C U+06EC [300]
0751	062B 065C	Identical: Identical in appearance to U+062B U+065C [300]
0766	0645 065C	Identical: Identical in appearance to U+0645 U+065C [300]
0649 06DB	063F, 06CC 06DB	Identical: Identical in appearance to U+063F [300]
06CC 06DB	063F, 0649 06DB	Identical: Identical in appearance to U+063F [300]
0633 06DB	0634	Identical: Identical in appearance to U+0634 [300]
069B 06DB	069C	Identical: Identical in appearance to U+069C [300]
066E 06DB	062B	Identical: Identical in appearance to U+062B [300]
062D 06DB	0685	Identical: Identical in appearance to U+0685 [300]
0631 06DB	0698	Identical: Identical in appearance to U+0698 [300]
062F 06DB	068E	Identical: Identical in appearance to U+068E [300]
0639 06DB	06A0	Identical: Identical in appearance to U+06A0 [300]
06A1 06DB	06A4, 06A8, 066F 06DB	Identical: Identical in appearance to U+06A4 and U+06A8 and U+066F U+06DB when assuming... [300]
066F 06DB	06A8, 06A4, 06A1 06DB	Identical: Identical in appearance to U+06A8 and to ... [300]
0643 06DB	06AD	Identical: Identical in appearance to U+06AD [300]
06AF 06DB	06B4	Identical: Identical in appearance to U+06B4 [300]
0644 06DB	06B7	Identical: Identical in appearance to U+06B7 [300]
06BA 06DB	067E, 06BD	Identical: Identical in appearance to U+06BD and to U+067E when assuming initial or medial form [300]
06A9 06DB	0763	Identical: Identical in appearance to U+0763 [300]
066E 065C	0628	Identical: Identical in appearance to U+0628 [300]
062F 065C	068A	Identical: Identical in appearance to U+068A [300]
0688 065C	068A 0615, 068B	Identical: Identical in appearance to U+068B [300]
0631 065C	0694	Identical: Identical in appearance to U+0694 [300]
0641 065C	06A1 065C 06EC, 06A3	Identical: Identical in appearance to U+06A3 and to U+06A1 U+065C U+06EC [300]
0646 065C	06BA 065C 06EC, 06B9	Identical: Identical in appearance to U+06B9 and to a sequence with two combining marks [300]
063A 065C	0639 065C 06EC, 06FC	Identical: Identical in appearance to U+06FC and to U+0639 U+065C U+06EC [300]
0636 065C	0635 065C 06EC, 06FB	Identical: Identical in appearance to U+06FB and to U+0635 U+065C U+06EC [300]
062B 065C	0751	Identical: Identical in appearance to U+0751 [300]
0645 065C	0766	Identical: Identical in appearance to U+0766 [300]
08A8	064A 0654	Identical: Identical in appearance to U+064A U+0654 [99]
08A9	064A 06EC	Identical: Identical in appearance to U+064A U+06EC [99]
064A 06EC	08A9	Identical: Identical in appearance U+08A9 [99]
098C 09E2	09E1	Identical: Identical in appearance to VOCALIC LL
09E1	098C 09E2	Identical: Used for Sanskrit, Identical in appearance to LETTER VOCALIC L plus SIGN VOCALIC L
0B95	0BE7	Identical: Identical in appearance to TAMIL DIGIT ONE
0BE7	0B95	Identical: Identical in appearance to TAMIL KA [110]
0D4C	0D57	Not Recommended: Obsolete, preferred alternative is U+0D57 [120] [115]
0D57	0D4C	Identical: This code point preferred over U+0D4C, which is obsolete [120]
0E3A		Other issue: Renders unreliably, or not at all, if adjacent to any Thai vowel below. This may be prevented by a context rule
0E41		Other issue: Digraph of U+0E40 SARA E U+0E40 SARA E. Normally handled by disallowing the seqeunce via a context rule
0E40		Restricted Context: Restrict more than oneSARA E from occurring together, as pairs are indistinguishable from U+0E40 SARA EE. This restriction is normally implemented more generally, disallowing any pair of leading vowels
0E45		Restricted Context: Only occurs after two special Thai vowels,U+0E24 RU and U+0E26 LU. Is also potentially confused with U+0E32 SARA I. Both issues can be addressed by defining a context rule. Alternatively the context may be spelled out by enumerating the two sequences and excluding U+0E45 if occurring by itself.
0E4E		Not Recommended: Rarely used in modern Thai; it is more commonly replaced with U+0E3A (PHINTHU). Excluding it avoids issues with confusing it with another diacritic U+0E4C (THANTHAKHAT). Both are rendered atop a syllable and hard to distinguish at small sizes.
0F18		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
0F19		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
0F35		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
0F37		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
0F3E		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
0F3F		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
0F7A 0F7A	0F7B	Identical: Identical in appearance to VOWEL SIGN EE [120] [115]
0F7B	0F7A 0F7A	Identical: Identical in appearance to a sequence of two VOWEL SIGN E [120] [115]
0F7C 0F7C	0F7D	Identical: Identical in appearance to VOWEL SIGN OO [120] [115]
0F7D	0F7C 0F7C	Identical: Identical in appearance to a sequence of two VOWEL SIGN O [120] [115]
0FC6		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [115] [120]
101D	1040	Identical: Letter U+101D is identical to digit U+1040 [100] [150]
1040	101D	Identical: Digit U+1040 is identical to letter U+101D [110] [150]
1200	1210, 1280	Interchangeable: U+1200, U+1210 and U+1280 are used interchangeably in Amharic [100] [202]
1201	1211, 1281	Interchangeable: U+1201, U+1211 and U+1281 are used interchangeably in Amharic [100] [202]
1202	1212, 1282	Interchangeable: U+1202, U+1212 and U+1282 are used interchangeably in Amharic [100] [202]
1203	1213, 1283	Interchangeable: U+1203, U+1213 and U+1283 are used interchangeably in Amharic [100] [202]
1204	1214, 1284	Interchangeable: U+1204, U+1214 and U+1284 are used interchangeably in Amharic [100] [202]
1205	1215, 1285	Interchangeable: U+1205, U+1215 and U+1285 are used interchangeably in Amharic [100] [202]
1206	1216, 1286	Interchangeable: U+1206, U+1216 and U+1286 are used interchangeably in Amharic [100] [202]
1210	1200, 1280	Interchangeable: U+1200, U+1210 and U+1280 are used interchangeably in Amharic [100] [202]
1211	1201, 1281	Interchangeable: U+1201, U+1211 and U+1281 are used interchangeably in Amharic [100] [202]
1212	1202, 1282	Interchangeable: U+1202, U+1212 and U+1282 are used interchangeably in Amharic [100] [202]
1213	1203, 1283	Interchangeable: U+1203, U+1213 and U+1283 are used interchangeably in Amharic [100] [202]
1214	1204, 1284	Interchangeable: U+1204, U+1214 and U+1284 are used interchangeably in Amharic [100] [202]
1215	1205, 1285	Interchangeable: U+1205, U+1215 and U+1285 are used interchangeably in Amharic [100] [202]
1216	1206, 1286	Interchangeable: U+1206, U+1216 and U+1286 are used interchangeably in Amharic [100] [202]
1217	1288	Interchangeable: U+1217 and U+1288 are used interchangeably in Amharic [100] [202]
1220	1230	Interchangeable: U+1220 and U+1230 are used interchangeably in Amharic [100] [202]
1221	1231	Interchangeable: U+1221 and U+1231 are used interchangeably in Amharic [100] [202]
1222	1232	Interchangeable: U+1222 and U+1232 are used interchangeably in Amharic [100] [202]
1223	1233	Interchangeable: U+1223 and U+1233 are used interchangeably in Amharic [100] [202]
1224	1234	Interchangeable: U+1224 and U+1234 are used interchangeably in Amharic [100] [202]
1225	1235	Interchangeable: U+1225 and U+1235 are used interchangeably in Amharic [100] [202]
1226	1236	Interchangeable: U+1226 and U+1236 are used interchangeably in Amharic [100] [202]
1227	1237	Interchangeable: U+1227 and U+1237 are used interchangeably in Amharic [100] [202]
1230	1220	Interchangeable: U+1230 and U+1220 are used interchangeably in Amharic [100] [202]
1231	1221	Interchangeable: U+1231 and U+1221 are used interchangeably in Amharic [100] [202]
1232	1222	Interchangeable: U+1232 and U+1222 are used interchangeably in Amharic [100] [202]
1233	1223	Interchangeable: U+1233 and U+1223 are used interchangeably in Amharic [100] [202]
1234	1224	Interchangeable: U+1234 and U+1224 are used interchangeably in Amharic [100] [202]
1235	1225	Interchangeable: U+1235 and U+1225 are used interchangeably in Amharic [100] [202]
1236	1226	Interchangeable: U+1236 and U+1226 are used interchangeably in Amharic [100] [202]
1237	1227	Interchangeable: U+1237 and U+1227 are used interchangeably in Amharic [100] [202]
1280	1200, 1210	Interchangeable: U+1200, U+1210 and U+1280 are used interchangeably in Amharic [100] [202]
1281	1201, 1211	Interchangeable: U+1201, U+1211 and U+1281 are used interchangeably in Amharic [100] [202]
1282	1202, 1212	Interchangeable: U+1202, U+1212 and U+1282 are used interchangeably in Amharic [100] [202]
1283	1203, 1213	Interchangeable: U+1203, U+1213 and U+1283 are used interchangeably in Amharic [100] [202]
1284	1204, 1214	Interchangeable: U+1204, U+1214 and U+1284 are used interchangeably in Amharic [100] [202]
1285	1205, 1215	Interchangeable: U+1205, U+1215 and U+1285 are used interchangeably in Amharic [100] [202]
1286	1206, 1216	Interchangeable: U+1206, U+1216 and U+1286 are used interchangeably in Amharic [100] [202]
1288	1217	Interchangeable: U+1288 and U+1217 are used interchangeably in Amharic [100] [202]
12A0	12A3, 12D0, 12D3	Interchangeable: U+12A0, U+12A3, U+12D0 and U+12D3 are used interchangeably in Amharic [100] [202]
12A1	12D1	Interchangeable: U+12A1 and U+12D1 are used interchangeably in Amharic [100] [202]
12A2	12D2	Interchangeable: U+12A2 and U+12D2 are used interchangeably in Amharic [100] [202]
12A3	12A0, 12D0, 12D3	Interchangeable: U+12A0, U+12A3, U+12D0 and U+12D3 are used interchangeably in Amharic [100] [202]
12A4	12D4	Interchangeable: U+12A4 and U+12D4 are used interchangeably in Amharic [100] [202]
12A5	12D5	Interchangeable: U+12A5 and U+12D5 are used interchangeably in Amharic [100] [202]
12A6	12D6	Interchangeable: U+12A6 and U+12D6 are used interchangeably in Amharic [100] [202]
12AE	12B0	Interchangeable: U+12AE and U+12B0 are used interchangeably in Amharic [100] [202]
12B0	12AE	Interchangeable: U+12B0 and U+12AE are used interchangeably in Amharic [100] [202]
12D0	12A0, 12A3, 12D3	Interchangeable: U+12A0, U+12A3, U+12D0 and U+12D3 are used interchangeably in Amharic [100] [202]
12D1	12A1	Interchangeable: U+12D1 and U+12A1 are used interchangeably in Amharic [100] [202]
12D2	12A2	Interchangeable: U+12D2 and U+12A2 are used interchangeably in Amharic [100] [202]
12D3	12A0, 12A3, 12D0	Interchangeable: U+12A0, U+12A3, U+12D0 and U+12D3 are used interchangeably in Amharic [100] [202]
12D4	12A4	Interchangeable: U+12D4 and U+12D4 are used interchangeably in Amharic [100] [202]
12D5	12A5	Interchangeable: U+12D5 and U+12A5 are used interchangeably in Amharic [100] [202]
12D6	12A6	Interchangeable: U+12D6 and U+12A6 are used interchangeably in Amharic [100] [202]
1338	1340	Interchangeable: U+1338 and U+1340 are used interchangeably in Amharic [100] [202]
1339	1341	Interchangeable: U+1339 and U+1341 are used interchangeably in Amharic [100] [202]
133A	1342	Interchangeable: U+133A and U+1342 are used interchangeably in Amharic [100] [202]
133B	1343	Interchangeable: U+133B and U+1343 are used interchangeably in Amharic [100] [202]
133C	1344	Interchangeable: U+133C and U+1344 are used interchangeably in Amharic [100] [202]
133D	1345	Interchangeable: U+133D and U+1345 are used interchangeably in Amharic [100] [202]
133E	1346	Interchangeable: U+133E and U+1346 are used interchangeably in Amharic [100] [202]
1340	1338	Interchangeable: U+1340 and U+1338 are used interchangeably in Amharic [100] [202]
1341	1339	Interchangeable: U+1341 and U+1339 are used interchangeably in Amharic [100] [202]
1342	133A	Interchangeable: U+1342 and U+133A are used interchangeably in Amharic [100] [202]
1343	133B	Interchangeable: U+1343 and U+133B are used interchangeably in Amharic [100] [202]
1344	133C	Interchangeable: U+1344 and U+133C are used interchangeably in Amharic [100] [202]
1345	133D	Interchangeable: U+1345 and U+133D are used interchangeably in Amharic [100] [202]
1346	133E	Interchangeable: U+1346 and U+133E are used interchangeably in Amharic [100] [202]
17A2	17A3	Other issue: Preferred for deprecated U+17A3 [120] [150]
17A3	17A2	Not Recommended: Deprecated in Unicode, preferred is U+17A2 [120] [115] [150]
17A4		Not Recommended: Deprecated in Unicode [120] [115]
17A7 17CA	17A8	Other issue: This sequence preferred over U+17A8, which is obsolete [120]
17A8	17A7 17CA	Not Recommended: Obsolete, sequence U+17A7 U+17CA preferred [120] [115]
17D2 178A	17D2 178F	Identical: When preceded by U+17D2, U+178A and U+178F are indistinguishable [204]
17D2 178F	17D2 178A	Identical: When preceded by U+17D2, U+178A and U+178F are indistinguishable [204]
1835	1855	Identical: U+1835 is identical to U+1855 [115] [150]
1855	1835	Identical: U+1855 is identical to U+1835 [115] [150]
199E	19D0	Identical: Letter U+199E is identical to digit U+19D0 [115] [150]
19D0	199E	Identical: Digit U+19D0 is identical to Letter U+199E [115] [150]
19B1	19D1	Identical: Letter U+19B1 is identical to digit U+19D1 [150]
19D1	19B1	Identical: Digit U+19D1 is identical to letter U+19B2 [115] [150]
1B0D	1B52	Identical: Letter U+1B0D is identical to digit U+1B52 [115] [150]
1B11	1B53	Identical: Letter U+1B11 is identical to digit U+1B53 [115] [150]
1B28	1B58	Identical: Letter U+1B28 is identical to digit U+1B58 [115] [150]
1B52	1B0D	Identical: Digit U+1B52 is identical to letter U+1B0D [115] [150]
1B53	1B11	Identical: Digit U+1B53 is identical to letter U+1B11 [115] [150]
1B58	1B28	Identical: Digit U+1B58 is identical to letter U+1B28 [115] [150]
1C82		Not Recommended: Cyrillic NARROW O is a code point for specialist use, and common users do not expect to encounter it. It resembles digit ZERO and can be used to create an apparent contrast to the letter O in a label [115]
214E		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
2184		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
2E2F		Not Recommended: Does not have the XID_CONTINUE property; not considered suitable for identifiers by Unicode [120]
3006		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
302A		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
302B		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
302C		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
302D		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
303C		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
3078	30D8	Identical: Indistinguishable from U+30D8
3079	30D9	Identical: Indistinguishable from U+30D9
307A	30DA	Identical: Indistinguishable from U+30DA
30AB	529B	Identical: Not always distinct from U+529B
30AA	624D	Identical: Not always distinct from U+624D
30ED	53E3	Identical: Not always distinct from U+53E3
30CF	516B	Identical: Not always distinct from U+516B
30C8	535C	Identical: Not always distinct from U+535C
30CB	4E8C	Identical: Not always distinct from U+4E8C
30A8	5DE5	Identical: Not always distinct from U+5DE5
30D8	3078	Identical: Indistinguishable from U+3078
30D9	3079	Identical: Indistinguishable from U+3079
30DA	307A	Identical: Indistinguishable from U+307A
529B	30AB	Identical: Not always distinct from U+30AB
624D	30AA	Identical: Not always distinct from U+30AA
53E3	30ED	Identical: Not always distinct from U+30ED
516B	30CF	Identical: Not always distinct from U+30CF
535C	30C8	Identical: Not always distinct from U+30C8
4E8C	30CB	Identical: Not always distinct from U+30CB
5DE5	30A8	Identical: Not always distinct from U+30A8
30FC	4E00	Identical: Indistinguishable from U+4E00
4CA4		Not Recommended: Incorrectly unified ideograph; Encoding is unstable [120]
4E00	30FC	Identical: Indistinguishable from U+30FC
30FD	4E36	Identical: A single stroke shape; Indistinguishable from U+4E36
4E36	30FD	Identical: A single stroke shape; Indistinguishable from U+30FD
A717		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
A718		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
A719		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
A71A		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
A71B		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
A71C		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
A71D		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
A71E		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
A71F		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
A78C		Not Recommended: Indistinguishable from a punctuation character that is not PVALID [120]
A9CF		Not Recommended: Formally has the letter property, but functions more like a symbol or punctuation [120]
FE20		Not Recommended: Specialized combining mark, problematic for identifiers [120]
FE21		Not Recommended: Specialized combining mark, problematic for identifiers [120]
FE22		Not Recommended: Specialized combining mark, problematic for identifiers [120]
FE23		Not Recommended: Specialized combining mark, problematic for identifiers [120]
FE24		Not Recommended: Specialized combining mark, problematic for identifiers [120]
FE25		Not Recommended: Specialized combining mark, problematic for identifiers [120]
FE26		Not Recommended: Specialized combining mark, problematic for identifiers [120]
101FD		Not Recommended: Specialized combining mark, problematic for identifiers [120]
10486	104A0	Identical: Identical in appearance U+104A0 OSMANYA DEEL [115] [150]
104A0	10486	Identical: Identical in appearance to U+10486 OSMANYA DIGIT ZERO [115] [150]

4.2.2. References for Registry

[99]: The Unicode Consortium, "The Unicode Standard", (latest version) http:www.unicode.org/versions/latest (Multiple, or latest version)
[100]: Integration Panel, "Maximal Starting Repertoire (MSR-2)", April 2015, https://www.icann.org/en/system/files/files/msr-2-overview-14apr15-en.pdf (Code points included in MSR-2 as potentially appropriate for the root zone)
[110]: The Unicode Consortium, "Derived Numeric Type", (latest version) http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedNumericType.txt (Code points from modern use scripts, excluded from MSR-2 solely because they are defined as digits in the Unicode Character Database)
[115]: Integration Panel, "Maximal Starting Repertoire (MSR-2)", April 2015, https://www.icann.org/en/system/files/files/msr-2-overview-14apr15-en.pdf (Code points excluded from MSR-2 as inappropriate for the root zone)
[120]: Integration Panel, "Maximal Starting Repertoire (MSR-2)", April 2015, https://www.icann.org/en/system/files/files/msr-2-overview-14apr15-en.pdf (Code points considered problematic by MSR-2)
[150]: The Unicode Consortium, "Intentional.txt", Version 10.0.0, http://www.unicode.org/Public/security/10.0.0/intentional.txt (Code points considered identical by intention)
[201]: TF-AIDN, "Proposal for Arabic Script Root Zone LGR", 18 November 2015 https://www.icann.org/en/system/files/files/arabic-lgr-proposal-18nov15-en.pdf ()
[202]: Ethiopic Generation Panel, "Proposal for Ethiopic Script Root Zone LGR", May 17, 2017, https://www.icann.org/en/system/files/files/proposal-ethiopic-lgr-17may17-en.pdf ()
[204]: Khmer Generation Panel, “Proposal for Khmer Script Root Zone Label Generation Rules (LGR)”, August 15, 2016, https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf ()
[206]: Thai Generation Panel, "Proposal for the Thai Script Root Zone LGR", May 25, 2017 https://www.icann.org/en/system/files/files/proposal-thai-lgr-25may17-en.pdf ()
[300]: Internationalized Domain Names Variant Issues Project: Arabic Case Study Team Issues Report, ICANN, October 7, 2011 https://archive.icann.org/en/topics/new-gtlds/arabic-vip-issues-report-07oct11-en.pdf (In -script variants)
[5564]: RFC 5564 (Code points to be excluded from repertoires for the Arabic language)
[6912]: RFC 6912 (Code points considered problematic)
[IAB]: IAB, "IAB Statement on Identifiers and Unicode 7.0.0", February, 2015, https://www.iab.org/documents/correspondence-reports-documents/2015-2/iab-statement-on-identifiers-and-unicode-7-0-0/ ()

5. IANA Considerations

The IANA Services Operator is hereby requested to create the Registry of Unicode Code Points for Special Consideration in Network Identifiers, and to populate it with the values in section Section 4.2. The registry is to be updated by Expert Review.

This registry has no formal protocol status with respect to IDNA or PRECIS. It is a registry intended to be used by those creating registration or lookup policies, in order to inform the development of such policies.

6. Security Considerations

The registry established by this document is intended to help operators of identifier systems in deciding what to permit in identifiers. It may also be useful for user agents that attempt to provide warnings to users about suspicious or inadvisable identifiers. Operators that fail to make policies addressing the contents of the registry may permit the creation of identifiers that are misleading or that may be used in attacks on the network or users.

The registry is not a magic solution to all identifier ambiguity, and even refusing to permit registration of, or lookup of, every code point in the registry cannot ensure that misleading or confusing identifiers will never be created.

7. References

7.1. Normative References

[RFC2119]	Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997.
[RFC5890]	Klensin, J., "Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework", RFC 5890, DOI 10.17487/RFC5890, August 2010.
[RFC5891]	Klensin, J., "Internationalized Domain Names in Applications (IDNA): Protocol", RFC 5891, DOI 10.17487/RFC5891, August 2010.
[RFC5892]	Faltstrom, P., "The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)", RFC 5892, DOI 10.17487/RFC5892, August 2010.
[RFC5893]	Alvestrand, H. and C. Karp, "Right-to-Left Scripts for Internationalized Domain Names for Applications (IDNA)", RFC 5893, DOI 10.17487/RFC5893, August 2010.
[RFC5894]	Klensin, J., "Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale", RFC 5894, DOI 10.17487/RFC5894, August 2010.
[RFC7564]	Saint-Andre, P. and M. Blanchet, "PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols", RFC 7564, DOI 10.17487/RFC7564, May 2015.
[UAX44]	The Unicode Consortium, "Unicode Standard Annex #44, Unicode Character Database" This references the most currently published version of the description of the Unicode Character Database.
[UCD]	The Unicode Consortium, "Unicode Character Database" This references the most currently published version of the data files for the Unicode Character Database
[Unicode]	The Unicode Consortium, "The Unicode Standard, Latest Version" This references the most currently published version

7.2. Informative References

[I-D.klensin-idna-5892upd-unicode70]	Klensin, J. and P. Fältström, "IDNA Update for Unicode 7.0.0", Internet-Draft draft-klensin-idna-5892upd-unicode70-04, March 2015.
[I-D.rfc5891bis]	Klensin, J., "Internationalized Domain Names in Applications (IDNA): Registry Restrictions and Recommendations", March 2017.
[RFC6365]	Hoffman, P. and J. Klensin, "Terminology Used in Internationalization in the IETF", BCP 166, RFC 6365, DOI 10.17487/RFC6365, September 2011.

Appendix A. Additional Background

A.1. The Theory of Inclusion

The mechanism that the IETF has come to prefer for internationalization of identifiers may be called "inclusion-based identifier internationalization", or "inclusion" for short. Under inclusion, the characters that are permissible in identifiers for a protocol are selected from the set of all Unicode characters. One starts with an empty set of characters, and then gradually adds characters to the set, usually based on Unicode properties (see below, and also Section 3).

Inclusion depends in part on assumptions the IETF made when the strategy was adopted and developed; some of those assumptions were about the relationships between different characters and the likelihood that similar such relationships would get added to future versions of Unicode. Those assumptions turn out not to have been true in every case. Code points at issue are among those to be listed in the registry defined here. (See Section 4.2.)

The intent of Unicode is to encode all known writing systems into a single coded character set. One consequence of that goal is that Unicode encodes an enormous number of characters. Another is that the work of Unicode does not end until every writing system is encoded; even after that, it needs to continue to track any changes in those writing systems.

Unicode encodes abstract characters, not glyphs. Because of the way Unicode was built up over time, there are sometimes multiple ways to encode the same abstract character. For example, an e with an acute accent may be written by combining U+0065 LATIN SMALL LETTER E and U+0031 COMBINING ACUTE ACCENT, or it may be written U+00E9 LATIN SMALL LETTER E WITH ACUTE. If Unicode encodes an abstract character in more than one way, then for most purposes the different encodings should all be treated as though they're the same character. This "canonical equivalence" between encodings of the same abstract characters is explicitly called out by Unicode. A lack of a defined canonical equivalence is tantamount to an assertion by Unicode that the two encodings do not represent the same abstract character, even if both happen to result in the same appearance.

Every encoded character in Unicode (more precisely, every code point) is associated with a set of properties. The properties define what script a code point is in, whether it is a letter or a number or punctuation and so forth, its direction when written, to what other code point or code point sequence it is canonically equivalent, and many other properties. These properties are important to the inclusion mechanism. They are defined in the Unicode Character Database [UCD] [UAX44].

Inclusion depends on the assumption that such strings as will be used in identifiers will not have any ambiguous matching to other strings. In practice, this means that input strings to the protocol are expected to be in Normalization Form C. This way, any alternative sequences of code points for the same characters will be normalized to a single form. If all the characters in the string are also included for the protocol's candidate identifiers, then the string is eligible to be an identifier under the protocol.

A.2. The Difference Between Theory and Practice

In principle, under inclusion identifiers should be unambiguous. It has always been recognized, however, that for humans some ambiguity is inevitable, because of the vagaries of writing systems and of human perception.

Normalization Form C ("NFC") removes the ambiguities based on dual or multiple encoding for the same abstract character. However, characters are not the same as their glyphs. This means that it is possible for certain abstract characters to share a glyph. We can call such abstract characters "homoglyphs". While this looks at first like something that should be handled (or should have been handled) by normalization (NFC or something else), there are important differences; the situation is in some sense an extreme case of a spectrum of ambiguity discussed.

A.2.1. Confusability

While Unicode deals in abstract characters and inclusion works on Unicode code points, users interact with strings as actually rendered: sequences of glyphs. There are characters that, depending on font, sometimes look quite similar to one another (such as "l" and "1"); any character that is like this is often called "visually similar". More difficult are characters that, in any normal rendering, always look the same as one another. The shared history of Cyrillic, Greek, and Latin scripts, for example, means that there are characters in each script that function similarly and that are usually indistinguishable from one another, though they are not the same abstract character. These are examples of "homoglyphs." Any character that can be confused for another one can be called confusable, and confusability can be thought of as a spectrum with "visually similar" at one end, and "homoglyphs" at the other. (We use the term "homoglyph" strictly: code points that normally use the same glyph when rendered.)

Most of the time, there is some characteristic that can help to mitigate confusion. Mitigation may be as simple as using a font designed to distinguish among different characters. For homoglyphs, a large number of cases (but not all of them) turn out to be in different scripts. As a result, it is usually a good idea to adopt the operational convention that identifiers for a protocol should always be in a single script. This strategy has limits. First, identifiers are not always under the operational control of a single authority (such as in the case of DNS, where the system is under distributed control so that different parts of the hierarchy can have different operational rules). Moreover, sometimes the repertoire used in operation allows multiple scripts that create whole string confusables -- strings made up entirely of homoglyphs of another string in a different script (such as can be found between Cyrillic and Latin, for example). In such cases, mitigation must turn to other means of preventing the registration of mutually confusable string, for example by ensuring that the registration of one of them (whichever comes first) blocks the later registration of the other.

Also, operators should only ever use the smallest repertoire of code points possible for their environment. So, for example, if there is a code point that is sometimes used but is perhaps a little obscure, it is better to leave it out and gain some experience with other cases first. In particular, code points used only in a language with which the administrator is not familiar should probably be excluded. The same applies to code points used in specialized contexts, such as those only found in historic or sacred documents, or only used for phonetic transcription or poetry. In the case of IDNA, some client programs restrict display of U-labels to top-level domains known to have policies about single-script labels.

None of these policies or convention, other than ensuring mutual exclusion, will do anything to help strict homoglyphs of each other in the same script (see Appendix B for some example cases.)

Finally, there are some writing systems where characters do not normally occur in arbitrary locations in the context of each syllable. Neither users nor rendering systems for such scripts are adept at handling arbitrary sequences of such characters. While some latitude beyond strict spelling rules may be accommodated, policies that enforce a minimal set of structural rules are required to ensure that users can identify the identifier and systems can render them predictably.

A.2.2. Not everything can be solved

As noted in Section 1, it is not possible to solve all the problems with identifier systems, particularly when human factors are taken into account.

Appendix B. Examples

There are a number of cases that illustrate the combining sequence or digraph issue:

U+08A1 vs \u'0628'\u'0654': This case is ARABIC LETTER BEH WITH HAMZA ABOVE, which is the one that was detected during expert review that caused the IETF to notice the issue. The issue existed before this, but we did not know it. For detailed discussion of this case and some of the following ones, see [I-D.klensin-idna-5892upd-unicode70]
U+0681 vs \u'062D'\u'0654': This case is ARABIC LETTER HAH WITH HAMZA ABOVE, which (like U+08A1) does not have a canonical equivalent. In both cases, the places where hamza above are used are specialized enough that the combining marks can be excluded in some cases (for example, the root zone under IDNA).
U+0623 vs \u'0627'\u'0654': This case is ARABIC LETTER ALEF WITH HAMZA ABOVE. Unlike the previous two cases, it does have a canonical equivalence with the combining sequence. In the past, the IETF misunderstood the reasons for the difference between this pair and the previous two cases.
U+09E1 vs u\'098C'u\'09E2': This case is BENGALI LETTER VOCALIC LL. This is an example in Bengali script of a case without a canonical equivalence to the combining sequence. Per Unicode, the single code point should be used to represent vowel letters in text, and the sequence of code points should not be used. But it is not a simple matter of disallowing the combining vowel mark in cases like this; where the combination does not exist and the use of the sequence is already established, Unicode is unlikely to encode the combination.
U+019A vs \u'006C'\u'0335': This case is LATIN SMALL LETTER L WITH BAR. In at least some fonts, there is a detectable difference with the combining sequence, but only if one compares them side-by-side. Unlike a separable diacritic, there are no fast rules for placement of overlays. A bar may cross at different heights for different glyph shape or may cross different parts of the glyph. For this reason, there is no canonical equivalence defined between the sequence and the composite. Unicode has a principle of encoding barred letters of specific shape as single code point composites when needed for any writing system.
U+00F8 vs \u'006F'\u'0337': This is LATIN SMALL LETTER O WITH STROKE. The effect is similar to the previous case. Unicode has a principle of encoding stroked letters as composites when needed for any writing system.
U+02A6 vs \u'0074'\u'0073': This is LATIN SMALL LETTER TS DIGRAPH, which is not canonically equivalent to the letters t and s. The intent appears to be that the digraph shows the two shapes as kerned, but the difference may be slight out of context.
U+01C9 vs \u'006C'\u'006A': Unlike the TS digraph, the LJ digraph has a relevant compatibility decomposition, so it fails the relevant stability rules under inclusion and is therefore DISALLOWED in IDNA2008. This illustrates the way that consistencies that might be natural to some users of a script are not necessarily found in it, possibly because of uses by another writing system.
U+06C8 vs u\'0648'u\'0670': ARABIC LETTER YU is an example where the normally-rendered character looks just like a combining sequence, but are named differently. In other words, this is an example where the simple fact of the Unicode name would have concealed the apparent relationship from the casual observer.
U+069 vs \u'0069'\u'0307': LATIN SMALL LETTER I followed by COMBINING DOT ABOVE by definition, renders exactly the same as LATIN SMALL LETTER I by itself and does so in practice for any good font. The same would be true if "i" was replaced with any of the other Soft_Dotted characters defined in Unicode. The character sequence \u'0069'\u'0307' (followed by no other combining mark) is reportedly rather common on the Internet. Because base character and stand-alone code point are the same in this case, and the code points affected have the Soft_Dotted property already, this could be mitigated separately via a context rule affecting U+0307.

Other cases that demonstrate that the the issue does not lie exclusively or primarily with combining sequences:

U+0B95 vs U+0BE7: The TAMIL LETTER KA and TAMIL DIGIT ONE are always indistinguishable, but needed to be encoded separately because one is a letter and the other is a digit.
Arabic-Indic Digits vs. Extended Arabic-Indic Digits: Seven digits of these two sequences have entirely identical shapes. This case is an example of something dealt with in inclusion that nevertheless can lead to confusions that are not fully mitigated. IDNA, for example, contains context rules restricting the digits to one set or another; but such rules apply only to a single label, not to an entire name. Moreover, it provides no way of distinguishing between two labels that both conform to the context rule, but where each contains a different member one of the seven identical shape pairs.
U+53E3 vs U+56D7: These are two Han characters (roughly rectangular) that are different when laid side by side; but they may be difficult to distinguish out of context or in very small print.
U+01DD vs U+0259: The two code points share the same (lower case) forms, but are encoded differently due to different uppercase forms. The fact that they uppercase differently is taken as evidence that they are not the same abstract character, despite the superficial evidence of their shared shape. The more common cases, where the uppercase form are identical may be of less concern, given that IDNA 2008 is limited to lower case.

Cross script homoglyphs usually do not involve combining sequences, but can be mitigated by rules requiring strings to be in a single script.

LATIN SMALL LETTER OPEN E is one of a handful of examples of characters borrowed from another script, in this case GREEK SMALL LETTER EPSILON.
LATIN SMALL LETTER E and CYRILLIC SMALL LETTER IE are historically related, both derive from uppercase forms of the GREEK CAPTIAL LETTER EPSILON. There are a number of such pairs -- enough to make many whole strings that look the same in both scripts (but usually spell nonsense in one of them). An example would be "pax".

Appendix C. Discussion Venue

Note to RFC Editor: this section should be removed prior to publication as an RFC.

This Internet-Draft may be discussed on the IAB Internationalization public list: i18n-discuss@iab.org.

Appendix D. Change History

Note to RFC Editor: this section should be removed prior to publication as an RFC.

00:

Initial version

01:

Add background and examples from the LUCID Problem Statement
Add a paragraph about motivation to explain the difference between this registry and administrative policy more generally
Expand and clarify a number of earlier points of discussion
Attempt to make clear that this registry does not update any protocols
Move some formerly-appendix material to the body
Expand the initial registry.

Authors' Addresses

Asmus Freytag ASMUS, Inc. EMail: asmus@unicode.org

John C Klensin 1770 Massachusetts Ave, Ste 322 Cambridge, MA 02140 U.S.A. EMail: john-ietf@jck.com

Andrew Sullivan Oracle Corp. 100 Milverton Drive Missisauga, ON L5R 4H1 Canada EMail: andrew.s.sullivan@oracle.com