TOC |
|
Prior to Internationalised Domain Names in Applications (IDNA), there has been no standard method for domains, names, addresses and similar identifiers to use characters outside the ASCII repertoire. This still applies to many identifiers that are no domain names, such as email addresses (local-part), newsgroup names, etc.
This document extends the mechanism defined in IDNA to other protocols and their identifiers. As with IDNA, these identifiers may be drawn from a large repertoire (Unicode) and are mapped to backward-compatible identifiers using only ASCII characters.
For valid domain names, X-IDNA produces the same encoding as IDNA, even when these domain names are embedded in other addresses.
This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”
The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.
This Internet-Draft will expire on September 24, 2010.
Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the BSD License.
1.
Introduction
1.1.
Overview
1.2.
Rationale
1.3.
Requirements Language
1.4.
IDNA 2008
2.
Definitions
2.1.
Addresses, Normalised Addresses and Labels
2.2.
ACE Prefix
2.3.
Address slots
2.4.
Characters
3.
Requirements and Applicability
3.1.
Requirements
3.2.
Applicability and X-IDNA Profiles
4.
Address Conversion
4.1.
Address Input
4.2.
Conversion To Unicode
4.3.
Address Normalisation
4.4.
Unicode Normalisation
4.5.
Extraction of Labels and Delimiters
4.6.
A-Label Input
4.7.
Validation and Character List Testing
4.8.
Punycode Conversion
4.9.
Re-Assembly
5.
Address Validation
5.1.
Label Validation
5.1.1.
Hyphen Restrictions
5.1.2.
Leading Combining Marks
5.1.3.
Contextual Rules
5.1.4.
Labels Containing Characters Written Right to Left
5.1.5.
Successful Punycode Encoding
5.2.
Other Syntax Restrictions
5.3.
Local Restrictions
6.
Acknowledgements
7.
IANA Considerations
8.
Security Considerations
9.
References
9.1.
Normative References
9.2.
Informative References
§
Author's Address
TOC |
TOC |
X-IDNA works by extracting anything from the address that fits the syntax of
a valid domain name "label", i.e. strings that roughly match the "LDH" syntax
for "A-labels" and "U-labels".
These extracted, putative labels are then put through a conversion the
normative part of which is identical to the normative part of IDNA2008.
The characters that do not form labels, the separators, are solely drawn from the ASCII repertoire (potentially mapped from Unicode lookalikes) and thus need no internationalisation.
Special processing, called address normalisation, ensures that addresses considered equivalent in a protocol that allows arbitrary "quoting" or "escaping" produce the same "labels".
X-IDNA Profiles state to which (part of) address specifications X-IDNA is applied and what steps have to be taken for address normalisation.
TOC |
Unlike other methods for address internationalisation (such as allowing UTF-8), using X-IDNA, as IDNA, allows the graceful introduction of internationalised addresses not only by avoiding upgrades to existing infrastructure (such as DNS servers and mail transport agents), but also by allowing some limited use of internationalised addresses in applications by using the ASCII-encoded representation of the labels containing non-ASCII characters. While such names are user-unfriendly to read and type, and hence not optimal for user input, they can be used as a last resort to allow rudimentary usage of internationalised addresses. For example, they might be the best choice for display if it were known that relevant fonts were not available on the user's computer.
For protocols that have been extended to allow Unicode addresses to be used directly, X-IDNA also provides a way to "downgrade" the addresses that does not require lookups in a database or transmission of alternative ASCII addresses.
When strings covered by one profile of X-IDNA end up in places that are covered by a different profile, it does not matter whether the labels are converted to a A-Labels first and then put into the other string or vice-versa, provided that due care has been taken in defining these profiles.
The same is true for IDNA and X-IDNA as long as the domain name is valid, i.e. as long as it consists entirely of LDH-Labels. For example, if a valid domain name is put into the local-part of an email address, the conversion of the domain's U-Labels to A-Labels will be identical in IDNA and X-IDNA.
This property is best demonstrated by the examples in [I‑D.teint‑xidna‑zonefile] (Teint, N., “An X-IDNA Profile for DNS Zone Master Files,” March 2010.).
TOC |
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119] (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.).
TOC |
As X-IDNA essentially piggy-backs on IDNA 2008, the reader ought to be familiar with the following specifications: [I‑D.ietf‑idnabis‑defs] (Klensin, J., “Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework,” January 2010.), [I‑D.ietf‑idnabis‑protocol] (Klensin, J., “Internationalized Domain Names in Applications (IDNA): Protocol,” January 2010.), [I‑D.ietf‑idnabis‑mappings] (Resnick, P. and P. Hoffman, “Mapping Characters in IDNA,” October 2009.), [I‑D.ietf‑idnabis‑tables] (Faltstrom, P., “The Unicode code points and IDNA,” January 2010.), [I‑D.ietf‑idnabis‑bidi] (Alvestrand, H. and C. Karp, “Right-to-left scripts for IDNA,” January 2010.) and [I‑D.ietf‑idnabis‑rationale] (Klensin, J., “Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale,” January 2010.)
TOC |
TOC |
An "address" is defined in this document to be a protocol element
used in addressing network ressources, such as domain names, names,
addresses and similar identifiers. An address is in a
protocol-dependent format. An X-IDNA profile is required to specify
which protocol elements constitute an address and how to map
addresses to normalised addresses.
For the purposes of this specification, the definition of
an address need not contain the complete "address field" defined in
the other protocol.
For example, a protocol that has strings consisting of
locally-assigned names, domain names and numeric addresses may
specify that the locally-assigned names are covered by X-IDNA
whereas the domain names are covered by IDNA and the numeric
addresses are not internationalised. A different protocol that
allows address specifications containing both strings used for
ressource identification and free-form text intended for human
consumption may specify that X-IDNA applies to the the ressource
identification part of the address specification whereas the
human-readable text is covered by a different internationalisation
protocol.
A "normalised address" is defined in this document to be an address that is
in a format suitable as input to the generic X-IDNA protocol defined in this
document. A normalised address consists of one or more labels, each separated
by one or more separators. It is produced in the normalisation steps (see
Section 4.3 (Address Normalisation) to Section 4.4 (Unicode Normalisation)) of the
X-IDNA protocol defined in this document.
For the purposes of this document, a normalised address
need not be suitable for equivalence comparision.
A "label" is defined in this document to be a part of a normalised address
that does not contain a separator. It is produced in the label extraction step
(see Section 4.5 (Extraction of Labels and Delimiters)) of the X-IDNA protocol defined in this
document.
This definition of "label" is a generalisation of that
found in [I‑D.ietf‑idnabis‑defs] (Klensin, J., “Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework,” January 2010.): the term is applied to strings that are not
part of a domain name.
The definition of "A-Label", "fake A-Label", "U-Label", "LDH-Label", "R-LDH-Label" and "NR-LDH-Label" is taken from [I‑D.ietf‑idnabis‑defs] (Klensin, J., “Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework,” January 2010.) but applied to strings that are not part of a domain name.
A "separator" is a (base) character that appears between labels and is passed through the X-IDNA process as-is. Separators are also extracted in the label extraction step (see Section 4.5 (Extraction of Labels and Delimiters)) of the X-IDNA protocol defined in this document.
An "ASCII address" is an address that consists entirely of base characters. It can contain A-Labels, fake A-Labels, R-LDH-Labels and NR-LDH-Labels and separators.
A "Unicode address" may consist of both base and extended characters. It can contain A-Labels, fake A-Labels, R-LDH-Labels, NR-LDH-Labels, U-Labels and separators.
An "internationalised address" is either an ASCII address or a Unicode address.
TOC |
The "ACE prefix" is defined in this document to be a string of ASCII characters "xn--" that appears at the beginning of every A-Label. "ACE" stands for "ASCII-Compatible Encoding".
TOC |
An "address slot" is defined in this document to be a protocol element or a function argument or a return value (and so on) explicitly designated for carrying an "address". Examples of address name slots include the email address following in the parameter to the SMTP MAIL or RCPT commands or the "From:" field of an email message header; the newsgroup name appearing in Netnews; and the domain name in the QNAME field of a DNS query. A string that has the syntax of an address but that appears in general text is not in a address slot. For example, an email address appearing in the plain text body of an email message is not occupying an address name slot.
An "X-IDNA-aware address slot" is defined to be an address slot explicitly designated for carrying an internationalised address as defined in profiles based on this document. The designation may be static (for example, in the specification of the protocol or interface) or dynamic (for example, as a result of negotiation in an interactive session).
An "X-IDNA-unaware address slot" is defined for this set of documents to be any address slot that is not an X-IDNA-aware address slot. Obviously, this includes any address slot whose specification predates X-IDNA or a profile defined for these types of addresses. For domain names, it includes any domain name slot (as defined in [I‑D.ietf‑idnabis‑defs] (Klensin, J., “Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework,” January 2010.), Section 2.3.2.6) that predates IDNA.
These definitions are generalisations of the "domain name slots" defined in [I‑D.ietf‑idnabis‑defs] (Klensin, J., “Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework,” January 2010.), Section 2.3.2.6.
TOC |
A "base character" is a Unicode character in the range U+0000..U+007F. (These Unicode characters correspond to the "ASCII" character set and are also known as "ASCII characters".)
An "extended character" is a Unicode character that is not a base character, i.e. a character in the range U+0080 up to the maximum Unicode codepoint (U+10FFFF as of [Unicode] (Unicode Consortium, “Unicode Standard, Version 5.2,” December 2009.)).
TOC |
TOC |
X-IDNA makes the following requirements:
TOC |
The application of X-IDNA to any type of address depends on an additional specification that provides the details. These other specifications are referred to as X-IDNA Profiles.
Each definition of an X-IDNA Profile MUST include all of the following:
A specification MAY also include:
TOC |
Before an Internationalised Address is put into a X-IDNA-unaware slot, it MUST be converted to an ASCII Address using the following procedure.
Although some validity checks are necessary to avoid serious problems with the protocol, the tests are permissive and rely on the assumption that names that can be successfully used are valid. That assumption is, however, a weak one because the presence of wild cards in the receiving system might cause a string that has not been explicitly defined and validated to be successfully used as an address.
This procedure in a generalisation of the Domain Name Lookup Protocol defined in [I‑D.ietf‑idnabis‑protocol] (Klensin, J., “Internationalized Domain Names in Applications (IDNA): Protocol,” January 2010.), Section 5.
TOC |
The user supplies a string in the local character set, for example by typing it or clicking on, or copying and pasting, a resource identifier, e.g., a URI ([RFC3986] (Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifier (URI): Generic Syntax,” January 2005.)) or IRI ([RFC3987] (Duerst, M. and M. Suignard, “Internationalized Resource Identifiers (IRIs),” January 2005.)) from which the address is extracted. Alternately, some process not directly involving the user may read the string from a file or obtain it in some other way.
Processing in this step and that specified in Section 4.2 (Conversion To Unicode), Section 4.3 (Address Normalisation) and Section 4.4 (Unicode Normalisation) are local matters, to be accomplished prior to actual invocation of X-IDNA.
TOC |
The string is converted from the local character set into Unicode, if it is not already in Unicode.
Depending on local needs, this conversion may involve mapping some characters into other characters as well as coding conversions. This section defines a general algorithm that applications ought to implement in order to produce Unicode code points that will be valid under the IDNA protocol.
An application might implement the full mapping as described below, or can choose a different mapping. In fact, an application might want to implement a full mapping that is substantially compatible with the original IDNA protocol instead of the algorithm given here.
The general algorithm that an application (or the input method provided by an operating system) ought to use is relatively straightforward:
Unicode Normalisation ought to be deferred until after the Address Normalisation defined in the following section.
TOC |
The string is then normalised as specified in the X-IDNA Profile applicable for the type of address slot into which the string is intented to be put. This step maps a syntactically valid address to a normalised address, from which labels can be extracted. It is defined by the X-IDNA Profile.
The rules for normalisations defined by X-IDNA Profiles are:
An application MAY also choose to map invalid addresses to syntactically valid addresses, for example by "escaping" or "quoting" problematic base characters; or it MAY reject invalid addresses in this step.
TOC |
Depending on local needs, the string can then be mapped using Unicode Normalization Form C (NFC).
TOC |
The normalised address is then split into putative labels and separators. The labels extracted in this step have not been checked for conformance, are therefore referred to as "putative".
The putative labels will start with a letter, digit or extended character,
which may be followed by a string that consists of zero or more letters,
digits, extended characters or the HYPHEN-MINUS, and ends in a letter, digit or
extended character.
The delimiters will be base characters that are not digits or letters.
The label extraction differs significantly from the procedure specified in [I‑D.ietf‑idnabis‑mappings] (Resnick, P. and P. Hoffman, “Mapping Characters in IDNA,” October 2009.) by adding a number of additional separators in addition to U+002E (FULL STOP).
The purpose of this procedure is to ensure that all labels are IDNA-valid (as defined in [I‑D.ietf‑idnabis‑defs] (Klensin, J., “Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework,” January 2010.)) regardless of the range of non-LDH characters allowed in the address.
TOC |
If a putative label extracted in the previous step appears to be an A-Label (i.e., it starts in "xn--", interpreted case-insensitively and does not contain extended characters), the application MAY attempt to convert it to a U-Label, first ensuring that the A-Label is entirely in lower case (converting it to lower case if necessary), and apply the tests of Section 4.7 (Validation and Character List Testing) and the conversion of Section 4.8 (Punycode Conversion) to that form.
If the label is converted to Unicode (i.e., to U-Label form) using the Punycode decoding algorithm, then the processing specified in those two sections MUST be performed.
If any of these steps fails or rejects the label (i.e., it's a fake A-Label), the putative label MUST be used as-is.
If a label consists entirely of base characters, the following two steps (Section 4.7 (Validation and Character List Testing) and Section 4.8 (Punycode Conversion)) are skipped for this label.
TOC |
The putative labels extracted in the previous step are checked to verify that all characters that appear in it are valid as input to X-IDNA processing. As discussed above, this check is liberal in order to allow for compatibility with future extensions.
Putative labels with any of the following characteristics MUST be rejected in this step:
In addition, the application SHOULD apply the following test:
This test may be omitted in special circumstances, such as when the lookup application knows that the conditions are enforced elsewhere, because an attempt to use such strings as an address will almost certainly lead to an error except when wild cards are present on a receiving system. However, applying the test is likely to give an earlier detection and much better information about the reason for a failure -- information that may be usefully passed to the user when that is feasible -- than later failure alone.
For all other strings, the lookup application MUST rely on the protocol using the address to determine the validity of the address and the characters they contain. If they can successfully be used, they are presumed to be valid; if they are not, their possible validity is not relevant. While an application may reasonably issue warnings about strings it believes may be problematic, applications that decline to process a string that conforms to the rules above (i.e., does not allow putting it into an address slot) are not in conformance with this protocol.
TOC |
The string that has now been validated is converted to ACE form by applying the Punycode algorithm to the string and then adding the ACE prefix.
TOC |
The A-Labels resulting from the conversion in Section 4.8 (Punycode Conversion) or supplied directly (see Section 4.6 (A-Label Input)) is combined with the delimiters (see Section 4.5 (Extraction of Labels and Delimiters)), in the original order, to form an A-Adress.
The ASCII Address can then be put into an X-IDNA-unaware address slot and be used as a normal address. The use of this address can obviously either succees or fail (resulting in a lookup failure, bounce message, etc.).
TOC |
Whenever an X-IDNA Profile mandates that the addresses be validated, the following procedure MUST be followed.
Addresses ought to be validated whenever an address is defined, registered, etc. An X-IDNA Profiles defines when and by whom addresses are validated.
TOC |
In order to validate individual labels embedded in the address, it is normalised as specified in Section 4.3 (Address Normalisation) and then the labels are extracted as specified in Section 4.5 (Extraction of Labels and Delimiters).
The following validation steps apply to the extracted labels. The labels (in the form of a Unicode string, i.e., a string that at least superficially appears to be a U-label) are then examined, performing tests that require examination of more than one character. Character order is considered to be the on-the-wire order, not the display order.
TOC |
If the Unicode string contains non-base characters, it MUST NOT contain "--" (two consecutive HYPHEN-MINUS characters) in the third and fourth character positions.
Note: The Unicode string will not start or end with a "-" (HYPHEN-MINUS) at this point.
TOC |
The Unicode string MUST NOT begin with a combining mark or combining character (see Section 2.11 of [Unicode] (Unicode Consortium, “Unicode Standard, Version 5.2,” December 2009.) for an exact definition).
TOC |
The Unicode string MUST NOT contain any characters whose validity is context-dependent, unless the validity is positively confirmed by a contextual rule. To check this, each code-point marked as CONTEXTJ or CONTEXTO in [I‑D.ietf‑idnabis‑tables] (Faltstrom, P., “The Unicode code points and IDNA,” January 2010.) MUST have a non-null rule. If such a code-point is missing a rule, it is invalid. If the rule exists but the result of applying the rule is negative or inconclusive, the proposed label is invalid.
TOC |
If the proposed label contains any characters that are written from right to left it MUST meet the BIDI criteria [I‑D.ietf‑idnabis‑bidi] (Alvestrand, H. and C. Karp, “Right-to-left scripts for IDNA,” January 2010.)
TOC |
The Unicode string MUST be convertible to ACE form using the Punycode algorithm ([RFC3492] (Costello, A., “Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA),” March 2003.)), i.e., it MUST NOT cause an overflow.
TOC |
In A-Address form, the address MUST conform to the syntax defined by the X-IDNA-unaware specification for the address type. This MAY include length restrictions, syntax restrictions regarding separators, etc.
TOC |
In addition to the rules and tests above, there are many reasons why a site, registry, or administrators could reject an address.
The responsible entity is expected to establish or follow policies about addresses they wish to define or register. Policies are likely to be informed by the local languages and the scripts that are used to write them and may depend on many factors including what characters are in the label (for example, an address may be rejected based on other addresses already registered).
The same considerations as for IDNA registrations apply; see Section 3.2 of [I‑D.ietf‑idnabis‑rationale] (Klensin, J., “Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale,” January 2010.) for a discussion and recommendations about registry policies.
While X-IDNA, unlike IDNA 2008, allows addresses that contain fake A-Labels and R-LDH-Labels, responsible entities ought to avoid such addresses except when backwards-compatibility requires them.
TOC |
The larger part of this specifications was directly lifted from IDNA 2008, and would not have been possible without the excellent work that went into these specifications.
TOC |
This memo includes no request to IANA.
The definition of an X-IDNA Profiles ought to be coordinated with the entity that controls the specification of the address slot type to which X-IDNA is applied, instead.
TOC |
X-IDNA shares the Security Considerations for IDNA, which are described in [I‑D.ietf‑idnabis‑defs] (Klensin, J., “Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework,” January 2010.), except for the special issues associated with right to left scripts and characters. The latter are discussed in [I‑D.ietf‑idnabis‑bidi] (Alvestrand, H. and C. Karp, “Right-to-left scripts for IDNA,” January 2010.).
In addition, each X-IDNA Profile will require additional Security Considerations, which MUST be discussed in the document defining the Profile.
TOC |
TOC |
TOC |
[I-D.teint-xidna-zonefile] | Teint, N., “An X-IDNA Profile for DNS Zone Master Files,” draft-teint-xidna-zonefile-00 (work in progress), March 2010. |
[RFC3986] | Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifier (URI): Generic Syntax,” STD 66, RFC 3986, January 2005 (TXT, HTML, XML). |
[RFC3987] | Duerst, M. and M. Suignard, “Internationalized Resource Identifiers (IRIs),” RFC 3987, January 2005 (TXT). |
TOC |
Nick Teint | |
Email: | nick.teint@googlemail.com |