Network Working Group | Y. YONEYA |
Internet-Draft | JPRS |
Intended status: Informational | T. Nemoto |
Expires: January 1, 2015 | Keio University |
June 30, 2014 |
Mapping characters for PRECIS classes
draft-ietf-precis-mappings-08
The framework for preparation and comparison of internationalized strings ("PRECIS") defines several classes of strings for preparation and comparison. Case mapping is defined because many protocols perform case-sensitive or case-insensitive string comparison and so preparation of the string is mandatory. The Internationalized Domain Names in Applications (IDNA) and the PRECIS problem statement describes mappings for internationalized strings that are not limited to case, but include width mapping and mapping of delimiters and other specials that can be taken into consideration. This document provides guidelines for authors of protocol profiles of the PRECIS framework and describes several mappings that can be applied between receiving user input and passing permitted code points to internationalized protocols. The mappings described here are expected to be applied as an additional mapping and locale-/context-dependent case mapping.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 1, 2015.
Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
In many cases, user input of internationalized strings is generated through the use of an input method editor ("IME") or through copy-and-paste from free text. Users generally do not care about the case and/or width of input characters because they consider those characters to be functionally equivalent or visually identical. Furthermore, users rarely switch the IME state to input special characters such as protocol elements. For Internationalized Domain Names ("IDNs"), the IDNA Mapping specification [RFC5895] describes methods for handling these issues. For PRECIS strings, case mapping and width mapping are defined in the PRECIS framework specification [I-D.ietf-precis-framework]. Further, the handling of mappings other than case and width, such as delimiter, special, and local case, are also important in order to increase the probability that strings match as users expect. This document provides guidelines for authors of protocol profiles of the PRECIS framework and describes several mappings that can be applied between receiving user input and passing permitted code points to internationalized protocols. The delimiter mapping and special mapping rules described here are applied as "additional mappings" beyond those defined in the PRECIS framework, whereas the "local case mapping" rule provides an alternative to the case mapping rule specified in the PRECIS framework since it handles some locale-dependent and context-dependent mappings.
The PRECIS framework defines several protocol-independent mappings. The additional mappings and local case mapping defined in this document are protocol-dependent, i.e., they depend on the rules for a particular application protocol.
Some application protocols define delimiters for their own use, resulting in the fact that the delimiters are different for each protocol. The delimiter mapping table should therefore be based on a well-defined mapping table for each protocol.
Delimiter mapping is used to map characters that are similar to protocol delimiters into the canonical delimiter characters. For example, there are width-compatible characters that correspond to the '@' in email addresses and the ':' and '/' in URIs. The '+', '-', '<' and '>' characters are other common delimiters that might require such mapping. For the FULL STOP character (U+002E), a delimiter in the visual presentation of domain names, some IMEs produce a character such as IDEOGRAPHIC FULL STOP (U+3002) when a user types FULL STOP on the keyboard. In all these cases, the visually similar characters that can come from user input need to be mapped to the correct protocol delimiter characters before the string is passed to the protocol.
Aside from delimiter characters, certain protocols have characters which need to be mapped in ways that are different from the rules specified in the PRECIS framework (e.g., mapping non-ASCII space characters to ASCII space). In this document, these mappings are called "special mappings". They are different for each protocol. Therefore, the special mapping table should be based on a well-defined mapping table for each protocol. Examples of special mapping are the following; [RFC3748], SASLprep [RFC4013], IMAP4 ACL [RFC4314] and LDAPprep [RFC4518] define the rule that some codepoints for the non-ASCII space are mapped to SPACE (U+0020).
As examples, EAP
The purpose of local case mapping is to increase the probability of a matching result from the comparison between uppercase and lowercase characters, targeting characters which mapping depends on locale or locale and context.
As an example of locale and context-dependent mapping, LATIN CAPITAL LETTER I ("I", U+0049) is normally mapped to LATIN SMALL LETTER I ("i", U+0069); however, if the case of Turkish (or one of several other languages), unless an I is before a dot_above, the character should be mapped to LATIN SMALL LETTER DOTLESS I (U+0131).
Case mapping using Unicode Default Case Folding in PRECIS framework does not consider such locale or context because it is a common framework for internationalization. Local case mapping defined in this document corresponds to demands from applications which supports users' locale and/or context. The target characters of local case mapping are characters defined in the SpecialCasing.txt [Specialcasing] file in section 3.13 of the Unicode Standard [Unicode].
The case folding method for a target character is to map into lower case as defined in SpecialCasing.txt. The case folding method for all other, non-target characters is as specified in Section 4.1.3 of the PRECIS framework (i.e., It is RECOMMENDED to use Unicode Default Case Folding for all non-target characters). If an application supports users' locale and/or context , local case mapping can increase the probability of getting matching-results from the comparison between strings.
If Unicode Default Case Folding is selected as "Case Mapping" in PRECIS profiles registry, PRECIS profile designers may consider whether local case mapping can be applied. And if it can be applied, it is better to add "local case mapping is applicable alternatively" after "Unicode Default Case Folding" for note to application developers. The reason why local case mapping is alternative to Unicode Default Case Folding is written in the Appendix B.
Delimiter mapping and special mapping described in this document are expected to be applied as additional mappings in the PRECIS framework. The mappings described in this document could be applied in any order. This section specifies a particular order to minimize the effect of codepoint changes introduced by the mappings. This mapping order is very general and has been designed to be acceptable to the widest user community.
As well as Mapping Characters for IDNA2008 [RFC5895], this document suggests creating mappings that might cause confusion for some users while alleviating confusion in other users. Such confusion is not covered in any depth in this document.
This document has no actions for the IANA.
Martin Dürst suggested a need for the case folding about the mapping (map final sigma to sigma, German sz to ss,.).
Alexey Melnikov, Andrew Sullivan, Barry Leiba, Heather Flanagan, Joe Hildebrand, John Klensin, Marc Blanchet, Pete Resnick and Peter Saint-Andre, et al. gave important suggestion for this document during at WG meeting and WG LC.
[I-D.ietf-precis-framework] | Saint-Andre, P. and M. Blanchet, "PRECIS Framework: Preparation and Comparison of Internationalized Strings in Application Protocols", Internet-Draft draft-ietf-precis-framework-14, February 2014. |
[Unicode] | The Unicode Consortium, "The Unicode Standard, Version 6.3.0", <http://www.unicode.org/versions/Unicode6.3.0/>, 2012. |
[Casefolding] | The Unicode Consortium, "CaseFolding-6.3.0.txt", Unicode Character Database, July 2011, <http://www.unicode.org/Public/6.3.0/ucd/CaseFolding.txt>, . |
[Specialcasing] | The Unicode Consortium, "SpecialCasing-6.3.0.txt", Unicode Character Database, July 2011, <http://www.unicode.org/Public/6.3.0/ucd/SpecialCasing.txt>, . |
This table is the mapping type list for each protocol. Values marked "o" indicate that the protocol use the type of mapping. Values marked "-" indicate that the protocol doesn't use the type of mapping.
+----------------------+-------------+-----------+------+---------+ | Protocol and | Width | Delimiter | Case | Special | | mapping RFC | (NFKC) | | | | +----------------------+-------------+-----------+------+---------+ | IDNA (RFC 3490) | - | o | - | - | | IDNA (RFC 3491) | o | - | o | - | | iSCSI (RFC 3722) | o | - | o | - | | EAP (RFC 3748) | o | - | - | o | | SASL (RFC 4013) | o | - | - | o | | IMAP (RFC 4314) | o | - | - | o | | LDAP (RFC 4518) | o | - | o | o | | XMPP (RFC 6120) | - | - | o | - | +----------------------+-------------+-----------+------+---------+
One outstanding issue regarding full case folding for characters is, the character "LATIN SMALL LETTER SHARP S" (U+00DF) (hereinafter referred to as "eszett") becomes two "LATIN SMALL LETTER S"s (U+0073 U+0073) by performing the case mapping using Unicode Default Case Folding in the PRECIS framework. If local case mapping in this document is not an alternative to case mapping in PRECIS framework, PRECIS profile designers can select both mappings, therefore, German's eszett can not keep the locale if the case mapping in the PRECIS framework was performed after the local case mapping.
As described in section Section 2.3, target characters of local case mapping are characters defined in SpecialCasing.txt. The Unicode Standard (at least, up to version 6.3.0) does not define mappings between "GREEK SMALL LETTER SIGMA" (U+03C3) (hereinafter referred to as "small sigma") and "GREEK SMALL LETTER FINAL SIGMA" (U+03C2) (hereinafter referred to as "final sigma") depend on context. Thus, final sigma is always mapped to small sigma by local case mapping. (Cf. Followings are comments in SpecialCasing.txt.)
Local case mapping follows Unicode definition, so mapping of small sigma and final sigma is up to the definition.