Internet DRAFT - draft-unicode-separated-values
draft-unicode-separated-values
Internet Engineering Task Force J. Henderson, Ed.
Internet-Draft 1 January 2024
Intended status: Experimental
Expires: 4 July 2024
Unicode Separated Values (USV)
draft-unicode-separated-values-00
Abstract
Unicode Separated Values (USV) is a data format that uses Unicode
separator characters.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 4 July 2024.
Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Henderson Expires 4 July 2024 [Page 1]
Internet-Draft Unicode Separated Values (USV) January 2024
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
2. Unicode symbols in use . . . . . . . . . . . . . . . . . . . 3
3. Definition of the USV Format . . . . . . . . . . . . . . . . 3
3.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2. Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.3. Record . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4. Group . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.5. File . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.6. Header . . . . . . . . . . . . . . . . . . . . . . . . . 4
4. ABNF grammar . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1. Semantics . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2. Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3. Character classes . . . . . . . . . . . . . . . . . . . . 5
4.4. Unicode symbols . . . . . . . . . . . . . . . . . . . . . 5
5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.1. Hello World . . . . . . . . . . . . . . . . . . . . . . . 6
5.2. Hello World Goodnight Moon . . . . . . . . . . . . . . . 6
5.3. Units, Records, Groups, Files . . . . . . . . . . . . . . 7
5.4. Articles . . . . . . . . . . . . . . . . . . . . . . . . 7
6. Source Code Examples . . . . . . . . . . . . . . . . . . . . 8
7. MIME media type registration for text/usv . . . . . . . . . . 8
7.1. Optional parameters: charset, header . . . . . . . . . . 9
7.2. Encoding considerations . . . . . . . . . . . . . . . . . 9
7.3. Security considerations . . . . . . . . . . . . . . . . . 9
7.4. Interoperability considerations . . . . . . . . . . . . . 9
7.5. Published specification . . . . . . . . . . . . . . . . . 9
7.6. Applications that use this media type . . . . . . . . . . 9
7.7. Additional information . . . . . . . . . . . . . . . . . 10
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10
9. Security Considerations . . . . . . . . . . . . . . . . . . . 10
10. References . . . . . . . . . . . . . . . . . . . . . . . . . 10
10.1. Normative References . . . . . . . . . . . . . . . . . . 10
10.2. Informative References . . . . . . . . . . . . . . . . . 11
Appendix A. Appendix 1 . . . . . . . . . . . . . . . . . . . . . 11
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 11
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 11
1. Introduction
Unicode Separated Values (USV) is a data format useful for exchanging
and converting data between various spreadsheet programs, databases,
and streaming data services. This RFC explains USV.
Henderson Expires 4 July 2024 [Page 2]
Internet-Draft Unicode Separated Values (USV) January 2024
Additionally, we propose a new media type "text/usv", to be
registered with IANA.
We provide information references for a USV git repository
[usv-git-repository] and a programming implementation as a USV Rust
crate [usv-rust-crate].
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
2. Unicode symbols in use
Separators:
* ␟ U+241F Symbol for Unit Separator (US)
* ␞ U+241E Symbol for Record Separator (RS)
* ␝ U+241D Symbol for Group Separator (GS)
* ␜ U+241C Symbol for File Separator (FS)
Modifiers:
* ␛ U+241B Symbol for Escape (ESC)
* ␗ U+2417 Symbol for End of Transmission Block (ETB)
3. Definition of the USV Format
3.1. Data
Data is comprised of units, records, groups, and files.
3.2. Unit
A unit is comprised of content characters. It runs until a unit
separator.
Example unit and unit separator:
Henderson Expires 4 July 2024 [Page 3]
Internet-Draft Unicode Separated Values (USV) January 2024
<CODE BEGINS> file "unit-and-unit-separator.usv"
aaa␟
<CODE ENDS>
3.3. Record
A record is comprised of units. It runs until a record separator.
Example record and record separator:
<CODE BEGINS> file "record-and-record-separator.usv"
aaa␟bbb␟␞
<CODE ENDS>
3.4. Group
A group is comprised of records. It runs until a group separator.
Example group and group separator:
<CODE BEGINS> file "group-and-group-separator.usv"
aaa␟bbb␟␞ccc␟ddd␟␞␝
<CODE ENDS>
3.5. File
A file is comprised of groups. It runs until a file separator.
Example file and file separator:
<CODE BEGINS> file "file-and-file-separator.usv"
aaa␟bbb␟␞ccc␟ddd␟␞␝eee␟fff␟␞ggg␟hhh␟␞␝␜
<CODE ENDS>
3.6. Header
There may be an optional header appearing as the first item and with
the same format as normal items. This header will contain names
corresponding to the fields in the data, and should contain the same
number of fields as the rest of data. The presence or absence of the
header line should be indicated via the optional "header" parameter
of this media type.
For example:
<CODE BEGINS> file "header.usv"
name␟name␟␞aaa␟bbb␟␞
<CODE ENDS>
Henderson Expires 4 July 2024 [Page 4]
Internet-Draft Unicode Separated Values (USV) January 2024
4. ABNF grammar
4.1. Semantics
usv = *files
file = *groups
group = *records
record = *units
unit = *content-characters
4.2. Syntax
usv = 0*1( header ) body 0*1( ETB ); anything after is chaff
header = 1( unit_run / record_run / group_run / file_run )
body = *( unit_run / record_run / group_run / file_run )
file_run = *( file FS ) file ; next MUST be ( FS / ETB )
group_run = *( group GS ) group ; next MUST be ( GS / FS / ETB )
record_run = *( record RS ) record ; next MUST be ( RS / GS / FS /
ETB )
unit_run = *( unit US ) unit ; next MUST be ( US / RS / GS / FS / ETB
)
4.3. Character classes
content-character = *( typical-character / ESC '*' )
typical-character = '*' - special-character
special-character = US / RS / GS / FS / ESC / ETB
escape-character = ESC ( special-character / typical-character )
4.4. Unicode symbols
US = U+241F Symbol for Unit Separator (US)
RS = U+241E Symbol for Record Separator (RS)
Henderson Expires 4 July 2024 [Page 5]
Internet-Draft Unicode Separated Values (USV) January 2024
GS = U+241D Symbol for Group Separator (GS)
FS = U+241C Symbol for File Separator (FS)
ESC = U+241B Symbol for Escape (ESC)
ETB = U+2417 Symbol for End of Transmission Block (ETB)
5. Examples
5.1. Hello World
This kind of data …
<CODE BEGINS> file "hello-world.txt"
hello, world
<CODE ENDS>
… is represented in USV as two units:
<CODE BEGINS> file "hello-world.usv"
hello␟world␟
<CODE ENDS>
Optional: if you prefer to see one unit per line, then end each line
with a USV escape:
<CODE BEGINS> file "hello-world-with-lines.usv"
hello␟␛
world␟␛
<CODE ENDS>
5.2. Hello World Goodnight Moon
This kind of data …
<CODE BEGINS> file "hello-world-goodnight-moon.txt"
[ hello, world ], [ goodnight, moon ]
<CODE ENDS>
… is represented in USV as two records, each with two units:
<CODE BEGINS> file "hello-world-goodnight-moon.usv"
hello␟world␞goodnight␟moon␞
<CODE ENDS>
Optional: if you prefer to see one record per line, then end each
line with a USV escape:
Henderson Expires 4 July 2024 [Page 6]
Internet-Draft Unicode Separated Values (USV) January 2024
<CODE BEGINS> file "hello-world-goodnight-moon-with-lines.usv"
hello␟world␞␛
goodnight␟moon␞␛
<CODE ENDS>
5.3. Units, Records, Groups, Files
USV with 2 units by 2 records by 2 groups by 2 files:
<CODE BEGINS> file "units-records-groups-files.usv"
a␟b␞c␟d␝e␟f␞g␟h␜i␟j␞k␟l␝m␟n␞o␟p␜
<CODE ENDS>
This is what the USV can look like when you display it with a simple
display tool:
<CODE BEGINS> file "units-records-groups-files-with-lines.usv"
a,b
c,d
-
e,f
g,h
=
i,j
k,l
-
m,n
o,p
<CODE ENDS>
5.4. Articles
USV can format paragraphs, such as in this example data stream of
articles:
Henderson Expires 4 July 2024 [Page 7]
Internet-Draft Unicode Separated Values (USV) January 2024
<CODE BEGINS> file "articles.usv"
Title One
␟
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip.
␞
Title Two
␟
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.
␞
Title Three
␟
Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium
doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore
veritatis et quasi architecto beatae vitae dicta sunt explicabo.
␞
<CODE ENDS>
6. Source Code Examples
Hello World using Rust and the USV crate
<CODE BEGINS> file "usv-rust-crate-units.rs"
use usv::*;
let input = "hello␟world␟";
let records = input.units().collect();
<CODE ENDS>
Hello World Goodnight Moon using Rust and the USV crate
<CODE BEGINS> file "usv-rust-crate-records.rs"
use usv::*;
let input = "hello␟world␞goodnight␟moon␞";
let records = input.records().collect();
<CODE ENDS>
7. MIME media type registration for text/usv
This section provides the MIME media type registration application
information.
To: ietf-types@iana.org
Subject: Registration of MIME media type text/usv
Henderson Expires 4 July 2024 [Page 8]
Internet-Draft Unicode Separated Values (USV) January 2024
MIME media type name: text
MIME subtype name: usv
Required parameters: none
7.1. Optional parameters: charset, header
Common usage of USV is UTF-8, but other character sets defined by
IANA for the "text" tree may be used in conjunction with the
"charset" parameter.
The "header" parameter indicates the presence or absence of the
header line. Valid values are "present" or "absent". Implementors
choosing not to use this parameter must make their own decisions as
to whether the header line is present or absent.
7.2. Encoding considerations
This media type uses LF to denote line breaks. However, implementors
should be aware that some implementations may not conform i.e. may
incorrectly use other values.
7.3. Security considerations
USV files contain passive text data that should not pose any risks.
However, it is possible in theory that malicious binary data may be
included in order to exploit potential buffer overruns in the program
processing USV data. Additionally, private data may be shared via
this format (which of course applies to any text data).
7.4. Interoperability considerations
Implementors should "be conservative in what you do, be liberal in
what you accept from others" (RFC 793 [8]) when processing USV data.
Implementations deciding not to use the optional "header" parameter
must make their own decision as to whether the header is absent or
present.
7.5. Published specification
https://github.com/sixarm/usv
7.6. Applications that use this media type
Spreadsheet programs, such as with import/export. Database programs,
such as with loading/saving text. Data conversion utilities.
Henderson Expires 4 July 2024 [Page 9]
Internet-Draft Unicode Separated Values (USV) January 2024
7.7. Additional information
Magic number(s): none
File extension(s): usv
Apple macOS File Type Code(s): TEXT
Intended usage: COMMON
Author/Change controller: IESG
Contact: Joel Parker Henderson <joel@joelparkerhenderson.com>
8. IANA Considerations
We are requesting IANA to create a standard MIME media type "text/
usv".
We have filed an IANA request for this, with same contact
information.
9. Security Considerations
This document should not affect the security of the Internet.
10. References
10.1. Normative References
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
[RFC2234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 2234, DOI 10.17487/RFC2234,
November 1997, <https://www.rfc-editor.org/info/rfc2234>.
[RFC2048] Freed, N., Klensin, J., and J. Postel, "Multipurpose
Internet Mail Extensions (MIME) Part Four: Registration
Procedures", RFC 2048, DOI 10.17487/RFC2048, November
1996, <https://www.rfc-editor.org/info/rfc2048>.
[RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part Two: Media Types", RFC 2046,
DOI 10.17487/RFC2046, November 1996,
<https://www.rfc-editor.org/info/rfc2046>.
Henderson Expires 4 July 2024 [Page 10]
Internet-Draft Unicode Separated Values (USV) January 2024
10.2. Informative References
[usv-git-repository]
Henderson, J., "USV repository at
https://github.com/sixarm/usv", 2022.
[usv-rust-crate]
Henderson, J., "USV rust crate at
https://crates.io/crates/usv", 2024.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
Appendix A. Appendix 1
This becomes an Appendix
Acknowledgements
The author would like to thank Y. Shafranovich, author of the CSV
RFC, which provided guidance for this USV RFC.
A special thank you goes to P.X.V.
Contributors
Thanks to all of the contributors.
Joel Parker Henderson
Email: joel@joelparkerhenderson.com
Author's Address
Joel Parker Henderson (editor)
601 Van Ness Ave #E3-359
San Francisco, CA 94102
United States of America
Phone: 1-415-317-2700
Email: joel@joelparkerhenderson.com
URI: https://linkedin.com/in/joelparkerhenderson
Henderson Expires 4 July 2024 [Page 11]