Internet DRAFT - draft-thomy-ntv-tab
draft-thomy-ntv-tab
Internet Engineering Task Force P. THOMY
Internet-Draft Loco-labs
Intended status: Informational 19 December 2023
Expires: 21 June 2024
NTV tabular format (NTV-TAB)
draft-thomy-ntv-tab-00
Abstract
This document describes a set of simple rules for unambiguously and
concisely encoding semantic tabular and multidimensional data (NTV-
TAB format). These rules are based on the NTV structure and its JSON
representation (JSON-NTV format).
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 21 June 2024.
Copyright Notice
Copyright (c) 2023 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
THOMY Expires 21 June 2024 [Page 1]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Presentation . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Key design features . . . . . . . . . . . . . . . . . . . 3
1.3. Conventions Used in This Document . . . . . . . . . . . . 4
2. Tabular data . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1. Principles . . . . . . . . . . . . . . . . . . . . . . . 4
2.2. Tabular structure . . . . . . . . . . . . . . . . . . . . 5
2.3. Field structure . . . . . . . . . . . . . . . . . . . . . 8
2.4. Representation . . . . . . . . . . . . . . . . . . . . . 8
3. NTV-TAB format . . . . . . . . . . . . . . . . . . . . . . . 8
3.1. NTV structure . . . . . . . . . . . . . . . . . . . . . . 8
3.2. simple NTVfield formats . . . . . . . . . . . . . . . . . 9
3.3. default NTVfield formats . . . . . . . . . . . . . . . . 9
3.4. Optimized NTVfield formats . . . . . . . . . . . . . . . 12
3.5. Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 13
4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1. Field examples . . . . . . . . . . . . . . . . . . . . . 15
4.2. Dataset examples . . . . . . . . . . . . . . . . . . . . 17
5. Properties . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1. JSON representation . . . . . . . . . . . . . . . . . . . 19
5.2. Dataset size . . . . . . . . . . . . . . . . . . . . . . 19
5.3. Nested NTV-TAB structure . . . . . . . . . . . . . . . . 20
6. Parsing a JSON-value . . . . . . . . . . . . . . . . . . . . 20
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21
8. Security Considerations . . . . . . . . . . . . . . . . . . . 22
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 22
9.1. Normative References . . . . . . . . . . . . . . . . . . 22
9.2. Informative References . . . . . . . . . . . . . . . . . 22
Appendix A. Dataset sizing . . . . . . . . . . . . . . . . . . . 23
A.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . 23
A.2. Formats . . . . . . . . . . . . . . . . . . . . . . . . . 25
Appendix B. Table schema compatibility . . . . . . . . . . . . . 27
B.1. Table schema . . . . . . . . . . . . . . . . . . . . . . 27
B.2. Compatibility . . . . . . . . . . . . . . . . . . . . . . 28
B.3. Example . . . . . . . . . . . . . . . . . . . . . . . . . 29
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 30
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 30
1. Introduction
THOMY Expires 21 June 2024 [Page 2]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
1.1. Presentation
The main operational standard used to exchange textual tabular data
is CSV format [RFC4180]. Unfortunately CSV format is obsolete (last
revision in 2005) and current CSV tools do not comply with the
standard.
It is therefore important to define an alternative format that meets
the expectations of tabular and multidimensional data exchanges. The
NTV-TAB format proposed here is a response to this need.
1.2. Key design features
The format's focus is on simplicity, lightness and web usage.
The key features of this format are the following:
* JSON as the base format
JSON is simple and readable as simple text
JSON supports rich structure including nesting and basic types
JSON is web-native and very widely used and supported
JSON format has binary representation (i.e. CBOR format)
* optimized representations
from the simplest to the most optimized, are available
avoid data duplication,
reduce the size of data,
allows strict and unambiguous reversibility (lossless round-
trip)
* high semantic level of data (JSON-NTV as a grammar)
Take into account most common data formats used in Internet
standards
wide variety of data typing
meta-data (header or schema) can be integrate
common format between tabular and multidimensional data
THOMY Expires 21 June 2024 [Page 3]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
* simple, compact, extensible and self-describing
1.3. Conventions Used in This Document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
This document also uses the following terms:
*JsonText, JsonValue, JsonObject, JsonMember, JsonElement,
JsonArray, JsonNumber, JsonString, JsonFalse, JsonNull, JsonTrue
:*
These terms are defined in [JSON-NTV].
*NTV, NTVlist, NVlist, Vlist, TVlist, NTVsingle, NVsingle,
TVsingle, Vsingle, NTVname, NTVtype, NTVvalue, JsonNTVtype,
JsonNTVname, JsonPrimitive, JsonUnnamed, JsonNamed:*
These terms are defined in [JSON-NTV].
*Row, Column, Table, Cell:*
These terms are defined in [W3C_TAB].
*Dataset*
A Dataset is equivalent to a Table
*Field*
A Field is equivalent to a Column
2. Tabular data
2.1. Principles
_*Tabular data* is data that is structured into rows, each of which
contains information about some things. Each row contains the same
number of cells (although some of these cells may be empty), which
provide values of properties of the thing described by the row. In
tabular data, cells within the same column provide values for the
same property of the things described by each row. This is what
differentiates tabular data from other line-oriented
formats._[W3C_TAB]
Two main uses are identified for tabular data:
THOMY Expires 21 June 2024 [Page 4]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
* a flow-oriented use for which each row is independent of the
others. The dataset is then seen as a list of rows whose number
can be variable. This is for example the case of a list of
measurements from a sensor.
* a structure-oriented use for which the rows are not independent
and contribute to describing the same object. This is for example
the case of a grade table for a class which integrates the
students, courses, periods, etc.
This document deals with this second use.
2.2. Tabular structure
In structure-oriented use, columns and rows are not equivalent, the
columns (or Fields) represent the 'semantics' of the data and the
rows represent a specific combination of Field's values according to
the structure defined by the tabular data (Dataset). The nature of
the rows is often implicit.
Two basic patterns are present in Datasets:
* *Tree pattern*: A tree is represented in tabular form by a list of
paths between each leaf and the node. The columns then represent
the levels of the tree.
* *Matrix pattern*: A matrix (or multidimensional data) is
represented in tabular form by a column of the values of the
matrix and additional columns represent the coordinates of each of
the values.
Table 1 and Table 2 present an example of such patterns
+======+=========+=========+
| Root | level 1 | level 2 |
+======+=========+=========+
| A | B | D |
+------+---------+---------+
| A | B | E |
+------+---------+---------+
| A | C | F |
+------+---------+---------+
| A | C | G |
+------+---------+---------+
Table 1: Tree pattern
THOMY Expires 21 June 2024 [Page 5]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
+=======+=====+=====+
| Value | row | col |
+=======+=====+=====+
| 1 | A | C |
+-------+-----+-----+
| 2 | A | D |
+-------+-----+-----+
| 3 | B | C |
+-------+-----+-----+
| 4 | B | D |
+-------+-----+-----+
Table 2: Matrix pattern
Taking these structures into account leads to significant duplication
of data. In the general case, Datasets mix these different
structures.
If we now observe the relationships between Fields [TAB-ANA], we can
identify four main uses:
* *association*: this consists of coupling each value of a Field to
a single value of another Field ("coupled" relationship between
two fields),
* *classification*: This involves grouping the data by category in
order - for example - to be able to make a statistical use of it,
("derived" relationship between two fields),
* *crossing*: This consists of representing all the combinations
between the two Fields, such as in matrix representations
("crossed" relationship between two fields),
* *characterization*: It corresponds to the documentation of defined
properties (no specific relationship).
_Example: Price list of different foods based on packaging for the
year 2022._Table 3
THOMY Expires 21 June 2024 [Page 6]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
+==+=======+=========+=========+======+=====+========+==============+
|Id|Product|Food |Packaging|Weight|Price| Period | Availability |
+==+=======+=========+=========+======+=====+========+==============+
|11|apple |fruit |bag |1 kg |1 | 2nd | Yes |
| | | | | | | half | |
| | | | | | | 2022 | |
+--+-------+---------+---------+------+-----+--------+--------------+
|12|apple |fruit |cardboard|10 kg |9 | 2nd | Yes |
| | | | | | | half | |
| | | | | | | 2022 | |
+--+-------+---------+---------+------+-----+--------+--------------+
|13|orange |fruit |bag |1 kg |2 | 2nd | end of 2022 |
| | | | | | | half | |
| | | | | | | 2022 | |
+--+-------+---------+---------+------+-----+--------+--------------+
|14|orange |fruit |cardboard|10 kg |18 | 2nd | end of 2022 |
| | | | | | | half | |
| | | | | | | 2022 | |
+--+-------+---------+---------+------+-----+--------+--------------+
|15|pepper |vegetable|bag |1 kg |1.5 | 2nd | end of 2022 |
| | | | | | | half | |
| | | | | | | 2022 | |
+--+-------+---------+---------+------+-----+--------+--------------+
|16|pepper |vegetable|cardboard|10 kg |13 | 2nd | end of 2022 |
| | | | | | | half | |
| | | | | | | 2022 | |
+--+-------+---------+---------+------+-----+--------+--------------+
|17|banana |fruit |bag |1 kg |0.5 | 2nd | Yes |
| | | | | | | half | |
| | | | | | | 2022 | |
+--+-------+---------+---------+------+-----+--------+--------------+
|18|banana |fruit |cardboard|10 kg |4 | 2nd | Yes |
| | | | | | | half | |
| | | | | | | 2022 | |
+--+-------+---------+---------+------+-----+--------+--------------+
Table 3: Price list
_We find here:_
* _association: between "Packaging" and "Weight",_
* _classification: between "Product" and "Food",_
* _crossing: between "Product" and "Weight",_
* _characterization: between "Product" and "Availability"_
THOMY Expires 21 June 2024 [Page 7]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
2.3. Field structure
A Field is an ordered set of Cells.
To represent this structure, several representations are possible
depending on the nature of the data:
* the simplest format is to represent a Field by the list of Cells
with the same order for all Fields. This format is interesting
when the data is little duplicated,
* when the data is repetitive, a second option is to represent on
the one hand the list of different data and on the other hand
their position in the list (i.e. categorical data),
* another special case also concerns repetitive data for which one
value is highly predominant (sparse data). In this case, it is
sufficient to provide only the position in the list of data except
for the one that is predominant,
* a last option consists in representing the Field according to its
dependence with another Field (coupled, derived or crossed
relationship). This leads to an optimized data volume.
2.4. Representation
Three representations are available for a tabular object : row-
oriented (list of Rows), cells-oriented (list of Cells), field-
oriented (list of Fields).
The field-oriented representation is retained because it takes into
account the semantics carried by the Fields as well as the inter-
Field analysis presented above.
A Dataset is then seen as a set of Fields representing the properties
of the entire Dataset.
The order of Fields or Rows is not relevant.
3. NTV-TAB format
3.1. NTV structure
A Dataset is represented by the following NTV entities:
* NTVcell represents a Cell. NTVcell is a NTVsingle.
THOMY Expires 21 June 2024 [Page 8]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
* NTVfield represents a Field. NTVfield is a NTV entity depending
on the format chosen to represent the Field (simple format,
default format, optimized format). A NTVfield contains a NTVlist
of part of the NTVcells (Codec) and optionnaly coding data.
* NTVdataset represents the Dataset. NTVdataset is a NVlist where
the NTVname is the name of the Dataset and the NTVvalue is the
list of NTVfields.
The JSON format of a NTVdataset is his JSON-NTV format.
3.2. simple NTVfield formats
This category is the usual representation of a Field with different
values (Full format) or with several identical values (Unique
format).
*Full format* :
The Full format is the format that does not use any coding. Codec
and NTVfield are identical. The NTVfield is therefore a NTVlist
where the NTVname is the name of the Field, the NTVtype is the
default type of the NTVcells and the NTVvalue is the list of
NTVcells.
_Example JsonNTVvalue ( "price" Field)_ :
_[ 1, 9, 2, 18, 1.5, 13, 0.5, 4 ]_
*Unique format* :
The Unique format is used when all NTVcells are identical. The Codec
is the NTVcell.Codec and NTVfield are identical (coding is implict).
The NTVfield is therefore the NTVcell.
_Example JsonNTVvalue ( "period" Field)_ :
_"2nd half 2022"_
Note :
The Unique format also makes it possible to represent tabular
metadata
3.3. default NTVfield formats
This category completes the simple formats with the other most common
representations of a Field :
THOMY Expires 21 June 2024 [Page 9]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
* Categorical Field (Complete format)
* Periodic Field (Primary format)
* Sparse Field (Sparse format)
In those formats, Codec is explicit and is the TVlist of different
Field NTVcells (Codec). The NTVfield is a NVlist where the NTVname
is the name of the Field.
*Complete format* :
The "complete format" is equivalent to the format used to store
categorical variables.
The NTVfield is a NVlist composed with two NTV entities :
* Codec: TVlist of different Field NTVcells (Codec),
* Coding: Vlist of indexes of NTVcells in Component (Keys)
The list of NTVcells is reconstituted by replacing the integers in
the coding Vlist with the NTVcell at the coding index in the Codec
(e.g. pandas categories and codes).
_Example JsonNTVvalue ( "product" Field)_ :
_[ [ "orange" , "pepper" , "apple" , "banana" ], [ 2, 2, 0, 0,
1, 1, 3, 3 ] ]_
*Sparse format* :
A specific format (one dimensional sparse LIL format) is used for
sparse data. It is defined by:
* 'fill_value': it should be most common value
* 'sp_value': it is a list storing only values distinct from the
'fill_value'
* 'sp_index': list of index of 'sp-value' in the sparse data list
The NTVfield is a NVlist composed with three NTV entities :
* Codec: TVlist of different Field NTVcells (Codec),
* Ref: Vlist of indexes of Codec value in 'sp_value' ,
THOMY Expires 21 June 2024 [Page 10]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
* Coding: Vlist of 'sp_index'
The list of NTVcells is reconstituted by replacing in a list of
'fill_value', the values with index in the Coding Vlist by the
corresponding value defined by the Ref index in the Codec TVlist.
_Example JsonNTVvalue ( "food" Field)_ :
_[ [ "vegetable" , "fruit"], [0, 0], [ 4, 5 ] ] 'fruit' is the
'fill_value' - 0 is the index of "vegetable"_
*Primary format* :
This format is equivalent to the Complete format where the Keys Vlist
is calculated with the "repetition coefficient".
The NTVfield is a NVlist composed with two NTV entities :
* Codec: TVlist of different Field NTVcells (Codec),
* Coding: Vlist with a single integer (Repetition coefficient)
The Keys Vlist is generated with the formula:
keys[ikey] = ( ikey % ( coef * period ) ) // coef
where:
keys: is the Keys Vlist
ikey: is the index of a key value
coef: is the Repetition coefficient
period: is the length of Codec
_Example: coef = 2, period = 3, Keys length = 12_
_keys = [0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2]_
The Repetition coefficient is the number of adjacent identical values
in the Keys list.
_Example "packaging"_ :
_Codec: [ "bag" , "cardboard" ]_
_Coefficient: 1_
THOMY Expires 21 June 2024 [Page 11]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
_(implicit Keys : [ 0, 1, 0, 1, 0, 1, 0, 1 ] )_
_Example "product"_ :
_Codec: [ "apple" , "orange" , "pepper" , "banana" ]_
_Coefficient: 2_
_(implicit Keys : [ 0, 0, 1, 1, 2, 2, 3, 3 ] )_
3.4. Optimized NTVfield formats
This category of formats reduces the size of Complete format with
optimized Keys. The length of Keys is reduced with using of derived
(Relative format) or coupled (Implicit format) relationships between
two Fields.
In those formats, Codec is explicit and is the TVlist of different
Field NTVcells (Codec). The NTVfield is a NVlist where the NTVname
is the name of the Field.
*Implicit format* :
This representation is associated with "coupled" Fields. These
Fields have a one-to-one correspondence.
The NTVfield is a NVlist composed with two NTV entities :
* Codec: TVlist of different field NTVcells (Codec),
* Ref: Vsingle entity index or name of the coupled Field.
This format is equivalent to the Complete format where Keys is the
Keys of the Field (with Complete format) defined by Ref.
_Example JsonNTVvalue ( "weight" Field is associated with
"packaging" Field )_ :
_[ [ "1 kg" , "10 kg" ], "packaging"]_
_( implicit Keys : [ 0, 1, 0, 1, 0, 1, 0, 1 ] )_
*Relative format* :
This representation is associated with "derived" Fields. These
Fields have a one-to-many correspondence.
THOMY Expires 21 June 2024 [Page 12]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
The values of a "derived" Field are inferred from the values of the
parent Field.
The Field is a NVlist composed with three NTV entities :
* Codec: TVlist of different field NTVcells (Codec),
* Ref: Vsingle entity index or name of the parent Field,
* Coding : Vlist of relative indexes of NTVcells in Codec (Relative
Keys).
This format is equivalent to the Complete format where the Keys Vlist
is obtained by replacing the values of the Keys Vlist of the parent
Field with the corresponding values in the Relative Keys (the length
of the Relative Keys is the length of the Codec of the parent Field).
_Example JsonNTVvalue ( "food" Field - "product" Field is the
parent Field of "food" Field)_ :
_[ [ "fruit" , "vegetable" ], "product", [ 0, 1, 0, 0 ] ]_
_(the Vlist Keys is obtained by replacing the values 0, 1, 2, 3
of the Vlist Keys of the "product" Field by the values 0, 1, 0,
0 of the Relative Keys i.e.: [ 0, 0, 0, 0, 1 , 1, 0, 0] )_
3.5. Synthesis
The NTVfield structure corresponding to the format defined above are
in Table 4:
THOMY Expires 21 June 2024 [Page 13]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
+====================+========+===============+==================+
| Structure | Codec | Ref | Coding |
+==========+=========+========+===============+==================+
| format | NTV | TVlist | Vsingle | Vlist |
+==========+=========+========+===============+==================+
| Relative | NTVlist | x | index | Relative Keys |
| | | | | |
| | len = 3 | | or name | len < len(Field) |
+----------+---------+--------+---------------+------------------+
| Complete | NTVlist | x | | Keys |
| | | | | |
| | len = 2 | | | len = len(Field) |
+----------+---------+--------+---------------+------------------+
| Sparse | NTVlist | x | list of index | sp_index |
| | | | | |
| | len = 3 | | sp_value | 1<len<len(Field) |
+----------+---------+--------+---------------+------------------+
| Implicit | NTVlist | x | index | |
| | | | | |
| | len = 2 | | or name | |
+----------+---------+--------+---------------+------------------+
| Primary | NTVlist | x | | coef |
| | | | | |
| | len = 2 | | | len = 1 |
+----------+---------+--------+---------------+------------------+
| Unique | NTVsingle |
+----------+-----------------------------------------------------+
| Full | NTVlist len = len(Field) |
+----------+-----------------------------------------------------+
Table 4: NTVfield formats
Three levels are available to convert tabular data in JSON structure
Table 5.
* *Level 0: "simple"* is the usual representation of tabular data.
Fields are converted with the Simple or Unique format.
* *Level 1: "default"* avoids duplication of information by adding
simple encoding.
Fields are converted according to their own structure (simple,
unique, categorical, sparse, periodic).
* *Level 2: "optimize"* avoids duplication of information and
minimizes encoding. It is the usual representation of
multidimensional data.
THOMY Expires 21 June 2024 [Page 14]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
This level requires an analysis of the relationships between
Fields ("partition")
+===============+=========================+
| Level | Structure |
+====+==========+==============+==========+
| | mode | Type Field | format |
+====+==========+==============+==========+
| 0 | simple | Unique | Unique |
| | +--------------+----------+
| | | Simple | Full |
+----+----------+--------------+----------+
| 1 | default | Unique | Unique |
| | +--------------+----------+
| | | Simple | Full |
| | +--------------+----------+
| | | Sparse | Sparse |
| | +--------------+----------+
| | | Categorical | Complete |
| | +--------------+----------+
| | | Periodic | Primary |
+----+----------+--------------+----------+
| 2 | optimize | Unique | Unique |
| | +--------------+----------+
| | | Root coupled | Full |
| | +--------------+----------+
| | | Root derived | Complete |
| | +--------------+----------+
| | | Primary | Primary |
| | +--------------+----------+
| | | Derived | Relative |
| | +--------------+----------+
| | | Coupled | Implicit |
+----+----------+--------------+----------+
Table 5: NTVfield levels
4. Examples
4.1. Field examples
The example in Section 2.2 has the following JSON representation
Table 6:
THOMY Expires 21 June 2024 [Page 15]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
+========+========================================================+
| Format | JsonNTV Representations |
+========+========================================================+
|Full |{ "price::float": [ 1, 9, 2, 18, 1.5, 13, 0.5, 4 ] } |
| | |
| |{ "price": [ 1, 9, 2, 18, 1.5, 13, 0.5, 4 ] } |
| | |
| |[ 1, 9, 2, 18, 1.5, 13, 0.5, 4 ] |
+--------+--------------------------------------------------------+
|Complete|{"product":[["orange","pepper","apple","banana"], |
| | |
| |[2,2,0,0,1,1,3,3]]} |
| | |
| |{"product": [ ["orange","pepper","apple","banana"], |
| | |
| |[2, 2, 0, 0, 1, 1, 3, 3] ]} |
| | |
| |[ ["orange","pepper","apple","banana"], |
| | |
| |[2, 2, 0, 0, 1, 1, 3, 3] ] |
+--------+--------------------------------------------------------+
|Unique |{ "period": "2nd half 2022" } |
| | |
| |"2nd half 2022" |
+--------+--------------------------------------------------------+
|Implicit|{"weight":[{"::string":["1 kg","10 kg"]},"packaging"]} |
| | |
| |[["1 kg","10 kg"],3] |
+--------+--------------------------------------------------------+
|Relative|{"food": [ {"::string": [ "fruit" , "vegetable" ]}, |
| | |
| |"product", [ 0, 1, 0, 0 ]] } |
| | |
| |[ [ "fruit" , "vegetable" ], 1, [ 0, 1, 0, 0 ] ] |
+--------+--------------------------------------------------------+
|Sparse |{"food":[{"::string":["vegetable","vegetable","fruit"]},|
| | |
| |[4,5,-1]]} |
| | |
| |[["vegetable","vegetable","fruit"],[4,5, 1]] |
+--------+--------------------------------------------------------+
|Primary |{"packaging":[{"::string":["cardboard","bag"]},[1]]} |
| | |
| |[["cardboard","bag"],[1]] |
| | |
| |{"product":[["apple","orange","peppers","banana"],[2]]} |
| | |
| |[["apple","orange","peppers","banana"],[2]] |
THOMY Expires 21 June 2024 [Page 16]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
+--------+--------------------------------------------------------+
Table 6: NTVfield examples
4.2. Dataset examples
The examples in Table 7 below illustrate the optimize level:
THOMY Expires 21 June 2024 [Page 17]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
+====================================+==============================+
| Data | Optimize level |
+========+===========================+==============================+
| type | Full format | JsonNTV |
+========+===========================+==============================+
|matrix |[['a','a','b','b','c','c'],|[[['a','b','c'],[2]], |
| | | |
| |[10,20,10,20,10,20], |[[10,20],[1]], |
| | | |
| |[1,2,3,4,5,6]] |[1,2,3,4,5,6]] |
+--------+---------------------------+------------------------------+
|single |[[1,2,3,4,5,6], |[[1,2,3,4,5,6], |
| | | |
| |['a','a','a','a','a','a']] |'a'] |
+--------+---------------------------+------------------------------+
|complete|[[1,2,3,3,5,5]] |[[[1,2,3,5],[0,1,2,2,3,3]]] |
+--------+---------------------------+------------------------------+
|coupled |[[1,2,3,3,5,5], |[[[1,2,3,5],[0,1,2,2,3,3]], |
| | | |
| |['a','b','c','c','e','e']] |[['a','b','c','e'],0]] |
+--------+---------------------------+------------------------------+
|derived |[[1,2,3,4,5,6], |[[1,2,3,4,5,6], |
| | | |
| |['a','a','b','b','c','c'], |[['a','b','c'],[0,0,1,1,2,2]],|
| | | |
| |[10,10,10,10,20,20]] |[[10,20],1,[0,0,1]]] |
+--------+---------------------------+------------------------------+
|matrix |[[6,6,7,7,8,8,9,9], |[[[6,7,8,9],[2]], |
| | | |
|+ |[10,20,10,20,10,20,10,20], |[[10,20],[1]], |
| | | |
|coupled |[1,1,2,2,3,3,4,4], |[[1,2,3,4],0], |
| | | |
| |[1,2,3,4,5,6,7,8]] |[1,2,3,4,5,6,7,8]] |
+--------+---------------------------+------------------------------+
|matrix |[[6,6,7,7,8,8,9,9], |[[[6,7,8,9],[2]], |
| | | |
|+ |[10,20,10,20,10,20,10,20], |[[10,20],[1]], |
| | | |
|coupled |[1,1,2,2,3,3,4,4], |[[1,2,3,4],0], |
| | | |
|+ |[11,11,22,22,22,22,22,22], |[[11,22],0,[0,1,1,1]], |
| | | |
|derived |[1,2,3,4,5,6,7,8]] |[1 2,3,4,5,6,7,8]] |
+--------+---------------------------+------------------------------+
Table 7: optimize level examples
THOMY Expires 21 June 2024 [Page 18]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
The examples in Table 8 below illustre NTVdataset with a length equal
to 0, 1 or 2:
+------------------+-------------------------------------------+
| [ ] or { } | _Empty NTVdataset_ |
+------------------+-------------------------------------------+
| [25] or [[25]] | _NTVdataset with 1 NTVfield and length 1_ |
+------------------+-------------------------------------------+
| [2, 1] or [[2], | _NTVdataset with 2 NTVfield and length 1_ |
| [1]] or [2, [1]] | |
+------------------+-------------------------------------------+
| [[2, 1]] | _NTVdataset with 1 NTVfield and length 2_ |
+------------------+-------------------------------------------+
| [[2, 1], [4, 3]] | _NTVdataset with 2 NTVfield and length 2_ |
+------------------+-------------------------------------------+
Table 8: NTVdataset with length 0, 1 or 2
5. Properties
5.1. JSON representation
NTV-TAB format defines the representation of a Dataset into the NTV
format. This conversion is reversible (lossless).
Furthermore, the NTV format defines the conversion into JSON format.
This conversion is also reversible.
The exchange format (JsonText) of a Dataset is therefore obtained by
a representation in NTV-TAB format then a conversion to JSON format
and finally a conversion to text format (or binary format with CBOR
conversion). The data is reconstituted identically by reverse
conversions.
5.2. Dataset size
As explain in Section 2.2 cells are often duplicated in a Field. The
principle of NTV-TAB format is to replace duplicated data with
encoding based on integers.
This optimization considerably reduces the size of a representation
of a Dataset. Appendix A details the methodology to optimize this
size.
THOMY Expires 21 June 2024 [Page 19]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
5.3. Nested NTV-TAB structure
NTVcells in a NTVdataset are any NTVsingle. We can therefore include
in a NTVdataset the data associated with the types defined in the NTV
format.
The 'tab' and the 'field' NTVtypes are associated to NTVdataset and
NTVfield. A NTVcell can also include a NTVdataset or a NTVfield.
Figure 1 is an example of nested Dataset. The 'nested' JsonNTV is
the representation of a Dataset with length equal 2 and composed with
two Fields 'field1' and 'field2'.
* 'field1' is a Field with two Cells 'dataset1' ans 'dataset2' which
are Dataset. 'field1' is represented with a Full format NTVfield.
* 'field2' is a Field with two Cells 'field2_1' ans 'field2_2' which
are Field. 'field2' is represented with a Full format NTVfield.
nested = {
"field1": {
"dataset1:tab":{
"dts1_field1": [1,2,3],
"dts1_field2": [4,5,6]
},
"dataset2:tab":{
"dts2_field1": [10,20,30],
"dts2_field2": [40,50,60],
"dts2_field3": [70,80,90]
},
},
"field2":{
"field2_1:field": [1,2,3],
"field2_2:field": [4,5,6],
}
}
Figure 1: Nested Dataset
6. Parsing a JSON-value
A NTV parser generates a NTV entity from a JSON-value.
The decoding NTV entity is directly converted into the NTVdataset and
a list of NTVfields.
For each NTVfield the format is deduced following the structure
defined in the table xxx.
THOMY Expires 21 June 2024 [Page 20]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
For each format, a decoder converts the NTVvalue of the NTVfield into
the chosen object.
_Note :_
_Several NTVvalue are ambiguous to deduce the Field format :_
- _[ list-data, integer, list-integer ] : Full or Relative format
?_
- _[ list-data, string, list-integer ] : Full or Relative format
?_
- _[ list-data, integer ] : Full or Implicit format ?_
- _[ list-data, string ] : Full or Implicit format ?_
- _[ list-data, list-integer ] : Full or Complete/Primary format
?_
- _[ list-data, list-integer, list-integer ] : Full or Sparse
format ?_
_The full format is not retained for those NTVvalue._
_To avoid this ambiguity, precautions can be taken for Dataset
with length = 2 or 3 and with a Full format:_
- _a name can be added to the list-data (e.g. { "data": list-
data}),_
- _the order of data can be changed (e.g. [ integer, list-data
])_
- _a type can be added (e.g. { "::json": [ list-data, list-
integer ] } )_
- _an additional field can be added (e.g. [ list-data, integer,
list-integer, {"format": "full"} ] )._
7. IANA Considerations
Any JsonValue is a JsonNTVValue and conversely, any JsonNTVvalue is a
JsonValue.
Thus, any JSON data may or may not be treated as JsonNTV data, so
there is no need to create a specific MIME media type for JsonNTV.
THOMY Expires 21 June 2024 [Page 21]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
All properties of the MIME media type "application/json" are
applicable.
8. Security Considerations
The format used for NTV data exchanges is the JSON format. So, all
the security considerations of [RFC8259] apply.
The NTV structure provides no cryptographic integrity protection of
any kind.
9. References
9.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC4180] Shafranovich, Y., "Common Format and MIME Type for Comma-
Separated Values (CSV) Files", RFC 4180,
DOI 10.17487/RFC4180, October 2005,
<https://www.rfc-editor.org/info/rfc4180>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
[RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data
Interchange Format", STD 90, RFC 8259,
DOI 10.17487/RFC8259, December 2017,
<https://www.rfc-editor.org/info/rfc8259>.
9.2. Informative References
[TABLE] "FrictionLess", "Table Schema", 2021,
<https://specs.frictionlessdata.io/table-
schema/#language>.
[JSON-NTV] Thomy, P., "JSON semantic format (JSON-NTV)", 2023,
<https://datatracker.ietf.org/doc/draft-thomy-json-ntv/>.
[TAB-ANA] Thomy, P., "Tabular dataset analysis", 2022,
<https://github.com/loco-philippe/tab-
analysis/blob/main/docs/tabular_analysis.pdf>.
THOMY Expires 21 June 2024 [Page 22]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
[W3C_TAB] "W3C", "Recommendation : Model for Tabular Data and
Metadata on the Web", 17 December 2015,
<https://www.w3.org/TR/2015/REC-tabular-data-model-
20151217/>.
Appendix A. Dataset sizing
This appendix presents an analysis of NTVdataset size optimization
with the defined formats.
A.1. Methodology
The principle of defined formats is to replace duplicated data with
encoding based on integers.
We define the size of a Dataset representation (SZ) as the sum of the
encoding size and the size of unencoded unduplicated values. The
coding is modeled as being the product of the values remaining to be
represented (nv - nc) with an average coding size (sc):
SZ = nc * sv + (nv - nc) * sc
where :
nv : number of values
sv : mean value size
nc : number of different values
sc : mean coding size
example :
Full format : {"product":
["orange","apple","apple","apple","orange","orange"]}
Complete format : {"product": [ ["orange","apple"], [1, 0, 0,
0, 1, 1] ]}
SZ = 9 + 8 + 7 + 6 * 1 = 30 (including double quotes),
nv = 9 (including the Field name),
sv = (9 + 8 * 3 + 7 * 3) / 7 = 7.71
nc = 3 (including the Field name)
THOMY Expires 21 June 2024 [Page 23]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
sc = (30 - 3 * 7.71) / (9 - 3) = 1.15 (sc = (SZ - nc * sv) /
(nv - nc))
In this example the JSON overhead (coma, space, curly bracket,
square bracket) is not included.
SZ is maximal when there is no coding (sc = sv) and minimal when the
coding is perfect (sc = 0):
SZmax = nv * sv
SZmin = nc * sv
We then define the following indicators:
* unicity level UL = nc / nv
UL = SZmin / SZmax
1 - UL = (SZmax - SZmin) / SZmax
UL characterizes the nature of the data independently of the
coding and represents the maximum achievable gain (1-UL).
maximum UL = 1 (unduplicated data)
minimum UL = 0 (full duplicated data = empty data)
* object lightness OL = sc / sv
OL = (SZ - SZmin) / (SZmax - SZmin)
1 - OL = (SZmax - SZ) / (SZmax - SZmin)
OL characterizes coding efficiency
maximum OL = 1 (no coding)
minimum OL = 0 (perfect coding)
The optimization of the size of the representation is then evaluated
by comparing the size obtained without coding and that obtained with
coding:
G = (SZmax - SZ) / SZmax = (1 - UL) * (1 - OL)
R = 1 - G = SZ / SZmax = UL + OL - UL * OL
THOMY Expires 21 June 2024 [Page 24]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
The maximum G gain is 1 - UL, the minimum G gain is 0.
If the data is empty, UL = OL = 0 and the gain is equal to 1.
The indicators are deduced from the following four measurable values:
* number of cells in the dataset (nv)
* number of different cells in the dataset (nc)
* size of the dataset with the format to study (SZ)
* size of the dataset with Full format (SZmax)
We then deduce sv = SZmax / nv as well as sc = (SZ - nc * sv) / (nv -
nc)
In the example above, the indicators are:
UL = 3 / 9 = 0.33
OL = 1.15 / 7.71 = 0.15
G = 0.67 * 0.85 = 0.57
SZmax = 9 * 7.71 = 69.4
SZmin = 3 * 7.71 = 23.1
The Complete format is close to the minimum size (SZ = 30) and its
size is less than half the size of the Full format (43 %).
A.2. Formats
The formats used to represent an NTVfield are in general form:
* list of part of NTVcells
* list of integers used to encode other NTVcells
The size of this format can then be written (without taking into
account the overhead linked to the format):
SZ = nc * sv + k * nv * si
where :
nv : number of values
THOMY Expires 21 June 2024 [Page 25]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
sv : mean value size
nc : number of different Field values
si : integer size
k: specific coefficient of the coding used
Comparison with the structure defined in the previous chapter allows
us to deduce the parameters:
UL = nc/nv
sc = si * k * nv/(nv-nc)
OL = si/sv * k * nv/(nv-nc)
G = 1- nc/nv - si/sv * k
R = nc/nv + si/sv * k
The gain G is therefore equal to the maximum gain 1-UL reduced by the
weight of the coding corresponding to the parameter k weighted by the
average size of the values compared to an integer.
Table 9 below specifies the values of k for the different formats:
+==========+===============+===========================+
| Format | k coefficient | comments |
+==========+===============+===========================+
| Full | 0 | R = 1 |
+----------+---------------+---------------------------+
| Unique | 0 | R = 1/nv (nc = 1) |
+----------+---------------+---------------------------+
| Complete | 1 | R = nc/nv + si/sv |
+----------+---------------+---------------------------+
| Primary | 1 / nv | R = nc/nv + si/sv/nv |
+----------+---------------+---------------------------+
| Coupled | 1 / nv | R = nc/nv + si/sv/nv |
+----------+---------------+---------------------------+
| Sparse | 2 * ns / nv | R = nc/nv + 2*si/sv*ns/nv |
+----------+---------------+---------------------------+
| Derived | nd / nv | R = nc/nv + si/sv*nd/nv |
+----------+---------------+---------------------------+
Table 9: coding coefficient
_ns: number of values distinct from the 'fill_value'_
THOMY Expires 21 June 2024 [Page 26]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
_nd: number of different values in the parent Field_
Appendix B. Table schema compatibility
This appendix presents the compatibility between Tableschema [TABLE]
and the NTV-TAB format.
B.1. Table schema
Table Schema is a simple language- and implementation-agnostic way to
declare a schema for tabular data. A Table Schema is represented by
a descriptor. The descriptor MUST be a JSON object with defined
properties (JsonMember).
Table Schema define following descriptors and properties:
* Fields property
Fields property MUST be an array where each entry in the array
is a field descriptor (as defined below).
* Field descriptor
A field descriptor MUST be a JSON object that describes a
single field.
* Field properties
The field descriptor object MAY contain any number of other
properties. Some specific properties are defined below. Of
these, only the name property is REQUIRED.
Defined Properties:
o name
o title
o description
o example
o type / format
o constraints
The constraints property on Table Schema Fields can be used by
consumers to list constraints for validating field values.
THOMY Expires 21 June 2024 [Page 27]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
* Table properties
In additional to field descriptors, there are the following
"table level" properties:
o missingValues
o primaryKey
o foreignKeys
B.2. Compatibility
Three levels of compatibility are addressed :
* Concepts
The concepts are equivalent between Table Schema and NTV-TAB :
o Table is equivalent to Dataset
o Field in Table Schema is equivalent to the NTVfield
o Name in Table Schema is equivalent to the NTVname of the
NTVfield
o Type / Format in Table Schema is equivalent to the NTVtype
of the NTVfield
* Type / Format
NTVtype combines the concepts of type and format. The
correspondence table in the NTV specification [JSON-NTV] gives
the link between an NTVtype and the corresponding type/format.
* Constraints
Constraints are applicable to each value in a Table Field.
Validating constraints for all values in a Table Field is
equivalent to validating a constraint for all values in the
Codec list.
These compatibility levels are reached, which makes it possible to
validate an NTVdataset with a schema defined according to the
Table Schema format.
The following principles should then be considered to validate an
NTVdataset:
THOMY Expires 21 June 2024 [Page 28]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
* The NTVfields names must be identical to the Schema names,
* If the NTVtype of Codec data is not 'json', it must match the
Type/Format defined in the Schema,
* If the constrainst are valid with the Codec data, they are valid
with the Field.
B.3. Example
Figure 2 is an example of Dataset with Full format ('tab_data1'
without NTVtypes) and with other formats ('tab_data2').
tab_data1 = {
"index": [100, 200, 300, 400, 500, 600],
"dates": ["1964-01-01", "1985-02-05", "2022-01-21", "1964-01-01",
"1985-02-05", "2022-01-21"],
"value": [10, 10, 20, 20, 30, 30],
"coord": [[1,2], [3,4], [5,6], [7,8], [3,4], [5,6]],
"names": ["john", "eric", "judith", "mila", "hector", "maria"],
"unique": ["true, "true", "true", "true", "true", "true"]
}
tab_data2 = {
"index": [100, 200, 300, 400, 500, 600],
"dates": {"::date":[["1964-01-01","1985-02-05","2022-01-21"],[1]},
"value": [[10, 20, 30], [2]],
"coord::point": [[1,2], [3,4], [5,6], [7,8], [3,4], [5,6]],
"names::string":["john", "eric", "judith", "mila", "hector",
"maria"],
"unique": True
}
Figure 2: Dataset example
The schema in Figure 3 is valid with 'tab_data1' and 'tab_data2'
formats
tab_schema = {
"fields": [
{"name":"index", "type":"integer", "constraint":{"minimum":50}},
{"name":"dates", "type":"date"},
{"name":"value", "type":"integer"},
{"name":"coord", "type":"geopoint", "format":"array"},
{"name":"names"},
{"name":"unique", "type":"boolean"}
]
}
THOMY Expires 21 June 2024 [Page 29]
Internet-Draft NTV tabular format (NTV-TAB) December 2023
Figure 3: Schema example
Acknowledgements
TBD
Contributors
TBD
Author's Address
Philippe THOMY
Loco-labs
476 chemin du gaf de Famian
84 500 BOLLENE
France
Email: philippe@loco-labs.io
URI: https://github.com/loco-philippe/NTV/blob/main/README.md
THOMY Expires 21 June 2024 [Page 30]