Network Working Group | P. Thierry |
Internet-Draft | Thierry Technologies |
Intended status: Experimental | August 06, 2013 |
Expires: February 07, 2014 |
BULK ARchive Format
draft-thierry-bulk-barf-00
This specification describes a BULK format to pack together independent pieces of data and metadata about them.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on February 07, 2014.
Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
There are plenty of archives formats currently in use, from widely-used and repurposed formats like ZIP (used for generic file archives as well as Java deployment, ebooks and office documents) to legacy formats like ARC or Z through moderately used formats enjoying a stable niche, like tar, RAR or StuffIt.
A few archive formats actually make reuse of existing ones. Many archive formats developped nowadays actually reuse ZIP without modification and just dictate the tree structure inside the ZIP file. The Unix world has long had a tradition of separation of concern, thus using different formats for archiving (ar or tar) and compression (gzip, bzip2, lzma or now xz), with compressed archives named after the combination (foo.ar.gz, bar.tar.bz2, etc.). Debian packages are actually ar files containing little uncompressed metadata and a couple of compressed tar files.
But the problem remains that all these binary formats all define completely ad hoc syntaxes, sometimes incredibly optimized but narrowly tailored to their specific requirements. Many leave little room for future extension, or in a contrived way (many formats are actually extended by abusing an unused metadata field and cramming a new ad hoc format in it).
Some of these formats have a few fixed- or limited-length fields that became or will become obsolete in time. The ar format, for example, suffers from the Year 2038 problem and cannot store long file names. Various implementations have used different incompatible extensions to store long file names.
So we propose yet another archive format, that uses an efficient but extensible syntax, so that the format cannot fail to be extended or modified for new use cases or constraints.
A BARF file is basically a set of metadata fields followed by data entries. Each entry consists of a set of metadata fields followed by its content. The interesting property of using BULK is that any portion of that structure is dynamic (no fixed metadata fields, and an entry without metadata is serialized as its content, as with BULK, the entry and its content cannot be confused with each other) and anything can be enclosed in a BULK structure to add features.
Metadata fields are just a BULK expression, which means that any ad hoc or standard BULK vocabulary can be used in an efficient way as metadata. Mutually incompatible metadata vocabularies could even be stored alongside each other for legacy support, if need be.
The archive file can be compressed or encrypted by an outside tool (producing a foo.barf.gz or bar.pgp file, for example), but so can any individual BULK expression. The entire archive, internally to the file, can be a BULK compression or encryption form, as well as any metadata set, metadata field or entry. Almost any extension and optimization can be retrofitted in this structure in a backward-compatible way, like checksums, digital signature or access offsets for random access.
This extends the use case of BARF archives outside of archives for multiple files. An extensible image format could be based on a BARF structure, allowing seamless transition from a simple format to a full-featured one, whereas existing formats usually add complex extensions that fail to be widely adopted (to add support for layers, transparency, different compression or metadata). Although BARF would probably be ill-suited for playable audio and video, it would still provide a perfect fit for the storage of raw audio and video for editing programs.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].
Literal numerical values are provided in decimal or hexadecimal as appropriate. Hexadecimal literals are prefixed with 0x to distinguish them from decimal literals.
BULK bytes sequences and expressions are described with the same conventions than used in the BULK 1.0 specification [BULK1]
This specification defines the notion of Guaranteed Backward Compatibility (GBC). It applies to forms that carry a main payload with additional metadata. A form that obeys the rules of GBC has the type GBCForm.
A GBCForm has the shape ( Ref {arguments} {next}:Expr ). If the payload of a GBCForm is readable without knowledge of that form, then {next} MUST be that payload. Otherwise, {next} MUST be nil.
For example, a GBC-compliant checksum form could have the shape ( crc32c {crc}:Word32 {payload} ), where {crc} is the checksum of the byte sequence {payload}. On the other hand, a GBC-compliant encryption form, where obviously the payload is unreadable without proper knowledge of the form, could have the shape ( encrypt {payload} nil ).
The archive namespace (mnemonic: barf) is an official namespace identified by the UUID urn:uuid:8beba7c6-c65d-5256-a2da-3763513953f3 (BULK, "Stack 'em. Pack 'em. And rack 'em."). It provides a standard way to pack one or more data elements together with metadata.
This packs archive entries together as a form. {metadata} holds metadata about the whole pack. In the context of {metadata}, rdf:this-resource designates the whole pack.
This stacks archive entries together as a sequence, for the cases where it is not appropriate for entries to belong to a single expression. {metadata} holds metadata about the whole stack. In the context of {metadata}, rdf:this-resource designates the whole stack. {entries-metadata} MUST be a sequence of expressions of length equal or inferior to the number of expressions in {entries}. Each expression in {entries-metadata} holds metadata about a single entry of the stack. In the context of such a metadata expression, rdf:this-resource designates the described stack entry. By default, the expression number N in {entries-metadata} describes the expression number N in {entries}.
When the stack form is in the abstract yield, this has the property that if the last entry is an Array, the actual payload constitutes the end of the BULK stream. This can make it possible for BULK-unaware programs to read and/or write that payload easily.
Stacking also makes the addition of a metadata-carrying entry or a metadata-less entry an append-only operation.
This form associates arbitrary metadata with an arbitrary payload. It is intended to constitute most entries in BARF archives. In the context of {metadata}, rdf:this-resource designates the payload.
Type: GBCForm
This form makes it possible to include a complete BULK stream without modification, as {payload}.
Type: GBCForm
This form encapsulates a compressed payload. This specification doesn't define names to express a compression method.
Type: GBCForm
This form encapsulates an encrypted payload. This specification doesn't define names to express an encryption method.
Type: GBCForm
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels ", BCP 14, RFC 2119, March 1997. |
[BULK1] | Thierry, P., "Binary Uniform Language Kit 1.0", Internet-Draft draft-thierry-bulk-02, August 2013. |
[ISO8601] | ISO 8601:2004 Data elements and interchange formats -- Information interchange -- Representation of dates and times", 2004. | , "