Network Working Group | T.E.K. Keiser |
Internet-Draft | Sine Nomine |
Intended status: Informational | April 23, 2012 |
Expires: October 23, 2012 |
AFS-3 Directory Object Type Definition
draft-keiser-afs3-directory-object-00
Directory lookups in the AFS-3 distributed file system are supported by defining a canonical encoding for a directory object, and transmitting all--or part--of that object from a file server to its clients so that clients may resolve paths into AFS file IDs (FIDs). This memo describes the AFS-3 directory object wire encoding.
Comments regarding this draft are solicited. Please include the AFS-3 protocol standardization mailing list (afs3-standardization@openafs.org) as a recipient of any comments.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on October 23, 2012.
Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
AFS-3 [AFS1] [AFS2] is a distributed file system that has its origins in the VICE project [CMU-ITC-84-020] [VICE1] at the Carnegie Mellon University Information Technology Center [CMU-ITC-83-025], a joint venture between CMU and IBM. VICE later became AFS when CMU moved development to a new commercial venture called Transarc Corporation, which later became IBM Pittsburgh Labs. AFS-3 is a suite of un-standardized network protocols based on a remote procedure call (RPC) suite known as Rx [AFS3-RX]. While de jure standards for AFS-3 fail to exist, the various AFS-3 implementations have agreed upon certain de facto standards, largely helped by the existence of an open source fork called OpenAFS that has served the role of reference implementation. In addition to using OpenAFS as a reference, IBM wrote and donated developer documentation that contains somewhat outdated specifications for the Rx protocol and all AFS-3 remote procedure calls, as well as a detailed description of the AFS-3 system architecture.
Unlike most classical network file systems, AFS-3 explicitly eschews remote procedure calls to facilitate file server-assisted directory lookup operations. This was a conscious decision meant to limit server load by placing lookup operations on the clients. In the common cases, where there is significant locality of reference to directory entries, this results in a substantial reduction in server load (especially given the AFS-3 cache coherence model). It should be noted that 23 years of empirical evidence have borne out this decision as useful for many general-purpose workloads, while disadvantageous for certain very specific workloads (e.g., large directory objects with extremely non-uniform directory entry reference distributions--where the server overhead of a lookup rpc would would inconsequential compared to the directory file transfer overhead of the existing model).
Due to the distributed nature of AFS-3 directory objects, a canonical directory wire-format is an intrinsic part of the AFS-3 protocol. This memo documents the directory object wire format; a future document will document the lookup and modification algorithms, which by the decentralized nature of AFS-3 directories, must be implemented in the to-be-specified manner.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].
An AFS-3 directory object consists of between 1 and BIGMAXPAGES (1023) "pages" of length AFS_PAGESIZE (2048 octet).
(MSB) (LSB) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | page #0000 | ~ (2048 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | page #1022 | ~ (2048 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Directory Structure
All pages in a directory object are AFS_PAGESIZE (2048 octets) in length. All pages are subdivided into EPP (64) records, each RECSIZE (32 octets) in length.
(MSB) (LSB) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | record #00 | ~ (32 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | record #63 | ~ (32 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Page Structure
Each record within a directory object is referenced by an index number. This number represents the offset from the start of the file, in units of records, i.e., in multiples of RECSIZE. A record index of 0 would thus point to record 0 in page 0, and a record index of 67 would point to record 3 in page 1. Computing the file offset from an index is simply a matter of left logical shifting the index value by LRECSIZE (5) bits. Conversely, computing the index from a file offset merely involves a right logical shift by LRECSIZE (5) bits.
Each 2048-octet page within an AFS-3 directory object contains a 32-octet (RECSIZE) header at offset 0 in the following form:
(MSB) (LSB) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | pgcount | tag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | reserved | | +-+-+-+-+-+-+-+-+ + | allocation bitmap | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+ + | reserved | ~ (19 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Directory Page Header
The allocation bitmap contains one bit per record. The least significant bit of the first octet within the bitmap references the page header object (which is stored at offset 0). The second least significant bit of the first octet within the bitmap references the record index 1 (offset 32 octets into the page), and so on and so forth...until the most significant bit of the eighth octet of the allocation bitmap references the 64th--and final--record of the page (record 63 at page offset 2016).
The following invariants hold:
page_entry_offset = page_entry_index << LRECSIZE alloc_bitmap_index = page_entry_index >> 3 alloc_bitmap_bit_num = page_entry_index & 0x7 alloc_bitmap_bit = 1 << alloc_bitmap_bit_num
The variable page_entry_index, which is an unsigned integer between 0 and 63, can be derived from the record index (Section 5.1) by bit-wise ANDing it with EPP-1.
The equations above are written in C pseudocode--the variables are all assumed to be unsigned integers, and the operators are assumed to be identical to the ANSI C standard.
Page 0 is special due to the directory header. It is structured as follows:
(MSB) (LSB) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | page header | ~ (32 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | directory header | ~ (384 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data record #13 | ~ (32 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data record #63 | ~ (32 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Directory Page N=0 Structure
The directory header structure is a 384-octet structure that is stored at an offset of 32 octets from the beginning of the directory object (i.e., directly following the page 0 page header--see Section 5.2). The directory header layout is as follows:
(MSB) (LSB) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | page 0 map | page 1 map | page 2 map | page 3 map | ~ ~ | page 124 map | page 125 map | page 126 map | page 127 map | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | hash chain 0 | hash chain 1 | ~ ~ | hash chain 126 | hash chain 127 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Directory Header
Each page N>0 is structured as follows:
(MSB) (LSB) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | header | ~ (32 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data record #01 | ~ (32 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data record #63 | ~ (32 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Directory Page N>0 Structure
Since entry names are of variable length, directory entries are structured as follows:
This sequence of records MUST be contiguous, and MUST NOT cross a directory page boundary.
A directory entry record contains the dirent entry metadata (i.e., the vnode number and uniquifier, the name hash table next pointer, flag bits, and the first twenty octets of the entry name string. Its layout is as follows:
(MSB) (LSB) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | flags | reserved | next | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | vnode | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | uniquifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | name | ~ (20 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Directory Entry Record
When a file name string exceeds the 20 octets set aside in an entry record, one or more extension records MUST be allocated contiguously following the base entry record in order to contain the rest of the name string. The layout of an extension record is as follows:
(MSB) (LSB) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | name | ~ (32 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Directory Entry Extension Record
The hash function is a loop over each of the octets within the name string. The hash is computed using integer arithmetic on an unsigned 32-bit integer. The hash MUST be initialized to zero before commencing iteration over the characters in the name string. For each character, the hash value is multiplied by the constant 173, and then the value of the current character is added to the hash. When the null terminator is encountered, the loop is terminated before the hash is multiplied by 173.
For reasons unknown to the author, the resultant unsigned hash value is then compared against the value 2^31. If the hash value is less than 2^31 (i.e., what would be the sign bit--if the hash value were signed-- is not asserted), then the resultant hash value will be the value computed in the above loop bitwise ANDed with the constant NHASHENT-1 (127). However, if the hash value is greater than or equal to 2^31, then the resultant hash value will first be bitwise anded with the constant NHASHENT-1 (127), and then the value will be subtracted from the constant NHASHENT (128) to yield the final hash value.
This memo includes no request to IANA.
This memo includes no request to the AFS Assigned Numbers Registrar.
Directory metadata can contain sensistive information. This memo merely specifies the wire format encoding. Any implementation which may be utilized to store and retrieve directories containing entries whose name strings might reveal sensitive information should take precautions to ensure that they are never transmitted in the clear, and should take steps to ensure that those entries are not cached on machines lacking appropriate physical and network security.
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. |
(MSB) (LSB) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | pgcount=1 | tag=1234 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | reserved |1| 0xfff |1 1| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | 0... | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+ + | reserved | ~ (19 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | map[0]=49 | | +-+-+-+-+-+-+-+-+ + | map[1...127]=0x40 0x40 0x40 ... | ~ (127 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | hash[0...127]=0x0000 0x0000 ... | ~ (256 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | flags=0x1 | reserved | next=0x0000 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | vnode | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | uniquifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | name="iamexactly018chars" | ~ (18 octets) ~ + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | nul | ? | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | allocated extent record full of garbage | ~ (32 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | unallocated garbage | ~ (1568 octets) ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
18-Character Name String
#define NHASHENT 128 unsigned int dir_name_hash(const char * name) { unsigned int hash = 0; while(*name++) { hash *= 173; hash += *name; } if (hash & 0x80000000) { return NHASHENT - (hash & (NHASHENT-1)); } else { return hash & (NHASHENT-1); } }