Internet DRAFT - draft-mathis-ippm-data-curation
draft-mathis-ippm-data-curation
IP Performance Working Group M. Mathis
Internet-Draft Google, Inc
Intended status: Experimental Oct 15, 2012
Expires: April 18, 2013
Curating Internet Measurement Data
draft-mathis-ippm-data-curation-00.txt
Abstract
This document describes the nomenclature, requirements and framework
for handling per-sample metadata for a live archive of Internet
measurements. By archive, we mean that once captured, the data MUST
NOT be altered or deleted. By live, we mean that new data from
ongoing measurements are being continuously appended to the archive.
The purpose of the metadata is to support the use of the Scientific
Method in the study the archived data. Under the principle of full
disclosure, the Scientific Method requires that later researchers be
able to repeat earlier studies using the original data, but refined
by new insights (such as improved calibration) gained from other
intervening studies.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 18, 2013.
Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
Mathis Expires April 18, 2013 [Page 1]
Internet-Draft IPPM Meta Data Oct 2012
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Definitions and Assumptions . . . . . . . . . . . . . . . . . . 4
4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 6
5. References . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1. Normative References . . . . . . . . . . . . . . . . . . . 7
5.2. Informative References . . . . . . . . . . . . . . . . . . 7
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 7
Mathis Expires April 18, 2013 [Page 2]
Internet-Draft IPPM Meta Data Oct 2012
1. Introduction
This [very preliminary Internet-Draft will eventually] describe the
nomenclature, requirements and framework for handling per-sample
metadata for a live archive of Internet measurements. By archive, we
mean that once captured, the data MUST NOT be altered or deleted. By
live, we mean that new data from ongoing measurement are being
continuously appended to the archive. Without loss of generality, we
model such an archive as a large sparse table where each row is a
sample (likely but not required to be a singleton as defined in
[RFC2330]) and each column is a measurement parameter or result.
The purpose of the metadata is to support the use of the Scientific
Method in the study the live data archive. Under the principle of
full disclosure, the Scientific Method requires that later
researchers have access to the same tools and raw data set used by
previous researchers. The difficulty being that the archive itself
has been extended in the time between current researchers' access,
and previous researchers' conclusions. A capability must exist to
exclude all new data from these investigations. Furthermore, a great
deal of knowledge is often gained by detecting systematic error in
extant data and retroactively recalibrating well after it was
collected. It is extremely important to be able to repeat earlier
inquiry using legacy data refined by later calibration studies.
The metadata is to be placed in additional columns that are subject
to the same rules as the data itself: columns can be added, but once
added they MUST NOT be altered or deleted. This note describe the
nomenclature, framework and requirements for handling the metadata.
[At this time our goal is to recruit additional authors.]
2. Background
The advent of large scale online storage and high performance cloud
computing has created the opportunity for data mining Internet
measurements. Data mining can be described as detecting patterns in
large data sets to infer knowledge beyond the scope of the original
data or measurement.
In order for inferred results to meet the formal criterion as a
scientific endeavour, the data mining analysis MUST be repeatable by
later researchers who MUST be able to re-examine the same raw data
with the same or similar tools. The additional researcher can
confirm the conclusions or consider the possibility of alternative
explanations for the patterns detected in the data.
Mathis Expires April 18, 2013 [Page 3]
Internet-Draft IPPM Meta Data Oct 2012
To support data mining as a scientific endeavour the data set must be
archival: the only permitted alteration to the data set is to append
more data: additional rows as more measurements are performed;
additional columns as new annotations are added. Once captured, the
data MUST NOT be deleted or altered under any conditions.
All data is subject to measurement error. One common use of of data
mining is detecting and retroactively compensating for measurement
error. In the case of Internet measurement there is also the
opportunity for operational events such as topology changes or
routing updates to introduce subtle errors or changes in the
calibration of the measurements. As a consequence, it is desirable
to be able to tag rows with metadata indicating noteworthy
operational events. Although most such annotations are entirely
benign, occasionally it might be discovered that some of the data is
"tainted" when something goes wrong such that the data can not be
trusted without additional processing.
An important use case for an archived data set is supporting the
following type of scenario: researcher A makes a claim about a
pattern observed while mining the data. At a later time researcher B
discovers some systematic error present in some of the data and
develops a method to recalibrate the data to compensate for the
systematic errors. Researcher C wants to reconfirm A's results by
running the following series of experiments:
A's original analysis of the same rows as A used, to confirm the
procedure.
A's original analysis of the same rows as A used, using data
recalibrated by B.
Differential version of A's algorithm to extrapolate how
recalibration might affect other results.
A's original analysis on the data to date, including new rows
since A's conclusion.
A's original analysis on the data to date, using data recalibrated
by B.
This process of continuous re-evaluation is the foundation of the
Scientific Method.
3. Definitions and Assumptions
Data is assumed to be named (indexed) by N-tuples. Without loss of
generality, we further assume a 2 dimensional sparse table where each
measurement is a single row, and columns that are parameters or
results associated with each measurement, as might be implemented in
an SQL like database. These assumptions give us some linguistic
shorthand: each row is a singletons[RFC2330], and each column the
Mathis Expires April 18, 2013 [Page 4]
Internet-Draft IPPM Meta Data Oct 2012
parameters, result and metadata data associated with each singleton.
In a live archive, the number of rows continues to grow as more data
is collected. This note describes conventions and practices for
adding additional columns to support robust scientific method.
Other ways to organize the data are completely acceptable. Using
some other data organization would change the terminology but not the
principles outlined in this document.
Metadata can be added to each row, by adding additional columns to
the table. Metadata MUST be subject to the same rules as the
measurement data: once added it can not be deleted or altered,
otherwise an experiment that used the metadata can not be repeated.
To minimize the long term cost of the metadata, it should be designed
carefully such that as much as possible each metadata column will
remain useful for the life of the data set.
It is explicitly permitted for metadata to be added to existing rows
after the data itself has been archived.
Each metadata column must be properly defined: at a minimum a textual
description and date the column was first added. Other useful
attributes include an indication of who or what authority or process
is responsible for adding metadata to new rows, as the archive is
extended by live measurement.
It is tempting to try to replace the textual description of the
metadata by some formal language specification. However our
intuition is that such an effort is likely to be unbound and
potentially non-convergent. Therefore, we only provide a couple of
examples.
A chronology is a metadata tag for which there are well defined
"before" and "after" primitives. Examples include measurement time
and archive time, as well as several platform properties: OS version
and versions of all component software. The important property of a
chronology is that it is useful to specify rows as ranges.
While it is tempting to think that measurement time might be
sufficient metadata, consider the difficulty in representing a
scenario where a software version change affected some detail of the
measurement parameters. If this tool is deployed across a very large
fleet of nodes, then making sense of the data requires being able to
join the per node update logs with the archived data. In general it
would be far easier to label the data with the measurement tool
version used for measurement. The researcher then just has to filter
the data by comparing tool version.
Mathis Expires April 18, 2013 [Page 5]
Internet-Draft IPPM Meta Data Oct 2012
Another example of a chronology might be to tag measurements with
pointers to entries in an operations log that might be tracking
topology and configuration changes. This would permit fast
exploration of questions such as, "did some network change affect
performance in a detectable way?"
A another class of meta data is data recalibration, for instance
using an independent means to compute systematic errors present in
measurement parameters. Either updated values or error offsets could
be stored as metadata. This gives all researchers access to both the
raw, uncorrected values and the adjusted or recalibrated values.
Similar to recalibration is addressing tainted data, for example if
some event made the data unreliable or inaccurate as pertains to
determining some specific metric in a way that can not be calibrated.
While it is tempting to entirely discard such data, doing so would
invalidate the archive for other sorts of studies which might not be
affected by inaccuracy, for example investigating questions of user
self selection.
4. Requirements
This section defines a metadata formalism that would permit a data
archive to implement computer science grade lock semantics on the
data. This is likely to be overkill for nearly all applications,
however it does permit an implementer clearly understand where there
are assumption that might create potential for race conditions, for
example by not having a strong way to assure that empty cells are not
changed after they have been accessed.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
Data in an archive MUST NOT be altered or deleted.
The Archive MUST make a distinction between an empty cell (never
written) and a cell containing a null data. e.g. a null string or a
null pointer. Each cell is permitted to have one transition from
empty to non-empty, and no other transitions.
And many many more....
5. References
Mathis Expires April 18, 2013 [Page 6]
Internet-Draft IPPM Meta Data Oct 2012
5.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
5.2. Informative References
[RFC2330] Paxson, V., Almes, G., Mahdavi, J., and M. Mathis,
"Framework for IP Performance Metrics", RFC 2330,
May 1998.
Author's Address
Matt Mathis
Google, Inc
1600 Amphitheater Parkway
Mountain View, California 93117
USA
Email: mattmathis@google.com
Mathis Expires April 18, 2013 [Page 7]