mathis-ippm-data-curation-00.txt

Internet DRAFT - draft-mathis-ippm-data-curation
draft-mathis-ippm-data-curation

Last Version:	draft-mathis-ippm-data-curation-00.txt	Tracker Entry
Date:	`16-Oct-2012`
Disposition:	expired




IP Performance Working Group                                   M. Mathis
Internet-Draft                                               Google, Inc
Intended status: Experimental                               Oct 15, 2012
Expires: April 18, 2013


                   Curating Internet Measurement Data
                 draft-mathis-ippm-data-curation-00.txt

Abstract

   This document describes the nomenclature, requirements and framework
   for handling per-sample metadata for a live archive of Internet
   measurements.  By archive, we mean that once captured, the data MUST
   NOT be altered or deleted.  By live, we mean that new data from
   ongoing measurements are being continuously appended to the archive.

   The purpose of the metadata is to support the use of the Scientific
   Method in the study the archived data.  Under the principle of full
   disclosure, the Scientific Method requires that later researchers be
   able to repeat earlier studies using the original data, but refined
   by new insights (such as improved calibration) gained from other
   intervening studies.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on April 18, 2013.

Copyright Notice

   Copyright (c) 2012 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents



Mathis                   Expires April 18, 2013                 [Page 1]

Internet-Draft               IPPM Meta Data                     Oct 2012


   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . . . 3
   2.  Background  . . . . . . . . . . . . . . . . . . . . . . . . . . 3
   3.  Definitions and Assumptions . . . . . . . . . . . . . . . . . . 4
   4.  Requirements  . . . . . . . . . . . . . . . . . . . . . . . . . 6
   5.  References  . . . . . . . . . . . . . . . . . . . . . . . . . . 7
     5.1.  Normative References  . . . . . . . . . . . . . . . . . . . 7
     5.2.  Informative References  . . . . . . . . . . . . . . . . . . 7
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . . . 7
































Mathis                   Expires April 18, 2013                 [Page 2]

Internet-Draft               IPPM Meta Data                     Oct 2012


1.  Introduction

   This [very preliminary Internet-Draft will eventually] describe the
   nomenclature, requirements and framework for handling per-sample
   metadata for a live archive of Internet measurements.  By archive, we
   mean that once captured, the data MUST NOT be altered or deleted.  By
   live, we mean that new data from ongoing measurement are being
   continuously appended to the archive.  Without loss of generality, we
   model such an archive as a large sparse table where each row is a
   sample (likely but not required to be a singleton as defined in
   [RFC2330]) and each column is a measurement parameter or result.

   The purpose of the metadata is to support the use of the Scientific
   Method in the study the live data archive.  Under the principle of
   full disclosure, the Scientific Method requires that later
   researchers have access to the same tools and raw data set used by
   previous researchers.  The difficulty being that the archive itself
   has been extended in the time between current researchers' access,
   and previous researchers' conclusions.  A capability must exist to
   exclude all new data from these investigations.  Furthermore, a great
   deal of knowledge is often gained by detecting systematic error in
   extant data and retroactively recalibrating well after it was
   collected.  It is extremely important to be able to repeat earlier
   inquiry using legacy data refined by later calibration studies.

   The metadata is to be placed in additional columns that are subject
   to the same rules as the data itself: columns can be added, but once
   added they MUST NOT be altered or deleted.  This note describe the
   nomenclature, framework and requirements for handling the metadata.

   [At this time our goal is to recruit additional authors.]


2.  Background

   The advent of large scale online storage and high performance cloud
   computing has created the opportunity for data mining Internet
   measurements.  Data mining can be described as detecting patterns in
   large data sets to infer knowledge beyond the scope of the original
   data or measurement.

   In order for inferred results to meet the formal criterion as a
   scientific endeavour, the data mining analysis MUST be repeatable by
   later researchers who MUST be able to re-examine the same raw data
   with the same or similar tools.  The additional researcher can
   confirm the conclusions or consider the possibility of alternative
   explanations for the patterns detected in the data.




Mathis                   Expires April 18, 2013                 [Page 3]

Internet-Draft               IPPM Meta Data                     Oct 2012


   To support data mining as a scientific endeavour the data set must be
   archival: the only permitted alteration to the data set is to append
   more data: additional rows as more measurements are performed;
   additional columns as new annotations are added.  Once captured, the
   data MUST NOT be deleted or altered under any conditions.

   All data is subject to measurement error.  One common use of of data
   mining is detecting and retroactively compensating for measurement
   error.  In the case of Internet measurement there is also the
   opportunity for operational events such as topology changes or
   routing updates to introduce subtle errors or changes in the
   calibration of the measurements.  As a consequence, it is desirable
   to be able to tag rows with metadata indicating noteworthy
   operational events.  Although most such annotations are entirely
   benign, occasionally it might be discovered that some of the data is
   "tainted" when something goes wrong such that the data can not be
   trusted without additional processing.

   An important use case for an archived data set is supporting the
   following type of scenario: researcher A makes a claim about a
   pattern observed while mining the data.  At a later time researcher B
   discovers some systematic error present in some of the data and
   develops a method to recalibrate the data to compensate for the
   systematic errors.  Researcher C wants to reconfirm A's results by
   running the following series of experiments:
      A's original analysis of the same rows as A used, to confirm the
      procedure.
      A's original analysis of the same rows as A used, using data
      recalibrated by B.
      Differential version of A's algorithm to extrapolate how
      recalibration might affect other results.
      A's original analysis on the data to date, including new rows
      since A's conclusion.
      A's original analysis on the data to date, using data recalibrated
      by B.

   This process of continuous re-evaluation is the foundation of the
   Scientific Method.


3.  Definitions and Assumptions

   Data is assumed to be named (indexed) by N-tuples.  Without loss of
   generality, we further assume a 2 dimensional sparse table where each
   measurement is a single row, and columns that are parameters or
   results associated with each measurement, as might be implemented in
   an SQL like database.  These assumptions give us some linguistic
   shorthand: each row is a singletons[RFC2330], and each column the



Mathis                   Expires April 18, 2013                 [Page 4]

Internet-Draft               IPPM Meta Data                     Oct 2012


   parameters, result and metadata data associated with each singleton.
   In a live archive, the number of rows continues to grow as more data
   is collected.  This note describes conventions and practices for
   adding additional columns to support robust scientific method.

   Other ways to organize the data are completely acceptable.  Using
   some other data organization would change the terminology but not the
   principles outlined in this document.

   Metadata can be added to each row, by adding additional columns to
   the table.  Metadata MUST be subject to the same rules as the
   measurement data: once added it can not be deleted or altered,
   otherwise an experiment that used the metadata can not be repeated.
   To minimize the long term cost of the metadata, it should be designed
   carefully such that as much as possible each metadata column will
   remain useful for the life of the data set.

   It is explicitly permitted for metadata to be added to existing rows
   after the data itself has been archived.

   Each metadata column must be properly defined: at a minimum a textual
   description and date the column was first added.  Other useful
   attributes include an indication of who or what authority or process
   is responsible for adding metadata to new rows, as the archive is
   extended by live measurement.

   It is tempting to try to replace the textual description of the
   metadata by some formal language specification.  However our
   intuition is that such an effort is likely to be unbound and
   potentially non-convergent.  Therefore, we only provide a couple of
   examples.

   A chronology is a metadata tag for which there are well defined
   "before" and "after" primitives.  Examples include measurement time
   and archive time, as well as several platform properties: OS version
   and versions of all component software.  The important property of a
   chronology is that it is useful to specify rows as ranges.

   While it is tempting to think that measurement time might be
   sufficient metadata, consider the difficulty in representing a
   scenario where a software version change affected some detail of the
   measurement parameters.  If this tool is deployed across a very large
   fleet of nodes, then making sense of the data requires being able to
   join the per node update logs with the archived data.  In general it
   would be far easier to label the data with the measurement tool
   version used for measurement.  The researcher then just has to filter
   the data by comparing tool version.




Mathis                   Expires April 18, 2013                 [Page 5]

Internet-Draft               IPPM Meta Data                     Oct 2012


   Another example of a chronology might be to tag measurements with
   pointers to entries in an operations log that might be tracking
   topology and configuration changes.  This would permit fast
   exploration of questions such as, "did some network change affect
   performance in a detectable way?"

   A another class of meta data is data recalibration, for instance
   using an independent means to compute systematic errors present in
   measurement parameters.  Either updated values or error offsets could
   be stored as metadata.  This gives all researchers access to both the
   raw, uncorrected values and the adjusted or recalibrated values.

   Similar to recalibration is addressing tainted data, for example if
   some event made the data unreliable or inaccurate as pertains to
   determining some specific metric in a way that can not be calibrated.
   While it is tempting to entirely discard such data, doing so would
   invalidate the archive for other sorts of studies which might not be
   affected by inaccuracy, for example investigating questions of user
   self selection.


4.  Requirements

   This section defines a metadata formalism that would permit a data
   archive to implement computer science grade lock semantics on the
   data.  This is likely to be overkill for nearly all applications,
   however it does permit an implementer clearly understand where there
   are assumption that might create potential for race conditions, for
   example by not having a strong way to assure that empty cells are not
   changed after they have been accessed.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

   Data in an archive MUST NOT be altered or deleted.

   The Archive MUST make a distinction between an empty cell (never
   written) and a cell containing a null data. e.g. a null string or a
   null pointer.  Each cell is permitted to have one transition from
   empty to non-empty, and no other transitions.

   And many many more....


5.  References





Mathis                   Expires April 18, 2013                 [Page 6]

Internet-Draft               IPPM Meta Data                     Oct 2012


5.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

5.2.  Informative References

   [RFC2330]  Paxson, V., Almes, G., Mahdavi, J., and M. Mathis,
              "Framework for IP Performance Metrics", RFC 2330,
              May 1998.


Author's Address

   Matt Mathis
   Google, Inc
   1600 Amphitheater Parkway
   Mountain View, California  93117
   USA

   Email: mattmathis@google.com






























Mathis                   Expires April 18, 2013                 [Page 7]