Internet DRAFT - draft-mcbride-data-discovery-problem-statement
draft-mcbride-data-discovery-problem-statement
Network Working Group M. McBride
Internet-Draft Futurewei
Intended status: Standards Track D. Kutscher
Expires: January 11, 2021 Emden University
E. Schooler
Intel
CJ. Bernardos
UC3M
D. Lopez
Telefonica I+D
July 10, 2020
Data Discovery Problem Statement
draft-mcbride-data-discovery-problem-statement-00
Abstract
If data is the new oil of the 21st century, then we need a
standardized way of locating, capturing, classifying and transforming
this raw data to generate insights and recommendations. Data, like
oil, needs to be discovered and captured in order to be refined and
valuable. While the topic of data discovery can be far reaching,
this document focuses on the problem of actually locating data,
throughout a network of data servers, in a standardized way.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 11, 2021.
Copyright Notice
Copyright (c) 2020 IETF Trust and the persons identified as the
document authors. All rights reserved.
McBride, et al. Expires January 11, 2021 [Page 1]
Internet-Draft Data Discovery Problem Statement July 2020
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 2
2. Problem Scope . . . . . . . . . . . . . . . . . . . . . . . . 2
3. Existing Solutions . . . . . . . . . . . . . . . . . . . . . 3
3.1. Proprietary . . . . . . . . . . . . . . . . . . . . . . . 3
3.2. Opensource . . . . . . . . . . . . . . . . . . . . . . . 4
4. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 5
6. Security Considerations . . . . . . . . . . . . . . . . . . . 5
7. Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 5
8. Normative References . . . . . . . . . . . . . . . . . . . . 5
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 6
1. Introduction
There are myriad proprietary and standardized ways of discovering
networking devices and hosts. There are many solutions for
discovering data within a database. There are proprietary, non-
standardized, ways of discovering the data that may be stored
throughout an environment of networking devices. We can discover
information about the devices but can't locate and capture stored
data in a standard way. With more networking devices storing
collected data there needs to be a standard way of discovering the
specific data needed amongst a potentially huge lake of databases.
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
2. Problem Scope
Data may be cached, copied and/or stored at multiple locations in the
network on route to its final destination. With an increasing
percentage of devices connecting to the Internet being mobile,
McBride, et al. Expires January 11, 2021 [Page 2]
Internet-Draft Data Discovery Problem Statement July 2020
support for in-the-network caching and replication is critical for
continuous data availability. There are data repositories throughout
a modern network and there needs to be a standardized way to locating
the repositories and discovering the desired data within.
There are many types of relational (SQL) and non-relational (NoSQL)
data classification solutions. Existing database classification
engines allow for scanning of a database. We are defining the
problem, however, of having a standards based solution to discover
first where the databases exist throughout a network and then where
specific data objects are located.
Data discovery is likely to look different depending on if we are
seeking global vs local discovery. Data discovery may be location-
driven. A standard to find data may want to search for it in a more
proximal fashion, i.e., find the data that matches the search that is
nearest to a location.
There is so much data being created, processed, and migrated, that it
may only sometimes get stored more permanently in a database. There
is going to be slightly less permanent data that resides for a time
in memory, so that it may be discovered and accessed quickly. It may
be more dynamic and short lived. Although we refer to the data store
as a database, it may reside entirely in memory, and/or it may be
stored in some other non-SQL indexing technology.
Each database essentially provides a directory service for the data
within them and that directory service can be viewed as metadata.
There is the need to understand where the databases/data lakes/
pockets of data reside. The location of each data store is the first
level discovery problem, and the details of the database's directory
is the second level discovery problem.
Publish and subscribe approaches allow nodes to express their
interest in specific pieces of data without knowing the location of
the data. There might be sources of data to be discovered that might
not produce the specific data desired by the subscribers (or not
produce data with a specific format or frequency). The subscriber
will want to find the publishers which send the desired data
characteristics.
3. Existing Solutions
3.1. Proprietary
There are many existing proprietary database discovery solutions we
can evaluate in order to understand what aspects we need to
standardized. For instance there is IBM Cognos, Wipro Data Discovery
McBride, et al. Expires January 11, 2021 [Page 3]
Internet-Draft Data Discovery Problem Statement July 2020
Platform (DDP), and Amazon Macie among many others. Macie, for
instance, is a data security and data privacy service that uses
machine learning and pattern matching to discover and protect data in
AWS. The service allows you to define data types in order to
discover and protect the data that may be unique to a use case.
3.2. Opensource
There are opensource data solutions such as from ScienceBase
(https://sciencebase.usgs.gov/). The U.S. Geological Survey (USGS)
is developing ScienceBase, an open source, collaborative, scientific
data and information management platform. It provides current
documentation about its structure, information model, services,
directory and repository. sbtools uses an R (command line driven
program used to find data within the platform) interface for
ScienceBase.
Another solution is the Interplanetary File System (ipfs.io). IPFS
is a distributed system for storing and accessing files, websites,
applications, and data. IPFS is a peer-to-peer (p2p) storage
network. Content is accessible through peers, located anywhere in
the world, that might relay information, store it, or do both. IPFS
knows how to find what you ask for via its content address, rather
than its location. There are three fundamental principles to
understanding IPFS:
o Unique identification via content addressing
o Content linking via directed acyclic graphs (DAGs)
o Content discovery via distributed hash tables (DHTs)
4. Use Cases
Here are some of the use cases which will benefit from standards
based data discovery solutions:
o We need a standards based solution to discover the increasing
amount of data being stored in various locations throughout a
network including at the edge. We need a standard protocol set
for doing this data discovery, on the device or infrastructure
edge, in order to meet the requirements of many use cases. We
will have terabytes of data on the edge and need a way to identify
its existence and find the desired data.
[I-D.mcbride-edge-data-discovery-overview] is focusing on this
aspect of data discovery.
McBride, et al. Expires January 11, 2021 [Page 4]
Internet-Draft Data Discovery Problem Statement July 2020
o We need a secure standards based solution for data discovery.
Several of the proprietary secure data discovery solutions use
machine learning and pattern matching to discover and protect the
data. We need to incorporate existing, or new, ietf security
solutions when discoverying data.
o We need a standards based solution for using named based solutions
for data discovery. An Information Centric Networking (ICN)
enabled network routes data by name (vs address), caches content
natively in the network, and employs data-centric security. Data
discovery may require that data be associated with a name or
names, a series of descriptive attributes, and/or a unique
identifier. NDN (Named Data Networking) can be applied to edge
data discovery to make it much easier to extract data and meta-
data by naming it. If data was named we would be able to discover
the appropriate data simply by its name.
o We need a standards based way of discovering data in mobile
wireless networks. Data could reside on the eNodeB or other
wireless access infrastructure equipment in addition to residing
on servers in the packet core.
5. IANA Considerations
N/A
6. Security Considerations
Data and metadata discovery are both a function of who asks for the
data and in what context. The policies attached to the database and
the metadata are going to dictate what view into the data that the
system returns to the requester.
7. Acknowledgement
8. Normative References
[I-D.mcbride-edge-data-discovery-overview]
McBride, M., Kutscher, D., Schooler, E., and C. Bernardos,
"Edge Data Discovery for COIN", draft-mcbride-edge-data-
discovery-overview-03 (work in progress), January 2020.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
McBride, et al. Expires January 11, 2021 [Page 5]
Internet-Draft Data Discovery Problem Statement July 2020
Authors' Addresses
Mike McBride
Futurewei
Email: michael.mcbride@futurewei.com
Dirk Kutscher
Emden University
Email: ietf@dkutscher.net
Eve Schooler
Intel
Email: eve.m.schooler@intel.com
URI: http://www.eveschooler.com
Carlos J. Bernardos
Universidad Carlos III de Madrid
Av. Universidad, 30
Leganes, Madrid 28911
Spain
Phone: +34 91624 6236
Email: cjbc@it.uc3m.es
URI: http://www.it.uc3m.es/cjbc/
Diego R. Lopez
Telefonica I+D
Don Ramon de la Cruz, 82
Madrid 28006
Spain
Email: diego.r.lopez@telefonica.com
McBride, et al. Expires January 11, 2021 [Page 6]