Internet DRAFT - draft-xiang-interdomain-science-network
draft-xiang-interdomain-science-network
ALTO WG Q. Xiang
Internet-Draft Y. Yang
Intended status: Standards Track Tongji/Yale University
Expires: June 20, 2018 December 17, 2017
Unicorn: Resource Orchestration for Large-Scale, Multi-Domain Data
Analytics
draft-xiang-interdomain-science-network-00.txt
Abstract
This document presents the design of Unicorn, a multi-domain,
geographically-distributed, data-intensive analytics system. The
setting of such a system includes edge science networks, which
provide storage and computation resources for collecting, sharing and
analyzing extremely large amounts of data, and transit networks,
which provide networking resources to connects edge science networks
for transmitting large science datasets.
The key design challenge is to accurately discover and represent
resource information from different domains. Unicorn leverages
multiple ALTO services, including ALTO-Path Vector, ALTO-Routing
State Abstraction, ALTO-Server-Side Event and ALTO-Flow Cost Service
to address this challenge. In particular, Unicorn decomposes the
resource discovery into three phases. The first phase is to identify
endpoint resource, e.g., dataset storage location, computation
resource location and output storage resource location. The second
phase is to identify the reachability information between the
locations of storage and computation resources. The third phase is
to identify the available networking resource connecting different
storage and computation resources. All information collected through
these three phases can be used by a logically centralized scheduling
system to orchestrate the resources usage.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
Xiang & Yang Expires June 20, 2018 [Page 1]
Internet-Draft Unicorn Design December 2017
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 20, 2018.
Copyright Notice
Copyright (c) 2017 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Settings . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4
3. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4. Storage and Computation Resource Discovery . . . . . . . . . 6
5. Path Discovery . . . . . . . . . . . . . . . . . . . . . . . 6
5.1. Using SDN to get flow-based site-path . . . . . . . . . . 7
5.2. Path Discovery Example . . . . . . . . . . . . . . . . . 7
6. Networking Resource Discovery . . . . . . . . . . . . . . . . 8
6.1. Networking Resource Discovery Example . . . . . . . . . . 8
6.2. A Secure Multiparty Computation Protocol to Compute
Minimal, Cross-Domain RSA . . . . . . . . . . . . . . . . 8
7. References . . . . . . . . . . . . . . . . . . . . . . . . . 9
7.1. Normative References . . . . . . . . . . . . . . . . . . 9
7.2. Informative References . . . . . . . . . . . . . . . . . 9
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10
1. Introduction
As the data volume increases exponentially over time, data intensive
analytics is transiting from single-domain computing to multi-
organizational, geographically-distributed, collaborative computing,
where different organizations contribute various resources, e.g.,
computation, storage and networking resources, to collaboratively
collect, share and analyze extremely large amounts of data. One
leading example is the Large Hadron Collider (LHC) high energy
Xiang & Yang Expires June 20, 2018 [Page 2]
Internet-Draft Unicorn Design December 2017
physics (HEP) program, which aims to find new particles and
interactions in a previously inaccessible range of energies. The
scientific collaborations that have built and operate large HEP
experimental facilities at the LHC, such as the Compact Muon Solenoid
(CMS) and A Toroidal LHC ApparatuS (ATLAS), currently have more than
300 petabytes of data under management at hundreds of sites around
the world, and this volume is expected to grow to one exabyte by
approximately 2018.
This document presents Unicorn, a generic design for resource
orchestration for large-scale, multi-domain data analytics. The key
design challenge for such a resource orchestration system is to
accurately discover and represent the resource information from
different domains. Our design resorts to the Application-Layer
Traffic Optimization Protocol (ALTO) [RFC7285] to address this
challenge. In particular, several ALTO extension services, including
ALTO-Path Vector, ALTO-Routing State Abstraction, ALTO-Server-Side
Event and ALTO-Flow Cost Service, are integrated in the proposed
design.
This document focuses on the design details of Unicorn. We present
the implementation and deployment experience of Unicorn in another
document [DRAFT-UNICORN-INFO].
1.1. Settings
The targeting scenario is as follows. There are two types of
networks in the whole system. The first type is the edge science
network. An edge science networks is usually a cluster residing in a
campus network. It provides storage resources to store large
scientific datasets and computation resources to analyze these
datasets. The second type is the transit network. A transit network
does not provide any storage or computation resources. It only
provides networking resources to inter-connect different edge science
networks so that datasets can be moved and shared between different
edge science networks. Edge science networks do not directly connect
to each other, but are connected through transit networks.
Without loss of generality, a data analytics task is defined as a
3-tuple: (input dataset, program, output site). A task can be
further decomposed into a set of jobs, who have a precedence relation
defined by a directed acyclic graph (DAG). And each job can also be
defined as a 3-tuple: (input dataset, program, output site).
Xiang & Yang Expires June 20, 2018 [Page 3]
Internet-Draft Unicorn Design December 2017
2. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
3. Overview
The key design challenge for multi-domain data analytics system is to
accurately discover the resource information from different sites
while preserving the autonomy and privacy of each site. In order to
address this challenge, the design needs to strike a balance between
the information accuracy, the efficiency of resource discovery and
the privacy of each site. In particular, we propose the following
architecture in the Figure 1.
.---------.
| Users |
'---------'
| Tasks
.- - - - - - - - - - - - - - -|- - - - - - - - - - - - - - - - - - .
| | |
| .-----------------------. 1 .------------------------.|
| | Resource Orchestrator | -----|Storage/Computation Pool||
| '-----------------------' \ '------------------------'|
| / | | 4 \ \ |
| 2 / 3 | | 3\ \ 2 |
| .-------------. .-----------. .-------------. |
| | ALTO Server | | Execution | | ALTO Server | |
| '-------------' | Agents | '-------------' |
| | '-----------' | |
| | / \ | |
| .----------------./ \ .----------------. |
| | Site 1 | . . . | Site N | |
| '----------------' '----------------' |
'- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - '
Figure 1: Resource Orchestration for Large-Scale, Multi-Domain Data
Analytics: Architecture.
In the proposed design, each site deploys an ALTO server. Within the
site, the ALTO server collects resource information, such as dataset
locations, storage resources, computation resources, networking
resources and so on, and announces its capability, i.e., the type of
information it is willing to share with other ALTO servers or the
data analytics resource orchestrator.
Xiang & Yang Expires June 20, 2018 [Page 4]
Internet-Draft Unicorn Design December 2017
In this system, users submit analytics tasks to the resource
orchestrator. For a data analytics task, a user needs to at least
provide the analytics program. If the input dataset is not specified
by the user, it means this task does not require a dataset as input.
Similarly, if the user does not specify the computation resource or
the output dataset storage site, the system will try to allocate
default computation resources to this task, and will return the
output dataset directly to the user.
After getting a (set of) task(s) from the user(s), the orchestrator
discover the available resources for executing the submitted task(s)
in three steps, labeled in the figure. The first step is called
storage/computation resource discovery. In this step, the
orchestrator sends requests to a centralized storage and computation
resource pool to find the location of candidate input storage
resources, the computation resources and the output storage
resources.
The second step is called path discovery, in which the resource
orchestrator sends endpoint or flow cost service queries to the ALTO
servers at the site holding the candidate input storage resources and
the site holding the candidate computation resources to ask about the
connectivity from input dataset site to the computation site and that
from the computation site to the output site. The cost type of such
queries is path vector defined in [DRAFT-PV]. The response sent back
from the ALTO server to the orchestrator is a vector. Each element
in this vector is the IP address of the ingress gateway switch/router
that the candidate flow will pass along the AS-path. This vector is
called the site-path in this document.
After collecting the site-path of all the candidate (storage,
computation) flows, for each site X, the orchestrator derives F_X,
the set of candidate flows that will consume networking resources in
site X. Then the orchestrator will send endpoint/flow cost service
queries to the ALTO server at each site X to ask about the networking
resource sharing of the flow set F_X in site X. The returned
response is a set of linear inequalities called resource state
abstraction.
Using the resource information collected from the three-phase
resource discovery process, the resource orchestrator can run an
scheduling algorithm to make the resource allocation decisions to
execute the submitted tasks. The decisions include job decomposition
(DAG construction), task concatenation, job placement, network
resource allocation for input dataset movement and output movement.
These decisions will be sent to the corresponding execution agents at
different sites, which will practice these decisions and send
feedback to the orchestrator.
Xiang & Yang Expires June 20, 2018 [Page 5]
Internet-Draft Unicorn Design December 2017
When resource state changes, e.g., a network link is broken, the ALTO
server at the scene will check whether the results of existing path
discovery and networking resource discovery are affected by this
event, and sends updated resource information using the ALTO-SSE
service.
In the next few sections, we present the detailed design of the
three-phase resource discovery.
4. Storage and Computation Resource Discovery
In order to allocation resources for a (set of) data analytic tasks,
the scheduling system must first know the availability of the
resources explicitly specified in the task, i.e., the storage
resource storing the input dataset, the computation resources to run
the analytics program and the storage resource that will be used to
store the output dataset. Such resources are only provided by the
edge science networks. Therefore, a strawman design is for the
scheduling system to send requests to the resource information
servers of all the edge science networks and to get such information.
However, this solution is inefficient in that the scheduling system
needs to query all the edge science networks to get the complete
information.
This document adopts an alternative design, in which all the resource
information servers proactively send all their information about the
storage and computation resources to a centralized resource pool.
This resource pool can be a DNS server or a traditional database.
Different techniques are under investigation to improve the
scalability of this design, including sharding and distributed
hashing table (DHT).
5. Path Discovery
Having identified the locations of input dataset storage nodes, the
locations of candidate computation nodes and the locations of
candidate output dataset storage nodes, the scheduling system next
needs to find out the connectivity information between storage nodes
and computation nodes. The first connectivity information is the
reachability between storage nodes and the computation nodes. A
input storage node, a computation node and a output storage node can
be allocated to execute a job only if data movement is allowed
between the input storage node and the computation node, and between
the computation node and the output storage node.
Because edge science networks are connected through transit networks,
the data movement between candidate storage nodes and computation
nodes need to consume networking resources of multiple networks if
Xiang & Yang Expires June 20, 2018 [Page 6]
Internet-Draft Unicorn Design December 2017
these nodes are located at different edge science networks. In order
to find the networking resource sharing between different (storage,
computation) pair, the scheduling system also needs to know which
networks are involved in the data movement of each (storage,
computation) node pair.
To retrieve both the types of information, the scheduling system
issues endpoint cost service queries to the ALTO servers at edge
science networks. For the ALTO server at an edge science network X,
the scheduling system issues endpoint cost service defined in
[RFC7285] or the extension flow cost service defined in [DRAFT-FCS]
queries for all the (input storage node, computation node) pairs
where the input storage node is located in X, and all the
(computation node, output storage node) pairs where the computation
node is located in X. The cost type of such queries is the new path
vector cost type introduced in [DRAFT-PV].
For each (storage, computation) pair, the response sent by the ALTO
servers at edge science networks is a path vector providing the
information about the AS-level path for the data movement of this
pair. Different from the traditional path vector where each element
is an AS name/number, each element in the path vector sent by the
ALTO servers also includes the ingress IP address of the gateway
switch/router of the corresponding network. We call this path vector
the "site-path", to differentiate it from the traditional AS-path.
5.1. Using SDN to get flow-based site-path
ALTO servers can compute the site-path for a given (storage,
computation) pair using the information provided by BGP and
traceroute. However, BGP only supports destination-IP based routing
and limits each network's ability to make fine-grained flow-based
routing decisions. We are investigating the usage of SDN technique
to allow different networks in the multi-domain data analytics system
to exchange and make fine-grained flow-based inter-domain routing
decisions. To avoid the route advertisement explosion brought by
flow-based routing, we design use a sub/pub system that allows an
ALTO server to send routing information queries of a set of flows,
instead of the whole flow space, to other ALTO servers at other
domains.
5.2. Path Discovery Example
The following is an example of path discovery query made by the
orchestrator.
Xiang & Yang Expires June 20, 2018 [Page 7]
Internet-Draft Unicorn Design December 2017
{ "cost-type":
{ "cost-mode": "array",
"cost-metric": "ane-path" },
"endpoint-flows":
{ "srcs": [ "ipv4:172.0.0.1", "ipv4:172.0.1.1"],
"dsts": [ "ipv4:172.0.2.1", "ipv4:172.0.3.1"]}
}
And the following is the response sent from the ALTO server.
{"endpoint-cost-map":
"ipv4: 172.0.0.1 ": {
"ipv4: 172.0.2.1 ": ["ane:172.1.0.1", "ane:172.0.2.0"],
"ipv4: 172.0.3.1 ": ["ane:172.1.0.1", "ane:172.0.3.0"]},
"ipv4: 172.0.1.1 ": {
"ipv4: 172.0.2.1 ": ["ane:172.1.0.1", "ane:172.0.2.0"],
"ipv4: 172.0.3.1 ": ["ane:172.1.0.1", "ane:172.0.3.0"]}
}
6. Networking Resource Discovery
The responses from ALTO servers during the path discovery provides
the connectivity information for every pair of candidate input
dataset storage node and computation node, and that of every pair of
candidate computation node to output storage node in the form of
site-path. With such information, the scheduling system can further
discover the networking resource sharing between candidate (storage,
computation) data movement flows. In particular, for each network X,
both edge science networks and transit network, we can easily derive
the whole set of candidate data movement flows F_X that will enter
network X from the site-path information of all candidate (storage,
computation) data movement flows. After deriving F_X for each
network X, the scheduling system will send endpoint cost services or
flow cost services to retrieve the resource state abstraction
[DRAFT-RSA] for the flow set F_X.
6.1. Networking Resource Discovery Example
TBA.
6.2. A Secure Multiparty Computation Protocol to Compute Minimal,
Cross-Domain RSA
The current design of ALTO-RSA can only compute the minimal resource
state abstraction for a single network. In Unicorn, we design a
secure multiparty computation protocol to support the computation of
minimal, cross-domain routing state abstraction. This protocol
contains each network's exposure of its redundant linear inequalities
Xiang & Yang Expires June 20, 2018 [Page 8]
Internet-Draft Unicorn Design December 2017
to a small number of other networks, and ensures that the
orchestrator only gets the minimal, cross-domain resource state
abstraction. The overhead of this SMPC process is reasonable due to
the adoption of state-of-the-art secure scalar product protocol.
7. References
7.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
7.2. Informative References
[DRAFT-CC]
Randriamasy, S., Yang, R., Wu, Q., Deng, L., and N.
Schwan, "ALTO Cost Calendar", 2017,
<https://datatracker.ietf.org/doc/
draft-ietf-alto-cost-calendar>.
[DRAFT-DC]
Lee, Y., Bernstein, G., Dhody, D., and T. Choi, "ALTO
Extensions for Collecting Data Center Resource
Information", 2014, <https://datatracker.ietf.org/doc/
draft-lee-alto-ext-dc-resource/>.
[DRAFT-FCS]
Zhang, J., Gao, K., Wang, J., Xiang, Q., and Y. Yang,
"ALTO Extension: Flow-based Cost Query", 2017,
<https://datatracker.ietf.org/doc/draft-gao-alto-fcs/>.
[DRAFT-MC]
Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost
ALTO", 2017, <https://datatracker.ietf.org/doc/
draft-ietf-alto-multi-cost/>.
[DRAFT-NETGRAPH]
Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y.
Yang, "ALTO Topology Extensions: Node-Link Graphs", 2015,
<https://tools.ietf.org/html/draft-yang-alto-topology-06>.
[DRAFT-PM]
Roome, W. and Y. Yang, "Extensible Property Maps for the
ALTO Protocol", 2015, <https://datatracker.ietf.org/doc/
draft-roome-alto-unified-props-new/>.
Xiang & Yang Expires June 20, 2018 [Page 9]
Internet-Draft Unicorn Design December 2017
[DRAFT-PV]
Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y.
Yang, "ALTO Extension: Abstract Path Vector as a Cost
Mode", 2015, <https://tools.ietf.org/html/
draft-yang-alto-path-vector-01>.
[DRAFT-RSA]
Gao, K., Wang, X., Xiang, Q., Gu, C., Yang, Y., and G.
Chen, "A Recommendation for Compressing ALTO Path
Vectors", 2017, <https://datatracker.ietf.org/doc/
draft-gao-alto-routing-state-abstraction/>.
[DRAFT-SSE]
Roome, W. and Y. Yang, "ALTO Incremental Updates Using
Server-Sent Events (SSE)", 2015,
<https://datatracker.ietf.org/doc/
draft-ietf-alto-incr-update-sse/>.
[DRAFT-UNICORN-INFO]
Xiang, Q., Newman, H., Bernstein, G., Du, H., Gao, K.,
Mughal, A., Balcas, J., Zhang, J., and Y. Yang,
"Implementation and Deployment of A Resource Orchestration
System for Multi-Domain Data Analytics", 2017,
<https://datatracker.ietf.org/doc/
draft-xiang-alto-exascale-network-optimization/>.
[HTCondor]
Thain, D., Tannenbaum, T., and M. Livny, "Distributed
computing in practice: the Condor experience", 2005,
<http://dl.acm.org/citation.cfm?id=1064336>.
[RFC7285] Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S.,
Previdi, S., Roome, W., Shalunov, S., and R. Woundy,
"Application-Layer Traffic Optimization (ALTO) Protocol",
RFC 7285, DOI 10.17487/RFC7285, September 2014,
<https://www.rfc-editor.org/info/rfc7285>.
Authors' Addresses
Qiao Xiang
Tongji/Yale University
51 Prospect Street
New Haven, CT
USA
Email: qiao.xiang@cs.yale.edu
Xiang & Yang Expires June 20, 2018 [Page 10]
Internet-Draft Unicorn Design December 2017
Y. Richard Yang
Tongji/Yale University
51 Prospect Street
New Haven, CT
USA
Email: yry@cs.yale.edu
Xiang & Yang Expires June 20, 2018 [Page 11]