Internet DRAFT - draft-he-coin-datacenter
draft-he-coin-datacenter
COIN J. He
Internet-Draft R. Chen
Intended status: Informational Huawei
Expires: April 14, 2019 M. Montpetit, Ed.
Triangle Video
October 11, 2018
In-Network Data-Center Computing
draft-he-coin-datacenter-00
Abstract
This draft wants to review the existing research and the open issues
that relate to the addition of data plane programmability in Data
Center. While some of the research hypotheses that are at the center
of in-network-computing have been investigated since the time of
active networking, recent developments in software defined
networking, virtualization programmable switches and new network
programming languages like P4 have generated a new enthusiasm in the
research community and a flourish of new projects in systems and
applications alike. This is what this draft is addressing.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 14, 2019.
Copyright Notice
Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
He, et al. Expires April 14, 2019 [Page 1]
Internet-Draft October 2018
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
2. In Network Computing and Data Centers . . . . . . . . . . . . 3
3. State of the Art in DC Programmability . . . . . . . . . . . 4
3.1. In-Network Computing . . . . . . . . . . . . . . . . . . 4
3.2. In-Network Caching . . . . . . . . . . . . . . . . . . . 6
3.3. In Network Consensus . . . . . . . . . . . . . . . . . . 7
3.4. Research Topics in Next Generation Data Centers . . . . . 8
3.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 10
4. References . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1. Normative References . . . . . . . . . . . . . . . . . . 10
4.2. Informative References . . . . . . . . . . . . . . . . . 10
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11
1. Introduction
It is now a given in the computing and networking world that
traditional approaches to cloud and client-server architectures lead
to complexity and scalability issues. New solutions are necessary to
address the growth of next generation network operation (in data
centers and edge devices alike) including automation, self-
management, orchestration across components and federation across
network nodes to enable emerging services and applications.
Mobility, social network and big data and AI/ML as well as emerging
content application in the XR (virtual, augmented and mixed reality)
require more scalable, available and reliable solution not only in
real time, anywhere and over a wide variety of end devices. While
these solutions involve edge resources for computing, rendering and
distribution, this paper focuses on the data center what are the
current research approaches to create more flexible solutions. We
must define what we understand by data centers. In this draft, we
are not going to limit them to single location cloud resources but
add multiple locations as well as interwork with edge resources to
enable the network programmability that is central to next generation
DCs in term of supported services and dynamic resilience. This leads
to innovative research opportunities, including but not limited to:
- Software defined networking (SDN) in distributed environments.
He, et al. Expires April 14, 2019 [Page 2]
Internet-Draft October 2018
- Security and trust models.
- Data plane programmability for consensus and key-value
operations.
- High Level abstractions as in network computing should focus on
primitives, which can be widely re-used in a class of applications
and workloads, and identify those high level abstractions to
promote deployment.
- Machine Learning (ML) and Artificial Intelligence (AI) to detect
faults and failures and allow rapid responses as well as implement
network control and analytics.
- New services for mixed reality (XR) deployment with in-network
optimization and advanced data structures and rendering for
interactivity, security and resiliency.
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
2. In Network Computing and Data Centers
As DC hardware components becoming interchangeable, the advent of
software-defined technologies suggests that a change is underway. In
the next-generation data center, an increasing percentage of critical
business and management functions will be activated in the software
layer rather than the underlying hardware. This will allow
organizations to move away from the current manual configurations to
handle more dynamic, rules-based configurations. Hence,
virtualization and cloud computing have redefined the datacenter (DC)
boundaries beyond the traditional hardware-centric view [SAPIO].
Servers, storage, monitoring and connectivity are becoming one. The
network is more and more the computer.
Hence, there is now a number of distributed networking and computing
systems which are the basis of big-data and AI-related applications
in DCs in particular. They include Distributed file system (e.g. the
Hadoop Distributed File System or HDFS [HADOOP]), distributed memory
database (e.g. MemCached [MEM]), distributed computing system (e.g.
mapReduce from Hadoop, [HADOOP], Tensorflow [TENSOR], and Spark
GraphX [SPARK]), as well as distributed trust systems on the
blockchain, such as hyperledgers and smart contracts.
He, et al. Expires April 14, 2019 [Page 3]
Internet-Draft October 2018
In parallel the emergence of the P4 language [P4] and programmable
switches facilitates innovation and triggers new research. For
example, the latest programmable switches make the concept of the
totally programmable and dynamically reconfigurable network closer to
reality. And, as distributed systems are increasingly based on
memory instead of hard disks, distributed system-based application
performance is increasingly constrained by network resources not
computing.
However, there are some challenges when introducing in-network
computing and caching:
- Limited memory size: tor example, the SRAM size [TOFINO] can be
as small as tens of MBs.
- Limited instruction sets: the operations are mainly simple
arithmetic, data (packet) manipulation and hash operation. Some
switches can provide limited floating-point operation. This
enables network performance tools like forward error correction
but limits more advanced applications such as machine learning for
congestion control for example.
- Limited speed/CPU processing capabilities: only a few operations
can be performed on each packet to ensure line speed (tens of
nano-seconds on fast hardware). Looping could allow a processed
packet to re-enter the ingress queue, but with a cost of
increasing latency and reducing forwarding capabilities.
- Performance: for devices located on the links of the distributed
network; it is to be evaluated how on-path processing can reduce
the FCT (Flow Completion Time) in data center network hence reduce
the network traffic/congestion and increase the throughput.
The next sections of this draft review how some of these questions
are currently being addressed in the research community.
3. State of the Art in DC Programmability
Recent research has shown that in-network computing can greatly
improve the DC network performance of three typical scenarios:
aggregate on path- computing, key-value (K-V) cache, and strong
consistency. Some of these research results are summarized below.
3.1. In-Network Computing
The goals on on-path computing in DC is 1. to reduce delay and/or
increase throughput for improved performance by allowing advanced
packet processing and 2. to help reduce network traffic and alleviate
He, et al. Expires April 14, 2019 [Page 4]
Internet-Draft October 2018
congestion by implementing better traffic and congestion management
[REXFORD][SOULE][SAPIO].
However, in terms of research and implementation, there are still
open issues that need to be addressed in order to fulfill these
promises beyond what was mentioned in the previous section. In
particular, the end-to-end principle which has driven most of the
networking paradigms of the last 20 years is challenged when in-
network computing devices are inserted on the ingress-egress path.
This is still an open discussion topic.
The type of computing that can be performed to improve the DC
performance is another of the open topics. Computing should improve
performance but not at the expense of existing application
degradation Computing should also enable new applications to be
developed. At the time of this writing those include data intensive
applications in workload mode with partition and aggregation
functionality.
Data-intensive applications include big data analysis (e.g. data
reduction, deduplication and machine learning), graph processing, and
stream processing. They support scalability by distributing data and
computing to many worker servers. Each worker performs computing on
a part of the data, and there is a communication phase to update the
shared state or complete the final calculation. This process can be
executed iteratively. It is obvious that communication cost and
availability of bottleneck resources will be one of the main
challenges for such applications to perform well as a large amount of
data need to be transmitted frequently in many-to-many mode. But
already, there are several distributed frameworks with user-defined
aggregation functions, such as mapReduce from Haddop [HADOOP], Pregel
from Google [PREGEL], and DryadLinq from Microsoft [DRYAD]. These
functions enable application developers to reduce the network load
used for messaging by aggregating all single messages together and
consequently reduce the task execution time. Currently, these
aggregation functions are used only at the worker level. If they are
used at the network level, a higher traffic reduction ratio can be
reached.
The aggregation functions needed by the data intensive applications,
have some features that make it suitable to be at least executed in a
programmable. They usually reduce the total amount of data by
arithmetic (add) or logical function (minima/maxima detection) that
can be parallelized. Performing these functions in the DC at the
ingress of the network can be beneficial to reduce the total network
traffic and lead to reduced congestion. The challenge is of course
not to lose important data in the process especially when applied to
He, et al. Expires April 14, 2019 [Page 5]
Internet-Draft October 2018
different parts of the input data without considering the order and
affect the accuracy of the final result.
In-network computing can also improve the performance of multipath
routing by aggregating path capacity to individual flows and
providing dynamic path selection, improving scalability and
multitenancy.
Other data intensive applications that can be improved in terms of
network load by in-network computing include: machine learning, graph
analysis, data analytics and map reduce. For all of those,
aggregation functions in the computing hardware provides a reduction
of potential network congestion; in addition, because of the reduced
load, the overall application performance is improved. The traffic
reduction was shown to range from 48% up to 93% [SAPIO].
Machine learning is a very active research area for in-network
computing because of the large datasets it both requires and
generates. For example, in TensorFlow [TENSOR], parameters
updates are small deltas that only change a subset of the overall
tensor and can be aggregated by a vector addition operation. The
overlap of the tensor updates, i.e. the portion of tensor elements
that are updated by multi workers at the same time, is
representative of the possible data reduction achievable when the
updates are aggregated inside the network.
In graph analysis, three algorithms with various characteristics
have been considered in [SAPIO]: PageRank, Single Source Shortest
Path (SSSP) and Weakly Connected Components (WCC) with a
commutative and associative aggregation function. Experiment
shows that the potential traffic reduction ratio in the three
applications is signification.
Finally, in map-reduce, experiments in the same paper show that
after aggregation computing, the number of packets received by the
reducer decreases by 88%~90% compared UDP, by 40% compared to
using TCP. There is thus great promise for mapReduce-like to take
advantage of computing and storage optimization.
3.2. In-Network Caching
Key-value stores are ubiquitous and one of their major challenges are
to process their associated data-skewed workload in a dynamic
fashion. As in any caches, popular items receive more queries, and
the set of popular items can change rapidly, with the occurrence of
well-liked posts, limited-time offers, and trending events. The skew
generated by the dynamic nature of the K-V can lead to severe load
imbalance and significant performance deterioration. The server is
He, et al. Expires April 14, 2019 [Page 6]
Internet-Draft October 2018
either overused in an area or underused in another, the throughput
can decrease rapidly, and the response time latency degrades
significantly. When the storage server uses per core sharding/
partitioning to process high concurrency, this degradation will be
further amplified. The problem of unbalanced load is especially
acute for high performance in memory K-V store.
The selective replication copying of popular items is often used to
keep performance high. However, in addition to more hardware
resource consumption, selective replication requires a complex
mechanism to implement data mobility, data consistency and query
routing. As a result, system design becomes complex and overhead is
increased.
This is where in-network caching can help. Recent research
experiments show that K-V cache throughput can be improved by 3~10
times by introducing in net cache. Analytical results in [FAN] show
that a small frontend cache can provide load balancing for N back-end
nodes by caching only O(N logN) entries, even under worst-case
request patterns. Hence, caching O(NlogN) items is sufficient to
balance the load for N storage servers (or CPU cores).
In the NetCache system [JIN], a new rack-scale key-value store design
guarantees billions of queries per second (QPS) with bounded
latencies even under highly-skewed and rapidly-changing workloads. A
programmable switch is used to detect, sort, cache, and obtain a
hotspot K-V pair to process load balancing between the switch storage
nodes.
3.3. In Network Consensus
Strong consistency and consensus in distributed networks are
important. Significant efforts in the in-network computing community
have been directed towards it. Coordination is needed to maintain
system consistency and it requires a large amount of communication
between network nodes and instances, taking away processing
capabilities from other more essential tasks. Performance overhead
and extra resources often result in a decrease in consistency. And
as a result, potential inconsistencies need to be addressed.
Maintaining consistency requires multiple communications rounds in
order to reach agreement, hence the danger of creating messaging
bottlenecks in large systems. Even without congestion, failure or
lost messages, a decision can only be reached as fast as the network
round trip time (RTT) permits. Thus, it is essential to find
efficient mechanisms for the agreement protocols. One idea is to use
the network devices themselves.
He, et al. Expires April 14, 2019 [Page 7]
Internet-Draft October 2018
Hence, consensus mechanisms for ensuring consistency are some of the
most expensive operations in managing large amounts of data [ZSOLT].
Often, there is a tradeoff that involves reducing the coordination
overhead at the price of accepting possible data loss or
inconsistencies. As the demand for more efficient data centers
increases, it is important to provide better ways of ensuring
consistency without affecting performance. In [ZSOLT] consensus
(atomic broadcast) is removed from the critical path by moving it to
hardware. The Zookeeper atomic broadcast (also in Hadoop) proof of
concept is implemented at the network level on an FPGA, using both
TCP and an application specific network protocol. This design can be
used to push more value into the network, e.g., by extending the
functionality of middle boxes or adding inexpensive consensus to in-
network processing nodes.
A widely used protocol for consensus is Paxos. Paxos is a
fundamental protocol used by fault-tolerant systems, and is widely
used by data center applications. In summary, Paxos serializes
transaction requests from different clients in the leader, ensuring
that each learner (message replicator) in the distributed system is
implemented in the same order. Each proposal can be an atomic
operation (an inseparable operation set). Paxos does not care about
specific content of the proposal. Recently, some research evaluation
suggests that moving Paxos logic into the network would yield
significant performance benefits for distributed applications
[DANG15]. In this scheme network switches can play the role of
coordinators (request managers) and acceptors (managed storage
nodes). Messages travel fewer hops in the network, therefore
reducing the latency for the replicated system to reach consensus
since coordinators and acceptors typically act as bottlenecks in
Paxos implementations, because they must aggregate or multiplex
multiple messages. Experiments suggest that moving consensus logic
into network devices could dramatically improve the performance of
replicated systems. In [DANG15], NetPaxos achieves a maximum
throughput of 57,457 messages/s, while basic Paxos the coordinator
being CPU bound, is only able to send 6,369 messages/s. In [DANG16],
a P4 implementation of Paxos is presented as a result of Paxos
implementation with programmable data planes.
Other papers, have shown the use of in-network processing and SDN for
Paxos performance improvements using multi-ordered multicast and
multi-sequencing [LIJ] [PORTS].
3.4. Research Topics in Next Generation Data Centers
While the previous section introduced the state of the art in data
center in-network computing, there are still some open issues that
need to be addressed. In this section, some of these questions are
He, et al. Expires April 14, 2019 [Page 8]
Internet-Draft October 2018
listed as well as the impacts that adding in-network computing will
have on existing systems.
Adding computing and caching to the network violates the End-to-End
principle central to the Internet. And the interaction with
encrypted systems can limit the scope of what in-network can do to
individual packet. In addition, even when programmable, every switch
is still designed for (line speed) forwarding with the resulting
limitations, such as lack of floating-point support for advanced
algorithms and buffer size limitation. Especially in the high-
performance datacenters for in-network computing to be successful, a
balance between functionality, performance and cost must be found.
Hence the research areas include but are not limited to:
Protocol design and network architecture: a lot of the current in-
network work has targeted network layer optimization; however,
transport protocols will influence the performance of any in-
network solution. How can in-network optimization interact (or
not) with transport layer optimizations?
Can the end-to-end assumptions of existing transport like TCP
still be applicable in the in-network compute era? There is
heritage in middlebox interactions with existing flows.
Fixed-point calculation for current application vs. of floating-
point calculation for more complex operations and services:
network switches typically do not support floating-point
calculation. Is it necessary to introduce this capability and
scale to the demand? For example, AI and ML algorithms currently
mainly use floating-point calculation. If the AI algorithm is
changed to fixed-point calculation, will the training stop in
advance and the training result deteriorate (this needs to be done
as joint work with the AI community)?
What are the gains brought by aggregation in distributed and
decentralized networks? The built-in buffer of the network device
is often limited, and, for example the AI and XR application
parameters and caches can reach hundreds of megabytes. There is a
trade-off between aggregation of the data on a single network
device and its distribution across multiple nodes in terms of
performance.
What is the relationship between the depth of packet inspection
and not only performance but security and privacy? There is a
need to find what application layer cryptography is ready to
expose to other layers and even collaborating nodes; this is also
related to trust in distributed networks.
He, et al. Expires April 14, 2019 [Page 9]
Internet-Draft October 2018
Relationship between the speed of creating tables on the data
plane and the performance.
3.5. Conclusion
In-network computing as it applies to data centers is a very current
and promising research area. Thus, the proposed Research Group
creates an opportunity to bring together the community in
establishing common goals, identify hurdles and difficulties, provide
paths to new research especially in applications and linkage to other
new networking research areas at the edge. More information is
available in [COIN].
4. References
4.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
4.2. Informative References
[CHAN] Chan et al., M., "Network Support for DNN Training", 2018.
[COIN] He, J., Chen, R., and M. Montpetit, "Computing in the
Network, COIN, proposed IRTF group", 2018.
[DANG15] Dang et al., T., "NetPaxos: Consensus at Network Speed",
2015.
[DANG16] Dang et al., T., "Paxos Made Switch-y", 2016.
[DRYAD] Microsoft, "DryadLinq", 2018.
[FAN] Fan et al., B., "Small Cache, Big Effect: Provable Load
Balancing for Randomly Partitioned Cluster Services",
2011.
[FORSTER] Forster, N., "To be included", 2018.
[GRAHAM] Graham, R., "Scalable Hierarchical Aggregation Protocol
(SHArP): A Hardware Architecture for Efficient Data
Reduction.", 2016.
[HADOOP] Hadoop, "Hadoop Distributed Filesystem", 2016.
He, et al. Expires April 14, 2019 [Page 10]
Internet-Draft October 2018
[JIN] Jin et al., X., "NetCache: Balancing Key-Value Stores with
Fast In-Network Caching", 2017.
[LIJ] Li et al., J., "NetCache: Balancing Key-Value Stores with
Fast In-Network Caching", 2017.
[LIX] Li et al., X., "Be fast, cheap and in control with
SwitchKV", 2016.
[MEM] memcached.org, "Memcached", 2018.
[P4] p4.org, "P4 Language", 2018.
[PORTS] Ports, D., "Designing Distributed Systems Using
Approximate Synchrony in Data Center Networks", 2015.
[PREGEL] github.com/igrigorik/pregel, "Pregel", 2018.
[REXFORD] Rexford, J., "Sigcomm 2018 Keynote Address", 2018.
[SAPIO] Sapio et al., A., "In net computing is a dumb idea whose
time has come", 2017.
[SOULE] Soule, R., "Sigcomm 2018 Netcompute Workshop Keynote
Address", 2018.
[SPARK] Apache, "Spark Graph X", 2018.
[SUBEDI] Subedi et al., T., "OpenFlow-based in-network Layer-2
adaptive multipath aggregation in data centers", 2015.
[TENSOR] tensorflow.org, "Tensorflow", 2018.
[TOFINO] BarefootNetworks, "Tofino", 2018.
[ZSOLT] Zsolt et al., I., "Consensus in a Box", 2018.
Authors' Addresses
Jeffrey He
Huawei
Email: jeffrey.he@huawei.com
He, et al. Expires April 14, 2019 [Page 11]
Internet-Draft October 2018
Rachel Chen
Huawei
Email: chenlijuan5@huawei.com
Marie-Jose Montpetit (editor)
Triangle Video
Boston, MA
US
Email: marie@mjmontpetit.com
He, et al. Expires April 14, 2019 [Page 12]