TSVWG | Y. Zhuang |
Internet-Draft | B. Zhang |
Intended status: Informational | H. Pan |
Expires: April 20, 2020 | Huawei Technologies Co., Ltd. |
October 18, 2019 |
Artificial Intelligence (AI) based ECN adaptive reconfiguration for datacenter networks
draft-zhuang-tsvwg-ai-ecn-for-dcn-00
This document is to provide an artificial intelligence (AI) based ECN adaptive reconfiguration for datacenter networks.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 20, 2020.
Copyright (c) 2019 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
As defined in [RFC3168], Explicit Congestion Notification is introduced for IP to allow congestion to be signaled before dropping packets. As such, the latency of applications is reduced due to less retransmission of the dropped packets. Besides, MPLS also supports ECN defined in [RFC6679]. For tunneling, [RFC6040] defines how ECN should be constructed in the case of IP-in-IP tunnels.
Meanwhile, the upper layer transports protocols, like TCP in [RFC3168] and UDP based protocols DCCP in [RFC4341][RFC4342][RFC5632] and RTP in [RFC6679] are defined to support ECN-capable functions.
With ECN marking, active queue management (AQM) can choose a non-packet loss way to indicate congestion on the device, rather than dropping packets which might ask for packet retransmission and increase the latency. By using AQM in network devices, it can signal to common congestion-controlled transports to manage the queue length in the buffer and reduce the latency of traffics. Random Early Detection (RED) specified in [RFC2309]is one of the AQM algorithms that recommended to be implemented in routers.
As stated in [RFC7567], with proper parameters, RED can be an effective algorithm. However, dynamically predicting the set of parameters (minimum threshold and maximum threshold) is difficult. As a result, its present use in the Internet is limited. Other AQM algorithms have also been developed, while how to find proper parameters of algorithms for application traffics is still difficult and affect the network performance.
For data center networks, traffic patterns change with the deployment of applications like storage and high performance computing and changes of corresponding traffics which make the network more dynamic, while such applications have more restrict requirements on high throughput and ultra-low latency. In this area, a set of static ECN configurations suitable for all traffics at all time challenges.
With this, this document is to provide a way to seek ECN adaptive reconfiguration by using AI technologies in running data center network environment.
Our intent is to seek proper parameters of ECN adaptive reconfiguration by using artificial intelligence technologies to achieve self-tuning in a running data center network, so as to accommodate the changes of network resources to improve the network performance.
We also offer this as a starting point for seeking adaptive parameters for algorithms and network reconfigurations by using advanced technologies of AI. We do not change the way ECN works defined in [RFC3168]. With this, this document is to provide a way to achieve ECN adaptive reconfiguration by using AI technologies in dyanmic data center network environment.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
The following is a simple 2 layer data center network architecture with an analyzer to process the AI ECN adaptive reconfiguration with the changes of network traffics.
+------------------------------------------------------+ | Analyzer | +-.-----.-------------.-------.--------------.-----.---+ . . . . . . . . . . . . . +---.-----------+ . . +-----------.---+ . . | Spine | . . | Spine | . . ++--+--+----+---+ . . +-+-+-+----+----+ . . | | +----------.-------.---------------+ . . | +-------------.-------.-+ | | | | | . . | | +--.-------.--------+ | | . . | +-------------.-------.------+ | | . +---+--+-+ ++--+--.-+ +.-+--+--+ ++-+----.+ | | | | | | | | | Leaf | | Leaf | | Leaf | | Leaf | ++------++ ++------++ ++------++ ++------++ | | | | | | | | | | | | | | | | +++ +++ +++ +++ +++ +++ +++ +++ |S| ...|S| |S| ...|S| |S| ...|S| |S| ...|S| +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ ........ information collecting path -------- data path Figure 1. The architecture of a 2-layer data center network
The analyzer can be integrated with spine or can be an independent device which is left for implementation. In this design, it is responsible for collecting device information and conducting the induction for proper parameters for ECN adaptive reconfiguration periodically.
The idea of AI ECN in this document is to identify the “scene” of the current network at some time based on the collected information over a period. The identified scene (which can also considered as a network traffic pattern)is one of the scenes that are collected and learned from datacenter networks running different traffics of various applications in training process. The ECN settings of these scenes are decided based on human experience. As such, the ECN parameters of current network can be tuned to the settings of the identified scene. This adaptive reconfiguration process is running periodically to accommodate changes of the running network environment due to traffic changes.
Scene training is the first process in the procedure. It composes of two steps. Firstly, construct typical scenes and generate a learning model to identify these scenes based on a set of network performance indicators. Secondly, provide proper ECN settings for these typical scenes based on human experience.
In the first step, it might need the network operator to select some typical applications and the combinations of traffics based on experience to be used as the typical training scenes. For these typical scenes, we run a learning algorithm (for example, neutral network) to learn the characteristics of these scenes from periodically collected network performance indicators.
The selected network performance indicators can be device’s port bandwidth, queue size, etc al. which might be related to the applications and traffics in the networks.
While in the second step, human experience from network administrators can be used to provide proper ECN configurations for these typical scenes. AI technologies can also be used to enrich the scene sets based on these human experience, which is left for implementation.
In the practical network, the analyzer periodically collects information of selected network performance indicators from network nodes. The information is then used as input to the pre-learnt model and get the identified scene. The ECN settings of network devices will then be adaptively reconfigured to the parameters of the identified scene periodically.
The adaptive cycle of the period can be decided according to experience or it can be a training result in previous process defined in section 3.1.
In both training and adaptive reconfiguration process, the analyzer needs to collect information of the network i.e. a set of network performance indicators.
The data collection can be achieved by grpc or yang-push or other protocols.
The adaptive reconfiguration of ECN in a running network environment can be achieved by control-plane protocols such as netconf.
TBD
TBD
No IANA action
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997. |
[RFC8174] | Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017. |
We would like to thank the following persons for their great efforts and contributions to the work: Huafeng Wen, Binghui Wu, Weiqin Kong, Ke Meng, Xitong Jia, Liang Shan, Siyu Yan, Weishan Deng, Boding Wang, Jungan Yan, Haonan Ye and Liang Zhang.