Internet DRAFT - draft-miao-ccwg-hpcc-info

draft-miao-ccwg-hpcc-info







Network Working Group                                            R. Miao
Internet-Draft                                                      Meta
Intended status: Informational                                S. Anubolu
Expires: 1 September 2024                                   Broadcom Inc
                                                                  R. Pan
                                                                     AMD
                                                                  J. Lee
                                                                  Google
                                                                B. Gafni
                                                             J. Tantsura
                                                                  NVIDIA
                                                             A. Alemania
                                                                   Intel
                                                           Y. Shpigelman
                                                                  NVIDIA
                                                        29 February 2024


                      Inband Telemetry for HPCC++
                      draft-miao-ccwg-hpcc-info-02

Abstract

   Congestion control (CC) is the key to achieving ultra-low latency,
   high bandwidth and network stability in high-speed networks.
   However, the existing high-speed CC schemes have inherent limitations
   for reaching these goals.

   In this document, we describe HPCC++ (High Precision Congestion
   Control), a new high-speed CC mechanism which achieves the three
   goals simultaneously.  HPCC++ leverages inband telemetry to obtain
   precise link load information and controls traffic precisely.  By
   addressing challenges such as delayed signaling during congestion and
   overreaction to the congestion signaling using inband and granular
   telemetry, HPCC++ can quickly converge to utilize all the available
   bandwidth while avoiding congestion, and can maintain near-zero in-
   network queues for ultra-low latency.  HPCC++ is also fair and easy
   to deploy in hardware, implementable with commodity NICs and
   switches.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.







Miao, et al.            Expires 1 September 2024                [Page 1]

Internet-Draft                   HPCC++                    February 2024


   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 1 September 2024.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Inband telemetry padding at the network switches  . . . . . .   3
     2.1.  Inband telemetry on IFA2.0  . . . . . . . . . . . . . . .   4
     2.2.  Inband telemetry on IOAM  . . . . . . . . . . . . . . . .   5
     2.3.  Inband telemetry on P4.org INT  . . . . . . . . . . . . .   6
   3.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   7
   4.  Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .   7
   5.  Contributors  . . . . . . . . . . . . . . . . . . . . . . . .   7
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .   7
   7.  Normative References  . . . . . . . . . . . . . . . . . . . .   7
   8.  Informative References  . . . . . . . . . . . . . . . . . . .   7
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   8

1.  Introduction

   The link speed in data center networks has grown from 1Gbps to
   100Gbps in the past decade, and this growth is continuing.  Ultralow
   latency and high bandwidth, which are demanded by more and more
   applications, are two critical requirements in today's and future
   high-speed networks.



Miao, et al.            Expires 1 September 2024                [Page 2]

Internet-Draft                   HPCC++                    February 2024


   Given that traditional software-based network stacks in hosts can no
   longer sustain the critical latency and bandwidth requirements as
   described in [Zhu-SIGCOMM2015], offloading network stacks into
   hardware is an inevitable direction in high-speed networks.  As an
   example, large-scale networks with RDMA (remote direct memory access)
   often uses hardware-offloading solutions.  In some cases, the RDMA
   networks still face fundamental challenges to reconcile low latency,
   high bandwidth utilization, and high stability.

   This document describes a new congestion control mechanism, HPCC++
   (Enhanced High Precision Congestion Control), for large-scale, high-
   speed networks.  The key idea behind HPCC++ is to leverage the
   precise link load information from signaled through inband telemetry
   to compute accurate flow rate updates.  Unlike existing approaches
   that often require a large number of iterations to find the proper
   flow rates, HPCC++ requires only one rate update step in most cases.
   Using precise information from inband telemetry enables HPCC++ to
   address the limitations in current congestion control schemes.
   First, HPCC++ senders can quickly ramp up flow rates for high
   utilization and ramp down flow rates for congestion avoidance.
   Second, HPCC++ senders can quickly adjust the flow rates to keep each
   link's output rate slightly lower than the link's capacity,
   preventing queues from being built-up as well as preserving high link
   utilization.  Finally, since sending rates are computed precisely
   based on direct measurements at switches, HPCC++ requires merely
   three independent parameters that are used to tune fairness and
   efficiency.

   HPCC++ is an enhanced version of [SIGCOMM-HPCC].  HPCC++ takes into
   account system constraints and aims to reduce the design overhead and
   further improves the performance.  Detailed specification about
   HPCC++ can be found at [draft-miao-ccwg-hpcc].

   This document describes the architecture changes in switches and end-
   hosts to support the needed tranmission of inband telemetry and its
   consumption, that imporves the efficiency in handling network
   congestion.

2.  Inband telemetry padding at the network switches

   HPCC++ only relies on packets to share information across senders,
   receivers, and switches.  The switch should capture inband telemetry
   information that includes link load (txBytes, qlen, ts) and link spec
   (switch_ID, port_ID, B) at the egress port.  Note, each switch should
   record all those information at the single snapshot to achieve a
   precise link load estimate.  Inside a data center, the path length is
   often no more than 5 hops.  The overhead of the inband telemetry
   padding for HPCC++ is considered to be low.



Miao, et al.            Expires 1 September 2024                [Page 3]

Internet-Draft                   HPCC++                    February 2024


   As long the above algorithm is met, HPCC++ is open to a variety of
   inband telemetry format standards, which are orthogonal to the HPCC++
   algorithm.  Although this document does not mandate a particular
   inband telemetry header format or encapsulation, we provide concrete
   implementation specifications using strandard inband telemetry
   protocols, including IFA [I-D.ietf-kumar-ippm-ifa], IETF IOAM
   [RFC9179], and P4.org INT [P4-INT].  In fact, the emerging inband
   telemetry protocols inform the evolution for a broader range of
   protocols and network functions, where this document leverages the
   trend to propose the architecture change to support in-network
   functions like congestion control with high efficiency.

2.1.  Inband telemetry on IFA2.0

   For more details, please refer to IFA [I-D.ietf-kumar-ippm-ifa]

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |  lns  |  deviceID                             |     rsvd      |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      | Speed |     rsvd      |          rxTimestampSec               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |       egressPort              |         ingressPort           |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                        rxTimeStampNs                          |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                         residenceTime                         |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                            txBytes                            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |       rsvd                    |      Queue Length             |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                             rsvd                              |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


                        Figure 1: Example IFA header

   Figure 1 shows the packet format of the INT metadata after UDP and
   IFA metadata header.  The field lns is the local name space and
   defines the format of the metadata.  The field deviceID is a 20-bit
   field that uniquely identifies the device in the network.  The Speed
   field is an encode field with the following encoding for port speed:
   0 - 10G, 1 - 25G, 2 - 40G, 3- 50G, 4 - 100G, 5 - 200G, 6 - 400G.  The
   field cn is the congestion field and denotes if the packet
   experienced congestion.




Miao, et al.            Expires 1 September 2024                [Page 4]

Internet-Draft                   HPCC++                    February 2024


2.2.  Inband telemetry on IOAM

   IOAM is the technology adopted by IETF to be used for in-situ
   telemetry.  For the use of HPCC++ we would discuss the IOAM trace
   option as part of the IOAM architecture.  IOAM trace supports both
   Pre-allocated and Incremental trace Options, meaning that a node in
   the network may either write data into an already-allocated space in
   the packet, or may it add the data as an extenation to the IOAM
   header, respectively.  An IOAM data header has a modular design,
   where the data types written by a node are determined based on the
   IOAM trace header instruction list.  For the full description of the
   IOAM header design please refer to IETF IOAM [RFC9179] specification.
   In order to fulfill the requirements set by the HPCC++ architecture
   we would suggest to use the below trace types:

   *  Hop_Lim and node_id Short

   *  Ingress_if_id and egress_if_id Short

   *  Queue Depth

   *  Timestamp Fraction: To be used as egress timestamp rather than an
      ingress timestamp

   *  Transmitted Bytes

   Note that Transmitted Bytes trace type is defined in
   [I-D.draft-gafni-ippm-ioam-additional-data-fields] as a suggested
   extension to [RFC9179].

   When using the above trace types, the IOAM data header would be
   constructed as follows:



















Miao, et al.            Expires 1 September 2024                [Page 5]

Internet-Draft                   HPCC++                    February 2024


       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |   Hop_Lim     |              node_id                          |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |     ingress_if_id             |         egress_if_id          |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                       queue depth                             |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                   timestamp fraction                          |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                           tx_bytes                            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+



                  Figure 2: Example of an IOAM data header

2.3.  Inband telemetry on P4.org INT

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                        Node ID (Nth hop)                      |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |      Ingress Interface ID     |      Egress Interface ID      |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |    Queue ID   |               Queue occupnacy                 |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                        Egress timestamp                       |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                    Egress timestamp (cont'd)                  |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                 Egress interface Tx utilization               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                        Node ID (N-1th hop)                    |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                              ...                              |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                        Node ID (1st hop)                      |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                              ...                              |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+



         Figure 3: Example P4.org INT v2.1 per-hop metadata header




Miao, et al.            Expires 1 September 2024                [Page 6]

Internet-Draft                   HPCC++                    February 2024


   Figure 3 shows the per-hop metadata format of the P4.org INT-MD mode
   (following INT v2.1 spec).  Each hop switch along the path adds its
   Node ID for the sender to be able to track the path and detect a path
   change event.  If so, it throws away the existing status records of
   the flow and builds up new records.  Queue occupancy (24 bits) is the
   current buffer occupancy of the egress port and queue that the flow
   is going through.  Egress timestamp (8 bytes) is used by HPCC++
   algorithm to eventually compute interface utilization.  Since P4.org
   INT reports Egress TX utilization in-band, the Egress timestsamp is
   not mandatory but optional.  HPCC++ algorithm today doesn't require
   Ingress Interface ID.  P4.org INT defines Ingress and Egress
   Interface IDs as one metadata instruction.  We keep the Ingress ID
   for a future use.

3.  IANA Considerations

   This document makes no request of IANA.

4.  Acknowledgments

   The authors would like to thank RTGWG members for their valuable
   review comments and helpful input to this specification.

5.  Contributors

   The following individuals have contributed to the implementation and
   evaluation of the proposed scheme, and therefore have helped to
   validate and substantially improve this specification: Pedro Y.
   Segura, Roberto P.  Cebrian, Robert Southworth and Md Ashiqur Rahman.

6.  Security Considerations

   TBD

7.  Normative References

8.  Informative References

   [Zhu-SIGCOMM2015]
              Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M.,
              Liron, Y., Padhye, J., Raindel, S., Yahia, M. H., and M.
              Zhang, "Congestion Control for Large-Scale RDMA
              Deployments", ACM SIGCOMM London, United Kingdom, August
              2015.

   [P4-INT]   "In-band Network Telemetry (INT) Dataplane Specification,
              v2.0", February 2020, <https://github.com/p4lang/p4-
              applications/blob/master/docs/INT_v2_0.pdf>.



Miao, et al.            Expires 1 September 2024                [Page 7]

Internet-Draft                   HPCC++                    February 2024


   [RFC9179]  "Data Fields for In Situ Operations, Administration, and
              Maintenance (IOAM)", May 2022,
              <https://datatracker.ietf.org/doc/html/rfc9197>.

   [I-D.draft-gafni-ippm-ioam-additional-data-fields]
              "Additional data fields for IOAM Trace Option Types", May
              2021, <https://datatracker.ietf.org/doc/html/draft-gafni-
              ippm-ioam-additional-data-fields-00>.

   [I-D.ietf-kumar-ippm-ifa]
              "Inband Flow Analyzer", February 2019,
              <https://tools.ietf.org/html/draft-kumar-ippm-ifa-07>.

   [draft-miao-ccwg-hpcc]
              Miao, R., "HPCC++: Enhanced High Precision Congestion
              Control", 2024.

   [SIGCOMM-HPCC]
              Li, Y., Miao, R., Liu, H., Zhuang, Y., Fei Feng, F., Tang,
              L., Cao, Z., Zhang, M., Kelly, F., Alizadeh, M., and M.
              Yu, "HPCC: High Precision Congestion Control", ACM
              SIGCOMM Beijing, China, August 2019.

Authors' Addresses

   Rui Miao
   Meta
   1 Hacker Way
   Menlo Park, CA 94025
   United States of America
   Email: rmiao@meta.com


   Surendra Anubolu
   Broadcom, Inc.
   1320 Ridder Park
   San Jose, CA 95131
   United States of America
   Email: surendra.anubolu@broadcom.com


   Rong Pan
   AMD
   2485 Augustine Dr.
   Santa Clara, CA 95054
   United States of America
   Email: Rong.Pan@amd.com




Miao, et al.            Expires 1 September 2024                [Page 8]

Internet-Draft                   HPCC++                    February 2024


   Jeongkeun Lee
   Google
   Headquarters 1600 Amphitheatre Parkway
   Mountain View, CA 95043
   United States of America
   Email: leejk@google.com


   Barak Gafni
   NVIDIA
   350 Oakmead Parkway, Suite 100
   Sunnyvale, CA 94085
   United States of America
   Email: gbarak@NVIDIA.com


   Jeff Tantsura
   NVIDIA
   United States of America
   Email: jefftant.ietf@gmail.com


   Allister Alemania
   Intel
   2200 Mission College Blvd
   Santa Clara,  95952
   United States of America
   Email: allister.alemania@intel.com


   Yuval Shpigelman
   NVIDIA
   Haim Hazaz 3A
   Netanya 4247417
   Israel
   Email: yuvals@nvidia.com















Miao, et al.            Expires 1 September 2024                [Page 9]