Internet DRAFT - draft-ravi-ippm-csig
draft-ravi-ippm-csig
Networking Working Group A. Ravi
Internet-Draft N. Dukkipati
Intended status: Experimental N. Mehta
Expires: 5 August 2024 Google LLC
J. Kumar
Broadcom Inc.
2 February 2024
Congestion Signaling (CSIG)
draft-ravi-ippm-csig-01
Abstract
This document presents Congestion Signaling (CSIG), an in-band
network telemetry protocol that allows end-hosts to obtain visibility
into fine-grained network signals for congestion control, traffic
management, and network debuggability in the network. CSIG provides
a simple, low-overhead, and extensible packet header mechanism to
obtain fixed-length summaries from bottleneck devices along a packet
path. This summarized information is collected over L2 CSIG-tags in
a compare-and-replace manner across network devices along the path.
Receivers can reflect this information back to senders via L4+ CSIG
reflection headers.
CSIG builds upon the successful aspects of prior work such as switch
in-band network telemetry (INT) that incorporates multibit signals in
live data packets. At the same time, CSIG's end-to-end mechanism for
carrying the signals via fixed size header is simple, practical and
deployable akin to Explicit Congestion Notification (ECN).
In addition to a detailed description of the end-to-end protocol,
this document also motivates the use cases for CSIG and the rationale
for design choices made in CSIG. It describes a set of signals of
interest to applications (minimum available bandwidth, maximum link
utilization, and maximum hop delay), methods to compute these signals
in network devices, and how these signals can be leveraged in
applications. Additionally, it describes how attributes about the
bottleneck's location can be carried and made useful to applications.
It also provides the framework to incorporate future signals.
Finally, this document addresses incremental deployment, backward
compatibility and nuances of CSIG's applicability in a range of
scenarios.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Ravi, et al. Expires 5 August 2024 [Page 1]
Internet-Draft CSIG February 2024
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 5 August 2024.
Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6
2. Design Principles . . . . . . . . . . . . . . . . . . . . . . 8
3. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 9
4. Congestion Signaling Protocol . . . . . . . . . . . . . . . . 9
4.1. CSIG-tag Header Format . . . . . . . . . . . . . . . . . 10
4.1.1. Compact Format . . . . . . . . . . . . . . . . . . . 11
4.1.2. Expanded Format . . . . . . . . . . . . . . . . . . . 11
4.1.3. CSIG-tag Data fields Description . . . . . . . . . . 11
4.2. CSIG Reflection Header Format . . . . . . . . . . . . . . 14
4.2.1. Reflection in TCP . . . . . . . . . . . . . . . . . . 15
4.2.2. Reflection in non-TCP Transports . . . . . . . . . . 15
4.3. CSIG Operation - Life of a packet . . . . . . . . . . . . 16
4.3.1. Forward Path . . . . . . . . . . . . . . . . . . . . 16
4.3.2. Reverse Path . . . . . . . . . . . . . . . . . . . . 17
4.3.3. Multiple signals . . . . . . . . . . . . . . . . . . 17
4.4. Device Roles . . . . . . . . . . . . . . . . . . . . . . 17
4.4.1. Sender host . . . . . . . . . . . . . . . . . . . . . 17
4.4.2. Transit device . . . . . . . . . . . . . . . . . . . 18
4.4.3. Receiver host . . . . . . . . . . . . . . . . . . . . 18
Ravi, et al. Expires 5 August 2024 [Page 2]
Internet-Draft CSIG February 2024
4.4.4. Host roles for bidirectional flows . . . . . . . . . 18
5. Signals in CSIG . . . . . . . . . . . . . . . . . . . . . . . 19
5.1. Minimum Available Bandwidth - min(ABW) . . . . . . . . . 19
5.1.1. ABW Computation . . . . . . . . . . . . . . . . . . . 19
5.2. Maximum link utilization - max(U/C) or min(ABW/C) . . . . 20
5.2.1. ABW/C Computation . . . . . . . . . . . . . . . . . . 20
5.2.2. min(ABW) vs min(ABW/C) bottlenecks . . . . . . . . . 21
5.3. Shared requirements for min(ABW) and min(ABW/C) . . . . . 21
5.3.1. Algorithm Requirements . . . . . . . . . . . . . . . 21
5.3.2. Timescale and Accuracy Requirements . . . . . . . . . 21
5.3.3. Bucketing / Quantization Requirements . . . . . . . . 21
5.3.4. QoS requirements . . . . . . . . . . . . . . . . . . 22
5.4. Maximum Per-hop Delay - max(PD) . . . . . . . . . . . . . 22
5.4.1. Per-hop Delay Computation . . . . . . . . . . . . . . 22
5.4.2. Requirements . . . . . . . . . . . . . . . . . . . . 22
5.5. Locator Metadata Implementation . . . . . . . . . . . . . 23
5.5.1. Requirements . . . . . . . . . . . . . . . . . . . . 24
5.5.2. Attributes . . . . . . . . . . . . . . . . . . . . . 24
6. Incremental Deployment of CSIG. . . . . . . . . . . . . . . . 25
6.1. CSIG Stripping: A per egress-port primitive . . . . . . . 25
6.2. Levels of CSIG Support . . . . . . . . . . . . . . . . . 26
6.2.1. Discard . . . . . . . . . . . . . . . . . . . . . . . 26
6.2.2. Pass-through . . . . . . . . . . . . . . . . . . . . 26
6.2.3. Complete . . . . . . . . . . . . . . . . . . . . . . 26
6.3. Interoperability in Brownfield Deployments . . . . . . . 27
6.3.1. Requirements for interoperability . . . . . . . . . . 27
6.3.2. Forwarding . . . . . . . . . . . . . . . . . . . . . 28
6.3.3. Negotiation . . . . . . . . . . . . . . . . . . . . . 28
6.4. Backward Compatibility via Software-assisted CSIG . . . . 30
6.5. Greenfield deployments . . . . . . . . . . . . . . . . . 31
7. Design Rationale . . . . . . . . . . . . . . . . . . . . . . 31
7.1. Choice of Layer 2 . . . . . . . . . . . . . . . . . . . . 31
7.2. Separation of headers for CSIG-tag and reflection . . . . 32
7.3. Fixed-size headers . . . . . . . . . . . . . . . . . . . 32
7.4. Signal Design . . . . . . . . . . . . . . . . . . . . . . 33
8. Use Cases defined by Bottleneck Signals . . . . . . . . . . . 34
8.1. Congestion Control . . . . . . . . . . . . . . . . . . . 34
8.1.1. Using maximum per-hop delay in E2E CC . . . . . . . . 34
8.1.2. Using maximum link utilization in E2E CC . . . . . . 35
8.1.3. Using minimum available bandwidth in E2E CC . . . . . 36
8.2. Traffic Management . . . . . . . . . . . . . . . . . . . 36
8.2.1. Load Balancing and Multipathing . . . . . . . . . . . 37
8.2.2. Traffic Engineering . . . . . . . . . . . . . . . . . 37
8.3. Application Performance Debugging . . . . . . . . . . . . 37
9. Security Considerations . . . . . . . . . . . . . . . . . . . 38
10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 38
11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 38
12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 39
Ravi, et al. Expires 5 August 2024 [Page 3]
Internet-Draft CSIG February 2024
13. Normative References . . . . . . . . . . . . . . . . . . . . 39
Appendix A. Example encodings of CSIG signals . . . . . . . . . 42
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 43
1. Introduction
Many network control loops, including Congestion Control, Traffic
Engineering and Network Operations, make decisions based on the
congestion experienced by application flows. The signals used to
determine congestion are often implicitly derived from end-to-end
signals, approximated over larger timescales than desired, or
obtained out-of-band from the network. This can lead to suboptimal
performance for applications or inefficiency in network usage. CSIG
(Congestion Signaling) provides direct, real-time, inband signals
that network control loops can incorporate for performance and
efficiency.
A number of congestion control algorithms (CCA) are deployed in
datacenters, including Swift [SWIFT], BBR [BBR], DCTCP [RFC8257],
DCQCN [DCQCN] and HPCC++ [I-D.miao-tsv-hpcc]. These CCA vary in the
congestion signals they use and in how they increase/decrease flow
rates in response to the signals. Swift uses precise measurements of
round-trip time (RTT) to modulate its congestion window. BBR uses a
combination of flow's delivery rate and RTT measurements. DCTCP and
DCQCN rely on Explicit Congestion Notification (ECN [RFC3168]) from
switches that indicate if the queue build up is above a threshold.
HPCC++ leverages per-hop queue depth and transmit bytes along the
flow's path, obtained via inband telemetry probes, to update flow
rates.
Despite the advances in sophisticated signals on when to slow down
transfers, there continue to be blind-spots for CCA when it comes to
increasing flow rates, e.g., What is the appropriate starting rate
for a flow? How quickly should a flow ramp up in the absence of
congestion? Without explicit information from the network, end-to-
end CCA have come to rely on heuristics that can either undershoot or
overshoot the bottleneck bandwidth, which can lead to slower Flow
Completion Times (FCT) or increased round-trip times or packet
losses. At the same time, applications' appetite for fast network
performance is rising: AI/ML applications are pushing for fast
network transfers and avoid idling expensive Tensor Processing Units
(TPUs) and Graphics Processing Units (GPUs). Similarly Storage
disaggregation needs fast transfers to make a remote Storage device
appear as a local device at host.
Ravi, et al. Expires 5 August 2024 [Page 4]
Internet-Draft CSIG February 2024
In this document we introduce Congestion Signaling (CSIG) to
explicitly notify the hosts of the bottleneck link metrics. There
are several important use cases for CSIG, including:
* Congestion Control Algorithms for making decisions on sending
rate: CCA at senders can use CSIG for quickly and safely ramping
up to the maximum feasible rate as determined by the bottleneck
link, and react with precision to the bottleneck hop both in the
presence and absence of congestion. The motivation for quick
ramp-up stems from making maximal use of datacenter bandwidth, and
decreasing latency even for large transfers. There are several
ways in which CSIG can help complete transfers quickly, e.g.,
transfers belonging to an ML collective communication can ramp up
quickly to maximally use all network bandwidth and complete close
to the ideal transfer completion time.
* Traffic Management systems including Traffic Engineering (TE),
Load Balancing and Multipathing too benefit from CSIG. TE systems
infer congested flows through an offline multi-minute process via
superimposition of network traffic stats, topology and routing
information. With CSIG, TE has more up to date information on the
congested points and the application flows experiencing
congestion. Using such finer-grained information can lead to more
efficient and timely provisioning for bursty traffic. Similarly,
CSIG-enabled multipathed transport flows can choose paths in real
time with the most available bandwidth.
* Troubleshooting and Performance Optimization. We also envision
CSIG to assist with debugging the network-level performance of
datacenter applications. Large-scale applications, including ML
training workloads, open thousands of connections at the transport
layer. When the network is slow for an application, it is almost
impossible to identify the bottleneck hops without joining many
data sources across switches and hosts. Because CSIG conveys the
path bottleneck characteristics, it is valuable in pinpointing
choke points in the network. Knowledge of these choke points can
lead to better bandwidth provisioning, timely repair processes,
and real-time control, such as better load balancing.
CSIG provides simple, fixed-length summaries of bottleneck links
along a path, such as maximum hop delay, minimum available bandwidth,
and maximum link utilization. Information is collected at L2 from
network devices along a packet path. Each data receiver then returns
the collected information to the data sender via L4 transport options
or payloads. CSIG uses a simple compare-and-replace operation at
network devices, which allows it to scale with network topology, link
speeds, and packet rates.
Ravi, et al. Expires 5 August 2024 [Page 5]
Internet-Draft CSIG February 2024
CSIG builds on the successful aspects of prior explicit feedback
schemes, but is more capable. CSIG carries rich multi-bit switch
telemetry in live data packets, drawing from the advancements in in-
band network telemetry, also generally known as INT. At the same
time, CSIG retains the fixed-size headers and reflection in L4
transports akin to Explicit Congestion Notification (ECN). The
industry has three key variants of INT: the one first specified in
P4.org [P4-INT], the IOAM (In Situ Operations, Administration, and
Maintenance) standard [RFC9378] in IETF and the Inband Flow Analyzer
(IFA) spec [I-D.kumar-ippm-ifa] that is used in HPCC deployment
[HPCCPLUS]. While they differ in the header definitions and
encapsulation mechanisms, they all commonly stack up multiple per-
switch telemetry data per-hop in the path of a packet. The packet
size grows proportional to the metrics per switch and the number of
forwarding devices along its path. Depending on the use case and
header definition, the per-packet overhead ranges from 20B to above
100B. The large and variable size header overhead incurs challenges
in end-to-end MTU limit conformation and parsing of the packet header
data in the forwarding or receiving devices.
There exist several efforts to address the challenges incurred in INT
variants, including: 1) carrying INT data in synthetically generated
non-data packets also known as probe packets, and 2) carrying only
the fixed-size INT instructions (e.g., specifying which data to
collect per hop) in data packets, while hop devices generate separate
report packets that deliver the requested per-hop data. While these
techniques reduced the per-data-packet overhead, they did not
fundamentally reduce the total amount of bytes or PPS overhead on the
network devices or the data collector. TCP-INT [TCP-INT] was
developed in parallel to carry fixed-size min/max/sum aggregate
metric over the hops together with a hop locator in live data
packets. However, it is limited to TCP Options, hence not applicable
to various modern transports for AI/HPC, and furthermore there is no
flexible way to introduce a new metric. CSIG's type-value format
ensures a constant size overhead with future-proofness. The
guaranteed constant size is small enough to fit into the 4B or 8B
tag, enabling the unique placement of CSIG in L2, which frees the
operators from the concerns around tunneling and encryption in
deploying CSIG.
In the rest of the document, we describe the design of end-to-end
CSIG at hosts and network devices.
1.1. Terminology
ABW: Available Bandwidth
AQM: Active Queue Management
Ravi, et al. Expires 5 August 2024 [Page 6]
Internet-Draft CSIG February 2024
CCA: Congestion Control Algorithms
Connection / Flow: A 5-tuple transport connection, e.g. TCP
connection
CSIG: Congestion Signaling
CSIG data fields: Fields in the CSIG tag excluding the TPID.
CSIG packets: Packets that contain the CSIG-tag and optionally the
CSIG reflection header
CSIG-capable path: Path is termed CSIG-capable if all transit
devices along the path support the CSIG protocol and end hosts
have at least pass-through support for CSIG packets
CSIG-tagged packets: Packets that contain the CSIG-tag in the packet
header
CSIG-domain: Secure network deployment domain where all devices in
the domain have complete CSIG support or pass-through CSIG support
PD: Per-hop delay
E2E: End-to-End
IPSec: Internet Protocol Security
MTU: Maximum Transmission Unit
MSS: Maximum Segment Size
NIC: Network Interface Card
Packet Path: The port-by-port network path taken by a given packet
specified as a sequence of device interfaces
PSP: PSP Security Protocol
TPID: Tag Protocol ID
TE: Traffic Engineering
Transit device: Any switch, router or middlebox in the path of a
CSIG packet
WRR: Weighted Round Robin
Ravi, et al. Expires 5 August 2024 [Page 7]
Internet-Draft CSIG February 2024
2. Design Principles
CSIG was conceived to address problems in congestion control, traffic
management and network debuggability in production networks. We
describe below the design principles that shaped CSIG, with
simplicity and ease of deployment being at the forefront. Section 7
discusses the rationale behind the specific design choices made in
CSIG.
* Simple Signals driven by Use Cases: Simple device port or queue
metrics that solve concrete use cases are at the heart of CSIG's
design principles. This simplicity is not only important to
applications, but also keeps the area, power and cost of
implementation low on network devices. Signals in CSIG are
designed to be implementable in ASICs at line rate. Signals that
track per-flow state at the switch, for example, are harder to
implement and deploy, and are hence avoided in CSIG. CSIG is also
flexible enough to accommodate new signals and use cases beyond
those described in this document.
* End-to-End Perspective: CSIG's design stems from an end-to-end
perspective of requirements and trade-offs for both applications
and the network. This document covers the necessary end-to-end
aspects and the resulting design choices that make CSIG both
useful to applications and practical to deploy.
* Small and Fixed Packet Overhead: It is important that the packet
size does not increase as it traverses the network, which means
that the MTU does not need to be changed. Any overhead that is
introduced should be fixed and small, minimizing the cost of
implementation in switch / NIC pipelines. Low protocol overhead
also means low bandwidth overhead for small packets, minimizing
impact to packet-per-second (PPS) load and bandwidth efficiency.
We make very few assumptions about which packets and devices CSIG
is enabled on. Device implementations must be able to process
CSIG on packets at line rate with minimal CPU involvement.
Keeping the overhead small and fixed allows for CSIG to be enabled
on every single packet at line rate. This is important because
deployments may choose to enable CSIG on every packet rather than
on a small sample of packets.
Ravi, et al. Expires 5 August 2024 [Page 8]
Internet-Draft CSIG February 2024
* Works easily under Tunneling and Encryption: Tunnels are broadly
used in modern deployments e.g., Traffic-engineering systems and
Cloud traffic frequently use tunnels. CSIG is designed to easily
support end-to-end signaling on devices even in the presence of
complex tunneling deployments. This is in contrast to other in-
band telemetry schemes that put more pressure on the ASICs to
relocate metadata across inner and outer headers to work in the
presence of tunnels. In addition, CSIG also works with encrypted
packets, including PSP, IPSec and 802.1AE MAC Security.
* Incremental Deployability: CSIG allows incremental deployment,
where the mechanism can be deployed gradually into domains where
some devices may support the new protocol and others may not.
This document addresses interoperability in heterogeneous
networks, and addresses backward compatibility with legacy
devices. We envision CSIG to be broadly valuable across wired
networks, although our target domain for initial usage is
datacenter networks. We make minimal assumptions about the
network architecture around tunneling, number of hops (diameter),
routing, topology etc. Configuring CSIG for end-to-end
consistency in a private network, or deployments over the Internet
are not in scope for this document.
3. Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
In this document, these words will appear with that interpretation
only when in ALL CAPS. Lower case uses of these words are not to be
interpreted as carrying significance described in RFC 2119.
4. Congestion Signaling Protocol
CSIG protocol defines two components in the packet header to achieve
end to end congestion signaling in a production network.
* CSIG-tag: An L2 protocol that end hosts and transit devices
participate in.
* CSIG Reflection: A flexible L4+ protocol that only end hosts
participate in.
CSIG-tag is the core component of the CSIG specification. It enables
end hosts to request network signals of interest and for transit
devices to provide these signals to end hosts over the specified
packet header bits.
Ravi, et al. Expires 5 August 2024 [Page 9]
Internet-Draft CSIG February 2024
However, to achieve end-to-end CSIG, CSIG-tag MAY be combined with
the CSIG reflection protocol to expose the signals of interest to the
relevant endpoints or consumers where the signals are needed.
This section first describes the header formats for CSIG-tag and CSIG
reflection. Then it describes the life of a CSIG packet, outlining
the different roles of network devices in the context of CSIG, and
how these two packet header mechanisms work together to achieve end-
to-end signaling.
4.1. CSIG-tag Header Format
CSIG tag is a fixed size tag at the layer 2 header.
CSIG-tag placement in various packet encapsulations is shown below
for completeness. It is always the last tag in the layer 2 header.
ARPA: dstmac / srcmac / csig-tag / ethertype / payload
802.1q: dstmac / srcmac / vlan-tag / csig-tag / ethertype / payload
802.1ad: dstmac / srcmac / vlan-tag / vlan-tag / csig-tag / ethertype
/ payload
802.1ad tunnel: dstmac / srcmac / vlan-tag / vlan-tag / vlan-tag /
vlan-tag / csig-tag / ethertype / payload
802.1ae: dstmac / srcmac / security-tag / vlan-tag / csig-tag /
ethertype / payload
Consequently, the placement / offset of the CSIG tag is not affected
by the headers and payload at layers 3 and above. Layer 2.5 headers,
such as MPLS, are also placed after the CSIG tag and do not impact
its offset.
CSIG-tag is defined in two variants - Compact and Expanded. Each
variant has a dedicated TPID codepoint to allow devices to infer
which variant is in use. Each variant supports a distinct set of
requirements with respect to production deployment and identifies
contrasting trade-off points in the solution space. Deployment
considerations are discussed in Section 6.
Structurally, the compact CSIG-tag variant resembles a single VLAN
tag and the expanded CSIG-tag variant resembles a double VLAN tag.
This structural similarity is intentional and the reasons are
elaborated in Section 6.4.
Ravi, et al. Expires 5 August 2024 [Page 10]
Internet-Draft CSIG February 2024
4.1.1. Compact Format
CSIG-tag compact format is as shown, with 2B allocated for the CSIG
Tag Protocol ID (TPID) and 2B allocated for the data fields.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| TPID | T |R| S | LM |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0-15| TPID : IEEE allocated Tag Protocol ID for 4 Byte CSIG tag
|16-18| T : Signal Type (0:min(ABW), 1: min(ABW/C), 2:max(PD))
|19| R : Reserved
|20-24| S : Signal Value: Bucketed (32 configurable buckets)
|25-31| LM : Locator Metadata of bottleneck device / port
Figure 1: CSIG-tag Compact version
4.1.2. Expanded Format
CSIG-tag expanded format is as shown, with 2B allocated for the Tag
Protocol ID (TPID) and 6B allocated for the data fields
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| TPID | LM |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| T | S | R |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0-15| TPID : IEEE allocated Tag Protocol ID for 8 Byte CSIG tag
|16-31| LM : Locator Metadata of bottleneck device / port
|0-3| T : Signal Type (0:min(ABW), 1: min(ABW/C), 2:max(PD))
|4-23| S : Signal Value: Uniformly quantized
|24-31| R : Reserved for future use
Figure 2: CSIG-tag Expanded version
4.1.3. CSIG-tag Data fields Description
This section describes the format and usage of data fields within the
CSIG-tag
Ravi, et al. Expires 5 August 2024 [Page 11]
Internet-Draft CSIG February 2024
4.1.3.1. Signal Type
The Signal Type field T is three (four) bits long in the compact
(expanded) format and indicates the type of signal being carried in
the CSIG-tag. End hosts set the signal type T and request it on each
packet of interest. Up to 8 signal types are supported in the
compact format, and up to 16 signal types are supported in the
expanded format. This draft concretely defines three signals:
min(ABW), min(ABW/C) and max(PD), elaborated in Section 5 and
Section 8. The remaining codepoints are reserved for future signals,
and may be defined and used in future versions of CSIG.
A single packet can carry at most one Congestion Signal. However,
end hosts MAY obtain multiple signals for a single 5-tuple flow by
requesting different signal types on alternating packets of a flow or
in a round-robin fashion across packets. Therefore, end hosts need
not tie a single flow to a specific signal type, and MAY obtain all
supported CSIG signals for a single flow.
4.1.3.2. Signal Value
The Signal Value field S is 5 bits (20 bits) long in the compact
(expanded) format and captures the value of the signal specified by
Signal Type T. End hosts set the initial Signal Value S alongside
the requested Signal Type T, and each transit device along the packet
path in the network MAY modify S in accordance with the e2e signal
being computed. E.g., For signals that are min() aggregations, end
hosts set the initial value of S to the maximum allowable value of
the signal or its encoding thereof, and transit devices perform
compare-and-replace to compute the min() across signals of individual
devices on the packet path.
In the compact format, the 5-bit Signal Value is bucketed with 32
fully configurable buckets. Each bucket is configured with (low,
high) value range. This configuration is specific to each Signal
Type and MAY vary across Signal Types. This allows the Signal Value
representation to be tailored to the specific needs of each Signal
Type. For example, in typical use cases of available bandwidth, it
is more useful to have higher granularity at lower values of the
signal (i.e., when ABW is close to 0) than at higher values of the
signal. This is because lower values of ABW have greater impact on
application control decisions e.g., knowing whether there was 0 Gbps
vs 1 Gbps available on a path makes a larger difference than knowing
if there was 399 Gbps vs 400 Gbps available. Appendix A shows how
the buckets could be defined in order to provide such a non-linear
encoding of value-ranges to buckets. Such configurable encodings
allow capturing useful information about the signal with fewer bits
and is a core feature of the compact CSIG format.
Ravi, et al. Expires 5 August 2024 [Page 12]
Internet-Draft CSIG February 2024
In the expanded format, Signal Value is uniformly quantized into a 20
bit value. The unit of quantization is configurable on a per Signal
Type basis, depending on the minimum and maximum value that needs to
be represented with the given bits. The higher bit length allows for
enhanced signal granularity and fewer configuration knobs in domains
where the expanded CSIG format is viable to deploy (Section 6.5).
20-bits are sufficient to represent a wide range of values with high
granularity. As an example, with a 8Mbps quantum for min(ABW), the
signal value field can represent up to a max of 8Tbps. With a 128ns
quantum for max(PD), the signal value field can represent up to a max
of 128ms. More discussion on signal-specific quanta is in
Appendix A.
Signal quantization / bucketing parameters are configured directly at
the transit devices where the signal is computed. End hosts do not
explicitly request or negotiate these parameters. As described in
Section 5, all devices MUST be configured with the same quantization
/ bucketing parameters for each signal type, in order to correctly
compute the requested signal along packet paths.
4.1.3.3. Locator Metadata
Locator Metadata field LM is an optional 7 bits (16 bits) in the
compact (expanded) format. It captures relevant metadata about the
bottleneck port or device, where the notion of bottleneck is specific
to individual signal types. Locator Metadata MAY include compressed
attributes about the bottleneck that is relevant for the use case
e.g., capacity of the bottleneck port, stage of the bottleneck device
in the data center topology, orientation of the bottleneck port -
uplink / downlink. LM MAY also include expanded attributes of the
bottleneck (e.g., port ID, TTL). This document provides
recommendations for the type of information that locator metadata MAY
carry, but it does not require any specific set of metadata to be
supported. Metadata that is useful and viable to support will depend
on the production setting, which is out of scope for this document.
Instances of CSIG deployment MAY include locator metadata with
custom-defined metadata beyond those described in this document.
Section 5.5 discusses requirements for supporting LM in devices.
End hosts initialize LM to a default value. Transit devices that do
not update the Signal Value S on a given packet MUST NOT alter LM on
the packet. Transit devices that update S on a packet MUST update LM
on the same packet.
Ravi, et al. Expires 5 August 2024 [Page 13]
Internet-Draft CSIG February 2024
4.2. CSIG Reflection Header Format
CSIG reflection enables consumption of tag data fields at the point
where the signals are needed for telemetry or control. This
mechanism is particularly relevant for sender-driven / source-based
telemetry and control. For receiver-driven transports and
controllers, CSIG reflection may not be necessary as the signals on
the CSIG tag are available at the receiver without reflection (See
Section 4.3).
This document provides recommendations on how CSIG reflection SHOULD
be implemented, and provides the framework to make the implementation
deployment-specific.
CSIG reflection header is a separate header from the CSIG tag,
implemented at layer 4 or above. The location of the header and the
choice of which packets carry the header are transport-specific. As
an example, the header can be carried on TCP ACK packets from the
receiver back to the sender. Note that the presence of ACK
coalescing, piggybacked ACKs, Selective Acknowledgements (SACK) etc.
can impact the behavior of CSIG reflection. More generally, there
may not be a 1:1 mapping between forward and reverse path packets.
In a scenario where the transport implements ACK coalescing, the CSIG
reflection header SHOULD reflect the latest CSIG-tag data fields
received across the packets being acknowledged or a more advanced
summary of the CSIG-tag data fields across the packets being
acknowledged. It is important to note that since Signal Type is
chosen on a per-packet granularity, a coalesced ACK may acknowledge
multiple packets that carry different signal types in their CSIG-
tags. In such a scenario, the reflection header MAY only reflect one
of the signals. The sender transport should choose Signal Type for
packets in a way that ensures that it can continue to receive all
signals of interest.
CSIG reflection header MAY include all of the CSIG data fields i.e.,
2B for the compact version and 6B for the expanded version. However,
one could optimize header space and include only a subset of the data
fields if the consumer is interested only in a subset of signals or
locator metadata.
CSIG reflection is an end-host-only protocol and transit devices do
not participate in it. Therefore, CSIG reflection header can be
incorporated in portions of the packet that are e2e encrypted via PSP
or IPSec.
The following subsections discuss locations in the packet header
where CSIG reflection could be implemented for different transports
Ravi, et al. Expires 5 August 2024 [Page 14]
Internet-Draft CSIG February 2024
4.2.1. Reflection in TCP
Reflection in TCP is typically achieved via TCP options. CSIG
Reflection can be implemented via a new TCP Option, identified by a
unique Kind.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Kind | Length | CSIG data fields |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Kind : Unique codepoint to recognize TCP CSIG option
Length : Length in bytes of the CSIG data fields
carried in the options payload
CSIG data fields : Values reflected from receiver to sender
Figure 3: CSIG Reflection TCP Option
4.2.2. Reflection in non-TCP Transports
Several transports such as QUIC [RFC9000] and PonyExpress
[PONYEXPRESS] are built atop UDP. Reflection in UDP can be achieved
by including CSIG data fields in the UDP payload from receiver to
sender. For unidirectional UDP traffic, an out-of-band reverse
connection from the receiver to the sender may be necessary for CSIG
reflection.
As an example, PonyExpress [PONYEXPRESS] is a custom transport
implemented within a userspace host networking stack. It supports a
flexible L4 wire protocol that periodically changes as new features
are added (Sec 3.1 in Snap). CSIG reflection can be implemented as
additional bytes within this wire format.
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Flags | CSIG data fields|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 4: PonyExpress CSIG Reflection header
For simplicity and to avoid the need for negotiation, the CSIG
reflection header can be carried on all packets independent of
whether CSIG is enabled on them. The Valid bit in the Flags field
can be set to 1 for packets that carry valid data fields in the
reflection header. In certain deployments, negotiation is
unavoidable for a variety of reasons. Section 6.3.3 provides details
regarding options for negotiation.
Ravi, et al. Expires 5 August 2024 [Page 15]
Internet-Draft CSIG February 2024
4.3. CSIG Operation - Life of a packet
This section describes the end-to-end operation of CSIG with the
walkthrough of the life a packet. It assumes that all nodes in the
path are CSIG-capable and omits the negotiation phase. Details of
negotiation are covered in in Section 6.3.3
Forward Path
--------------------------------------------------------->
<---------------------------------------------------------
Reverse Path
+------+ +-----+ +------+ +------+ +------+ +-----+ +------+
| Host +---+ ToR +---+ Aggr +--+ Core +--+ Aggr +--+ ToR +---+ Host |
+------+ +-----+ +------+ +------+ +------+ +-----+ +------+
C: 800G 100G 100G 100G 40G
ABW: 100G 95G 70G 90G 20G
---
ABW/C: 12.5% 95% 70% 90% 50%
-----
D: 10us 3us 18us 5us 8us
----
Figure 5: Life of a CSIG packet. Underlined values show the
forward path bottlenecks for the corresponding signal types
4.3.1. Forward Path
The sender end-host first constructs a CSIG-tagged packet for a flow
of interest and sends out the packet with the tag data fields
initialized. The transport determines these initial values for the
packet, including Signal Type to request and default values for the
other data fields. Each transit device performs a compare-and-
replace on the CSIG-tag to optionally update the Signal Value and
Locator Metadata fields on the tag. As the packet traverses through
the network, the CSIG-tag data fields accumulate the desired
aggregation of the requested signal.
Ravi, et al. Expires 5 August 2024 [Page 16]
Internet-Draft CSIG February 2024
4.3.2. Reverse Path
When the CSIG-tagged packet reaches the receiver end-host, the data
fields in the CSIG tag are extracted and delivered to the transport
layer at the receiver. The transport stores the data fields of the
packet to be reflected, or a summary of these fields across packets.
It reflects these data fields in the layer-4 CSIG reflection header
on packets traversing the reverse path from receiver to sender. The
CSIG reflection header is unmodified as the packet travels from
receiver to sender. The sender extracts the CSIG data fields from
the CSIG reflection header of the incoming packet, and hands it to
the transport layer for use in applications at the sender. As a
result, the sender transport learns the desired signal for a flow
within approximately one round-trip time.
4.3.3. Multiple signals
The transport layer has a significant role to play in making CSIG
usable. Although the CSIG data fields are carried on packets, the
measurements are ultimately relevant at the flow / connection level
for specific paths. If the sender transport desires to obtain
multiple signals for the same flow, it MAY choose Signal Type on a
per-packet basis (e.g., in a round robin fashion across the flow's
packets), and internally keep track of all of the requested signals
as part of the flow's state variables. This approach allows the
sender transport to use all supported CSIG signals for use cases such
as congestion control, load balancing and multipathing.
4.4. Device Roles
CSIG has three participating entities, each with their own roles and
responsibilities for achieving end-to-end congestion signaling.
4.4.1. Sender host
The sender host is responsible for
(i) Constructing CSIG-tagged packets for flows of interest and
initializing the CSIG-tag data fields on each packet as specified by
the transport, and
(ii) Parsing the CSIG reflection header received in incoming packets
and extracting CSIG data fields for use in the sender transport /
applications.
Only the sender is allowed to insert CSIG-tags into packets.
Ravi, et al. Expires 5 August 2024 [Page 17]
Internet-Draft CSIG February 2024
4.4.2. Transit device
Transit devices are responsible for
(i) Computing and tracking Congestion signals such as ABW and ABW/C
of each port and hop delay per packet
(ii) Parsing the CSIG-tag based on the TPID code point on incoming
packets to identify the Signal type being requested, and
(iii) Performing compare-and-replace on the Signal value and locator
metadata fields on the CSIG-tag based on the aggregation
corresponding to the requested signal type (min / max)
Transit devices MUST NOT add CSIG tags to incoming packets that are
not already CSIG-tagged. Transit devices MAY delete the CSIG tag
before forwarding the packet. This functionality can be exercised
when downstream devices are not CSIG-capable. Further discussion on
this topic is in Section 6 on Incremental Deployment of CSIG.
4.4.3. Receiver host
The receiver host is responsible for
(i) Extracting the CSIG-tag on incoming packets and exposing the data
fields to the transport layer and/or receiver-driven applications
(ii) Inserting and populating the CSIG Reflection header at the
transport layer for packets traversing the reverse path to the
sender.
4.4.4. Host roles for bidirectional flows
Note that for bi-directional flows, the Sender and Receiver are
specific to each direction within the flow. For a bi-directional
flow between hosts A and B,
(i) A plays the Sender host role and B plays the Receiver host role
for data packets traveling from A to B, and similarly
(ii) B plays the Sender host role and A plays the Receiver host role
for data packets traveling from B to A.
Ravi, et al. Expires 5 August 2024 [Page 18]
Internet-Draft CSIG February 2024
In this scenario, packets traversing from A to B contain both a CSIG-
tag that captures the congestion signals on the forward A-->B path,
and a CSIG reflection header that captures the CSIG data fields of
the reverse B-->A path. Equivalently, packets traversing from B to A
contain both a CSIG-tag that captures the congestion signals on the
forward B-->A path, and a CSIG reflection header that captures the
CSIG data fields of the reverse A-->B path
5. Signals in CSIG
As described in the previous section, Signal Type indicates the type
of congestion signal that CSIG-tag carries on each packet. Up to 8
signal types are supported by the compact format and up to 16 signal
types are supported by the expanded format.
In this section, we concretely define three signals driven by use
cases described in Section 8. While Section 8 covers how these three
signals are useful to applications, this section focuses on precise
definitions of these signals and how they may be implemented on
transit devices.
Note for future extensions: Signals in CSIG are intended to be
aggregation functions of individual per-hop or per-port signals
across the path of a packet. The typical definition of such signals
with max / min aggregations captures the notion of a path bottleneck
for different definitions of bottleneck. However, structurally, the
format supports arbitrary read-modify-write operations, including
aggregations such as max, min, count and sum, allowing future use
cases to leverage this structure for new signals.
5.1. Minimum Available Bandwidth - min(ABW)
min(ABW) captures the minimum absolute available bandwidth (in bps)
across all the ports in the packet path. Available bandwidth is
defined per egress port on each device.
5.1.1. ABW Computation
ABW can be computed using one of many algorithm variants, each having
implications on HW or SW implementation complexity, timescales of
computation and accuracy of the signal. In its rudimentary form, the
raw ABW for a given egress port p over a time interval delta_t can be
computed as follows:
// delta_txbit is the number of bits that exited on the wire
utilization_bps[p] = (delta_txbit[p]) / delta_t;
// capacity_bps[p] captures the link speed of port p
abw_bps[p] = capacity_bps[p] - utilization_bps[p];
Ravi, et al. Expires 5 August 2024 [Page 19]
Internet-Draft CSIG February 2024
Implementation of these computations relies on at least one of the
following capabilities in the devices:
* Timer-based computations: Most networking ASICs maintain hardware
counters that track the number of bits that exit on each egress
port. To compute available bandwidth, a periodic-timer thread in
SW or HW triggers the computation and update of available
bandwidth every delta_t time interval , where delta_t is a
configurable parameter.
* Per-packet computations: In this alternative, available bandwidth
is computed and updated on every packet that is processed via the
egress pipeline, typically in HW e.g., via Exponential Weighted
Moving Average (EWMA) estimation where the weights are
configurable. delta_t is not an explicit parameter in this
approach, and is implicitly determined by EWMA weights.
Variants such as Discounted Rate Estimator (DRE) [CONGA] use a
combination of per-packet updates and timer-based approaches.
5.2. Maximum link utilization - max(U/C) or min(ABW/C)
ABW/C captures the fraction or percentage of available bandwidth on a
given link relative to the link's capacity. min(ABW/C) captures the
link utilization bottleneck along the path of the packet. This
signal is most relevant in paths with heterogeneous link speeds,
where it distinguishes itself from min(ABW). min(ABW/C) is equivalent
to max(U/C), where
U = utilization of a given egress port in bps
C = capacity of a given egress port in bps
ABW = available bandwidth of a given egress port in bps
Therefore, max(U/C) = max (1 - ABW/C) = 1 - min(ABW/C)
5.2.1. ABW/C Computation
ABW/C can be computed from ABW as follows:
// Represents fraction of available bandwidth on port p
// relative to the port's capacity.
abwc_frac[p] = abw_bps[p] / capacity_bps[p];
Algorithms for ABW computation described in Section 5.1.1 also apply
to ABW/C computation, except that the resulting value is normalized
by the port capacity. Quantization / bucketing is performed after
normalization.
Ravi, et al. Expires 5 August 2024 [Page 20]
Internet-Draft CSIG February 2024
5.2.2. min(ABW) vs min(ABW/C) bottlenecks
On paths with heterogeneous link speeds, min(ABW) and min(ABW/C)
bottlenecks are not necessarily the same ports. Figure 2 shows an
example where these two bottlenecks are different. Each type of
bottleneck has its own value, as demonstrated in Section 8.
5.3. Shared requirements for min(ABW) and min(ABW/C)
5.3.1. Algorithm Requirements
To support min(ABW) or min(ABW/C) in CSIG, the device SHOULD support
raw ABW computation with a configurable delta_t, and MAY support
additional algorithms such as EWMA or DRE. This requirement enables
the consistent interpretation of timescale over which available
bandwidth is computed. This consistent interpretation allows end-
hosts to tune their control decisions based on this timescale e.g.,
in relation to the flow's RTT.
5.3.2. Timescale and Accuracy Requirements
CSIG does not set strict requirements on the delta_t values to be
supported by the implementation, except that it SHOULD be
configurable to cover the range of RTTs in the network e.g., {10us,
100us, 1ms, 10ms, 100ms, 1s etc.}. Although one would expect all
devices on a packet path to compute ABW at similar timescales to
provide a consistent path-wide view, CSIG does NOT set strict
requirements on the consistency of delta_t parameters chosen across
the devices of a packet path. Choices of signal accuracy and
timescales are a function of the use case and are not enforced by
CSIG. End hosts MAY use EWMA across packets of a flow to calculate
ABW or ABW/C over a longer timescale when CSIG on each packet carries
ABW or ABW/C over shorter timescales. This technique is useful when
flows traversing a given egress port span a wide range of RTTs while
ABW computation over the egress port is fixed to a chosen timescale
at each transit device.
5.3.3. Bucketing / Quantization Requirements
The computed ABW or ABW/C values MUST be compressed to fit in the
available Signal value bits on the CSIG-tag. The device MUST support
32 fully configurable ABW buckets and ABW/C buckets for compact CSIG,
and configurable quanta for uniform quantization in expanded CSIG.
All devices along the packet path MUST be configured with the same
buckets / quanta per signal type in order to correctly compute
min(ABW) or min(ABW/C) along the path. Appendix A provides examples
of these configurations.
Ravi, et al. Expires 5 August 2024 [Page 21]
Internet-Draft CSIG February 2024
Each transit device performs a compare-and-replace, i.e., updates the
signal value on the CSIG tag if the incoming ABW or ABW/C signal
value on the packet is higher than the device's locally computed ABW
or ABW/C value for the packet's egress port, post bucketization /
quantization. E.g.,
// Update the signal value on packet if current hop is the bottleneck
pkt->csig_tag->abw = min(pkt->csig_tag->abw, egr_port->abw)
5.3.4. QoS requirements
min(ABW) and min(ABW/C) are unambiguous signals with low
implementation complexity on network devices. For simplicity, these
definitions intentionally do NOT distinguish across QoS classes that
may share the egress port. Available bandwidth per QoS class on an
egress port is complex to define and meaningfully interpret since it
depends on the scheduling policy (Strict Priority / WRR / Deficit
WRR), buffer carving configuration and other policies (e.g., AQM)
associated with QoS. Section 8 describes the applications of
min(ABW) and min(ABW/C) as defined. We leave QoS-based variations of
these signals and their potential use cases as future work.
5.4. Maximum Per-hop Delay - max(PD)
max(PD) captures the maximum per-hop delay experienced by a packet
among all the hops in the packet path. Per-hop delay PD is the time
spent by the packet in the device pipeline. It MAY include link
layer delays or it MAY only include the delays observed in the
forwarding pipeline.
5.4.1. Per-hop Delay Computation
Unlike ABW and ABW/C which are per-port signals, PD is a per-packet
signal. It consists of PHY, MAC and switch pipeline delay
experienced by the packet. Pipeline delay is the most relevant
component as it captures congestion related queueing delay. Device
implementations MAY track ingress and egress timestamps explicitly
for each packet and perform a diff in the final stages of the
pipeline. Precise definitions of these stages depend on the
architecture of the device. For example, some devices could leverage
existing timestamping support from tail timestamping capabilities for
this purpose.
5.4.2. Requirements
Ravi, et al. Expires 5 August 2024 [Page 22]
Internet-Draft CSIG February 2024
5.4.2.1. Algorithm Requirements
To support max(PD) in CSIG, the device SHOULD support per-packet
tracking of delay experienced through the device.
5.4.2.2. Accuracy Requirements
It is desirable to have minimal gaps in the components of packet
delays captured by the device. However, CSIG does NOT set strict
requirements on the accuracy of PD to be supported by the
implementation.
5.4.2.3. Bucketing / Quantization Requirements
The computed delay values MUST be compressed to fit in the available
Signal value bits on the CSIG-tag. The device MUST support 32 fully
configurable delay buckets for compact CSIG, and configurable quanta
for uniform quantization in expanded CSIG. All devices along the
packet path MUST be configured with the same buckets / quanta to
correctly compute max(PD) along the path.
Each transit device performs a compare-and-replace, i.e., updates the
signal value on the CSIG tag if the incoming delay signal value on
the packet is lower than the device's locally computed delay for the
packet, post bucketization / quantization. E.g.,
// Update the signal value on packet if current hop is the bottleneck
pkt->csig_tag->pd = max(pkt->csig_tag->pd, device->pkt->pd)
5.4.2.4. QoS requirements
Delay experienced by the packet on a device, as defined, is
implicitly a QoS-specific signal. This is because the packet is
subject to QoS policies as it traverses through the device pipeline,
including prioritization, scheduling and buffering. For example, a
high priority packet may see smaller delays than low priority
packets. Therefore, the delay measured for the packet SHOULD include
components in the pipeline where QoS policies are applied.
5.5. Locator Metadata Implementation
Locator metadata (LM) captures information about the bottleneck
device or port, as described in Section 4.1.3.3. In this section, we
discuss requirements for supporting LM in CSIG, and provide
recommendations for commonly useful attributes to carry in LM.
Ravi, et al. Expires 5 August 2024 [Page 23]
Internet-Draft CSIG February 2024
5.5.1. Requirements
A single deployment MAY choose a subset of the attributes in
Section 5.5.2 and/or newly defined attributes beyond those listed in
Section 5.5.2 to include in LM. However, the total size of the
individual attributes MUST be within 7 bits for Compact CSIG and
within 16 bits for Expanded CSIG.
CSIG does not set strict requirements on the LM internal format i.e.,
how the individual attributes are organized among the available LM
bits. However, this LM internal format MUST be consistent across
devices in the deployment domain so that the end hosts can
consistently interpret these bits. The LM internal format MAY be
specific to each signal type.
Devices SHOULD support configuring per-port values for LM to be
written on the CSIG-tag. Devices MAY provide more granular
configurability of LM based on Signal type as well. CSIG packets
egressing on a given port that have their Signal Value updated by the
device MUST be updated with the LM corresponding to the port and
Signal Type.
5.5.2. Attributes
Attributes can be designed to capture the level of resolution desired
by use cases for pinpointing the bottleneck. Attributes may be
encoded to fit within the limited number of LM bits available in
CSIG.
We separate the list of attributes into compact attributes and
expanded attributes. Compact attributes are motivated by the limited
number of LM bits available in Compact CSIG, and therefore capture
only the essential information about the bottleneck that is necessary
for the use cases i.e., to inform control decisions or telemetry.
Expanded attributes provide higher resolution information about the
bottleneck, and can aid in directly pinpointing bottleneck devices or
ports. Expanded attributes typically require more bits and are hence
more suited for Expanded CSIG.
Examples of attributes are listed below.
5.5.2.1. Compact Attributes
* Link capacity: Encodes the capacity of the bottleneck link. In
typical deployments, the number of link speeds deployed is a small
set, can be encoded using <= 5 bits.
Ravi, et al. Expires 5 August 2024 [Page 24]
Internet-Draft CSIG February 2024
* Stage of the bottleneck: Encodes the stage of the topology where
the bottleneck device / port is located. For example, in a
5-stage clos topology, the stage of the device can be encoded with
3 bits.
* Link orientation: Encodes the direction of a port in the context
of the network topology. For example, with three categories -
uplinks, downlinks and side-links - link orientation can be
encoded using 2 bits.
5.5.2.2. Expanded Attributes
* Port ID: Encodes a unique identifier for each port within a
deployment domain.
* Device ID: Encodes a unique identifier for each device within a
deployment domain.
* TTL (Time-to-live): Captures the TTL value of the packet at the
bottleneck device, represented using 8-bits. End hosts can use
this attribute to infer the hop number at which the packet was
bottlenecked.
LM attributes and encoding schemes are ultimately deployment specific
and use-case specific. CSIG supports a flexible specification of LM
to accommodate a variety of requirements and future applications.
6. Incremental Deployment of CSIG.
Most production networks are heterogeneous, with a mix of network
devices across generations. This document addresses the brownfield
deployment of CSIG in a heterogeneous network, where there may be a
mix of devices that offer varying degrees of support for CSIG packet
construction and processing.
6.1. CSIG Stripping: A per egress-port primitive
Before describing incremental deployment, we introduce the idea of
CSIG stripping, an action primitive which is foundational to
deploying CSIG in a heterogeneous environment.
Devices that support CSIG MUST be capable of removing the CSIG tag
before forwarding the packet. Devices MUST allow configuring CSIG-
stripping on a per egress-port basis. If a port is configured to
strip CSIG, then all CSIG-tagged packets that egress on this port
must have the tag removed before being forwarded.
Ravi, et al. Expires 5 August 2024 [Page 25]
Internet-Draft CSIG February 2024
In the following sections, we describe how this capability can enable
incremental deployment.
6.2. Levels of CSIG Support
We first classify devices into three simplified categories based on
their level of CSIG support. In the subsequent sections we describe
how CSIG can interoperate with each category of device. Note that
the level of support is a function of the tag placement and whether
the compact or expanded CSIG tag format is used as shown in
Section 4.1.
6.2.1. Discard
Devices in this category are not capable of recognizing or parsing
CSIG tagged packets. If such packets are received, they will simply
be dropped.
6.2.2. Pass-through
Devices in this category are able to recognize and parse CSIG tagged
packets, and transparently forward the packet with the tag intact or
with the tag stripped to neighboring devices (in the case of transit
devices) or to the end host transport layer (in the case of end
hosts). However, they do not support updating the CSIG data fields
on the tag.
Some devices that do not natively support CSIG may be configured to
support pass-through mode for CSIG if they support VLAN tags with
configurable TPIDs. This is discussed in more detail in Section 6.4.
6.2.3. Complete
Devices in this category support the complete CSIG protocol,
including recognition, parsing, forwarding, tag-stripping, signal
computation, and signal updates on the tag. However, only a subset
of signal types may be supported.
6.2.3.1. Software-assisted support
It is noteworthy that in some devices that do not natively support
CSIG, resources available for VLAN tag processing can be repurposed
to support CSIG for certain signal types using a combination of
software and hardware capabilities. We refer to this level of
support as software-assisted support. This capability is discussed
in more detail in Section 6.4.
Ravi, et al. Expires 5 August 2024 [Page 26]
Internet-Draft CSIG February 2024
6.2.3.2. Native support
Devices that natively support CSIG are explicitly equipped with the
hardware capabilities required to implement the CSIG protocol.
A CSIG domain is a deployment domain where all network devices have
complete support or pass-through support for CSIG.
6.3. Interoperability in Brownfield Deployments
In this section, we first define the requirements for CSIG
Interoperability in brownfield deployments. Then, we consider
devices with all levels of support described in Section 6.2 and
describe how these devices MAY be configured to achieve
interoperability. Note that the following descriptions apply
separately to both Compact and Expanded CSIG-tags.
+==============+=======================================+
| Device | Interop support |
| category | |
+==============+=======================================+
| Discard | Upstream devices must strip CSIG tags |
| | before packets reach this device |
+--------------+---------------------------------------+
| Pass-through | Device may strip tag or transparently |
| support only | forward with tag unmodified depending |
| | on e2e signal accuracy requirements |
+--------------+---------------------------------------+
| Native CSIG | Device updates CSIG-tag as per |
| support | protocol |
+--------------+---------------------------------------+
| SW-assisted | Device updates CSIG-tag using VLAN |
| CSIG support | match/action with approximate signals |
| | computed in S/W agent |
+--------------+---------------------------------------+
Table 1: Interoperability with devices having
different levels of CSIG support
6.3.1. Requirements for interoperability
Forwarding: The fundamental requirement is that no CSIG-tagged packet
should be dropped in the network due to a lack of CSIG support on a
device. This requirement means packets with CSIG-tags MUST never
reach devices in the Discard category, or MUST have their CSIG-tag
stripped before reaching such devices.
Ravi, et al. Expires 5 August 2024 [Page 27]
Internet-Draft CSIG February 2024
Negotiation: End hosts / flows SHOULD ensure that the path (including
end hosts and transit devices) is CSIG-capable before enabling CSIG-
tagging on packets. Devices in the Discard category should not
require any changes in order to achieve negotiation. This
requirement is to ensure correctness of data fields in end-to-end
CSIG operation, and to interoperate with legacy devices or software
stacks.
6.3.2. Forwarding
To achieve forwarding interoperability requirements for CSIG, CSIG
stripping may be exercised as shown below
* When a neighboring device connected to a given egress port is a
Discard device and cannot parse CSIG packets, this egress port
MUST be configured to strip the tag on outgoing packets to ensure
that the packet does not get dropped downstream.
* When a device supports Pass-through only or does not support the
requested signal type on a CSIG packet, egress ports on this
device MAY be configured to strip the tag on outgoing packets to
ensure that CSIG does not carry inaccurate information. In some
use cases where it is acceptable for CSIG to miss capturing
signals on certain hops, pass-through devices MAY transparently
forward the packet with the CSIG tag intact.
* At the boundary of a CSIG domain, device ports that are connected
to devices outside of the CSIG domain MUST strip the tag to ensure
that packets exiting the domain do not contain CSIG-tags. Only
egress ports connected to devices within the CSIG domain SHOULD
retain CSIG-tags on outgoing packets.
CSIG packets and non-CSIG packets can be used together in a
brownfield setting. This requirement means that end hosts MUST be
capable of transmitting and receiving both CSIG packets and non-CSIG
packets, including for the same flow. A packet marked with CSIG-tag
at the sender host may arrive at the receiver host without the tag.
In addition, Compact CSIG and Expanded CSIG packets may be used
together on the same network.
6.3.3. Negotiation
Support for sending and receiving CSIG-tagged packets may require
software and/or hardware changes on transit devices and end hosts.
In many deployments, particularly those requiring hardware upgrades
to support CSIG (such as Switch or NIC support), version stragglers
continue to exist for long time horizons for a variety of reasons,
and interoperability with such stragglers is a critical requirement.
Ravi, et al. Expires 5 August 2024 [Page 28]
Internet-Draft CSIG February 2024
Without negotiation for CSIG capability, devices that are not CSIG-
compliant may drop CSIG packets and thus blackhole traffic.
Negotiating for CSIG-capability of a path is critical to ensure that
CSIG protocol operates safely end-to-end in a brownfield deployment.
A path is considered CSIG-capable if end-hosts have at least Pass-
through CSIG support and transit devices have Complete CSIG support
(native or software-assisted). Before sending CSIG-tagged packets on
a network flow, end-hosts must negotiate for path CSIG-capability.
We discuss one approach to negotiation for path CSIG-capability,
which involves two parts: negotiation for transit device support and
negotiation for end host support.
6.3.3.1. Negotiation for transit device support
In this section, we describe one simple approach to negotiate CSIG
support on transit devices with CSIG stripping.
CSIG stripping can be used to implicitly achieve negotiation by
removing the CSIG-tag from the packet header at or before devices on
the packet path that do not have the desired level of CSIG support.
If the receiver end host receives a CSIG-tagged packet, it serves as
an explicit indication that all devices on the packet path, including
transit devices and end-hosts, have the desired CSIG support. If the
receiver end host receives a packet without a CSIG-tag, it is an
indication that one or more devices do not have the desired CSIG
support, or that the packet was not tagged at the sender to begin
with. This indication can be implicitly reported to the sender via
an empty / invalid CSIG reflection header and the sender can
determine whether the packet path was CSIG-capable.
This approach assumes that each device has knowledge about the level
of CSIG support in its immediate neighboring devices, which is viable
through configuration in typical private SDN networks. In the
absence of centralization, mechanisms such as a new LLDP TLV may be
defined to advertise aspects of CSIG support on the device, including
compact vs expanded CSIG-tag support, signal types that are
supported, pass-through vs complete support etc. We leave the
details of such an LLDP extension for future extensions of the
protocol.
Ravi, et al. Expires 5 August 2024 [Page 29]
Internet-Draft CSIG February 2024
6.3.3.2. Negotiation for end host support
A sender end host may need to explicitly negotiate with the remote
end-host to ensure that the host networking stack at the remote host
has the desired level of CSIG support. Ideally such explicit CSIG
negotiation should be performed during or before the initial
connection handshake, after which CSIG is enabled / disabled on
packets post connection establishment. It may also be necessary to
explicitly negotiate the use of CSIG Reflection in transports,
separately from the negotiation for path CSIG-capability. For
example, in TCP, negotiation is required to use the CSIG Reflection
TCP Option. We leave the details of such negotiation schemes for
future extensions of the protocol.
6.4. Backward Compatibility via Software-assisted CSIG
Transit devices without native CSIG support MAY participate in CSIG
protocol via a Software-assisted approach. This allows brownfield
deployments to reap incremental benefits of CSIG without having to
upgrade a significant fraction of device HW on their networks.
Since compact and expanded CSIG tags are structurally similar to
single VLAN-tags and double VLAN-tags respectively, VLAN resources in
a transit device can be repurposed to support CSIG updates. More
specifically, configurable TPIDs for VLAN tags can be used to treat
CSIG tags as VLAN tags, and VLAN match/action resources for tag
updates in the device can be leveraged to support updating CSIG data
fields on the tag.
For signals such as ABW and ABW/C, a software agent running on the
CPU of a transit device can periodically compute these signals based
on hardware byte counters, and program VLAN match/action rules in the
dataplane to update CSIG data fields based on the computed signals.
Since the match/action rules are in the dataplane, CSIG packets can
be processed at line rate without CPU involvement. However the
match/action rules themselves can be updated at a slower cadence via
the software agent.
Compact CSIG is designed to enable software-assisted backward
compatibility while operating within the constraints of commonly
available VLAN resources on transit devices. Backward compatibility
via software is a fundamental feature in the design of Compact CSIG.
Note that it may not be possible to track signal types such as hop
delay per packet in a software agent. However, approximations of the
signal based on available hardware counters and registers (such as
latency histograms) can be implemented in the agent if software-
assisted support is desired for such signal types.
Ravi, et al. Expires 5 August 2024 [Page 30]
Internet-Draft CSIG February 2024
6.5. Greenfield deployments
In greenfield deployments of CSIG domains, all devices in the domain
natively support the CSIG protocol.
Expanded CSIG is designed to leverage greenfield deployments where
backward compatibility, negotiation and interoperability are not
requirements. It provides enhanced signal resolution via higher bit
width for signal values and locator metadata in comparison to Compact
CSIG. Expanded CSIG can also support up to 16 signal types.
Devices in Greenfield CSIG domains MUST support CSIG stripping at the
domain boundary to ensure that CSIG packets don't exit the domain.
7. Design Rationale
CSIG's design choices are shaped by an end-to-end perspective of what
matters to applications and where tradeoffs can be made towards
simplicity and practicality. In this section, we discuss the
rationale behind CSIG's design and the advantages it provides over
existing state of the art.
7.1. Choice of Layer 2
CSIG-tag offsets at layer 2 are independent of headers and payload at
layer 3 and above, which means that only a small set of tag placement
offsets need to be supported for reading and updating the header.
This makes device implementations of CSIG simpler. In contrast, in-
band network telemetry schemes implemented at layer 3 or higher
require support for a large set of packet formats as this set grows
by the cross-product of formats / encapsulations at each layer. This
complexity forces device implementations to restrict support for only
a fraction of packet formats / encapsulations, hindering the adoption
and deployment of such schemes. CSIG-tagging, on the other hand, is
simpler to support and deploy since it is at layer 2 and has a fixed
offset despite various formats / encapsulation at layer 3 and above.
The choice of layer 2 also makes compatibility with in-network
tunneling and encryption simpler, which are common features in data
center deployments.
* CSIG-tags are, by design, compatible with PSP encrypted packets
and IPSec encrypted packets, where Layer 4 headers and payloads
may be encrypted.
* CSIG tags are carried through Layer 3 tunnels e.g., IP-in-IP,
VxLAN, Geneve, at a fixed offset in the packet header. This
avoids the need to copy and relocate CSIG tags across inner /
Ravi, et al. Expires 5 August 2024 [Page 31]
Internet-Draft CSIG February 2024
outer headers during encapsulation and decapsulation of packets,
which would be necessary if implemented instead at layers 3 or
higher.
* CSIG tags are placed as the last header in the Layer 2 header
stack to ensure compatibility with layer 2 and layer 2.5 tunneled
domains as well. The placement of CSIG tags in MACSec and other
Layer 2 encapsulations is shown in the table in Section 4.1.
Most in-band network telemetry schemes are not backward compatible.
However, CSIG tag's structural similarity to VLAN tags enables
backward compatibility with many devices that don't have native CSIG
support as described in Section 6.4. This allows deployments to reap
the benefits of CSIG without having to upgrade a significant portion
of their network hardware.
In addition, since expanded CSIG is limited to 8B, i.e., the size of
double VLAN tags, the packet parsing depth required on devices to
read and process headers at layer 3 and above is not affected.
In summary, the choice of Layer 2 for CSIG-tag is a key part of
CSIG's simplicity and efficiency, since it keeps device
implementations simple while supporting multiple encapsulations and
backward compatibility.
7.2. Separation of headers for CSIG-tag and reflection
CSIG's design separates the CSIG-tag and CSIG reflection headers into
distinct layers. This decoupling enables end hosts to develop
different transport-specific implementations of CSIG reflection while
sharing the underlying CSIG-tag mechanism. This means that transit
device behaviors are not impacted by innovations in CSIG reflection.
In addition, this decoupling enables the separate tracking of forward
and reverse path bottlenecks. This is important since CCAs typically
prefer to react to congestion on the forward path only and not react
to congestion on the reverse path. In contrast, in-band schemes that
mix signaling and reflection into the same header do not provide
distinctions between forward and reverse path.
7.3. Fixed-size headers
CSIG's fixed-size headers constitute less than 0.2% bandwidth
overhead in packets with 4k or 9k MTU. This means that there is no
need for fragmentation or increasing MTU size for the purposes of
supporting multiple congestion signals. Furthermore, the performance
of network device packets per second (PPS) is minimally impacted by
the inclusion of CSIG tag and reflection headers.
Ravi, et al. Expires 5 August 2024 [Page 32]
Internet-Draft CSIG February 2024
The low overhead allows CSIG to be enabled on all live data packets
or explicit probe packets or sampled packets. This is an important
capability because it allows for the direct quantification of the
bottlenecks experienced by the data packets themselves instead of
having to rely on probes. However, leveraging CSIG on probes or
sampled packets is still an option for deployments that require such
visibility.
CSIG is designed to perform compare-and-replace (or more generally
read-modify-write for future extensions), with a fixed size header.
Therefore, CSIG is not limited by the number of hops in a network
path (i.e., diameter of the network) unlike schemes that append
information at each hop.
7.4. Signal Design
CSIG's signal design focuses on simple, aggregate signals that are
driven by use cases, as demonstrated in Section 5 and Section 8.
CSIG allows a single packet to carry only one congestion signal. To
obtain multiple signals at the end hosts, it takes advantage of the
fact that the end host can request different signal types across
multiple packets of a flow. In contrast, other schemes tend to
overload each packet with a lot of information, including metadata
about multiple signals, which can be limiting. Moreover, CSIG-tag's
format is also extensible, which means that it can be adapted to
support additional signal types and locator metadata in the future
without compromising the advantages of CSIG's design.
A unique feature of Compact CSIG's design is the ability to fully
configure signal value buckets, which allows for efficient signal
representations with a limited number of bits. For example, the
encodings can be adjusted to provide greater granularity at value
ranges that are more important to the application, and lower
granularity at ranges that are less important. Similarly, locator
metadata can be efficiently represented by carrying fewer bits of
relevant compressed attributes of the bottleneck that are important
to applications. Expanded CSIG, on the other hand, uses uniform
signal quantization for more accuracy and provides even more
flexibility in defining signals and locator metadata with a larger
bit width.
Ravi, et al. Expires 5 August 2024 [Page 33]
Internet-Draft CSIG February 2024
8. Use Cases defined by Bottleneck Signals
The use cases for CSIG are motivated by congestion control, traffic
management and network debuggability. These use cases have always
existed in production before CSIG, often using signals that are
measured end-to-end (such as packet loss and delay), or out-of-band
signals from network devices such as port utilization. CSIG provides
a boost in performance, efficiency and debuggability by augmenting
existing use cases with explicit in-band measurements.
In this document, we present the use cases for the three signals
defined in Section 5. At the crux of a signal is the definition of
bottleneck. Over time we envision use cases for other signals that
would define a bottleneck, e.g., the maximum number of co-sharing
flows on a link. For each of these new signals, locator metadata can
continue to provide attributes about the bottleneck port such as port
capacity.
8.1. Congestion Control
CCA can make use of CSIG signals in at least two different ways.
First, existing CCA can use CSIG values to address blindspots in end-
to-end signals such as packet loss, delay, and delivery rates. This
use case is immediately relevant as most production networks deploy
some form of end-to-end congestion control including Swift [SWIFT],
and BBR [BBR]. A second way to use CSIG is to design entirely new
congestion control algorithms that use CSIG as their primary signal.
We focus below on the former category.
E2E CCA comes in various forms and for simplicity we describe the use
cases taking Swift CC [SWIFT] as the baseline. Swift is delay-based
congestion control that uses accurate round-trip time (RTT)
measurements done via the NIC hardware timestamps. These signals can
be applied to other CCA and are NOT limited to Swift.
The interpretation and applications of CSIG for congestion control in
lossless networks and networks that use packet spraying is a topic
for future research.
8.1.1. Using maximum per-hop delay in E2E CC
E2E RTT measurements used in Swift include the queueing delays on all
hops along the flows' path, including the forward and reverse paths.
A consequence of using a lumped delay signal is that a flow reduces
its sending rate in response to delays that it may not be able to
directly control. Furthermore, in deployments where there can be
multiple congested links along the path of a flow, it is desirable to
modulate the sending rate of a flow in response to just the maximum
Ravi, et al. Expires 5 August 2024 [Page 34]
Internet-Draft CSIG February 2024
of the per-hop delays, max(PD), along a flows' path. Replacing the
end-to-end measured delay with bottleneck delay into Swift's equation
yields the following:
// Reduce the congestion window when bottleneck hop delay
// exceeds a chosen target hop delay
if (max(PD) > target_delay) then
md = beta * (max(PD) - target_delay) / max(PD)
cwnd = (1 - md) *cwnd
Poseidon [POSEIDON] is a CC proposed in literature that exemplifies
the use of maximum per-hop delay in reducing its congestion window.
By incorporating bottleneck information in congestion control
response, POSEIDON flows achieve higher flow throughputs in presence
of reverse path congestion, and congestion across multiple network
hops. Algorithm 1 in [POSEIDON] details the use of maximum per-hop
delay in both the increase and the decrease of the congestion window.
8.1.2. Using maximum link utilization in E2E CC
E2E CC uses heuristics to determine by how much to increase the
congestion window, e.g., in the case of Swift, when the measured
round-trip time is lower than the target delay, Swift increments the
congestion window by one per round-trip time. BBR [BBR] increases
the rate as a function of the flow's measured delivery rate.
The problem with these heuristics is that they don't get the rate or
window adjustments just right and either under or overshoot.
Undershooting the rate would mean that transfers take longer to
complete even when the bottleneck link has a low utilization, while
overshooting can cause an unnecessary increase in queueing delay and
packet losses.
In the following example, we integrate the maximum utilization signal
into Swift's congestion window update equation to ramp up adaptively
faster when the bottleneck link has low utilization. The congestion
window evolution is represented below:
// Increase congestion window in proportion
// to the utilization headroom
if (rtt < target_rtt) then
fcwnd <-- fcwnd + additive_increment
+ kLambda . fcwnd . (1 - max(U/C))
Ravi, et al. Expires 5 August 2024 [Page 35]
Internet-Draft CSIG February 2024
As an example, the fixed additive increase in Swift of rate <-- rate
+ Additive Increment, means that it takes 200 RTTs to take 80 Gbps of
bandwidth with an Additive Increment of 400 Mbps. The fast ramp-up
with CSIG using the bottleneck link utilization takes <10 RTTs to
safely ramp to 80 Gbps.
8.1.3. Using minimum available bandwidth in E2E CC
E2E CC uses heuristics to determine the initial transfer rate for
newly established connections. Starting too slowly would cause the
transfer to take longer than necessary while wasting available
bandwidth, whereas starting too quickly would cause queue delays and
packet drops. The same dilemma exists for transfers that are
starting on a connection that has been idle for multiple round-trip
times.
In networks where we know ahead of time that the degree of
multiplexing is low i.e., just a handful of flows co-existing on the
link at any point in time, transfers complete quickly when they
"jump-start" to use up all of the bottleneck bandwidth. This is
especially helpful when transports employ robust loss recovery
mechanisms such that even if the queue overflows, any lost packets
can be quickly recovered.
As an example, on an empty network of 200Gbps, a single transfer can
use up the entire 200Gbps in the second RTT, after the CSIG feedback
in the first RTT indicates the availability of 200Gbps at the
bottleneck link.
CSIG's min(ABW) bottleneck bandwidth allows transfers to start safely
at line-rate.
8.2. Traffic Management
CSIG encodes the most notable information about the path for each
flow by carrying bottleneck link signals and bottleneck locator
metadata. This path-level information, which is obtained directly
from application data packets rather than synthetic probes, is
directly attributable to the flow and is valuable for traffic
engineering and application performance debugging.
Ravi, et al. Expires 5 August 2024 [Page 36]
Internet-Draft CSIG February 2024
8.2.1. Load Balancing and Multipathing
Datacenter topologies employ a diverse set of paths between any
source-destination pairs. Transports employ techniques such as
Protective Load Balancing [PLB] and Multipathing [RFC8684] to spread
traffic across the multitude of paths. Load balancing and
multipathing in transports use a combination of end-to-end signals
and heuristics to select which paths to use and how much traffic to
channel in each of the paths.
Using CSIG signals from bottleneck links along the diverse set of
paths, load balancing and multipathing schemes can select high
quality paths with lower congestion, and spread traffic across them
in a congestion-aware manner.
Locator metadata can also be used to distinguish between incast
congestion and core network congestion, which can then be used to
adjust load balancing / multipathing actions. For instance, the
stage of the bottleneck and link orientation attributes are enough to
determine whether the last hop is the bottleneck or not. When the
last hop is the bottleneck, flow-level load balancing / multipathing
actions may not be effective and may, in fact, worsen incasts. Such
cases may require application-level load balancing or job scheduling
techniques to distribute traffic. However, when congestion is
instead known to be in the core network, flow-level load balancing /
multipathing actions can route around congested areas and improve
performance.
8.2.2. Traffic Engineering
Traffic Engineering carves out paths with apt bandwidth across
aggregate source-destination pairs. Examples within a datacenter
include Datacenter Network Interconnection Layer (DCNI)
[JUPITEREVOL]. CSIG can be used to provide fine-grained path level
information, including short timescale microburst congestion, to TE
systems. By using summarized CSIG signals aggregated both spatially
and temporally across flows, TE can select paths and balance traffic
at the datacenter level to accommodate bursty traffic, e.g., from ML.
8.3. Application Performance Debugging
Applications often complain that the network is slow, but it can be
challenging to identify the specific segment of the network that is
causing the problem. This is especially true with the scale of
datacenters, where flows can traverse up to nine hops [JUPITEREVOL].
Figuring out where the bottleneck is and the timescales at which the
path poses a bottleneck is like searching for a needle in a haystack
for an application with thousands of flows across various source-
Ravi, et al. Expires 5 August 2024 [Page 37]
Internet-Draft CSIG February 2024
destination pairs.
On application network flows, CSIG information, with its bottleneck
locator, can quickly and precisely answer why the flows are slow and
where the network / path bottlenecks are.
CSIG can also be enabled on mesh prober systems similar to [PINGMESH]
to augment end-to-end probe measurements between any two servers with
bottleneck information to aid troubleshooting.
9. Security Considerations
Only trusted sender hosts MUST be allowed to construct, initialize
and insert a CSIG tag into packets for authorized flows. Based on
deployments, the authorization can be done at the NICs or at the
switches, akin to firewall rules. CSIG stripping may also be
employed as fencing rules at domain boundaries to ensure that
unauthorized CSIG-tags are not traversing across these boundaries.
A rogue or broken network-device in a private network might put in
arbitrary CSIG values, or insert a CSIG tag in packets on a transit
node. We expect there to be checks and balances to identify and take
non-functioning or rogue network devices out of a private network, as
they can impose greater harm than distributing misleading CSIG
values.
10. IANA Considerations
There are no IANA considerations. CSIG Tag Protocol Identifier
(TPID) is requested from IEEE.
11. Conclusions
With the increased deployment of applications that are sensitive to
delay and bandwidth usage in data centers, e.g., AI/ML/HPC workloads
and RDMA based applications, relying solely on end-to-end signals is
insufficient under dynamically changing traffic patterns. Simple and
timely signals from network devices to end-hosts can augment and
optimize end-host transports to make optimal use of datacenter
bandwidth. CSIG is a simple, practical and deployable protocol for
distributing congestion information in networks that builds on the
successful aspects of prior work and is grounded in use-cases of
congestion control, traffic management and network debuggability.
Ravi, et al. Expires 5 August 2024 [Page 38]
Internet-Draft CSIG February 2024
12. Acknowledgements
This work would not be possible without the following individuals
whose various engineering and design contributions shaped CSIG and
its use cases:
Christopher Alfeld, Neelesh Bansod, Jis Ben, Neal Cardwell, Yongzhou
Chen, Yuchung Cheng, Dal Chand Choudhary, Mick Fingleton, Mahmudul
Hasan, Jeffrey Ji, Marc De Kruijf, Praveen Kumar, Rich Lane, Chang
Liu, Morley Mao, Carl Mauer, Sachin Menezes, Nipen Mody, Masoud
Moshref, Alex Rumyantsev, Gerald Schmidt, Arjun Singh, Arjun Singhvi,
Babru Thatikunta, Jeff Tikkanen, Frank Uyeda, Brian Vasquez, Rui
Wang, Hassan Wassel, Yong Xia, Zhengxu Xia, Kevin Yang, Liangcheng
Yu.
We would like to thank Arjun Singh, David Wetherall, Neal Cardwell,
Akash Deshpande and Arvind Krishnamurthy for their feedback on
several portions of this document.
13. Normative References
[BBR] Cardwell, N., Cheng, Y., Gunn, C., Yeganeh, S., and V.
Jacobson, "BBR: congestion-based congestion control",
Communications of the ACM vol. 60, no. 2, pp. 58-66,
DOI 10.1145/3009824, January 2017,
<https://doi.org/10.1145/3009824>.
[CONGA] Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan,
R., Chu, K., Fingerhut, A., Lam, V., Matus, F., Pan, R.,
Yadav, N., and G. Varghese, "CONGA: distributed
congestion-aware load balancing for datacenters", ACM
SIGCOMM Computer Communication Review vol. 44, no. 4, pp.
503-514, DOI 10.1145/2740070.2626316, August 2014,
<https://doi.org/10.1145/2740070.2626316>.
[DCQCN] Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M.,
Liron, Y., Padhye, J., Raindel, S., Yahia, M., and M.
Zhang, "Congestion Control for Large-Scale RDMA
Deployments", ACM SIGCOMM Computer Communication
Review vol. 45, no. 4, pp. 523-536,
DOI 10.1145/2829988.2787484, August 2015,
<https://doi.org/10.1145/2829988.2787484>.
[HPCCPLUS] "High-precision congestion control (HPCC++) deployment at
Alibaba leveraging In-band Flow Analyzer (IFA)", n.d.,
<https://www.broadcom.com/blog/high-precision-congestion-
control>.
Ravi, et al. Expires 5 August 2024 [Page 39]
Internet-Draft CSIG February 2024
[I-D.kumar-ippm-ifa]
Kumar, J., Anubolu, S., Lemon, J., Manur, R., Holbrook,
H., Ghanwani, A., Cai, D., Ou, H., Li, Y., and X. Wang,
"Inband Flow Analyzer", Work in Progress, Internet-Draft,
draft-kumar-ippm-ifa-07, 7 September 2023,
<https://datatracker.ietf.org/doc/html/draft-kumar-ippm-
ifa-07>.
[I-D.miao-tsv-hpcc]
Miao, R., Anubolu, S., Pan, R., Lee, J., Gafni, B.,
Shpigelman, Y., Tantsura, J., and G. Caspary, "HPCC++:
Enhanced High Precision Congestion Control", Work in
Progress, Internet-Draft, draft-miao-tsv-hpcc-02, 17 May
2023, <https://datatracker.ietf.org/doc/html/draft-miao-
tsv-hpcc-02>.
[JUPITEREVOL]
Poutievski, L., Mashayekhi, O., Ong, J., Singh, A., Tariq,
M., Wang, R., Zhang, J., Beauregard, V., Conner, P.,
Gribble, S., Kapoor, R., Kratzer, S., Li, N., Liu, H.,
Nagaraj, K., Ornstein, J., Sawhney, S., Urata, R.,
Vicisano, L., Yasumura, K., Zhang, S., Zhou, J., and A.
Vahdat, "Jupiter evolving: transforming google's
datacenter network via optical circuit switches and
software-defined networking", Proceedings of the ACM
SIGCOMM 2022 Conference, DOI 10.1145/3544216.3544265,
August 2022, <https://doi.org/10.1145/3544216.3544265>.
[P4-INT] "In-band Network Telemetry (INT) Dataplane Specification",
n.d., <https://p4.org/p4-spec/docs/INT_v2_1.pdf>.
[PINGMESH] Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz,
D., Liu, Z., Wang, V., Pang, B., Chen, H., Lin, Z., and V.
Kurien, "Pingmesh: A Large-Scale System for Data Center
Network Latency Measurement and Analysis", ACM SIGCOMM
Computer Communication Review vol. 45, no. 4, pp. 139-152,
DOI 10.1145/2829988.2787496, August 2015,
<https://doi.org/10.1145/2829988.2787496>.
[PLB] Qureshi, M., Cheng, Y., Yin, Q., Fu, Q., Kumar, G.,
Moshref, M., Yan, J., Jacobson, V., Wetherall, D., and A.
Kabbani, "PLB: congestion signals are simple and effective
for network load balancing", Proceedings of the ACM
SIGCOMM 2022 Conference, DOI 10.1145/3544216.3544226,
August 2022, <https://doi.org/10.1145/3544216.3544226>.
Ravi, et al. Expires 5 August 2024 [Page 40]
Internet-Draft CSIG February 2024
[PONYEXPRESS]
Marty, M., de Kruijf, M., Adriaens, J., Alfeld, C., Bauer,
S., Contavalli, C., Dalton, M., Dukkipati, N., Evans, W.,
Gribble, S., Kidd, N., Kononov, R., Kumar, G., Mauer, C.,
Musick, E., Olson, L., Rubow, E., Ryan, M., Springborn,
K., Turner, P., Valancius, V., Wang, X., and A. Vahdat,
"Snap: a microkernel approach to host networking",
Proceedings of the 27th ACM Symposium on Operating
Systems Principles, DOI 10.1145/3341301.3359657, October
2019, <https://doi.org/10.1145/3341301.3359657>.
[POSEIDON] Wang, W., Moshref, M., Li, Y., Kumar, G., Ng, E.,
Cardwell, N., and N. Dukkipati, "Poseidon: Efficient,
Robust, and Practical Datacenter CC via Deployable INT",
2023,
<https://www.usenix.org/conference/nsdi23/presentation/
wang-weitao>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/rfc/rfc2119>.
[RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
of Explicit Congestion Notification (ECN) to IP",
RFC 3168, DOI 10.17487/RFC3168, September 2001,
<https://www.rfc-editor.org/rfc/rfc3168>.
[RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
October 2017, <https://www.rfc-editor.org/rfc/rfc8257>.
[RFC8684] Ford, A., Raiciu, C., Handley, M., Bonaventure, O., and C.
Paasch, "TCP Extensions for Multipath Operation with
Multiple Addresses", RFC 8684, DOI 10.17487/RFC8684, March
2020, <https://www.rfc-editor.org/rfc/rfc8684>.
[RFC9000] Iyengar, J., Ed. and M. Thomson, Ed., "QUIC: A UDP-Based
Multiplexed and Secure Transport", RFC 9000,
DOI 10.17487/RFC9000, May 2021,
<https://www.rfc-editor.org/rfc/rfc9000>.
[RFC9378] Brockners, F., Ed., Bhandari, S., Ed., Bernier, D., and T.
Mizrahi, Ed., "In Situ Operations, Administration, and
Maintenance (IOAM) Deployment", RFC 9378,
DOI 10.17487/RFC9378, April 2023,
<https://www.rfc-editor.org/rfc/rfc9378>.
Ravi, et al. Expires 5 August 2024 [Page 41]
Internet-Draft CSIG February 2024
[SWIFT] Kumar, G., Dukkipati, N., Jang, K., Wassel, H., Wu, X.,
Montazeri, B., Wang, Y., Springborn, K., Alfeld, C., Ryan,
M., Wetherall, D., and A. Vahdat, "Swift: Delay is Simple
and Effective for Congestion Control in the Datacenter",
Proceedings of the Annual conference of the ACM Special
Interest Group on Data Communication on the applications,
technologies, architectures, and protocols for
computer communication, DOI 10.1145/3387514.3406591, July
2020, <https://doi.org/10.1145/3387514.3406591>.
[TCP-INT] Jereczek, G., Jepsen, T., Wass, S., Pujari, B., Zhen, J.,
and J. Lee, "TCP-INT: lightweight network telemetry with
TCP transport", Proceedings of the SIGCOMM '22 Poster and
Demo Sessions, DOI 10.1145/3546037.3546064, August 2022,
<https://doi.org/10.1145/3546037.3546064>.
Appendix A. Example encodings of CSIG signals
The following table demonstrates an example encoding of a 3-bit
signal value. Note that this is an example ONLY. The encoding that
is meaningful to a certain deployment is specific to the use cases in
consideration.
Note that CSIG tag supports 5 bit (20 bit) signal value size for the
compact (expanded) formats.
+=======+============+===========+============+
| Value | min(ABW/C) | min(ABW) | max(PD) |
+=======+============+===========+============+
| 0x0 | 0%-1% | 0-1Gbps | 0-10us |
+-------+------------+-----------+------------+
| 0x1 | 1%-5% | 1-5Gbps | 10-50us |
+-------+------------+-----------+------------+
| 0x2 | 5%-10% | 5-10Gbps | 50-100us |
+-------+------------+-----------+------------+
| 0x3 | 10%-20% | 10-20Gbps | 100-200us |
+-------+------------+-----------+------------+
| 0x4 | 20%-50% | 20-50Gbps | 200-400us |
+-------+------------+-----------+------------+
| 0x5 | 50%-75% | 50-75Gbps | 400-800us |
+-------+------------+-----------+------------+
| 0x6 | 75%-90% | 75-90Gbps | 800-2000us |
+-------+------------+-----------+------------+
| 0x7 | 90%-100% | >90 Gbps | >2000us |
+-------+------------+-----------+------------+
Table 2
Ravi, et al. Expires 5 August 2024 [Page 42]
Internet-Draft CSIG February 2024
Contributors
Weida Huang
Google LLC
Tyler Griggs
UC Berkeley
Mohammad Jafar Akhbarizadeh
Google LLC
Jeongkeun Lee
Google LLC
Surendra Anubolu
Broadcom Inc.
Kok-Kiong Yap
Google LLC
Neal Cardwell
Google LLC
Authors' Addresses
Abhiram Ravi
Google LLC
Email: abhiramr@google.com
Nandita Dukkipati
Google LLC
Email: nanditad@google.com
Naoshad Mehta
Google LLC
Email: naoshad@google.com
Ravi, et al. Expires 5 August 2024 [Page 43]
Internet-Draft CSIG February 2024
Jai Kumar
Broadcom Inc.
Email: jai.kumar@broadcom.com
Ravi, et al. Expires 5 August 2024 [Page 44]