<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>

<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
     category="std"
     docName="draft-li-spring-rdma-multicast-over-srv6-00"
     ipr="trust200902"
     submissionType="IETF"
     consensus="true"
     version="3">

  <front>
    <title abbrev="RDMA Multicast over SRv6">
      SRv6 Extensions for RDMA Multicast Delivery
    </title>

    <seriesInfo name="Internet-Draft"
                value="draft-li-spring-rdma-multicast-over-srv6-00"/>

    <author fullname="Zhiqiang Li" initials="Z." surname="Li">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <street>32 Xuanwumen West Street</street>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>lizhiqiangyjy@chinamobile.com</email>
      </address>
    </author>
    <author fullname="Zongpeng Du" initials="Z." surname="Du">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <street>32 Xuanwumen West Street</street>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>duzongpeng@chinamobile.com</email>
      </address>
    </author>
    <author fullname="Wei Cheng" initials="W." surname="Cheng">
      <organization>Centec Networks</organization>
      <address>
        <postal>
          <city>Suzhou</city>
          <code>215000</code>
          <country>China</country>
        </postal>
        <email>chengw@centec.com</email>
      </address>
    </author>
    <author fullname="Junjie Wang" initials="J." surname="Wang">
      <organization>Centec Networks</organization>
      <address>
        <postal>
          <city>Suzhou</city>
          <code>215000</code>
          <country>China</country>
        </postal>
        <email>wangjj@centec.com</email>
      </address>
    </author>
    <author fullname="Guoying Zhang" initials="G." surname="Zhang">
      <organization>Centec Networks</organization>
      <address>
        <postal>
          <city>Suzhou</city>
          <code>215000</code>
          <country>China</country>
        </postal>
        <email>zhanggy@centec.com</email>
      </address>
    </author>

<date year="2026" month="February" day="28"/>

    <area>Routing</area>
    <workgroup>SPRING</workgroup>

    <keyword>RDMA</keyword>
    <keyword>multicast</keyword>
    <keyword>SRv6</keyword>
    <keyword>RoCEv2</keyword>

    <abstract>
      <t>This document specifies SRv6 (Segment Routing over IPv6)
      extensions for multicast delivery of RDMA (Remote Direct Memory
      Access) Reliable Connection (RC) traffic.  It defines a new SRv6
      endpoint behavior, End.MT, that performs per-receiver RDMA Base
      Transport Header (BTH) modifications at edge nodes of the
      multicast tree.  It also specifies procedures for hop-by-hop
      aggregation of RDMA ACK, NACK, and CNP response messages along
      the reverse path.  Together, these extensions allow RDMA RC
      endpoints to communicate using standard point-to-point Queue
      Pair (QP) semantics while the network distributes data packets
      over an IP multicast tree.  Target deployment scenarios include
      multi-replica distributed storage writes, HPC collective
      communications, AI training parameter distribution, and
      large-scale inference KV cache distribution.</t>
    </abstract>
  </front>

  <middle>
    <!-- ======== Section 1: Introduction ======== -->
    <section anchor="introduction">
      <name>Introduction</name>

      <t>Large-scale distributed computing deployments, including
      data center interconnection, distributed AI training and
      inference, and national-scale computing networks, rely on
      high-throughput data transport.  RDMA (Remote Direct Memory
      Access) provides kernel-bypass data transfer with low CPU
      overhead on both sending and receiving hosts.  RDMA one-sided
      operations, where receive buffers are pre-registered at the
      receiver's Network Interface Card (NIC), further reduce CPU
      involvement at the receiving end.</t>

      <t>Many distributed applications exhibit one-to-many traffic
      patterns, including multi-replica storage writes, HPC collective
      communications (broadcast, scatter), AI training parameter
      distribution, and KV cache distribution in inference pipelines.
      IP multicast delivery of such traffic can reduce total
      network bandwidth consumption compared to per-receiver unicast
      replication at the source.</t>

      <t>The RDMA Reliable Connection (RC) transport mode is the most
      widely adopted RDMA mode because it supports the complete set
      of RDMA operations: Read, Write, and Atomic.  However, each RC
      Queue Pair (QP) is a point-to-point association between exactly
      one sending QP and one receiving QP.  RC packets carry
      per-connection identifiers in the Base Transport Header (BTH),
      specifically the Destination Queue Pair Number (QPN) and Packet
      Sequence Number (PSN).  These per-connection fields prevent
      direct application of IP multicast replication to RC traffic
      because each receiver requires its own QPN and independently
      tracks PSN state.</t>

      <t>Existing application-layer approaches in distributed
      frameworks (MPI, NCCL, Spark) address this limitation in two
      ways: by opening separate RC QP connections to each receiver,
      which results in source bandwidth consumption proportional to
      the number of receivers, or by constructing application-layer
      relay trees (tree or ring topologies), which introduce per-hop
      host-stack traversal latency and additional memory copy
      overhead at relay nodes.</t>

      <t>This document specifies SRv6 extensions that bridge the gap
      between RDMA RC point-to-point semantics and IP multicast
      one-to-many delivery.  Edge nodes of the multicast tree execute
      a new SRv6 endpoint behavior (End.MT) that rewrites per-receiver
      RDMA BTH fields in replicated packet copies.  Intermediate and
      edge nodes aggregate reverse-path RDMA response messages
      (ACK, NACK, CNP) before they reach the source.  RDMA RC
      endpoints are not required to implement any multicast-specific
      extensions.</t>

      <section anchor="related-work">
        <name>Relationship to Other Work</name>

        <t>The Segment Routing Replication segment defined in
        <xref target="RFC9524"/> provides a general-purpose
        SRv6 packet replication behavior (End.Replicate).  The
        End.MT behavior specified in this document is complementary:
        it performs RDMA-specific BTH header modifications in
        addition to packet replication at edge nodes.  Transit nodes
        in the multicast tree MAY use End.Replicate or any other
        IP multicast forwarding mechanism for tree-interior
        replication.</t>

        <t>Fast Congestion Notification Packet (Fast CNP) mechanisms
        for RoCEv2 networks, such as those described in
        <xref target="I-D.xiao-rtgwg-rocev2-fast-cnp"/>, define
        switch-originated CNPs sent directly to the sender on a
        point-to-point basis.  The reverse-path CNP aggregation
        specified in this document operates on the multicast tree
        topology and is independent of, and compatible with, Fast
        CNP on individual links.</t>

        <t>RoCEv2-based collective communication offloading, as
        described in <xref target="I-D.liu-nfsv4-rocev2"/>,
        implements in-network aggregation functions for collective
        operations.  This document differs in scope: it addresses
        one-to-many data distribution (multicast) rather than
        many-to-one aggregation (reduce), and it does not require
        RDMA connections between hosts and switches.</t>
      </section>

      <section anchor="requirements-language">
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
        "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",
        "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document
        are to be interpreted as described in BCP 14
        <xref target="RFC2119"/> <xref target="RFC8174"/> when,
        and only when, they appear in all capitals, as shown here.</t>
      </section>
    </section>

    <!-- ======== Section 2: Terminology ======== -->
    <section anchor="terminology">
      <name>Terminology</name>

      <t>This document uses the following terms.  Familiarity with
      SRv6 terminology from <xref target="RFC8402"/>,
      <xref target="RFC8754"/>, and <xref target="RFC8986"/> is
      assumed.</t>

      <dl>
        <dt>RDMA:</dt>
        <dd>Remote Direct Memory Access, as specified in the
        InfiniBand Architecture <xref target="ROCEV2"/>.</dd>

        <dt>RoCEv2:</dt>
        <dd>RDMA over Converged Ethernet version 2, an RDMA
        transport encapsulated in UDP/IPv6 (or UDP/IPv4).</dd>

        <dt>RC:</dt>
        <dd>Reliable Connection, an RDMA transport mode providing
        connection-oriented reliable delivery.</dd>

        <dt>QP:</dt>
        <dd>Queue Pair, the RDMA communication endpoint consisting
        of a Send Queue (SQ) and a Receive Queue (RQ).</dd>

        <dt>QPN:</dt>
        <dd>Queue Pair Number, a 24-bit identifier for a QP.</dd>

        <dt>BTH:</dt>
        <dd>Base Transport Header, the RDMA transport header
        containing the opcode, Destination QPN, PSN, and other
        fields.</dd>

        <dt>AETH:</dt>
        <dd>ACK Extended Transport Header, an RDMA header carrying
        acknowledgment information.</dd>

        <dt>PSN:</dt>
        <dd>Packet Sequence Number, a 24-bit sequence number used
        for ordering and acknowledgment in RDMA reliable
        transport.</dd>

        <dt>ACK:</dt>
        <dd>Acknowledgment, an RDMA response confirming successful
        reception.</dd>

        <dt>NACK:</dt>
        <dd>Negative Acknowledgment, an RDMA response requesting
        retransmission.</dd>

        <dt>CNP:</dt>
        <dd>Congestion Notification Packet, used in the RoCEv2
        ECN-based congestion control mechanism.</dd>

        <dt>SRv6:</dt>
        <dd>Segment Routing over IPv6
        <xref target="RFC8402"/>.</dd>

        <dt>SRH:</dt>
        <dd>Segment Routing Header
        <xref target="RFC8754"/>.</dd>

        <dt>SID:</dt>
        <dd>Segment Identifier, a 128-bit IPv6 address in SRv6.</dd>

        <dt>End.MT:</dt>
        <dd>A new SRv6 endpoint behavior defined in this document
        for RDMA multicast header transformation at edge
        nodes.</dd>

        <dt>Designated QPN:</dt>
        <dd>A QPN value agreed upon by all multicast group
        participants during group setup, used as the Destination
        QPN in data packets traversing the multicast tree.</dd>

        <dt>Proxy Address:</dt>
        <dd>An IPv6 address used as the common destination by
        the source and all receivers to represent the multicast
        group at the RDMA layer.</dd>
      </dl>
    </section>

    <!-- ======== Section 3: Applicability ======== -->
    <section anchor="applicability">
      <name>Applicability</name>

      <t>The extensions specified in this document apply to networks
      that meet all of the following conditions:</t>

      <ol>
        <li>RDMA transport uses RoCEv2 in Reliable Connection (RC)
        mode over IPv6.</li>
        <li>The network underlay supports SRv6 as defined in
        <xref target="RFC8754"/> and <xref target="RFC8986"/>.</li>
        <li>IP multicast forwarding is available in the underlay
        between the source and the edge nodes, using any combination
        of PIM, static multicast routing, or SRv6 Replication
        segments <xref target="RFC9524"/>.</li>
        <li>Edge nodes are capable of maintaining per-receiver
        RDMA connection state and performing BTH field
        modification at line rate.</li>
      </ol>

      <t>These extensions do not modify RDMA endpoint behavior.
      Hosts run unmodified RoCEv2 protocol stacks and establish
      standard RC QP connections.  All multicast-related packet
      transformations occur within the network.</t>
    </section>

    <!-- ======== Section 4: Architecture ======== -->
    <section anchor="architecture">
      <name>Architecture</name>

      <section anchor="network-roles">
        <name>Network Roles</name>

        <t>This specification defines the following network roles:</t>

        <dl>
          <dt>Multicast Source (S):</dt>
          <dd>The RDMA sending host.  S establishes a standard RDMA
          RC QP to the Proxy Address using the Designated QPN.  S
          obtains the unicast IPv6 addresses and QPNs of all
          receivers via a control plane (the control plane protocol
          is out of scope).</dd>

          <dt>Multicast Receivers (R1..Rn):</dt>
          <dd>RDMA receiving hosts.  Each Ri establishes a standard
          RDMA RC QP to the Proxy Address using its own locally
          assigned QPN.</dd>

          <dt>Edge Nodes:</dt>
          <dd>SRv6-capable network nodes adjacent to receivers
          (e.g., N1-N3 in <xref target="fig-topology"/>).  Edge
          nodes instantiate the End.MT SID and perform: (a) RDMA
          BTH Destination QPN replacement, (b) IPv6 Destination
          Address replacement, (c) ICRC recomputation, and (d)
          reverse-path ACK/NACK/CNP aggregation.</dd>

          <dt>Transit Nodes:</dt>
          <dd>Intermediate forwarding nodes in the multicast tree
          (e.g., N4-N6 in <xref target="fig-topology"/>).  Transit
          nodes replicate and forward packets using IP multicast
          procedures and participate in reverse-path response
          aggregation.</dd>
        </dl>
      </section>

      <section anchor="ref-topology">
        <name>Reference Topology</name>

        <figure anchor="fig-topology">
          <name>Reference Network Topology</name>
          <artwork type="ascii-art"><![CDATA[
                            S1
                             |
                            N6 (transit)
                          /    \
               (transit) N4     N5 (transit)
                        /  \    /  \
              (edge)  N1   N2    N3 (edge)
                     / \    |   / \
                   R1   R2  R3 R4  R5
]]></artwork>
        </figure>

        <t>In <xref target="fig-topology"/>, S1 is the multicast
        source.  R1 through R5 are multicast receivers.  N1, N2, and
        N3 are edge nodes executing the End.MT behavior.  N4, N5,
        and N6 are transit nodes performing IP multicast
        replication.</t>
      </section>
    </section>

    <!-- ======== Section 5: Protocol Specification ======== -->
    <section anchor="protocol-spec">
      <name>Data Plane Specification</name>

      <section anchor="group-setup">
        <name>Multicast Group Setup</name>

        <t>Prior to data transmission, a multicast group MUST be
        established as follows:</t>

        <ol>
          <li>All participants (source S1 and receivers R1..Rn) MUST
          each create an RDMA RC QP directed at the Proxy Address.
          All QPs MUST use the Designated QPN as the remote
          (destination) QPN.</li>

          <li>The source S1 MUST obtain the unicast IPv6 address and
          the actual local QPN of each receiver Ri via the control
          plane.  The control plane protocol and its signaling
          procedures are outside the scope of this document.</li>

          <li>Each edge node MUST be configured to receive IP
          multicast traffic addressed to the Proxy Address.</li>

          <li>S1 MUST encode each edge node's associated receiver
          information (unicast IPv6 addresses and QPNs) into End.MT
          TLVs within the SRH of data packets, as specified in
          <xref target="end-mt-tlv"/>.</li>
        </ol>
      </section>

      <section anchor="downstream-forwarding">
        <name>Downstream Data Forwarding</name>

        <t>The source-to-receiver data path operates as follows:</t>

        <ol>
          <li>S1 constructs RDMA RC data packets with the IPv6
          Destination Address set to the Proxy Address and the BTH
          Destination QPN set to the Designated QPN.  S1
          encapsulates the packets in an outer IPv6 header with an
          SRH containing the End.MT SID(s) and associated
          TLVs.</li>

          <li>Transit nodes forward the traffic according to their
          IP multicast forwarding tables, performing tree
          replication as needed.</li>

          <li><t>When a packet arrives at an edge node whose local
          End.MT SID matches the IPv6 Destination Address, the
          edge node MUST execute the End.MT behavior
          (<xref target="end-mt-behavior"/>):</t>
            <ol type="a">
              <li>Parse the End.MT TLV from the SRH to obtain the
              list of downstream receivers (unicast addresses and
              QPNs).</li>
              <li>Create one copy of the inner packet for each
              downstream receiver.</li>
              <li>In each copy, replace the IPv6 Destination
              Address with the receiver's unicast address.</li>
              <li>In each copy, replace the BTH Destination QPN
              with the receiver's actual QPN.</li>
              <li>Recompute the Invariant CRC (ICRC) and any other
              affected checksums.</li>
              <li>Forward each modified copy toward its
              destination.</li>
            </ol>
          </li>

          <li>Each receiver Ri receives a standard RDMA RC unicast
          packet addressed to its own IPv6 address and QPN.  No
          multicast-specific behavior is required at Ri.</li>
        </ol>
      </section>

      <section anchor="upstream-response">
        <name>Reverse-Path Response Processing</name>

        <t>Receivers generate three types of response messages toward
        the source: ACK (acknowledgment of successful reception),
        NACK (request for retransmission), and CNP (congestion
        notification).  These responses MUST be aggregated hop-by-hop
        at intermediate nodes before reaching the source, so that
        the source's retransmission and rate control logic operates
        correctly without multicast-specific modifications.</t>

        <section anchor="ack-aggregation">
          <name>ACK Aggregation</name>

          <t>The source MUST receive an AckPSN value satisfying the
          following invariant: for every receiver Ri and every
          packet with PSN less than or equal to AckPSN, Ri has
          confirmed successful reception.</t>

          <t>Each intermediate node (edge or transit) MUST maintain
          a record of the most recent AckPSN reported by each
          downstream branch.  When an ACK is received from a
          downstream branch, the node MUST update the stored
          AckPSN for that branch.  The node MUST forward an ACK
          upstream carrying an AckPSN equal to the minimum of all
          downstream branches' stored AckPSN values.  The node
          adjacent to the source (N6 in <xref target="fig-topology"/>)
          MUST write this minimum value into the AETH AckPSN field
          of the ACK forwarded to the source.</t>
        </section>

        <section anchor="nack-aggregation">
          <name>NACK Aggregation</name>

          <t>When a receiver detects missing packets, it sends a
          NACK containing an expected PSN (ePSN) indicating the
          start of the retransmission range.  The source MUST
          receive an ePSN satisfying the following invariant: for
          every receiver Ri, all packets with PSN less than ePSN
          have been successfully received by Ri.</t>

          <t>Each intermediate node MUST maintain a per-branch
          record of ePSN values.  For branches that have sent only
          ACKs (no NACK), the effective ePSN SHOULD be treated as
          AckPSN + 1.  The NACK forwarded upstream MUST carry the
          minimum ePSN across all downstream branches.</t>
        </section>

        <section anchor="cnp-aggregation">
          <name>CNP Aggregation</name>

          <t>Each intermediate node MUST maintain a per-branch
          counter (CCount) that records the number of CNP messages
          received from each downstream branch within a
          configurable time window T.</t>

          <t>At the expiration of each time window T, the node MUST
          select the branch with the highest CCount value and
          forward a single CNP upstream representing the most
          congested downstream path.  All CCount values MUST be
          reset to zero at the start of each new time window.</t>

          <t>The time window T MAY be adjusted dynamically based on
          observed network conditions.  The node adjacent to the
          source MUST rewrite CNP packet headers so that the source
          processes the CNP as a standard RoCEv2 congestion
          notification.</t>
        </section>
      </section>
    </section>

    <!-- ======== Section 6: End.MT Behavior ======== -->
    <section anchor="end-mt-behavior">
      <name>SRv6 End.MT Behavior</name>

      <section anchor="end-mt-definition">
        <name>Definition</name>

        <t>End.MT is an SRv6 endpoint behavior instantiated at edge
        nodes of the RDMA multicast tree.  When a node N receives a
        packet whose IPv6 Destination Address matches a locally
        instantiated End.MT SID, N performs the processing described
        in <xref target="end-mt-pseudocode"/>.</t>

        <t>End.MT combines the following operations: SRH segment
        processing, End.MT TLV parsing, per-receiver packet
        replication, BTH Destination QPN replacement, IPv6
        Destination Address replacement, and ICRC recomputation.</t>
      </section>

      <section anchor="end-mt-pseudocode">
        <name>Pseudocode</name>

        <t>The following pseudocode follows the conventions of
        <xref target="RFC8986"/> Section 4.</t>

        <artwork type="pseudocode"><![CDATA[
When N receives a packet destined to S, where S is a local
End.MT SID, N does:

  S01. If NH=SRH and SL > 0 {
  S02.   Decrement SL
  S03.   Update the IPv6 DA with SRH[SL]         ;; Ref1
  S04.   Parse the End.MT TLV associated with S
  S05.   Let RecvList = list of (IPv6_Addr, QPN) from TLV
  S06.   For each entry (Addr_i, QPN_i) in RecvList {
  S07.     Copy the packet                        ;; Ref2
  S08.     In the copy, set IPv6 DA = Addr_i
  S09.     In the copy, set BTH.DestQPN = QPN_i
  S10.     Recompute ICRC over the modified headers
  S11.     Forward the copy based on Addr_i       ;; Ref3
  S12.   }
  S13. } Else {
  S14.   Drop the packet                          ;; Ref4
  S15. }

Ref1: Standard SRH processing per RFC 8754 Section 4.3.1.1.

Ref2: The copy includes all payload beyond the outer IPv6
      and SRH headers that are relevant to the inner RDMA
      frame.

Ref3: FIB lookup on Addr_i determines the outgoing
      interface.

Ref4: A packet arriving with SL=0 or without SRH is not
      valid for End.MT processing.
]]></artwork>
      </section>
    </section>

    <!-- ======== Section 7: Packet Formats ======== -->
    <section anchor="packet-formats">
      <name>Packet Formats</name>

      <section anchor="srh-usage">
        <name>SRH Usage</name>

        <t>This specification uses the standard IPv6 Segment Routing
        Header (SRH) as defined in <xref target="RFC8754"/>.  The
        SRH carries the End.MT SID in its Segment List and the
        End.MT TLV in its Optional TLV field.</t>
      </section>

      <section anchor="end-mt-tlv">
        <name>End.MT TLV Format</name>

        <t>The End.MT TLV is carried in the Optional TLV field of
        the SRH and conveys per-edge-node receiver information.  Its
        format is as follows:</t>

        <figure anchor="fig-tlv">
          <name>End.MT TLV Format</name>
          <artwork type="ascii-art"><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      Type     |     Length    |           Reserved            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|          Edge Node Address (128 bits IPv6)                    |
|                                                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Num Receivers |                  Reserved                     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|         Receiver 1 Address (128 bits IPv6)                    |
|                                                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|            Receiver 1 QPN (24 bits)            |   Reserved   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
~                            ...                                ~
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|         Receiver N Address (128 bits IPv6)                    |
|                                                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|            Receiver N QPN (24 bits)            |   Reserved   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
        </figure>

        <t>The fields are defined as follows:</t>

        <dl>
          <dt>Type (8 bits):</dt>
          <dd>SRH TLV type code, to be assigned by IANA
          (see <xref target="iana"/>).</dd>

          <dt>Length (8 bits):</dt>
          <dd>Length of the Value field in octets, not including the
          Type and Length fields.</dd>

          <dt>Edge Node Address (128 bits):</dt>
          <dd>The IPv6 unicast address of the edge node associated
          with this TLV entry.</dd>

          <dt>Num Receivers (8 bits):</dt>
          <dd>The number of receiver entries following this
          field.</dd>

          <dt>Receiver i Address (128 bits):</dt>
          <dd>The IPv6 unicast address of the i-th receiver.</dd>

          <dt>Receiver i QPN (24 bits):</dt>
          <dd>The RDMA Queue Pair Number of the i-th receiver.</dd>

          <dt>Reserved:</dt>
          <dd>MUST be set to zero on transmission and MUST be
          ignored on reception.</dd>
        </dl>

        <t>Multiple End.MT TLVs MAY be present in a single SRH, one
        per edge node in the multicast tree.  Each End.MT TLV is
        associated with the End.MT SID of the corresponding edge
        node.</t>
      </section>
    </section>

    <!-- ======== Section 8: Intermediate Node State ======== -->
    <section anchor="node-state">
      <name>Intermediate Node State Requirements</name>

      <t>Each intermediate node (both edge and transit) participating
      in reverse-path response aggregation MUST maintain the
      following per-multicast-group state:</t>

      <dl>
        <dt>Per-branch AckPSN:</dt>
        <dd>The most recent AckPSN value received from each
        downstream branch.  Initial value: 0.</dd>

        <dt>Per-branch ePSN:</dt>
        <dd>The most recent expected PSN from any NACK received from
        each downstream branch.  For branches that have not sent a
        NACK, this value SHOULD be set to AckPSN + 1.</dd>

        <dt>Per-branch CCount:</dt>
        <dd>A counter of CNP messages received from each downstream
        branch within the current time window T.  Reset to zero at
        the start of each new time window.</dd>
      </dl>

      <t>The amount of state is proportional to the number of
      downstream branches at each node, not to the total number of
      receivers in the multicast group.  Edge nodes additionally
      maintain the receiver information (addresses and QPNs) learned
      from the End.MT TLV or from control-plane provisioning.</t>
    </section>

    <!-- ======== Section 9: Security ======== -->
    <section anchor="security">
      <name>Security Considerations</name>

      <t>The security considerations of <xref target="RFC8754"/> and
      <xref target="RFC8986"/> apply to all SRv6 aspects of this
      specification.  The following additional considerations are
      specific to RDMA multicast delivery.</t>

      <t>The End.MT TLV carries receiver IPv6 addresses and QPNs in
      the SRH.  An on-path attacker able to read SRH contents can
      obtain receiver topology and RDMA connection identifiers.
      Implementations operating outside a single administrative
      trust domain SHOULD protect SRH integrity and confidentiality
      using the HMAC TLV defined in Section 7 of
      <xref target="RFC8754"/> or IPsec Encapsulating Security
      Payload (ESP) encapsulation.</t>

      <t>Intermediate nodes maintain per-branch ACK/NACK/CNP
      aggregation state.  An attacker injecting forged response
      messages could corrupt this state, causing the source to
      prematurely consider data as acknowledged (via inflated
      AckPSN) or to trigger unnecessary retransmissions (via forged
      NACKs).  Nodes SHOULD validate that reverse-path response
      messages originate from addresses within the expected
      downstream receiver set.  BCP 38 <xref target="RFC2827"/>
      ingress filtering SHOULD be applied at network boundaries.</t>

      <t>An attacker injecting a high volume of forged CNP messages
      could force the source into continuous rate reduction,
      creating a denial-of-service condition.  Intermediate nodes
      SHOULD implement per-branch CNP rate limiting.  The
      configurable time window T for CNP aggregation provides an
      inherent dampening effect.</t>

      <t>If End.MT TLV contents are modified in transit, packets
      could be delivered to incorrect RDMA QPs, resulting in data
      corruption or information disclosure at unintended receivers.
      The SRH HMAC TLV <xref target="RFC8754"/> provides integrity
      protection for this purpose.  Edge nodes SHOULD verify HMAC
      before processing End.MT TLVs when operating across trust
      domain boundaries.</t>
    </section>

    <!-- ======== Section 10: IANA ======== -->
    <section anchor="iana">
      <name>IANA Considerations</name>

      <section anchor="iana-behavior">
        <name>SRv6 Endpoint Behavior</name>

        <t>This document requests IANA to allocate a new codepoint
        in the "SRv6 Endpoint Behaviors" sub-registry under the
        "Segment Routing" registry group
        <xref target="RFC8986"/>:</t>

        <table anchor="tab-behavior">
          <name>SRv6 Endpoint Behavior Registration</name>
          <thead>
            <tr>
              <th>Value</th>
              <th>Hex</th>
              <th>Endpoint Behavior</th>
              <th>Reference</th>
              <th>Change Controller</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>TBD1</td>
              <td>TBD1</td>
              <td>End.MT</td>
              <td>[This document]</td>
              <td>IETF</td>
            </tr>
          </tbody>
        </table>
      </section>

      <section anchor="iana-tlv">
        <name>SRH TLV Type</name>

        <t>This document requests IANA to allocate a new Type value
        in the "Segment Routing Header TLVs" registry
        <xref target="RFC8754"/>:</t>

        <table anchor="tab-tlv">
          <name>SRH TLV Type Registration</name>
          <thead>
            <tr>
              <th>Value</th>
              <th>Description</th>
              <th>Reference</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>TBD2</td>
              <td>End.MT TLV</td>
              <td>[This document]</td>
            </tr>
          </tbody>
        </table>
      </section>
    </section>
  </middle>

  <back>
    <references>
      <name>References</name>

      <references>
        <name>Normative References</name>

        <reference anchor="RFC2119"
           target="https://www.rfc-editor.org/info/rfc2119">
          <front>
            <title>Key words for use in RFCs to Indicate
            Requirement Levels</title>
            <author fullname="S. Bradner" initials="S."
                    surname="Bradner"/>
            <date month="March" year="1997"/>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
          <seriesInfo name="DOI" value="10.17487/RFC2119"/>
        </reference>

        <reference anchor="RFC8174"
           target="https://www.rfc-editor.org/info/rfc8174">
          <front>
            <title>Ambiguity of Uppercase vs Lowercase in
            RFC 2119 Key Words</title>
            <author fullname="B. Leiba" initials="B."
                    surname="Leiba"/>
            <date month="May" year="2017"/>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="8174"/>
          <seriesInfo name="DOI" value="10.17487/RFC8174"/>
        </reference>

        <reference anchor="RFC8402"
           target="https://www.rfc-editor.org/info/rfc8402">
          <front>
            <title>Segment Routing Architecture</title>
            <author fullname="C. Filsfils" initials="C."
                    surname="Filsfils" role="editor"/>
            <author fullname="S. Previdi" initials="S."
                    surname="Previdi" role="editor"/>
            <author fullname="L. Ginsberg" initials="L."
                    surname="Ginsberg"/>
            <author fullname="B. Decraene" initials="B."
                    surname="Decraene"/>
            <author fullname="S. Litkowski" initials="S."
                    surname="Litkowski"/>
            <author fullname="R. Shakir" initials="R."
                    surname="Shakir"/>
            <date month="July" year="2018"/>
          </front>
          <seriesInfo name="RFC" value="8402"/>
          <seriesInfo name="DOI" value="10.17487/RFC8402"/>
        </reference>

        <reference anchor="RFC8754"
           target="https://www.rfc-editor.org/info/rfc8754">
          <front>
            <title>IPv6 Segment Routing Header (SRH)</title>
            <author fullname="C. Filsfils" initials="C."
                    surname="Filsfils" role="editor"/>
            <author fullname="D. Dukes" initials="D."
                    surname="Dukes" role="editor"/>
            <author fullname="S. Previdi" initials="S."
                    surname="Previdi"/>
            <author fullname="J. Leddy" initials="J."
                    surname="Leddy"/>
            <author fullname="S. Matsushima" initials="S."
                    surname="Matsushima"/>
            <author fullname="D. Voyer" initials="D."
                    surname="Voyer"/>
            <date month="March" year="2020"/>
          </front>
          <seriesInfo name="RFC" value="8754"/>
          <seriesInfo name="DOI" value="10.17487/RFC8754"/>
        </reference>

        <reference anchor="RFC8986"
           target="https://www.rfc-editor.org/info/rfc8986">
          <front>
            <title>Segment Routing over IPv6 (SRv6) Network
            Programming</title>
            <author fullname="C. Filsfils" initials="C."
                    surname="Filsfils" role="editor"/>
            <author fullname="P. Camarillo" initials="P."
                    surname="Camarillo" role="editor"/>
            <author fullname="J. Leddy" initials="J."
                    surname="Leddy"/>
            <author fullname="D. Voyer" initials="D."
                    surname="Voyer"/>
            <author fullname="S. Matsushima" initials="S."
                    surname="Matsushima"/>
            <author fullname="Z. Li" initials="Z."
                    surname="Li"/>
            <date month="February" year="2021"/>
          </front>
          <seriesInfo name="RFC" value="8986"/>
          <seriesInfo name="DOI" value="10.17487/RFC8986"/>
        </reference>

        <reference anchor="RFC9524"
           target="https://www.rfc-editor.org/info/rfc9524">
          <front>
            <title>Segment Routing Replication for Multipoint
            Service Delivery</title>
            <author fullname="D. Voyer" initials="D."
                    surname="Voyer" role="editor"/>
            <author fullname="C. Filsfils" initials="C."
                    surname="Filsfils"/>
            <author fullname="R. Parekh" initials="R."
                    surname="Parekh"/>
            <author fullname="H. Bidgoli" initials="H."
                    surname="Bidgoli"/>
            <author fullname="Z. Zhang" initials="Z."
                    surname="Zhang"/>
            <date year="2026" month="February" day="28"/>
          </front>
          <seriesInfo name="RFC" value="9524"/>
          <seriesInfo name="DOI" value="10.17487/RFC9524"/>
        </reference>
      </references>

      <references>
        <name>Informative References</name>

        <reference anchor="RFC2827"
           target="https://www.rfc-editor.org/info/rfc2827">
          <front>
            <title>Network Ingress Filtering: Defeating Denial
            of Service Attacks which employ IP Source Address
            Spoofing</title>
            <author fullname="P. Ferguson" initials="P."
                    surname="Ferguson"/>
            <author fullname="D. Senie" initials="D."
                    surname="Senie"/>
            <date month="May" year="2000"/>
          </front>
          <seriesInfo name="BCP" value="38"/>
          <seriesInfo name="RFC" value="2827"/>
          <seriesInfo name="DOI" value="10.17487/RFC2827"/>
        </reference>

        <reference anchor="RFC3168"
           target="https://www.rfc-editor.org/info/rfc3168">
          <front>
            <title>The Addition of Explicit Congestion
            Notification (ECN) to IP</title>
            <author fullname="K. Ramakrishnan" initials="K."
                    surname="Ramakrishnan"/>
            <author fullname="S. Floyd" initials="S."
                    surname="Floyd"/>
            <author fullname="D. Black" initials="D."
                    surname="Black"/>
            <date month="September" year="2001"/>
          </front>
          <seriesInfo name="RFC" value="3168"/>
          <seriesInfo name="DOI" value="10.17487/RFC3168"/>
        </reference>

        <reference anchor="RFC8279"
           target="https://www.rfc-editor.org/info/rfc8279">
          <front>
            <title>Multicast Using Bit Index Explicit
            Replication (BIER)</title>
            <author fullname="IJ. Wijnands" initials="IJ."
                    surname="Wijnands" role="editor"/>
            <author fullname="E. Rosen" initials="E."
                    surname="Rosen" role="editor"/>
            <author fullname="A. Dolganow" initials="A."
                    surname="Dolganow"/>
            <author fullname="T. Przygienda" initials="T."
                    surname="Przygienda"/>
            <author fullname="S. Aldrin" initials="S."
                    surname="Aldrin"/>
            <date month="November" year="2017"/>
          </front>
          <seriesInfo name="RFC" value="8279"/>
          <seriesInfo name="DOI" value="10.17487/RFC8279"/>
        </reference>

        <reference anchor="RFC8296"
           target="https://www.rfc-editor.org/info/rfc8296">
          <front>
            <title>Encapsulation for Bit Index Explicit
            Replication (BIER) in MPLS and Non-MPLS
            Networks</title>
            <author fullname="IJ. Wijnands" initials="IJ."
                    surname="Wijnands" role="editor"/>
            <author fullname="E. Rosen" initials="E."
                    surname="Rosen" role="editor"/>
            <author fullname="A. Dolganow" initials="A."
                    surname="Dolganow"/>
            <author fullname="J. Tantsura" initials="J."
                    surname="Tantsura"/>
            <author fullname="S. Aldrin" initials="S."
                    surname="Aldrin"/>
            <author fullname="I. Meilik" initials="I."
                    surname="Meilik"/>
            <date month="January" year="2018"/>
          </front>
          <seriesInfo name="RFC" value="8296"/>
          <seriesInfo name="DOI" value="10.17487/RFC8296"/>
        </reference>

        <reference anchor="I-D.xiao-rtgwg-rocev2-fast-cnp">
          <front>
            <title>Fast Congestion Notification Packet (CNP) in
            RoCEv2 Networks</title>
            <author fullname="X. Min" initials="X."
                    surname="Min"/>
            <author fullname="H. Li" initials="H."
                    surname="Li"/>
            <date year="2026" month="February" day="28"/>
          </front>
          <seriesInfo name="Internet-Draft"
            value="draft-xiao-rtgwg-rocev2-fast-cnp-04"/>
        </reference>

        <reference anchor="I-D.liu-nfsv4-rocev2">
          <front>
            <title>RoCEv2-based Collective Communication
            Offloading</title>
            <author fullname="Y. Liu" initials="Y."
                    surname="Liu"/>
            <date year="2026" month="February" day="28"/>
          </front>
          <seriesInfo name="Internet-Draft"
            value="draft-liu-nfsv4-rocev2-00"/>
        </reference>

        <reference anchor="I-D.hu-rtgwg-rocev2-fcn">
          <front>
            <title>Fast Congestion Notification for Distributed
            RoCEv2 Network Based on SRv6</title>
            <author fullname="Z. Hu" initials="Z."
                    surname="Hu"/>
            <author fullname="Y. Zhu" initials="Y."
                    surname="Zhu"/>
            <date year="2026" month="February" day="28"/>
          </front>
          <seriesInfo name="Internet-Draft"
            value="draft-hu-rtgwg-rocev2-fcn-00"/>
        </reference>

        <reference anchor="ROCEV2">
          <front>
            <title>Supplement to InfiniBand Architecture
            Specification Volume 1 Release 1.2.1 -
            Annex A17: RoCEv2</title>
            <author>
              <organization>InfiniBand Trade
              Association</organization>
            </author>
            <date year="2026" month="February" day="28"/>
          </front>
        </reference>
      </references>
    </references>

    <section anchor="acknowledgments" numbered="false">
      <name>Acknowledgments</name>
      <t>The authors thank the members of the SPRING and RTGWG
      working groups for their review and feedback.</t>
    </section>
  </back>
</rfc>