<?xml version="1.0" encoding="utf-8"?>
  <?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
  <!-- generated by https://github.com/cabo/kramdown-rfc version 1.6.17 (Ruby 3.1.2) -->


<!DOCTYPE rfc  [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">

<!ENTITY I-D.kompella-teas-mpte SYSTEM "https://bib.ietf.org/public/rfc/bibxml3/reference.I-D.kompella-teas-mpte.xml">
<!ENTITY I-D.ietf-bess-bgp-multicast-controller SYSTEM "https://bib.ietf.org/public/rfc/bibxml3/reference.I-D.ietf-bess-bgp-multicast-controller.xml">
<!ENTITY RFC9815 SYSTEM "https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9815.xml">
<!ENTITY RFC9012 SYSTEM "https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9012.xml">
<!ENTITY RFC9830 SYSTEM "https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9830.xml">
<!ENTITY RFC2119 SYSTEM "https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC8174 SYSTEM "https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml">
<!ENTITY RFC9552 SYSTEM "https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9552.xml">
]>

<?rfc comments="yes"?>

<rfc ipr="trust200902" docName="draft-zzhang-idr-mpte-signaling-00" category="std" consensus="true" tocInclude="true" sortRefs="true" symRefs="true">
  <front>
    <title abbrev="BGP Signaling for MPTE">BGP Signaling for Multipath Traffic Engineering Junction States</title>

    <author initials="Z." surname="Zhang" fullname="Zhaohui Zhang">
      <organization>HPE</organization>
      <address>
        <email>zhaohui.zhang@hpe.com</email>
      </address>
    </author>
    <author initials="K." surname="Kompella" fullname="Kireeti Kompella">
      <organization>HPE</organization>
      <address>
        <email>kireeti.ietf@gmail.com</email>
      </address>
    </author>
    <author initials="A." surname="Mahale" fullname="Aditya Mahale">
      <organization>Meta</organization>
      <address>
        <email>aditya.ietf@gmail.com</email>
      </address>
    </author>
    <author initials="R." surname="Bhargava" fullname="Raghav Bhargava">
      <organization>Crusoe</organization>
      <address>
        <email>raghavbhargava12@gmail.com</email>
      </address>
    </author>
    <author initials="A." surname="Zhang" fullname="Aaron Zhang">
      <organization>Westford Academy</organization>
      <address>
        <email>aaronzhang194@gmail.com</email>
      </address>
    </author>

    <date year="2026" month="March" day="02"/>

    <area>Routing</area>
    <workgroup>idr</workgroup>
    <keyword>mpte</keyword>

    <abstract>


<t>Multi-Path Traffic Engineering (MPTE) combines Traffic Engineering with
Multi-Path forwarding, offering a much desired TE solution for both traditional
WAN and new AIML DC/DCI. MPTE tunnels are based on MPTE Directed Acyclic Graph
(DAG) and can be signaled with extensions to RSVP-TE, PCEP, BGP. This document
specifies the BGP protocol extensions and procedures for signaling MPTE DAGs.</t>



    </abstract>



  </front>

  <middle>


<section anchor="introduction"><name>Introduction</name>

<t><xref target="I-D.kompella-teas-mpte"/> describes the architecture and framework for
Multipath Traffic Engineering (MPTE). A signaling approach was described, 
which could be implemented via extensions to RSVP, PCEP, or BGP. This 
document specifies how to signal MPTE Directed Acyclic Graphs (DAGs), in 
particular, how to provision junctions that make up an MPTE DAG, using a 
new AFI/SAFI, the MPTE AFI/SAFI, in BGP.</t>

<t><xref target="I-D.ietf-bess-bgp-multicast-controller"/> specifies the BGP extensions to
signal multicast replication states to multicast tree nodes. Much of the
concepts and extensions can be used to signal MPTE Junction states.
This section describes how that is achieved, and the difference between
multicast signaling and MPTE signaling.</t>

<section anchor="mode-of-operation"><name>Mode of Operation</name>

<t>While the BGP signaling for MPTE is not limited to Data Centers (DCs), a
DC using EBGP signaling is used as an example.</t>

<t>Assume the EBGP sessions between switches in the DC support the MPTE
SAFI for the signaling of junction states. A future revision of this
document will describe the use of Route Reflectors to a) isolate MPTE
from the other functions of the EBGP mesh (basic routing), and b) scale
sessions.</t>

<t>For each DAG, its Signaling Source (SS), which could be a controller or
an ingress switch, originates a set of BGP routes of the MPTE SAFI, one for
each junction node. The route is referred to as a Junction State route,
and MUST carry a Route Target (RT) to target the route
at the corresponding junction node. Once the route is propagated to the
targeted node, the matched RT causes the route to be imported by the node
and stopped from being propagated further. Before the matching, the route
is propagated by the BGP infrastructure.</t>


<t>Before a junction node has at least one path set up to an egress, its upstream
node should not start sending traffic to it. This ordered control is
preferably done in a hop-by-hop fashion, like in the RSVP-TE case.
When a junction has its local state set up for a DAG (starting with an egress
node), it originates
a RESV route for each of its PHOPs for the DAG, including encapsulation
(e.g., an MPLS label) and BW information (e.g., the maximum traffic it
expects from the PHOP). The upstream node repeats the process, and eventually
the ingress node can start sending traffic. Note that a junction node may
originate RESV routes before it receives from all its NHOPs. When more or
updated RESV routes are received from its downstream, or when some of its
downstream nodes are removed or no longer reachable, it will send updated
RESV routes to its PHOPs.</t>

<t>As an option, the ordered control could be done by the signaling source (SS).
In this case, the encapsulation information (e.g., MPLS labels) can be assigned
by the SS and included in the junction routes (the label assignment options
are detailed in <xref target="I-D.ietf-bess-bgp-multicast-controller"/>).
After a junction node installs the forwarding state, it sends an
acknowledgment route to the SS,
which will tally the result and notify the ingress when and how much traffic
can be put onto the DAG. This is as if the junction nodes were programmed with
static routes, which shifts the burden/complexity to the SS.</t>

</section>
<section anchor="collecting-topologyte-information"><name>Collecting Topology/TE Information</name>

<t>Typically, Traffic Engineering uses the IGP (via TE extensions) to distribute
topology and TE information. That is not an option for a DC that uses BGP
signaling.</t>

<t>BGP-LS <xref target="RFC9552"/> is a mechanism using BGP extensions to collect link state
and TE information that has been signaled by IGP. Typically, the information
is distributed to a few collectors (e.g., controllers) from a few distributors
(e.g., IGP border routers).</t>


<t>This document suggests using BGP LS <xref target="RFC9815"/> to distribute TE information
for MPTE. Every switch is a distributor of its local information. If
distributed calculation is used, each switch is also a collector
of other switches' local information. More details will be provided.</t>

</section>
<section anchor="considerations-for-bgp-signaling"><name>Considerations for BGP Signaling</name>


<t><xref target="I-D.kompella-teas-mpte"/> outlined the information carried in the JUNCTION
message.
When implemented in BGP, the MC ID, MPTED ID, MPTED Version and Tunnel Type
are encoded in the NLRI of a new SAFI.</t>

<t>For the tunnel information part, the ingress/egress nodes information and tunnel 
bandwidth are (for now) not encoded.</t>

<t>The junction bandwidth is in the NLRI as well,
but not considered as part of the NLRI key.</t>

<t>All the NHOP and PHOP information is encoded into a Tunnel Encapsulation
Attribute (TEA) <xref target="RFC9012"/>, with extensions specified in
<xref target="I-D.ietf-bess-bgp-multicast-controller"/> and this document.
A TEA encodes a list of "tunnels", each of which could be a real tunnel 
or just an interface or neighbor.</t>

<t>As explained in <xref target="I-D.ietf-bess-bgp-multicast-controller"/>, when a TEA
is attached to an NLRI of MCAST-TREE SAFI, corresponding traffic is
replicated across the downstream tunnels in the TEA. Otherwise (including
in the MPTE case), traffic is load-balanced across the downstream (NHOP
in the case of MPTE) tunnels. Other than that, most of the TEA extensions
defined in <xref target="I-D.ietf-bess-bgp-multicast-controller"/> are applicable
to MPTE, with the following notes:</t>

<t><list style="symbols">
  <t>All tunnel types and sub-TLVs mentioned in
<xref target="I-D.ietf-bess-bgp-multicast-controller"/> can be used.</t>
  <t>A tunnel with an RPF sub-TLV is for a PHOP.</t>
  <t>The NHOP load share is encoded in the Weight sub-TLV <xref target="RFC9830"/>.</t>
  <t>In the case of labeled MPTE tunnels:
  <list style="symbols">
      <t>The Tree Label Stack sub-TLV is used to signal the outgoing label (stack)
of an NHOP.</t>
      <t>The Receiving MPLS Label Stack sub-TLV is used to signal the incoming
label (stack) of a PHOP.</t>
    </list></t>
  <t>For the MCAST-TREE case, only one tunnel has an RPF sub-TLV, and either
there is only one tunnel with the Receiving
MPLS Label Stack sub-TLV in the case of P2MP tunnel, or each tunnel has a
Receiving MPLS Label Stack sub-TLV in the case of MP2MP tunnel.</t>
  <t>For the MPTE case, only and all the PHOP tunnels for labeled MPTE tunnels
have a Receiving MPLS Label Stack sub-TLV unless ordered control is used.</t>
</list></t>


<t><list style="symbols">
  <t>The indication of an egress point (on a pure egress or on a bud node) is an
Any-Encapsulation tunnel without either the RPF sub-TLV or any sub-TLV
that identifies a downstream interface/tunnel. In the bud node case,
this tunnel has the Weight sub-TLV, indicating the load share as the traffic
is load-balanced between local delivery and other NHOP tunnels.</t>
</list></t>

</section>
</section>
<section anchor="specification"><name>Specification</name>

<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED",
"MAY", and "OPTIONAL" in this document are to be interpreted as
described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when, they
appear in all capitals, as shown here.</t>

<section anchor="afisafi-and-nlri"><name>AFI/SAFI and NLRI</name>

<t>This document defines a new SAFI type MPTE with value TBD1 for signaling MPTE
junction states.
When it is used with AFI 1, the IP addresses in the NLRI are IPv4. When it is
used with AFI 2, the addresses in the NLRI are IPv6.</t>

<t>The NLRI is encoded as follows:</t>

<figure><artwork><![CDATA[
  +-----------------------------------+
  |    Route Type (1 octet)           |
  +-----------------------------------+
  |     Length (1 octet)              |
  +-----------------------------------+
  | Route Type specific (variable)    |
  +-----------------------------------+
]]></artwork></figure>

<t>This document defines the following Route Types:</t>

<figure><artwork><![CDATA[
  + 1 - Junction State route
  + 2 - Junction RESV route
]]></artwork></figure>

<t>The Route Type specific part of the NLRI has the following format
for both route types:</t>

<figure><artwork><![CDATA[
  +-----------------------------------+
  |    MC Address (4/16 octet)        |
  +-----------------------------------+
  |     MPTED ID (4 octets)           |
  +-----------------------------------+
  |     MPTED Version (4 octets)      |
  +-----------------------------------+
  |      Tunnel Type (2 octets)|      |
  +-----------------------------------+
  | Junction Node Address (4/16 octet)|
  +-----------------------------------+
  |      Originating Node Address     |
  +-----------------------------------+
  |      Junction BW (4 octets)|      |
  +-----------------------------------+
]]></artwork></figure>

<t>All the fields above, except the Junction BW, along with the route type and
length are part of the NLRI key.</t>


</section>
<section anchor="full-link-identifier-sub-tlv"><name>Full Link Identifier sub-TLV</name>

<t>The Full Link Identifier sub-TLV is used to identify an unnumbered
interface by the Peer Node Address, Peer Link Index and Local Link Index.
It is used for unnumbered PHOPs in the Junction State routes.</t>

<figure><artwork><![CDATA[
      +- - - - - - - - - - - - - - - - +
      | sub-TLV Type (1 Octet, TBD2)   |
      +- - - - - - - - - - - - - - - - +
      | sub-TLV Length (1 Octets)      |
      +- - - - - - - - - - - - - - - - +
      | Peer Node Address (4/16 Octets)|
      +- - - - - - - - - - - - - - - - +
      | Peer Link Index (4 Octets)     |
      +- - - - - - - - - - - - - - - - +
      | Local Link Index (4 Octets)    |
      +- - - - - - - - - - - - - - - - +
]]></artwork></figure>

</section>
<section anchor="link-index-sub-tlv"><name>Link Index sub-TLV</name>

<t>The Link Index sub-TLV encodes the Link ID on a node receiving the route.
It is used for unnumbered PHOPs in the Junction RESV routes
.</t>

<figure><artwork><![CDATA[
      +- - - - - - - - - - - - - - - - +
      | sub-TLV Type (1 Octet, TBD3)   |
      +- - - - - - - - - - - - - - - - +
      | sub-TLV Length (1 Octet)       |
      +- - - - - - - - - - - - - - - - +
      | Link Index (4 Octets)          |
      +- - - - - - - - - - - - - - - - +
]]></artwork></figure>

</section>
<section anchor="interface-and-node-address-sub-tlv"><name>Interface and Node Address sub-TLV</name>

<t>The Interface and Node Address sub-TLV encodes the local or neighbor
address on an interface, and the address of the node that the interface
connects to. The type of address (IPv4/IPv6) is inferred from the
sub-TLV Length.</t>

<figure><artwork><![CDATA[
      +- - - - - - - - - - - - - - - - +
      | sub-TLV Type (1 Octet, TBD4)   |
      +- - - - - - - - - - - - - - - - +
      | sub-TLV Length (1 Octet)       |
      +- - - - - - - - - - - - - - - - +
      | Peer Node address (4/16 Octets)|
      +- - - - - - - - - - - - - - - - +
      | Intf/Nbr address  (4/16 Octets)|
      +- - - - - - - - - - - - - - - - +
]]></artwork></figure>

</section>
<section anchor="procedures"><name>Procedures</name>

<section anchor="originating-junction-state-routes"><name>Originating Junction State Routes</name>

<t>After the MC calculates an MPTED, the SS originates a Junction State route for
each junction node. The route carries an IP Address Specific RT,
with the Global Administrator field set to the junction node's address
and the Local Administrator field set to 0.</t>

<t>The route carries a Tunnel Encapsulation Attribute (TEA). Each tunnel in the
TEA corresponds to a PHOP or NHOP:</t>

<t><list style="symbols">
  <t>Each tunnel is an Any-Encapsulation tunnel, with a Full Link Identifier
sub-TLV or an Interface and Node Address sub-TLV
to identify an incoming/outgoing link (in addition to other
sub-TLVs).</t>
  <t>Each PHOP tunnel MUST also include the following sub-TLVs:
  <list style="symbols">
      <t>One RPF sub-TLV to indicate it is a PHOP</t>
      <t>One Receiving MPLS Label Stack sub-TLV to encode the incoming label
unless ordered control is used.</t>
    </list></t>
  <t>Each NHOP tunnel MUST include one Tree Label Stack sub-TLV to
encode the outgoing label unless ordered control is used.
It MAY include a Weight sub-TLV to encode
the NHOP share. If one NHOP tunnel includes a Weight sub-TLV, then
all NHOP tunnels MUST include a Weight sub-TLV.</t>
</list></t>

</section>
<section anchor="receiving-junction-state-routes"><name>Receiving Junction State Routes</name>

<t>Each node X that receives an MPTE route with an RT whose Global
Administrator field does not match its loopback address propagates
the route to all its neighbors (except the one from which it
received the route).</t>

<t>If the RT matches its own loopback address, X MUST import it, 
and MUST stop re-advertising the route upon match and importation.</t>

<t>Once the route is imported, X installs forwarding states as described
in the following sections,
in the case of MPLS when ordered control is not used (other tunnel types
or ordered control will be specified in a future revision).</t>

<section anchor="building-forwarding-nexthop"><name>Building Forwarding Nexthop</name>

<t>When a data packet is received, an IP address or MPLS label lookup is done
to produce the forwarding information about how the packet should be forwarded.
The forwarding information is referred to as forwarding nexthop in this
document, or simply nexthop when it is not ambiguous.</t>

<t>The forwarding nexthop for a junction is built by checking each NHOP tunnel.
The Interface and Node Address sub-TLV or the Full Link Identifier
sub-TLV identifies the outgoing interface/neighbor, and the Tree Label
sub-TLV identifies the outgoing label.
The Weight sub-TLV provides the load-balancing share for the link, and
bandwidth reservation can be done based on the Junction Bandwidth in the NLRI
and the Weight sub-TLV in the NHOP.</t>

</section>
<section anchor="installing-routes"><name>Installing Routes</name>

<t>For each PHOP tunnel in the TEA, a label route is installed with the label
value in the Receiving Label Stack sub-TLV, pointing to the forwarding
nexthop built as specified above.</t>

</section>
<section anchor="sending-junction-state-route-acknowledgment"><name>Sending Junction State Route Acknowledgment</name>

<t>Each junction sends an acknowledgment back to the SS. Unless
ordered control is used, the SS makes sure that all junctions
are properly programmed before the tunnel is put into use.</t>

<t>The acknowledgement is simply the same Junction State route modified as follows:</t>

<t><list style="symbols">
  <t>The Originating Node's Address is set to the junction node's address.</t>
  <t>The Route Target is set to target the SS.</t>
</list></t>

</section>
</section>
<section anchor="ordered-control"><name>Ordered Control</name>

<t>When hop-by-hop Ordered Control is used, the Junction State route does not carry
encapsulation information (e.g., labels) in the PHOPs/NHOPs, and the
junction's forwarding state is not installed until at least one Junction
RESV route has been received from one of the NHOPs.
Each junction originates a Junction RESV route targeted at each of its
upstream junctions. The route type specific part of the NRLI is set
according to the Junction State route, with the Junction Node Address
set to that of the upstream junction, which is from either the Interface and
Node Address sub-TLV or the Full Link Identifier sub-TLV. The Originating
Node Address is set to that of this junction. A Route Target is used to
target the route at the upstream junction.</t>

<t>The Junction BW is set to the total BW to be
reserved on the upstream junction for this junction. A TEA is attached,
with only PHOP tunnels toward the upstream junctions
.
The PHOP tunnel includes one of the following:</t>

<t><list style="symbols">
  <t>A Tunnel Egress Endpoint sub-TLV, in which the address is set to the
interface/neighbor address in the
Interface and Node Address sub-TLV in the corresponding PHOP in the
corresponding Junction State route.</t>
  <t>A Link Index sub-TLV, in which the Link Index is the Peer Link ID in the
Full Link identifier sub-TLV in the corresponding PHOP in the corresponding
Junction State route.</t>
</list></t>

</section>
<section anchor="route-update-and-withdrawal"><name>Route Update and Withdrawal</name>

<t>When a junction is updated (e.g., with added/removed/updated PHOPs/NHOPs),
the corresponding Junction State route is updated accordingly.
If a junction is deleted, the corresponding Junction State route is withdrawn.
Corresponding acknowledgement and reservation routes are updated, originated,
or withdrawn accordingly.</t>

</section>
<section anchor="routes-for-other-messages"><name>Routes For Other Messages</name>

<t>To be added.</t>


</section>
</section>
</section>
<section anchor="security-considerations"><name>Security Considerations</name>

<t>To be added.</t>

</section>
<section anchor="iana-considerations"><name>IANA Considerations</name>

<t>To be added.</t>

</section>
<section anchor="acknowledgments"><name>Acknowledgments</name>

<t>The authors thank Vishnu Pavan Beeram, Chandrasekar Ramachandran,
Sudharsana Venkataraman, and Jai Hari M K for their comments and suggestions.</t>

</section>


  </middle>

  <back>


    <references title='Normative References'>

&I-D.kompella-teas-mpte;
&I-D.ietf-bess-bgp-multicast-controller;
&RFC9815;
&RFC9012;
&RFC9830;
&RFC2119;
&RFC8174;


    </references>

    <references title='Informative References'>

&RFC9552;


    </references>



  </back>

</rfc>

