Internet DRAFT - draft-liu-msr6-problem-statement
draft-liu-msr6-problem-statement
Network Working Group Y. Liu
Internet-Draft T. Jiang
Intended status: Informational China Mobile
Expires: 24 April 2023 T. Eckert
Futurewei
Z. Li
Huawei Technologies
G. Mishra
Verizon Inc.
Z. Qin
China Unicom
C. Lin
New H3C Technologies
X. Geng
Huawei
21 October 2022
Problem Satement of IPv6 Multicast Source Routing (MSR6)
draft-liu-msr6-problem-statement-01
Abstract
This document analyses the gaps of the existing IPv6 multicast
solutions under discussion in IETF based on the requirements.
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 24 April 2023.
Liu, et al. Expires 24 April 2023 [Page 1]
Internet-Draft Problem Statement of MSR6 October 2022
Copyright Notice
Copyright (c) 2022 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Problem Statement for Multicast of Large-scale Network . . . 3
2.1. Typical Scenario in DCN . . . . . . . . . . . . . . . . . 5
2.1.1. AI Training . . . . . . . . . . . . . . . . . . . . . 5
2.1.2. HPC . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3. Storage . . . . . . . . . . . . . . . . . . . . . . . 8
3. Problem Statement for IPv6 Multicast with IPSec . . . . . . . 9
4. Problem Statement for IPv6 Host-initiated Multicast . . . . . 10
5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12
7. Security Considerations . . . . . . . . . . . . . . . . . . . 12
8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12
9. Normative References . . . . . . . . . . . . . . . . . . . . 12
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13
1. Introduction
Multicast could provide efficient P2MP service without bandwidth
waste. The increasing amount of live video traffic in the network
bring new requirements for multicast solutions. The existing
multicast solutions request multicast tree-building on control plane
and maintaining end-to-end tree state per flow, which impacts router
state capacity and network convergence time. There has been a lot of
work in IETF to simplify service deployment, in which Source Routing
is a very important technology, including SRv6, BIER, etc. Source
routing is able to reduce the state of intermediate nodes and
indicate multicast forwarding in the ingress nodes, which could
simplify multicast deployment. Source routing requires sufficient
flexibility on the forwarding plane and IPv6 has the advantage with
good scalability. Therefore, it is important to simplify multicast
deployment and meet high quality service requirements with IPv6
Source Routing based multicast.
Liu, et al. Expires 24 April 2023 [Page 2]
Internet-Draft Problem Statement of MSR6 October 2022
The MSR6 WG will focus on use cases identifed in
[I-D.liu-msr6-use-cases] with the following set of characteristics:
- Large network scale with numerous multicast service
- IPv6 multicast flow transmitting through Internet with requirement
of encryption
- IPv6 Host Initiated or overlay Multicast Transport
According to these usecase this document analyses the problem of the
existing IPv6 multicast solutions under discussion in IETF. To solve
these problems, MSR6 can be used as a complementary multicast
solution.
2. Problem Statement for Multicast of Large-scale Network
In large network scale with numerous multicast service, there are
scalability issues if using existing multicast solutions.
Based on the use case document, 2 typical scenarios are considered as
an example:
* Multicast for 5G transport, e.g., with1.5k egress nodes, 10k
multicast services;
* Multicast for DCN, e.g., with 3k switches, 60k links, 1k multicast
services;
If PIM/mLDP/P2MP RSVP-TE are used in these cases, per-flow state
protocols are used to set up multicast tree, which request period
state refresh and corresponding protocol message. Multicast stream
status are maintained in the intermediate nodes. When there are
thousands of concurrent multicast services, per-flow status will
bring scalability issues for network device, especially when the
multicast tree is dynamic.
BIER/BIER-TE([RFC8279]) is introduced in order to avoid explicit
multicast tree building and per flow status in intermediate nodes.
But there is challenge for BIER in a large scale network. Bit
position allocation for BIER is related to the scale of network
topology. The number of bit position affects BIFT size and bitstring
length directly. When there are too many egress nodes/links in the
network, encapsulation expanse and entry numbers of BITF could be
unacceptable. If Several SDs or SIs are divided, too many copies,
excessive traffic redundancy, similar to degradation to head-end
replication.
Liu, et al. Expires 24 April 2023 [Page 3]
Internet-Draft Problem Statement of MSR6 October 2022
For example, if BIER defined is used for P2MP tunnel in the network,
bit position should be allocated for all egress nodes, i.e., 9k bit
positions for all possible leaves a. Most of the bit positions are 0
and only few of them are set in some sparse multicast example. In
this case, the BIER Header is inefficient and the encapsulation
expense is unacceptable. Considering that the number of bit position
also determines the BIFT entry size, forwarding speed may also be
affected.
There are some possible methods to improve the situation in BIER.
For example "set" could be used to save the cost of bit position, but
multiple packets are supposed to be sent when the BFR-ID of the
receivers belong to different set. And when the network size is
large, the usefulness of set is not obvious. In the case showed
above, even 10 Sets are planned, there needs about 9 hundreds bit
positions for each packet and different set requests different BIFTs
in each node.
In BIER-TE, bitstring need to carry bits to indicate not only the
receiving BFER but also the intermediate hops/links across which the
packet must be sent. For the most common case, bit position should
be allocated for all adjacencies. About 100k bit positions are
requested. The bit position representing adjacencies that the
multicast tree goes through are set and the rest of the bit positions
are set to 0. In the example above, 7 bit positions are set in the
bitstring. BIER-TE header is less efficient and the encapsulation
expense is more significant,even compared to BIER. Also controller
is supposed to allocate different BIFTs for 10k nodes;
Some methods defined in BIER-TE is introduced to improve the
situation. "Set" could also be used, but not enough as the analysis
above. There are some other methods for reducing the number of
required bits, such as unicast (forward_routed()), ECMP() or flood
(DNC) over "uninteresting" sub- parts of the topology, which brings
different kinds of limitation for path planning.
Since the exiting BIER/BIER TE cannot satisfy the requirement of
multicast in the large-scale network, it need to introduce the new
source-routing-based solutions for the multicast TE. There can be
possible solutions defined in the drafts. It need to introduce the
new source-routing-based solutions for the multicast . There can be
possible solutions defined in the existing drafts. The basic idea is
combination of RH Segment list and bistring to specify the multicast
path. The existing BIER header cannot satisfy the requirement of
encapsulating such information. Instead IPv6 Route Header combining
with other IPv6 extension header can serve the purpose well. The
possible encapsulation is shown in the following figure.
Liu, et al. Expires 24 April 2023 [Page 4]
Internet-Draft Problem Statement of MSR6 October 2022
+--------------------------------+ ---
| IPv6 Header |
+--------------------------------+ IPv6 Multicast TE Tunnel Header
|IPv6 RH (Segment List/Bitstring)|
+--------------------------------+ ---
| Payload |
+--------------------------------+
2.1. Typical Scenario in DCN
In order to better show the requirements in data center, we list 3
typical potential multicast scenarios with P2MP services: AI
training, HPC and Storage.
The multicast requirements for large-scale is expressed in 3 aspects:
- Network Scale: number of switches, number of links, number of hosts
- Multicast Tree Size: number of intermeidate nodes; number of
receivers
- Multicast Service Number
2.1.1. AI Training
The following figure shows a typical RDMA AI training scenario.
PS(Parameter Server) Nodes
+-------+ +-------+
| CPU | | CPU |
| Server| | Server|
+-+-+-+-+ +-+-+-+-+
^ | | | | | | |
| +--|-|-|--------------+ | | |
| +----+ | +----------------------+ |
| | | +--------+ +-------+ | | V
Gradients | | | | | | Parameters
+---+-+-+ +---+-+-+ +-+---+-+
| GPU | | GPU | | GPU |
| Worker| | Worker| | Worker|
+-------+ +-------+ +-------+
Worker->PS: The gradient of each worker is pushed to PS node
PS->Worker: PS will pull the parameters back to all workers after
aggregation
Liu, et al. Expires 24 April 2023 [Page 5]
Internet-Draft Problem Statement of MSR6 October 2022
In this process, the second stage is information distribution, with
the same data content. N connections are used to transmit unicast
separately. The bandwidth efficiency is 1/N, the larger the scale,
the lower the efficiency.
+---------------+
| Source |
| +---+ +---+ |
| |CPU| |GPU| |
| +-+-+ +-+-+ |
| | | |
| \ / |
| +-V---V-+ |
| | HCA | |
| +-------+ |
+--+-+-+-+-+-+--+
| | ... | |
+--V-V-----V-V--+
| Switch |
+-+-----------+-+
/ \
+-------------V-+ +-V------------+
| Destination | | Destination |
| +-------+ | | +-------+ |
| | HCA | | | | HCA | |
| +-V---V-+ | | +-V---V-+ |
| / \ | | / \ |
| | | | | | | |
| +-+-+ +-+-+ | | +-+-+ +-+-+ |
| |CPU| |GPU| | | |CPU| |GPU| |
| +---+ +---+ | | +---+ +---+ |
+---------------+ +---------------+
If the source only sends 1 copy to the network and the switches
replicate the packet to different distinations. The use of bandwidth
is more efficient and the training is faster.
The large-scale multicast requirement in this scenario is as the
following:
- Network Scale: 10-10k GPU
- Multicast Tree Size: 10-10k receivers
- Multicast Service Number: depends on the scenario
Liu, et al. Expires 24 April 2023 [Page 6]
Internet-Draft Problem Statement of MSR6 October 2022
2.1.2. HPC
The following is an example of MPI in HPC scenario.
+-------------------------------------------+
| Dispatcher |
| Master |
+---------------------+---------------------+
|
+-----------------+
|
+---+----+ +--------+ +--------+
|+--V---+| |+------+| |+------+|
||Dispa-|| ||Dispa-|| ||Dispa-||
||Agent || ||Agent || ||Agent ||
|+---+--+| |+---+--+| |+---+--+|
| | | | | | | | |
|+---V--+| |+---V--+| |+---V--+|
|| MPI || || MPI || ... || MPI ||
||Proces|| ||Proces|| ||Proces||
|+---^--+| |+---^--+| |+---^--+|
| | | | | | | | |
|+---V--+| |+---V--+| |+---V--+|
|| RoCE |<-->| RoCE |<------------->| RoCE ||
|+------+| |+------+| |+------+|
+--------+ +--------+ +--------+
Stage 1: Dispatcher Master senses millions of cores and schedules
millions of Rank MPI jobs on demand. Dispatcher Master sends the
scheduling results to Dispatcher Agent
Stage 2: Dispatcher Agent starts Million Rank MPI on each node The
Dispatcher Agent that receives the message broadcast the message to
other Dispatcher Agents and do the initialization before starting the
MPI application
Stage 3: Dispatcher Agent broadcaast the message to start the MPI
application. MPI internal initialization Synchronize the RoCE
endpoint in allgather way after the MPI application is started
The last 2 stages could benefit from multicast and reduce task
completion time.
The large-scale multicast requirement in this scenario is as the
following:
- Network Scale: 1000 k CPU/GUP
Liu, et al. Expires 24 April 2023 [Page 7]
Internet-Draft Problem Statement of MSR6 October 2022
- Multicast Tree Size: 10k~100k receivers
- Multicast Service Number: 1~100
2.1.3. Storage
Ceph is an open-source distributed software platform. It mainly
focuses on scale-out file system including storage distribution and
availability, which is widely used in storage.
Ceph Object Storage Daemons (OSDs) are reponsible for storing objects
on a local file system on behalf of Ceph clients. Also, Ceph OSDs
use the CPU, memory, and networking of Ceph cluster nodes for data
replication, erasure coding, recovery, monitoring and reporting
functions.
The following process request P2MP service.
- Application initiates "write" operation from a client to a server.
- Client finds the server to write in, and 3 copies are sent to 3
services.
+-------+ +-------+
|Client1| |Client2|
+---+---+ +---+---+
| |
+---------+--------+
|
+-------+-------+
| Switch |
+-------+-------+
|
+----------------+----------------+
| | |
+---+---+ +---+---+ +---+---+
| Server| | Server| | Server|
+-------+ +-------+ +-------+
The large-scale multicast requirement in this scenario is as the
following:
- Network Scale: 3k Server (1 Pod)
- Multicast Tree Size: 3 receivers
- Multicast Service Number: 10k
Liu, et al. Expires 24 April 2023 [Page 8]
Internet-Draft Problem Statement of MSR6 October 2022
3. Problem Statement for IPv6 Multicast with IPSec
In the typical scenario like IPv6-based SDWAN, the multicast traffic
may traverse the Internet through the IPv6-based multicast tunnel.
At the same time the traffic must be encrypted for the purpose of
security. IPSec can be adopted for encryption.
The independent layer design of BIER brings the following challenges:
Option 1: If the IPv6 IPSec extension header is used for the reason
of security (shown in the following figure), the BIER header will be
encrypted and the traffic steering information cannot be acquired by
the BIER nodes. That is, the BIER cannot work in this option.
+--------------------------------+ ---
| IPv6 Header | ^
+--------------------------------+ |
| IPv6 IPSec Header (ESP & AH) | IPv6 Multicast Tunnel Header
+--------------------------------+ |
| BIER Header | |
+--------------------------------+ ---
| Payload |
+--------------------------------+
Option 2: In order for BIER Header to work while implement the
security function, a new security header may have to be introduced
for the BIER layer (shown in the following figure). This means: 1)
that the existing IPv6 IPSec extension header cannot be reused; 2)
There can be conflicted functions in the two layers: IPv6 layer and
BIER layer.
+--------------------------------+ ---
| IPv6 Header | ^
+--------------------------------+ |
| BIER Header | IPv6 Multicast Tunnel Header
+--------------------------------+ |
| New Security Header | |
+--------------------------------+ ---
| Payload |
+--------------------------------+
For MSR6, which is designed based on native IPv6, it is allowed to
reuse IPv6 Authentication header and Encapsulating Security Payload
header. If MSR6 is used in this case, the packet is supposed to
encapsulated as the following to implement end to end multicast
security:
Liu, et al. Expires 24 April 2023 [Page 9]
Internet-Draft Problem Statement of MSR6 October 2022
+--------------------------------+ ---
| IPv6 Header | ^
+--------------------------------+ |
| IPv6 EH (MSR6 EH or Options) | IPv6 Multicast Tunnel Header
+--------------------------------+ |
| IPv6 IPSec Header (ESP & AH) | |
+--------------------------------+ ---
| Payload |
+--------------------------------+
Just as IPsec, there are other existing functionalities that have
been in IETF based on IPv6, for example fragmentation, network
slicing, IOAM etc, which could all be reused in MSR6 which is based
on IPv6 data plane. Comparingly, it has to be defined again if these
functions/header are supposed to be used in BIER, which brings
redundancy.
4. Problem Statement for IPv6 Host-initiated Multicast
In the IPv6 host-initiated multicast scenarios, the host will
originate the IPv6 packet to be replicated for the different leaf
hosts. The packet originated by the host may have the format shown
in the following figure. The packet has the encapsulation of IP
layer and Transport Layer.
+--------------------------------+ ---
| IPv6 Header | IP Layer
+--------------------------------+ ---
| UDP Header | Transport Layer
+--------------------------------+ ---
| Payload |
+--------------------------------+
If BIER is adopted for the multicast traffic steering, the
independent layer design of BIER may make the packet originated by
the host as follows. This violates the layer architecture of the
Internet, that is, it introduces an extra layer (BIER layer). This
does not work in the host.
+--------------------------------+ ---
| IPv6 Header | IP Layer
+--------------------------------+ ---
| BIER Header | BIER Layer
+--------------------------------+ ---
| UDP Header | Transport Layer
+--------------------------------+ ---
| Payload |
+--------------------------------+
Liu, et al. Expires 24 April 2023 [Page 10]
Internet-Draft Problem Statement of MSR6 October 2022
For MSR6, multicast traffic steering information will be encapsulated
in the IPv6 extension header shown in the following figure. It can
still maintain the layer architecture of the Internet.
+--------------------------------+ ---
| IPv6 Header |
+--------------------------------+ IP Layer
| IPv6 EH (MSR6 EH or Options) |
+--------------------------------+ ---
| UDP Header | Transport Layer
+--------------------------------+ ---
| Payload |
+--------------------------------+
Besides, multicast source routing requests no explicit multicast tree
set up protocols. The network device replicates and forwards the
packet just based on the MSR6 header encapsulated by the host.
5. Summary
In summary, in order to satisfy the requirements of the usecase
characterized as follows,
- Large network scale with numerous multicast service
- IPv6 multicast flow transmitting through Internet with requirement
of encryption
- IPv6 Host Initiated or overlay Multicast Transport
according to the analysis of problems of the existing multicast
solutions, MSR6 solution should be introduced to take the advantages
of IPv6 extension header to encapsulate the extensible multicast
traffic steering information and reuse the existing IPv6
encapsulations like IPSec. There can be unified encapsulation for
the IPv6 tunneled packet and the IPv6 host initiated packet. The
abstract MSR6 header is shown in the following figure:
+--------------------------------+
| IPv6 Header |
+--------------------------------+
|IPv6 RH (Segment List/Bitstring)|
+--------------------------------+
| IPv6 EH (MCAST Options) |
+--------------------------------+
| IPv6 IPSec Header (ESP & AH) |
+--------------------------------+
Liu, et al. Expires 24 April 2023 [Page 11]
Internet-Draft Problem Statement of MSR6 October 2022
6. IANA Considerations
This document makes no request of IANA.
7. Security Considerations
TBD
8. Acknowledgements
TBD
9. Normative References
[I-D.cheng-spring-ipv6-msr-design-consideration]
Cheng, W., Mishra, G., Li, Z., Wang, A., Qin, Z., and C.
Fan, "Design Consideration of IPv6 Multicast Source
Routing (MSR6)", Work in Progress, Internet-Draft, draft-
cheng-spring-ipv6-msr-design-consideration-01, 25 October
2021, <https://www.ietf.org/archive/id/draft-cheng-spring-
ipv6-msr-design-consideration-01.txt>.
[I-D.liu-msr6-use-cases]
Liu, Y., Yang, F., Wang, A., Zhang, X., Geng, X., and Z.
Li, "MSR6(Multicast Source Routing over IPv6) Use Cases",
Work in Progress, Internet-Draft, draft-liu-msr6-use-
cases-01, 11 July 2022, <https://www.ietf.org/archive/id/
draft-liu-msr6-use-cases-01.txt>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A.,
Przygienda, T., and S. Aldrin, "Multicast Using Bit Index
Explicit Replication (BIER)", RFC 8279,
DOI 10.17487/RFC8279, November 2017,
<https://www.rfc-editor.org/info/rfc8279>.
[RFC8296] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A.,
Tantsura, J., Aldrin, S., and I. Meilik, "Encapsulation
for Bit Index Explicit Replication (BIER) in MPLS and Non-
MPLS Networks", RFC 8296, DOI 10.17487/RFC8296, January
2018, <https://www.rfc-editor.org/info/rfc8296>.
Liu, et al. Expires 24 April 2023 [Page 12]
Internet-Draft Problem Statement of MSR6 October 2022
[RFC8663] Xu, X., Bryant, S., Farrel, A., Hassan, S., Henderickx,
W., and Z. Li, "MPLS Segment Routing over IP", RFC 8663,
DOI 10.17487/RFC8663, December 2019,
<https://www.rfc-editor.org/info/rfc8663>.
Authors' Addresses
Yisong Liu
China Mobile
Email: liuyisong@chinamobile.com
Tianji Jiang
China Mobile
1525 McCathy Blvd.
Milpitas,, CA 95035,
United States of America
Email: tianjijiang@chinamobile.com
Toerless Eckert
Futurewei
Email: tte+ietf@cs.fau.de
Zhenbin Li
Huawei Technologies
Email: lizhenbin@huawei.com
Gyan Mishra
Verizon Inc.
Email: gyan.s.mishra@verizon.com
Zhuangzhuang Qin
China Unicom
Email: qinzhuangzhuang@chinaunicom.cn
Changwang Lin
New H3C Technologies
Email: linchangwang.04414@h3c.com
Xuesong Geng
Huawei
Email: gengxuesong@huawei.com
Liu, et al. Expires 24 April 2023 [Page 13]