Internet DRAFT - draft-liu-msr6-problem-statement

draft-liu-msr6-problem-statement







Network Working Group                                             Y. Liu
Internet-Draft                                                  T. Jiang
Intended status: Informational                              China Mobile
Expires: 24 April 2023                                         T. Eckert
                                                               Futurewei
                                                                   Z. Li
                                                     Huawei Technologies
                                                               G. Mishra
                                                            Verizon Inc.
                                                                  Z. Qin
                                                            China Unicom
                                                                  C. Lin
                                                    New H3C Technologies
                                                                 X. Geng
                                                                  Huawei
                                                         21 October 2022


        Problem Satement of IPv6 Multicast Source Routing (MSR6)
                  draft-liu-msr6-problem-statement-01

Abstract

   This document analyses the gaps of the existing IPv6 multicast
   solutions under discussion in IETF based on the requirements.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 24 April 2023.



Liu, et al.               Expires 24 April 2023                 [Page 1]

Internet-Draft          Problem Statement of MSR6           October 2022


Copyright Notice

   Copyright (c) 2022 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Problem Statement for Multicast of Large-scale Network  . . .   3
     2.1.  Typical Scenario in DCN . . . . . . . . . . . . . . . . .   5
       2.1.1.  AI Training . . . . . . . . . . . . . . . . . . . . .   5
       2.1.2.  HPC . . . . . . . . . . . . . . . . . . . . . . . . .   7
       2.1.3.  Storage . . . . . . . . . . . . . . . . . . . . . . .   8
   3.  Problem Statement for IPv6 Multicast with IPSec . . . . . . .   9
   4.  Problem Statement for IPv6 Host-initiated Multicast . . . . .  10
   5.  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .  11
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  12
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  12
   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  12
   9.  Normative References  . . . . . . . . . . . . . . . . . . . .  12
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  13

1.  Introduction

   Multicast could provide efficient P2MP service without bandwidth
   waste.  The increasing amount of live video traffic in the network
   bring new requirements for multicast solutions.  The existing
   multicast solutions request multicast tree-building on control plane
   and maintaining end-to-end tree state per flow, which impacts router
   state capacity and network convergence time.  There has been a lot of
   work in IETF to simplify service deployment, in which Source Routing
   is a very important technology, including SRv6, BIER, etc.  Source
   routing is able to reduce the state of intermediate nodes and
   indicate multicast forwarding in the ingress nodes, which could
   simplify multicast deployment.  Source routing requires sufficient
   flexibility on the forwarding plane and IPv6 has the advantage with
   good scalability.  Therefore, it is important to simplify multicast
   deployment and meet high quality service requirements with IPv6
   Source Routing based multicast.



Liu, et al.               Expires 24 April 2023                 [Page 2]

Internet-Draft          Problem Statement of MSR6           October 2022


   The MSR6 WG will focus on use cases identifed in
   [I-D.liu-msr6-use-cases] with the following set of characteristics:

   - Large network scale with numerous multicast service

   - IPv6 multicast flow transmitting through Internet with requirement
   of encryption

   - IPv6 Host Initiated or overlay Multicast Transport

   According to these usecase this document analyses the problem of the
   existing IPv6 multicast solutions under discussion in IETF.  To solve
   these problems, MSR6 can be used as a complementary multicast
   solution.

2.  Problem Statement for Multicast of Large-scale Network

   In large network scale with numerous multicast service, there are
   scalability issues if using existing multicast solutions.

   Based on the use case document, 2 typical scenarios are considered as
   an example:

   *  Multicast for 5G transport, e.g., with1.5k egress nodes, 10k
      multicast services;

   *  Multicast for DCN, e.g., with 3k switches, 60k links, 1k multicast
      services;

   If PIM/mLDP/P2MP RSVP-TE are used in these cases, per-flow state
   protocols are used to set up multicast tree, which request period
   state refresh and corresponding protocol message.  Multicast stream
   status are maintained in the intermediate nodes.  When there are
   thousands of concurrent multicast services, per-flow status will
   bring scalability issues for network device, especially when the
   multicast tree is dynamic.

   BIER/BIER-TE([RFC8279]) is introduced in order to avoid explicit
   multicast tree building and per flow status in intermediate nodes.
   But there is challenge for BIER in a large scale network.  Bit
   position allocation for BIER is related to the scale of network
   topology.  The number of bit position affects BIFT size and bitstring
   length directly.  When there are too many egress nodes/links in the
   network, encapsulation expanse and entry numbers of BITF could be
   unacceptable.  If Several SDs or SIs are divided, too many copies,
   excessive traffic redundancy, similar to degradation to head-end
   replication.




Liu, et al.               Expires 24 April 2023                 [Page 3]

Internet-Draft          Problem Statement of MSR6           October 2022


   For example, if BIER defined is used for P2MP tunnel in the network,
   bit position should be allocated for all egress nodes, i.e., 9k bit
   positions for all possible leaves a.  Most of the bit positions are 0
   and only few of them are set in some sparse multicast example.  In
   this case, the BIER Header is inefficient and the encapsulation
   expense is unacceptable.  Considering that the number of bit position
   also determines the BIFT entry size, forwarding speed may also be
   affected.

   There are some possible methods to improve the situation in BIER.
   For example "set" could be used to save the cost of bit position, but
   multiple packets are supposed to be sent when the BFR-ID of the
   receivers belong to different set.  And when the network size is
   large, the usefulness of set is not obvious.  In the case showed
   above, even 10 Sets are planned, there needs about 9 hundreds bit
   positions for each packet and different set requests different BIFTs
   in each node.

   In BIER-TE, bitstring need to carry bits to indicate not only the
   receiving BFER but also the intermediate hops/links across which the
   packet must be sent.  For the most common case, bit position should
   be allocated for all adjacencies.  About 100k bit positions are
   requested.  The bit position representing adjacencies that the
   multicast tree goes through are set and the rest of the bit positions
   are set to 0.  In the example above, 7 bit positions are set in the
   bitstring.  BIER-TE header is less efficient and the encapsulation
   expense is more significant,even compared to BIER.  Also controller
   is supposed to allocate different BIFTs for 10k nodes;

   Some methods defined in BIER-TE is introduced to improve the
   situation.  "Set" could also be used, but not enough as the analysis
   above.  There are some other methods for reducing the number of
   required bits, such as unicast (forward_routed()), ECMP() or flood
   (DNC) over "uninteresting" sub- parts of the topology, which brings
   different kinds of limitation for path planning.

   Since the exiting BIER/BIER TE cannot satisfy the requirement of
   multicast in the large-scale network, it need to introduce the new
   source-routing-based solutions for the multicast TE.  There can be
   possible solutions defined in the drafts.  It need to introduce the
   new source-routing-based solutions for the multicast . There can be
   possible solutions defined in the existing drafts.  The basic idea is
   combination of RH Segment list and bistring to specify the multicast
   path.  The existing BIER header cannot satisfy the requirement of
   encapsulating such information.  Instead IPv6 Route Header combining
   with other IPv6 extension header can serve the purpose well.  The
   possible encapsulation is shown in the following figure.




Liu, et al.               Expires 24 April 2023                 [Page 4]

Internet-Draft          Problem Statement of MSR6           October 2022


      +--------------------------------+      ---
      |          IPv6 Header           |
      +--------------------------------+ IPv6 Multicast TE Tunnel Header
      |IPv6 RH (Segment List/Bitstring)|
      +--------------------------------+      ---
      |            Payload             |
      +--------------------------------+

2.1.  Typical Scenario in DCN

   In order to better show the requirements in data center, we list 3
   typical potential multicast scenarios with P2MP services: AI
   training, HPC and Storage.

   The multicast requirements for large-scale is expressed in 3 aspects:

   - Network Scale: number of switches, number of links, number of hosts

   - Multicast Tree Size: number of intermeidate nodes; number of
   receivers

   - Multicast Service Number

2.1.1.  AI Training

   The following figure shows a typical RDMA AI training scenario.

                    PS(Parameter Server) Nodes
                  +-------+          +-------+
                  |  CPU  |          |  CPU  |
                  | Server|          | Server|
                  +-+-+-+-+          +-+-+-+-+
       ^            | | |              | | |          |
       |         +--|-|-|--------------+ | |          |
       |       +----+ | +----------------------+      |
       |       | |    +--------+ +-------+ |   |      V
   Gradients   | |             | |         |   | Parameters
           +---+-+-+       +---+-+-+     +-+---+-+
           |  GPU  |       |  GPU  |     |  GPU  |
           | Worker|       | Worker|     | Worker|
           +-------+       +-------+     +-------+

   Worker->PS: The gradient of each worker is pushed to PS node

   PS->Worker: PS will pull the parameters back to all workers after
   aggregation





Liu, et al.               Expires 24 April 2023                 [Page 5]

Internet-Draft          Problem Statement of MSR6           October 2022


   In this process, the second stage is information distribution, with
   the same data content.  N connections are used to transmit unicast
   separately.  The bandwidth efficiency is 1/N, the larger the scale,
   the lower the efficiency.

                         +---------------+
                         |     Source    |
                         | +---+   +---+ |
                         | |CPU|   |GPU| |
                         | +-+-+   +-+-+ |
                         |   |       |   |
                         |    \     /    |
                         |   +-V---V-+   |
                         |   |  HCA  |   |
                         |   +-------+   |
                         +--+-+-+-+-+-+--+
                            | | ... | |
                         +--V-V-----V-V--+
                         |     Switch    |
                         +-+-----------+-+
                          /             \
           +-------------V-+           +-V------------+
           |  Destination  |           |  Destination  |
           |   +-------+   |           |   +-------+   |
           |   |  HCA  |   |           |   |  HCA  |   |
           |   +-V---V-+   |           |   +-V---V-+   |
           |    /     \    |           |    /     \    |
           |   |       |   |           |   |       |   |
           | +-+-+   +-+-+ |           | +-+-+   +-+-+ |
           | |CPU|   |GPU| |           | |CPU|   |GPU| |
           | +---+   +---+ |           | +---+   +---+ |
           +---------------+           +---------------+

   If the source only sends 1 copy to the network and the switches
   replicate the packet to different distinations.  The use of bandwidth
   is more efficient and the training is faster.

   The large-scale multicast requirement in this scenario is as the
   following:

   - Network Scale: 10-10k GPU

   - Multicast Tree Size: 10-10k receivers

   - Multicast Service Number: depends on the scenario






Liu, et al.               Expires 24 April 2023                 [Page 6]

Internet-Draft          Problem Statement of MSR6           October 2022


2.1.2.  HPC

   The following is an example of MPI in HPC scenario.

         +-------------------------------------------+
         |                Dispatcher                 |
         |                  Master                   |
         +---------------------+---------------------+
                               |
             +-----------------+
             |
         +---+----+  +--------+             +--------+
         |+--V---+|  |+------+|             |+------+|
         ||Dispa-||  ||Dispa-||             ||Dispa-||
         ||Agent ||  ||Agent ||             ||Agent ||
         |+---+--+|  |+---+--+|             |+---+--+|
         |    |   |  |    |   |             |    |   |
         |+---V--+|  |+---V--+|             |+---V--+|
         ||  MPI ||  ||  MPI ||     ...     ||  MPI ||
         ||Proces||  ||Proces||             ||Proces||
         |+---^--+|  |+---^--+|             |+---^--+|
         |    |   |  |    |   |             |    |   |
         |+---V--+|  |+---V--+|             |+---V--+|
         || RoCE |<-->| RoCE |<------------->| RoCE ||
         |+------+|  |+------+|             |+------+|
         +--------+  +--------+             +--------+

   Stage 1: Dispatcher Master senses millions of cores and schedules
   millions of Rank MPI jobs on demand.  Dispatcher Master sends the
   scheduling results to Dispatcher Agent

   Stage 2: Dispatcher Agent starts Million Rank MPI on each node The
   Dispatcher Agent that receives the message broadcast the message to
   other Dispatcher Agents and do the initialization before starting the
   MPI application

   Stage 3: Dispatcher Agent broadcaast the message to start the MPI
   application.  MPI internal initialization Synchronize the RoCE
   endpoint in allgather way after the MPI application is started

   The last 2 stages could benefit from multicast and reduce task
   completion time.


   The large-scale multicast requirement in this scenario is as the
   following:

   - Network Scale: 1000 k CPU/GUP



Liu, et al.               Expires 24 April 2023                 [Page 7]

Internet-Draft          Problem Statement of MSR6           October 2022


   - Multicast Tree Size: 10k~100k receivers

   - Multicast Service Number: 1~100

2.1.3.  Storage

   Ceph is an open-source distributed software platform.  It mainly
   focuses on scale-out file system including storage distribution and
   availability, which is widely used in storage.

   Ceph Object Storage Daemons (OSDs) are reponsible for storing objects
   on a local file system on behalf of Ceph clients.  Also, Ceph OSDs
   use the CPU, memory, and networking of Ceph cluster nodes for data
   replication, erasure coding, recovery, monitoring and reporting
   functions.

   The following process request P2MP service.

   - Application initiates "write" operation from a client to a server.

   - Client finds the server to write in, and 3 copies are sent to 3
   services.

                  +-------+          +-------+
                  |Client1|          |Client2|
                  +---+---+          +---+---+
                      |                  |
                      +---------+--------+
                                |
                        +-------+-------+
                        |     Switch    |
                        +-------+-------+
                                |
               +----------------+----------------+
               |                |                |
           +---+---+        +---+---+        +---+---+
           | Server|        | Server|        | Server|
           +-------+        +-------+        +-------+

   The large-scale multicast requirement in this scenario is as the
   following:

   - Network Scale: 3k Server (1 Pod)

   - Multicast Tree Size: 3 receivers

   - Multicast Service Number: 10k




Liu, et al.               Expires 24 April 2023                 [Page 8]

Internet-Draft          Problem Statement of MSR6           October 2022


3.  Problem Statement for IPv6 Multicast with IPSec

   In the typical scenario like IPv6-based SDWAN, the multicast traffic
   may traverse the Internet through the IPv6-based multicast tunnel.
   At the same time the traffic must be encrypted for the purpose of
   security.  IPSec can be adopted for encryption.

   The independent layer design of BIER brings the following challenges:

   Option 1: If the IPv6 IPSec extension header is used for the reason
   of security (shown in the following figure), the BIER header will be
   encrypted and the traffic steering information cannot be acquired by
   the BIER nodes.  That is, the BIER cannot work in this option.

        +--------------------------------+      ---
        |          IPv6 Header           |       ^
        +--------------------------------+       |
        |  IPv6 IPSec Header (ESP & AH)  | IPv6 Multicast Tunnel Header
        +--------------------------------+       |
        |           BIER Header          |       |
        +--------------------------------+      ---
        |            Payload             |
        +--------------------------------+

   Option 2: In order for BIER Header to work while implement the
   security function, a new security header may have to be introduced
   for the BIER layer (shown in the following figure).  This means: 1)
   that the existing IPv6 IPSec extension header cannot be reused; 2)
   There can be conflicted functions in the two layers: IPv6 layer and
   BIER layer.

        +--------------------------------+      ---
        |          IPv6 Header           |       ^
        +--------------------------------+       |
        |          BIER Header           | IPv6 Multicast Tunnel Header
        +--------------------------------+       |
        |       New Security Header      |       |
        +--------------------------------+      ---
        |            Payload             |
        +--------------------------------+

   For MSR6, which is designed based on native IPv6, it is allowed to
   reuse IPv6 Authentication header and Encapsulating Security Payload
   header.  If MSR6 is used in this case, the packet is supposed to
   encapsulated as the following to implement end to end multicast
   security:





Liu, et al.               Expires 24 April 2023                 [Page 9]

Internet-Draft          Problem Statement of MSR6           October 2022


        +--------------------------------+      ---
        |          IPv6 Header           |       ^
        +--------------------------------+       |
        |  IPv6 EH (MSR6 EH or Options)  | IPv6 Multicast Tunnel Header
        +--------------------------------+       |
        |  IPv6 IPSec Header (ESP & AH)  |       |
        +--------------------------------+      ---
        |            Payload             |
        +--------------------------------+

   Just as IPsec, there are other existing functionalities that have
   been in IETF based on IPv6, for example fragmentation, network
   slicing, IOAM etc, which could all be reused in MSR6 which is based
   on IPv6 data plane.  Comparingly, it has to be defined again if these
   functions/header are supposed to be used in BIER, which brings
   redundancy.

4.  Problem Statement for IPv6 Host-initiated Multicast

   In the IPv6 host-initiated multicast scenarios, the host will
   originate the IPv6 packet to be replicated for the different leaf
   hosts.  The packet originated by the host may have the format shown
   in the following figure.  The packet has the encapsulation of IP
   layer and Transport Layer.

        +--------------------------------+       ---
        |          IPv6 Header           |    IP Layer
        +--------------------------------+       ---
        |          UDP Header            | Transport Layer
        +--------------------------------+       ---
        |            Payload             |
        +--------------------------------+

   If BIER is adopted for the multicast traffic steering, the
   independent layer design of BIER may make the packet originated by
   the host as follows.  This violates the layer architecture of the
   Internet, that is, it introduces an extra layer (BIER layer).  This
   does not work in the host.

        +--------------------------------+       ---
        |          IPv6 Header           |    IP Layer
        +--------------------------------+       ---
        |          BIER Header           |   BIER Layer
        +--------------------------------+       ---
        |          UDP Header            | Transport Layer
        +--------------------------------+       ---
        |            Payload             |
        +--------------------------------+



Liu, et al.               Expires 24 April 2023                [Page 10]

Internet-Draft          Problem Statement of MSR6           October 2022


   For MSR6, multicast traffic steering information will be encapsulated
   in the IPv6 extension header shown in the following figure.  It can
   still maintain the layer architecture of the Internet.

        +--------------------------------+       ---
        |          IPv6 Header           |
        +--------------------------------+    IP Layer
        |   IPv6 EH (MSR6 EH or Options) |
        +--------------------------------+       ---
        |          UDP Header            | Transport Layer
        +--------------------------------+       ---
        |            Payload             |
        +--------------------------------+

   Besides, multicast source routing requests no explicit multicast tree
   set up protocols.  The network device replicates and forwards the
   packet just based on the MSR6 header encapsulated by the host.

5.  Summary

   In summary, in order to satisfy the requirements of the usecase
   characterized as follows,

   - Large network scale with numerous multicast service

   - IPv6 multicast flow transmitting through Internet with requirement
   of encryption

   - IPv6 Host Initiated or overlay Multicast Transport

   according to the analysis of problems of the existing multicast
   solutions, MSR6 solution should be introduced to take the advantages
   of IPv6 extension header to encapsulate the extensible multicast
   traffic steering information and reuse the existing IPv6
   encapsulations like IPSec.  There can be unified encapsulation for
   the IPv6 tunneled packet and the IPv6 host initiated packet.  The
   abstract MSR6 header is shown in the following figure:

        +--------------------------------+
        |          IPv6 Header           |
        +--------------------------------+
        |IPv6 RH (Segment List/Bitstring)|
        +--------------------------------+
        |    IPv6 EH (MCAST Options)     |
        +--------------------------------+
        |  IPv6 IPSec Header (ESP & AH)  |
        +--------------------------------+




Liu, et al.               Expires 24 April 2023                [Page 11]

Internet-Draft          Problem Statement of MSR6           October 2022


6.  IANA Considerations

   This document makes no request of IANA.

7.  Security Considerations

   TBD

8.  Acknowledgements

   TBD

9.  Normative References

   [I-D.cheng-spring-ipv6-msr-design-consideration]
              Cheng, W., Mishra, G., Li, Z., Wang, A., Qin, Z., and C.
              Fan, "Design Consideration of IPv6 Multicast Source
              Routing (MSR6)", Work in Progress, Internet-Draft, draft-
              cheng-spring-ipv6-msr-design-consideration-01, 25 October
              2021, <https://www.ietf.org/archive/id/draft-cheng-spring-
              ipv6-msr-design-consideration-01.txt>.

   [I-D.liu-msr6-use-cases]
              Liu, Y., Yang, F., Wang, A., Zhang, X., Geng, X., and Z.
              Li, "MSR6(Multicast Source Routing over IPv6) Use Cases",
              Work in Progress, Internet-Draft, draft-liu-msr6-use-
              cases-01, 11 July 2022, <https://www.ietf.org/archive/id/
              draft-liu-msr6-use-cases-01.txt>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC8279]  Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A.,
              Przygienda, T., and S. Aldrin, "Multicast Using Bit Index
              Explicit Replication (BIER)", RFC 8279,
              DOI 10.17487/RFC8279, November 2017,
              <https://www.rfc-editor.org/info/rfc8279>.

   [RFC8296]  Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A.,
              Tantsura, J., Aldrin, S., and I. Meilik, "Encapsulation
              for Bit Index Explicit Replication (BIER) in MPLS and Non-
              MPLS Networks", RFC 8296, DOI 10.17487/RFC8296, January
              2018, <https://www.rfc-editor.org/info/rfc8296>.






Liu, et al.               Expires 24 April 2023                [Page 12]

Internet-Draft          Problem Statement of MSR6           October 2022


   [RFC8663]  Xu, X., Bryant, S., Farrel, A., Hassan, S., Henderickx,
              W., and Z. Li, "MPLS Segment Routing over IP", RFC 8663,
              DOI 10.17487/RFC8663, December 2019,
              <https://www.rfc-editor.org/info/rfc8663>.

Authors' Addresses

   Yisong Liu
   China Mobile
   Email: liuyisong@chinamobile.com


   Tianji Jiang
   China Mobile
   1525 McCathy Blvd.
   Milpitas,, CA 95035,
   United States of America
   Email: tianjijiang@chinamobile.com


   Toerless Eckert
   Futurewei
   Email: tte+ietf@cs.fau.de


   Zhenbin Li
   Huawei Technologies
   Email: lizhenbin@huawei.com


   Gyan Mishra
   Verizon Inc.
   Email: gyan.s.mishra@verizon.com


   Zhuangzhuang Qin
   China Unicom
   Email: qinzhuangzhuang@chinaunicom.cn


   Changwang Lin
   New H3C Technologies
   Email: linchangwang.04414@h3c.com


   Xuesong Geng
   Huawei
   Email: gengxuesong@huawei.com



Liu, et al.               Expires 24 April 2023                [Page 13]