Internet DRAFT - draft-liu-multicast-for-computing-storage

draft-liu-multicast-for-computing-storage







Network Working Group                                             Y. Liu
Internet-Draft                                              China Mobile
Intended status: Informational                                   X. Geng
Expires: 11 January 2024                                          Huawei
                                                            10 July 2023


                  Multicast for Computing and Storage
              draft-liu-multicast-for-computing-storage-00

Abstract

   This document introduces the multicast use case for computing and
   storage.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 11 January 2024.

Copyright Notice

   Copyright (c) 2023 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components



Liu & Geng               Expires 11 January 2024                [Page 1]

Internet-Draft  draft-liu-multicast-for-computing-storag       July 2023


   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Use in Multi-tenant Cloud . . . . . . . . . . . . . . . . . .   2
   3.  Typical Multicast Scenario in Computing . . . . . . . . . . .   3
     3.1.  AI Training . . . . . . . . . . . . . . . . . . . . . . .   3
     3.2.  HPC . . . . . . . . . . . . . . . . . . . . . . . . . . .   5
   4.  Typical Multicast Scenario in Computing . . . . . . . . . . .   6
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   7
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .   7
   7.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .   7
   8.  Normative References  . . . . . . . . . . . . . . . . . . . .   7
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   8

1.  Introduction

   There are applications in data center with point-to-multipoint
   communication patterns that would benefit from network multicast
   service, without which, these applications, when migrating to public
   clouds, will use server based packet replication techniques.  This
   leads to CPU load inflation and prevents tenants from sustaining high
   throughputs and low latencies for multicast workloads.

   In order to better show the requirements for computing and storage,
   we list 3 typical potential multicast scenarios with P2MP services:
   Multi-tenant Cloud, Computing and Storage.

   The multicast requirements could be described with the following 3
   aspects:

   - Network Scale: number of switches, number of links, number of hosts

   - Multicast Tree Size: number of intermeidate nodes; number of
   receivers

   - Multicast Service Number

2.  Use in Multi-tenant Cloud

   As illustrated in the following figure, a data-center may contain: a
   network fabric configured in unicast-only mode, hosts running as
   virtual machines (VMs) managed by tenants, central replicators (C-R)
   for providing MSR6 packet delivery service among the hosts of a
   tenant.



Liu & Geng               Expires 11 January 2024                [Page 2]

Internet-Draft  draft-liu-multicast-for-computing-storag       July 2023


          +----+   +----+   +----+
          |C-R1|   |C-R2|   |C-Rn| ====>Central-Replicator
          +-+--+   +-+--+   +-+--+
            |        |        |
      +-----+--------+--------+-----+
      | Spine +Leaf +vSwitch Fabric |
      |       (Unicast Only)        |
      +--+--+--+--+--+--+--+--+--+--+
         |  |  |  |  |  |  |  |
         h1 |  h3 h4 |  h6 |  h8   ====>Tenant-1
            |        |     |
            h2       h5    h7      ====>Tenant-2

   Take tenant-1 for example.  The host h1 can send multicast flow using
   MSR6 packets to C-R1, the MSR6 packets include one or more of the
   destination hosts h3/h4/h6/h8 encoding in the MSR6 header.  An MSR6
   packet may be sent to C-R1 where it is replicated and sends to the
   desired destination hosts.  An MSR6 packet may be sent to C-R1 where
   it is replicated and sends to the part of the destination hosts, and
   another copy to C-R2 for replication and delivery to the left
   destination hosts.

   A Tenant may have a dedicated set of C-Rs for its own use, or a
   Tenant may use a shared C-Rs for its replication requirement among
   VMs.

3.  Typical Multicast Scenario in Computing

3.1.  AI Training

   The following figure shows a typical RDMA AI training scenario.

                    PS(Parameter Server) Nodes
                  +-------+          +-------+
                  |  CPU  |          |  CPU  |
                  | Server|          | Server|
                  +-+-+-+-+          +-+-+-+-+
       ^            | | |              | | |          |
       |         +--|-|-|--------------+ | |          |
       |       +----+ | +----------------------+      |
       |       | |    +--------+ +-------+ |   |      V
   Gradients   | |             | |         |   | Parameters
           +---+-+-+       +---+-+-+     +-+---+-+
           |  GPU  |       |  GPU  |     |  GPU  |
           | Worker|       | Worker|     | Worker|
           +-------+       +-------+     +-------+

   Worker->PS: The gradient of each worker is pushed to PS node



Liu & Geng               Expires 11 January 2024                [Page 3]

Internet-Draft  draft-liu-multicast-for-computing-storag       July 2023


   PS->Worker: PS will pull the parameters back to all workers after
   aggregation

   In this process, the second stage is information distribution, with
   the same data content.  N connections are used to transmit unicast
   separately.  The bandwidth efficiency is 1/N, the larger the scale,
   the lower the efficiency.

                         +---------------+
                         |     Source    |
                         | +---+   +---+ |
                         | |CPU|   |GPU| |
                         | +-+-+   +-+-+ |
                         |   |       |   |
                         |    \     /    |
                         |   +-V---V-+   |
                         |   |  HCA  |   |
                         |   +-------+   |
                         +--+-+-+-+-+-+--+
                            | | ... | |
                         +--V-V-----V-V--+
                         |     Switch    |
                         +-+-----------+-+
                          /             \
           +-------------V-+           +-V------------+
           |  Destination  |           |  Destination  |
           |   +-------+   |           |   +-------+   |
           |   |  HCA  |   |           |   |  HCA  |   |
           |   +-V---V-+   |           |   +-V---V-+   |
           |    /     \    |           |    /     \    |
           |   |       |   |           |   |       |   |
           | +-+-+   +-+-+ |           | +-+-+   +-+-+ |
           | |CPU|   |GPU| |           | |CPU|   |GPU| |
           | +---+   +---+ |           | +---+   +---+ |
           +---------------+           +---------------+

   If the source only sends 1 copy to the network and the switches
   replicate the packet to different distinations.  The use of bandwidth
   is more efficient and the training is faster.

   The large-scale multicast requirement in this scenario is as the
   following:

   - Network Scale: 10-10k GPU

   - Multicast Tree Size: 10-10k receivers

   - Multicast Service Number: depends on the scenario



Liu & Geng               Expires 11 January 2024                [Page 4]

Internet-Draft  draft-liu-multicast-for-computing-storag       July 2023


3.2.  HPC

   The following is an example of MPI in HPC scenario.

         +-------------------------------------------+
         |                Dispatcher                 |
         |                  Master                   |
         +---------------------+---------------------+
                               |
             +-----------------+
             |
         +---+----+  +--------+             +--------+
         |+--V---+|  |+------+|             |+------+|
         ||Dispa-||  ||Dispa-||             ||Dispa-||
         ||Agent ||  ||Agent ||             ||Agent ||
         |+---+--+|  |+---+--+|             |+---+--+|
         |    |   |  |    |   |             |    |   |
         |+---V--+|  |+---V--+|             |+---V--+|
         ||  MPI ||  ||  MPI ||     ...     ||  MPI ||
         ||Proces||  ||Proces||             ||Proces||
         |+---^--+|  |+---^--+|             |+---^--+|
         |    |   |  |    |   |             |    |   |
         |+---V--+|  |+---V--+|             |+---V--+|
         || RoCE |<-->| RoCE |<------------->| RoCE ||
         |+------+|  |+------+|             |+------+|
         +--------+  +--------+             +--------+

   Stage 1: Dispatcher Master senses millions of cores and schedules
   millions of Rank MPI jobs on demand.  Dispatcher Master sends the
   scheduling results to Dispatcher Agent

   Stage 2: Dispatcher Agent starts Million Rank MPI on each node The
   Dispatcher Agent that receives the message broadcast the message to
   other Dispatcher Agents and do the initialization before starting the
   MPI application

   Stage 3: Dispatcher Agent broadcaast the message to start the MPI
   application.  MPI internal initialization Synchronize the RoCE
   endpoint in allgather way after the MPI application is started

   The last 2 stages could benefit from multicast and reduce task
   completion time.


   The large-scale multicast requirement in this scenario is as the
   following:

   - Network Scale: 1000 k CPU/GUP



Liu & Geng               Expires 11 January 2024                [Page 5]

Internet-Draft  draft-liu-multicast-for-computing-storag       July 2023


   - Multicast Tree Size: 10k~100k receivers

   - Multicast Service Number: 1~100

4.  Typical Multicast Scenario in Computing

   Ceph is an open-source distributed software platform.  It mainly
   focuses on scale-out file system including storage distribution and
   availability, which is widely used in storage.

   Ceph Object Storage Daemons (OSDs) are reponsible for storing objects
   on a local file system on behalf of Ceph clients.  Also, Ceph OSDs
   use the CPU, memory, and networking of Ceph cluster nodes for data
   replication, erasure coding, recovery, monitoring and reporting
   functions.

   The following process request P2MP service.

   - Application initiates "write" operation from a client to a server.

   - Client finds the server to write in, and 3 copies are sent to 3
   services.

                  +-------+          +-------+
                  |Client1|          |Client2|
                  +---+---+          +---+---+
                      |                  |
                      +---------+--------+
                                |
                        +-------+-------+
                        |     Switch    |
                        +-------+-------+
                                |
               +----------------+----------------+
               |                |                |
           +---+---+        +---+---+        +---+---+
           | Server|        | Server|        | Server|
           +-------+        +-------+        +-------+

   The large-scale multicast requirement in this scenario is as the
   following:

   - Network Scale: 3k Server (1 Pod)

   - Multicast Tree Size: 3 receivers

   - Multicast Service Number: 10k




Liu & Geng               Expires 11 January 2024                [Page 6]

Internet-Draft  draft-liu-multicast-for-computing-storag       July 2023


5.  IANA Considerations

   This document makes no request of IANA.

6.  Security Considerations

   TBD

7.  Acknowledgements

   TBD

8.  Normative References

   [I-D.cheng-spring-ipv6-msr-design-consideration]
              Cheng, W., Mishra, G. S., Li, Z., Wang, A., Qin, Z., and
              C. Fan, "Design Consideration of IPv6 Multicast Source
              Routing (MSR6)", Work in Progress, Internet-Draft, draft-
              cheng-spring-ipv6-msr-design-consideration-01, 25 October
              2021, <https://datatracker.ietf.org/doc/html/draft-cheng-
              spring-ipv6-msr-design-consideration-01>.

   [I-D.ietf-avtcore-rtp-topologies-update]
              Westerlund, M. and S. Wenger, "RTP Topologies", Work in
              Progress, Internet-Draft, draft-ietf-avtcore-rtp-
              topologies-update-10, 2 July 2015,
              <https://datatracker.ietf.org/doc/html/draft-ietf-avtcore-
              rtp-topologies-update-10>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC3493]  Gilligan, R., Thomson, S., Bound, J., McCann, J., and W.
              Stevens, "Basic Socket Interface Extensions for IPv6",
              RFC 3493, DOI 10.17487/RFC3493, February 2003,
              <https://www.rfc-editor.org/info/rfc3493>.

   [RFC3542]  Stevens, W., Thomas, M., Nordmark, E., and T. Jinmei,
              "Advanced Sockets Application Program Interface (API) for
              IPv6", RFC 3542, DOI 10.17487/RFC3542, May 2003,
              <https://www.rfc-editor.org/info/rfc3542>.

   [RFC8200]  Deering, S. and R. Hinden, "Internet Protocol, Version 6
              (IPv6) Specification", STD 86, RFC 8200,
              DOI 10.17487/RFC8200, July 2017,
              <https://www.rfc-editor.org/info/rfc8200>.



Liu & Geng               Expires 11 January 2024                [Page 7]

Internet-Draft  draft-liu-multicast-for-computing-storag       July 2023


Authors' Addresses

   Yisong Liu
   China Mobile
   Email: liuyisong@chinamobile.com


   Xuesong Geng
   Huawei
   Email: gengxuesong@huawei.com









































Liu & Geng               Expires 11 January 2024                [Page 8]