Internet DRAFT - draft-sun-matrix-dcn

draft-sun-matrix-dcn





Network Working Group                                         Yantao Sun
Internet-Draft                               Beijing Jiaotong University
Intended status: Informational                               Xiaoli Song
Expires: January 10, 2013                                        Bin Liu
                                                                ZTE Inc.
                                                               Qiang Liu
                                                              Jing Cheng
                                             Beijing Jiaotong University
                                                            July 9, 2012


            MatrixDCN: A New Network Fabric for Data Centers
                        draft-sun-matrix-dcn-00

Abstract

   This document introduces describes the requirement of today's data
   centers and a new type of network topology called MatrixDCN (matrix
   data center network).  MatrixDCN is used to deploy large scale data
   center network, which can support more than 100 thousands of servers
   in one data center without network bandwidth bottleneck.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 10, 2013.

Copyright Notice

   Copyright (c) 2012 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents



Yantao Sun, et al.      Expires January 10, 2013                [Page 1]

Internet-Draft                  MatrixDCN                      July 2012


   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
     1.1.  Acronyms & Definitions . . . . . . . . . . . . . . . . . .  3
   2.  Conventions used in this document  . . . . . . . . . . . . . .  3
   3.  Network fabric . . . . . . . . . . . . . . . . . . . . . . . .  4
     3.1.  Components . . . . . . . . . . . . . . . . . . . . . . . .  4
     3.2.  Multiple Paths . . . . . . . . . . . . . . . . . . . . . .  5
     3.3.  Addressing . . . . . . . . . . . . . . . . . . . . . . . .  5
   4.  Routing  . . . . . . . . . . . . . . . . . . . . . . . . . . .  6
     4.1.  Routing of RS  . . . . . . . . . . . . . . . . . . . . . .  6
     4.2.  Routing of CS  . . . . . . . . . . . . . . . . . . . . . .  6
     4.3.  Routing of AS  . . . . . . . . . . . . . . . . . . . . . .  7
     4.4.  Construction of Routing Table  . . . . . . . . . . . . . .  7
       4.4.1.  Construction for CS  . . . . . . . . . . . . . . . . .  7
       4.4.2.  Construction for RS  . . . . . . . . . . . . . . . . .  8
       4.4.3.  Construction for AS  . . . . . . . . . . . . . . . . .  8
     4.5.  PDU Format . . . . . . . . . . . . . . . . . . . . . . . .  8
     4.6.  Fault Tolerance  . . . . . . . . . . . . . . . . . . . . .  9
   5.  VM Migration . . . . . . . . . . . . . . . . . . . . . . . . . 10
   6.  Multiple tenants . . . . . . . . . . . . . . . . . . . . . . . 10
   7.  Deployment Scenarios . . . . . . . . . . . . . . . . . . . . . 10
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 10
   9.  Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 10
   10. Reference  . . . . . . . . . . . . . . . . . . . . . . . . . . 11
   11. Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 11
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11

















Yantao Sun, et al.      Expires January 10, 2013                [Page 2]

Internet-Draft                  MatrixDCN                      July 2012


1.  Introduction

   Traditional network topology is a tree-like fabric composed by
   routers and switches, in which network is divided into 3 layers
   including core layer, aggregation layer and access layer.  All
   servers are connected to the access switches in access layer.

   This kind of topology has some problems using in data center network.
   Firstly, it constrains the scale of data center network.  When the
   scale of network expands, those routers in core layer are apt to be
   the bandwidth bottleneck in the whole network, since more packages
   need to be routed between different layer-2 domains through core
   routers.  Secondly, data center network is divided into many small
   layer-2 domains by core routers, which conduces to that VM's
   migration is limited to only one layer-2 domains.  And last, it's
   difficult to exploit mass of redundant links between switches as STP
   protocol must be used to ensure no loops in layer-2 network to avoid
   broadcast storm.

   To solve above problems, new network architectures should be
   introduced to data centers.  The new architecture requires the
   following features: 1) support multiple paths to eliminate bandwidth
   bottleneck; 2) has regular network topology with good extendibility
   and maintainability; 3) support VM migration in the entire network;
   4) has enough VLANs and permit any endpoints to compose one VLAN.

1.1.  Acronyms & Definitions

   DCN - Data Center Network

   AS - Access Switch

   RS - Row Switch

   CS - Column Switch

   PDU - Protocol Data Unit

   OSPF - Open Shortest Path First Routing Protocol

   VLAN - Virtual Local Area Network


2.  Conventions used in this document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC-2119 [RFC2119].



Yantao Sun, et al.      Expires January 10, 2013                [Page 3]

Internet-Draft                  MatrixDCN                      July 2012


   In this document, these words will appear with that interpretation
   only when in ALL CAPS.  Lower case uses of these words are not to be
   interpreted as carrying RFC-2119 significance.


3.  Network fabric

   In the researches on data center networks, fat-tree fabric attracts a
   great many of attentions.  Fat-tree is on kind of multi-root tree and
   a lot of important research works and some practices have been done
   based on fat-tree fabric.  In this proposal, we introduce another
   variety of multi-root tree called MatrixDCN for building large-scale
   data center network.  Similar with fat-tree, MatrixDCN has regular
   topology, multiple paths, special addressing and routing matched with
   its topology.

   MatrixDCN can support more than 100 thousands of physical servers in
   a single data center network.  And furthermore, it can support
   virtual machine migration in whole network and huge number of
   tenants' isolation by modest modification.

3.1.   Components

   In MatrixDCN, there are 3 types of network devices, called Row Switch
   (RS), Column Switch (CS) and Access Switch (AS).  AS switches are
   deployed as a matrix with multiple rows and columns.  For example,
   one 8 X 8 matrix has 8 rows and 8 columns and 64 AS switches.  For a
   RS, it is deployed on the head of one row and links all the AS
   switches together in this row, and for a CS, it is deployed on the
   head of one column and links all the AS switches together in this
   column.  Every AS connects with all the RS and CS switches located at
   the row and column head of it.  Figure 1 is an example of 2 X 2
   MatrixDCN.


















Yantao Sun, et al.      Expires January 10, 2013                [Page 4]

Internet-Draft                  MatrixDCN                      July 2012


                        +----+           +----+
                        | CH |  ____     | CH |  ____
                        +----+ \    \    +----+ \    \
                       10.0.1.1|     |  10.0.2.1 |    |
                      _________| __  | _____     |    |
                     /         |     |      \    |    |
                 +----+   +----+     |       +----+   |
                 | RH |___| AS |     |       | AS |   |
                 +----+   +----+     |       +----+   |
                 10.1.0.1 10.1.1.1   |      10.1.2.1  |
                          /     \    |      /      \  |
                       +---+  +---+  |   +---+  +---+ |
                       | S |  | S |  |   | S |  | S | |
                       +---+  +---+  |   +---+  +---+ |
                   10.1.1.2 10.1.1.3 |                |
                      _____________  |  _______       |
                     /               |         \      |
                  +----+    +----+   |       +----+   |
                  | RH | __ | AS | __/       | AS | __/
                  +----+    +----+           +----+
                  10.2.0.1  10.2.1.1        10.2.2.1
                            /     \         /      \
                          +---+   +---+    +---+   +---+
                          | S |   | S |    | S |   | S |
                          +---+   +---+    +---+   +---+
                                                  10.2.2.3

                     Figure 1: demonstration topology

3.2.  Multiple Paths

   To eliminate bandwidth bottleneck in data center networks, MatrixDCN
   can deploy multiple RS switches in one row and CS switches in one
   column.  Thus, we can have more links between AS and RS/CS, which
   means more bandwidth, between an AS and its RS and CS switches.  If
   bandwidth between AS and RS(called row bandwidth), bandwidth between
   AS and CS(called column bandwidth) and access bandwidth of AS are
   equal, this network is approximately non-blocking.

3.3.  Addressing

   In MatrixDCN, all the devices including servers and switches are
   assigned an IP address according to their position in the network.
   Suppose an AS is located at the mth row and nth column, its IP
   Address is 10.m.n.1/24 and the servers connected with it are set to
   10.m.n.x/24.  For RS switches located at the head of mth row, their
   IP will be set as 10.m.0.x/16.  For CS switches located at the head
   of mth column, their IP will be set as 10.0.m.x/255.0.255.0.



Yantao Sun, et al.      Expires January 10, 2013                [Page 5]

Internet-Draft                  MatrixDCN                      July 2012


4.  Routing

   The routing is very simple for MatrixDCN, as its topology is very
   regular and every switch can obtain the knowledge of the entire
   topology without exchanging link states since the device position is
   encoded in its address.  All the switches in MatrixDCN have routing
   ability.  These switches store routing entry using the standard
   routing table.  The structure of routing table is illustrated in
   table 1.

   Table standard IP routing table

      ---------------------------------------------------------------
      Subnet |  Address  |Subnet mask|Next hop|Cost|Create/update time
      ---------------------------------------------------------------
     10.1.0.0|255.255.0.0| 10.1.1.1  |        |    |
      ---------------------------------------------------------------
     10.2.0.0|255.255.0.0| 10.1.2.1  |        |    |
      ---------------------------------------------------------------


4.1.  Routing of RS

   When packets arrive at RS switch, RS will determine the next hop of a
   packet by the 3rd number of its destination IP address based on IP
   routing table.  If this number is k, the packet will be send to the
   kth AS switch (next hop) on the row of this RS.  The routing table on
   RS switch of the ith row looks like below:

   Destination/Subnet mask       Next hop
   10.X.1.X/255.0.255.0          10.i.1.1
   10.X.2.X/255.0.255.0          10.i.2.1
   10.X.3.X/255.0.255.0          10.i.3.1
   ......

4.2.  Routing of CS

   When packets arrive at CS switch, CS will determine the next hop of a
   packet by the 2rd part of its destination IP address based on IP
   routing table.  If this number is k, the packet will be send to the
   kth AS switch (next hop) on the column of this CS.  The routing table
   on CS switch of the ith row looks like below:

   Destination/Subnet mask                         Next hop
   10.1.X.X/255.255.0.0                            10.1.Col.1
   10.2.X.X/255.255.0.0                            10.2.Col.1
   10.3.X.X/255.255.0.0                            10.3.Col.1
   ......



Yantao Sun, et al.      Expires January 10, 2013                [Page 6]

Internet-Draft                  MatrixDCN                      July 2012


4.3.  Routing of AS

   When packets arrive at AS switch, AS should determine the next hop is
   whether RS or CS switch for every packet.  For a packet, if its
   destination IP address is on the same row, it will be sent to the RS
   switch, and if its destination IP address is on the same column, it
   will be sent to the CS switch.  For the packet whose destination is
   on different row and column, it can be sent to either RS or CS, and
   for the packet with destination on the same row and column, it is
   forwarded through level-2 switching without routing.
   The routing table of the CS on the crossing position of the ith row
   and the jth column looks like below:

   Destination/Subnet mask         Next hop
   10.Row.0.0/255.255.0.0          10.Row.0.X1
   10.Row.0.0/255.255.0.0          10.Row.0.X2
   ......
   10.0.Col.0/255.0. 255.0         10.0.Col.X1
   10.0.Col.0/255.0. 255.0         10.0.Col.X2
   ......
   10.0.0.0/255.0.0.0              10.Row.0.X1
   10.0.0.0/255.0.0.0              10.Row.0.X2
   ......
   10.0.0.0/255.0.0.0              10.0.Col.X1
   10.0.0.0/255.0.0.0              10.0.Col.X2
   ......

4.4.  Construction of Routing Table

   To build the routing table for switches, the connection relationship
   between adjacent switches should be learned automatically, and to
   learn connection relationship, every switch will send Hello PDU to
   all of its active ports periodically.  Hello PDU is encapsulated in a
   UDP packet.  A well-known UDP port will be obtained from IANA.

4.4.1.  Construction for CS

   For a CS switch, such as 10.0.n.x, which will receive Hello PDUs from
   all the AS switches on the same column, its routing table is built
   according to the following rules:

   When the CS received a Hello PDU from 10.m.n.1, a routing entry that
   destination is "10.m.0.0/255.255.0.0" and next hop is "10.m.n.1" will
   be added/refreshed to its routing table.  If Hello PDU can't be
   received in set time, the corresponding routing entry will be
   deleted.





Yantao Sun, et al.      Expires January 10, 2013                [Page 7]

Internet-Draft                  MatrixDCN                      July 2012


4.4.2.  Construction for RS

   For a RS switch, such as 10.m.0.x, which will receive Hello PDUs from
   all the AS switches on the same row, its routing table is built
   according to the following rules:

   When the RS received a Hello PDU from 10.m.n.1, a routing entry that
   destination is "10.0.n.0/255.0.255.0" and next hop is "10.m.n.1" will
   be added/refreshed to its routing table.  If Hello PDU can't be
   received in set time, the corresponding routing entry will be
   deleted.

4.4.3.  Construction for AS

   For an AS switch, such as 10.m.n.1, which will receive Hello PDUs
   from all the RS switches of the mth row and all the CS switches of
   the nth column, its routing table is built according to the following
   rules:

   If a Hello PDU received from a RS 10.m.0.x, two routing entry
   "10.m.0.0/255.255.0.0 10.m.0.x" and "10.0.0.0/255.0.0.0 10.m.0.x"
   will be added/refreshed to its routing table.

   If a Hello PDU received from a CS 10.0.n.x, two routing entry
   "10.0.n.0/255.255.0.0 10.0.0.x" and "10.0.0.0/255.0.0.0 10.m.0.x"
   will be added/refreshed to its routing table.

   If Hello PDU can't be received in set time, the corresponding routing
   entry will be deleted.

4.5.  PDU Format




















Yantao Sun, et al.      Expires January 10, 2013                [Page 8]

Internet-Draft                  MatrixDCN                      July 2012


                        ----------------------------
                       |    0   |  8 |  16  |  24   |
                        ----------------------------
                       |Version |Type| Packet Length|
                        ----------------------------
                       |   Row No    |  Column No   |
                        ----------------------------
                       |  Check Sum  |    AuType    |
                        ----------------------------
                       |       Authentication       |
                        ----------------------------
                       |       Authentication       |
                        ----------------------------
                       |            Data            |
                        ----------------------------
                       |            ......          |
                        ----------------------------


   Version: version number of MatrixDCN Routing Protocol.

   Type: PDU packet type.  If value is 1, this is a Hello PDU.  If value
   is 2, this is a Link State Advertisements PDU to notice link fault
   knowledge.

   Packet Length: the total length of the PDU including PDU head and
   data.

   Row No and Colum No: The position of this switch.

   Check Sum: Check sum for the total PDU.

   AuType: Authentication type. 0: no authentication, 1: Plaintext
   Authentication, 2: MD5 Authentication.

   Authentication: Authentication infomation. 0: undefined, 1: Key, 2:
   key ID, MD5 data length and packet number.  MD5 data is appended to
   the back of the packet.

   AuType and Authentication can refer to the definition of OSPF packet.

4.6.  Fault Tolerance

   Any network node or link fault will conduct communication break, so
   fault tolerance must be considered in a usable routing protocol.  To
   do this, switches in MatrixDCN should learn the whole network state
   through Link State Advertisements PDU.  The more details will be
   elaborated in the next version of this document.



Yantao Sun, et al.      Expires January 10, 2013                [Page 9]

Internet-Draft                  MatrixDCN                      July 2012


5.  VM Migration

   Subnetting and location-related addressing make possible to build
   large-scale data center networks, but limit the migration of VMs.  To
   solve this problem, Overlay or IP tunneling technology is introduced
   to data center networks.  By small extension to AS, MatrixDCN can
   support seamless VM migration in the entire data center without any
   modification to above routing procedure.  The feature of regular
   topology in MatrixDCN has been considered into this solution and more
   detail would be specialized in another document.


6.  Multiple tenants

   The present VLAN protocol 802.1q can't satisfy the requirement of
   data center networks as it can only support only about 4000 VLANs.
   VXLAN and NVGRE are two similar but competitive draft protocols for
   solving this problem and can be used in MatrixDCN.  Moreover, another
   solution in consideration of the regular topology will be discussed
   in other document.


7.  Deployment Scenarios

   MatrixDCN can be used to deploy large-scale data center network.
   Suppose AS switch has 40 down-link ports with 1G bits speed and 8 up-
   link ports with 10G bits speed, RS and CS switch has 40 ports with
   10G bits speed, those switches are currently main stream switches
   used in data center, we can build a MatrixDCN with 40 rows X 40
   columns.  In this MatrixDCN, every row has 4 RS switches and every
   column has 4 CS switches.  For every AS, 4 up-link ports link RS
   switches and 4 up-link ports link CS switches, and its 40 down-link
   ports link servers.  Thus, this MatrixDCN can contain up to 64,000
   servers using 1600 AS switches, 160 RS switches and 160 CS switches.
   The average available bandwidth for every server is about 1000M bits.


8.  Security Considerations

   The protection for routing information and isolation for networks of
   different tenants (VLAN) should be considered in this protocol.


9.  Conclusion

   Today's data center produces some new requirement to networks, such
   as no-blocking, seamless VM migration and multiple tenants.This
   document introduces MatrixDCN, a new network fabric for data centers.



Yantao Sun, et al.      Expires January 10, 2013               [Page 10]

Internet-Draft                  MatrixDCN                      July 2012


   MatrixDCN can support up to 100 thousands of servers without network
   bandwidth bottleneck.  This fabric is very simple and extendable, and
   its routing is very effective.  Furthermore, this fabric has some
   advantages on supporting VM Migration and one-to-many or many-to-many
   traffic in cloud computing.


10.  Reference

   [RFC2328] J. Moy, "OSPF Version 2", RFC2338, Apr. 1998.

   [FAT-TREE] M. Al-Fares, A. Loukissas, and A. Vahdat.  "A Scalable,
   Commodity, Data Center Network Architecture",In ACM SIGCOMM 2008.


11.  Acknowledgments

   Thanks for Yuhua Wei, Lizhong Jin, Jinghai Yu, and Zhui GUo.  They
   give important advices for this document.


Authors' Addresses

   Yantao Sun
   Beijing Jiaotong University
   No.3 Shang Yuan Cun, Hai Dian District
   Beijing  100044
   China

   Phone:
   Email: ytsun@bjtu.edu.cn


   Xiaoli Song
   ZTE Inc.
   ZTE Plaza, No.19 East Huayuan Road,Haidian District
   Beijing  100191
   China

   Email: song.xiaoli@zte.com.cn











Yantao Sun, et al.      Expires January 10, 2013               [Page 11]

Internet-Draft                  MatrixDCN                      July 2012


   Bin Liu
   ZTE Inc.
   ZTE Plaza, No.19 East Huayuan Road,Haidian District
   Beijing  100191
   China

   Email: liu.bin21@zte.com.cn


   Qiang Liu
   Beijing Jiaotong University
   No.3 Shang Yuan Cun, Hai Dian District
   Beijing  100044
   China

   Email: liuq@bjtu.edu.cn


   Jing Cheng
   Beijing Jiaotong University
   No.3 Shang Yuan Cun, Hai Dian District
   Beijing  100044
   China

   Email: yourney.j@gmail.com


























Yantao Sun, et al.      Expires January 10, 2013               [Page 12]