Internet DRAFT - draft-zhang-6man-scale-large-datacenter
draft-zhang-6man-scale-large-datacenter
IPv6 maintenance Working Group (6man) M. Zhang
Internet-Draft S. Kapadia
Intended status: Standards Track L. Dong
Expires: September 10, 2015 Cisco Systems
March 9, 2015
Improving Scalability of Switching Systems in Large Data Centers
draft-zhang-6man-scale-large-datacenter-00
Abstract
Server virtualization has been overwhelmingly accepted especially in
cloud-based data centers. Accompanied with expansion of services and
technology advancements, the size of a data center has increased
significantly. There could be hundreds or thousands of physical servers
installed in a single large data center which implies that the number
of Virtual Machines (VMs) could be in the order of millions.
Effectively supporting millions of VMs with limited hardware resources,
becomes a real challenge to networking vendors. This document describes a
method to scale a switching system with limited hardware resources
using IPv6 in large data center environments.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.
The list of current Internet-Drafts is at
http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 10, 2015
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
Ming, et al. Expires September 10, 2015 [Page 1]
Internet-Draft Scalability of Switching Systems in DC March 2015
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
Abstract ......................................................... 1
1. Introduction ................................................... 2
1.1 Specification of Requirements ............................... 4
2. Terminology .................................................... 4
3. Large Data Center Requirements ................................. 5
4. Scaling Through Aggregation .................................... 5
5. SSP Aggregation ................................................ 8
6. Programming in FIB CAM with Special Mask ....................... 9
7. VM Mobility .................................................... 11
8. Scaling Neighbor Discovery ..................................... 11
9. DHCPv6 ......................................................... 12
10. BGP ........................................................... 12
11. Scalability ................................................... 13
12. DC edge router/switch ......................................... 14
12.1 DC Cluster Interconnect .................................... 14
13. Multiple VRFs and Multiple Tenancies .......................... 15
13.1 Resource Allocation and Scalability with VRFs .............. 15
14. Security ...................................................... 16
15. References .................................................... 16
Authors' Address .................................................. 16
1. Introduction
Server virtualization is extremely common in large data centers
realized with a large number of Virtual Machines (VMs) or containers.
Typically, multiple VMs share the resources of a physical server.
Accompanied with expansion of services and technology advancements, the
size of a data center has increased significantly. There could be
hundreds or thousands of physical servers in a single large data
center, which implies that the number of VMs could be in the order of
millions. Such large number of VMs imposes challenges to network
equipment providers on how to effectively support millions of VMs with
limited hardware resources.
The CLOS based spine-leaf topology has become a defacto-standard of
choice for data center deployments. A typical data center topology
consists of two tiers of switches: Aggregation or spine tier and
ccess/Edge or leaf tier.
Figure 1 describes a two tiers network topology in a data center
cluster. S1 to Sn are spine switches. L1 to Lm are leaf switches.
Every leaf switches has at least one direct connection to every
spine switch. H1 to Hz are hosts/VMs attached to leaf switches
directly or indirectly through L2 switches. E1 is an edge
router/switch. Multiple data center clusters are interconnected by
edge routers/switches.
Ming, et al. Expires September 10, 2015 [Page 2]
Internet-Draft Scalability of Switching Systems in DC March 2015
+---+ +---+ +---+
|S1 | |S2 | ... |Sn |
+-+-+ +-+-+ +-+-+
| | |
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
| Link Connections |
| Every Spines connects to every Leaf |
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
| | | |
+-+-+ +-+-+ +-+-+ +-+-+ connect
|L1 | |L2 | ... |Lm | |E1 +-->> to other
+---+ +---+ +---+ +---+ cluster
/ \ \ |
/ \ \ |
+-+-+ +-+-+ +-+-+ +-+-+
|H1 | |H2 | |H3 | ... |Hz |
+---+ +---+ +---+ +---+
Figure 1: Typical two tier network topology in a DC cluster
Switches at the aggregation tier are large expensive entities with
many ports to interconnect multiple access switches together and
provide fast switching between access switches. Switches at access
tier are low cost, low latency, smaller switches that are connected
to physical servers for switching traffic among local servers and
servers connected to other access switches through aggregation
switches. For maximizing profit, low cost, and low latency ASICs are
commonly selected when designing access switches, more commonly SoCs
or system-on-chips. In these types of ASICs, the Layer 3 hardware
Forwarding Information Base (FIB) table is typically split into two
tables: 1) Host Route Table or HRT for host routes (/32 for IPv4 host
routes and /128 for IPv6 host routes) that is typically implemented
as a hash table; 2) Longest Prefix Match (LPM) Table for prefix
routes. Due to high cost of implementing a large LPM table in ASIC
either with traditional Ternary Content Addressable Memory [TCAM] or
other alternatives, LPM table size in hardware is restricted to a few
thousand entries (from 16k to 64k for IPv4) on access switches. Note
that with the size of an IPv6 address being 4 times as long as an
IPv4 address, the effective number of FIB LPM entries available for
IPv6 is essentially one-fourth (or half depending on the width of the
LPM entry). Note that the same tables need to be shared by all IPv4,
IPv6, unicast and multicast traffic.
For years, people are looking for solutions for super scale data
centers, but there has been no major break-throughs. Overlay-based
[OVERLAYS] approaches using VXLAN, TRILL, FabricPath, LISP etc. have
certainly allowed for separation of the tenant end-host address space
from the topology address space thereby providing a level of
indirection that aids scalability by reducing the requirements on the
aggregation switches. However, the scale requirements on the access
switches still remains high since they need to be aware of all the
Ming, et al. Expires September 10, 2015 [Page 3]
Internet-Draft Scalability of Switching Systems in DC March 2015
tenant end-host addresses to support any-to-any communication
requirement in large data centers (both East-West and North-South
traffic).
Software-Defined-Network controllers gaining a lot of traction, there
has been a direction to go toward a God-box like model where all the
information about all the end hosts will be known. In this way, based
on incoming packet, if an access switch does not know how to reach a
destination, it queries the God-box and locally caches the
information (the vanilla OpenFlow model). The inherent latency
associated with this approach as well as the centralized model presents
a single-point of failure due to which such systems will not scale
beyond a point. Alternatively, the access switch can forward unknown
traffic toward a set of Oracle boxes (typically one or more aggregation
switches with huge tables that know all about end-hosts) which in turn
takes care of forwarding traffic to the destination. As scale increases,
throwing more silicon at the solution is never a good idea. The costs
for building such large systems will be prohibitively high making it
impractical to deploy these systems in the field.
This document describes an innovative approach to improve scalability
of switching systems for large data centers with IPv6-based end-hosts or
VMs. Major improvements include: 1) Reduced resource allocation from FIB
tables in hardware both on access switches and almost no FIB resource
allocation on aggregation switches. One single cluster can support
multi-millions of hosts/VMs. 2) Eliminate L2 flooding and and L3
multicast for NDP packets between access switches 3) Reduction in the
control plane processing on the access switches.
1.1 Specification of Requirements
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
2. Terminology
HRT:
Host Route Table in packet forwarding ASIC
LPM:
Longest Path Match Table in packet forwarding ASIC
Switch ID:
A unique ID for a switch in a DC cluster
Cluster ID:
A unique ID for a DC cluster in a data center
VRF:
Virtual Routing and Forwarding Instance
Ming, et al. Expires September 10, 2015 [Page 4]
Internet-Draft Scalability of Switching Systems in DC March 2015
Switch Subnet (SS):
Subnet of a VLAN on an access switch in a data center cluster.
Switch Subnet Prefix (SSP):
An IPv6 prefix assigned to a switch subnet. It consists of Subnet
Prefix, Cluster ID, and Switch ID. In a VRF, there could be one SSP
per VLAN per access switch.
Aggregated Switch Subnet Prefix (ASSP):
It equals to SSP excluding Subnet ID. For better scalability, SSPs
in a VRF on an access switch can be aggregated to a single ASSP. It
is used for hardware programming and IPv6 forwarding.
Cluster Subnet Prefix (CSP):
Subnet prefixes for forwarding between DC clusters. It consists of
Subnet Prefix and Cluster ID.
DC Cluster Prefix:
A common IPv6 prefix used by all hosts/VMs in a DC Cluster.
Subnet ID:
The ID for a subnet in a data center. It equals to Subnet Prefix
excluding DC Cluster Prefix.
3. Large Data center Requirements
These are the major requirements for large data centers:
Any subnet, any where, any time
Multi-million hosts/VMs
Any to Any communication
VLANs (aka subnets) span across access switches
VM Mobility
Control plane scalability
Easy management, trouble-shooting, debug-ability
Scale-out model
4. Scaling Through Aggregation
The proposed architecture employs a distributed gateway approach at the
access layer. Distributed gateway allows localization of the failure
domains as well as distributed processing of ARP, DHCP etc. messages
thereby allowing for a scale-out model without any restriction on host
placement (any subnet, any where). Forwarding within the same subnet
adheres to bridging semantics while forwarding across subnets is
achieved via routing. For communication between end-hosts in different
subnets below the same access switch, routing is performed locally at
that access switch. For communication between end-hosts in different
subnets on different access switches, routing lookups are performed both
on the ingress access switch as well as the egress access switch. With
distributed subnets and a distributed gateway deployment, host (/128)
Ming, et al. Expires September 10, 2015 [Page 5]
Internet-Draft Scalability of Switching Systems in DC March 2015
addresses need to be propagated between the access switches using some
IGP such as MP-BGP. As the number of hosts in the data center goes up,
this would be a huge burden on the control plane in terms of
advertisement of every single host address prefix. The problem is
further exacerbated with the fact that a host can have multiple
addresses. Our proposal indicates how this problem can be solved via
flexible address assignment and intelligent control and data plane
processing.
A Data Center Cluster (DCC) is a data center network that consists of a
cluster of aggregation switches and access switches for switching
traffic among all servers connected to the access switches in the
cluster. A data center can include multiple DCCs. One unique DC Cluster
Prefix (DCCP) MUST be assigned to a DCC. DC Cluster Prefix could be
locally unique if the prefix is not exposed to the external internet or
globally unique otherwise.
A public IPv6 address block can be procured from IANA. With the assigned
address block, a service provider or enterprise can subdivide the block
into multiple prefixes for their networks and Data Center Clusters
(DCC). A DCCP length SHOULD be less than 64 bits. With the bits left
between DCCP and IPv6 Network Prefix, many subnet prefixes can be
allocated. All subnet prefixes in the DC cluster SHOULD share the common
DCCP.
A new terminology is introduced in this document - Switch Subnet Prefix
(SSP) which is defined as follow:
[RFC 4291] defines the 128-bit unicast IPv6 address format. It consists
of two portions: Subnet Prefix and Interface ID. 64-bits Subnet Prefix
is most common and highly recommended. For this scaling method, we
subdivide the Interface ID in IPv6 address: N bits for Host ID, 16 bits
for Switch ID, and 8 bits for Cluster ID.
Interface ID format
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0 0 0 0 0 0 0 0| Cluster ID | Switch ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. Host ID (variable length) .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
A SSP is assigned to a VLAN on an access switch. SSP includes the Subnet
Prefix assigned to the VLAN, the Switch ID for the access switch, and
Cluster ID for the Cluster.
Each access switch MUST has a unique Switch ID in a DC cluster. Switch
ID is assigned by a user or from a management tool. Because the Switch
Ming, et al. Expires September 10, 2015 [Page 6]
Internet-Draft Scalability of Switching Systems in DC March 2015
ID is a portion of IPv6 address for all host addresses assigned to hosts
attached to the same access switch, it is recommended to assign the
Switch IDs with certain characteristics, for example its location, so
that it could be helpful when trouble-shooting traffic-loss issues in
large data centers where millions of VMs are hosted.
Each cluster MUST have a unique Cluster ID in a data center at a campus.
Cluster ID is used for routing traffic across DC clusters.
Switch Subnet Prefix Example
| 48 | 16 | 8 | 8 | 16 | 32 |
+---------------+-----+---+---+-----+---------+
|2001:0DB8:000A:|000A:|00:|C5:|0001:|0000:0001|
+---------------+-----+---+---+-----+---------+
Cluster ID: C5
Switch ID: 1
VLAN: 100
DC Cluster Prefix: 2001:DB8:A::/48
Subnet ID: A
Subnet Prefix: 2001:DB8:A:A::/64.
Cluster Subnet Prefix 2001:DB8:A:A:C5::/80
Switch Subnet Prefix: 2001:DB8:A:A:C5:1::/96
Host Address: 2001:A:A:A:0:C5:1:1/128
In this example, the DC Cluster Prefix 2001:DB8:A::/48 is a common
prefix for the cluster. From the Cluster Prefix block, there is plenty
of address space (16 bits Subnet ID) available for subnet prefixes.
2001:DB8:A:A::/64 is a subnet prefix assigned to a subnet in this
example that is assigned to VLAN 100. Note that for the purpose of
exposition, we assume a 1:1 correspondence between a VLAN and a subnet.
However, the proposal does not have any restriction if multiple subnets
are assigned to the same VLAN or vice-versa. The subnet prefix is for a
logical L3 interface/VLAN typically referred to as an Integrated Routing
and Bridging (IRB) interface. The subnet or VLAN spans across multiple
access switches thereby allowing placement of any host any where within
the cluster. On each access switch, there is a Switch Subnet Prefix (
SSP) per subnet or VLAN. 2001:DB8:A:A:C5:1::/96 is the SSP for VLAN 100
on switch 1. It is combination of the Subnet Prefix, Cluster ID, and
Switch ID. A Host/VM Address provisioned to a host/VM connected to this
access switch MUST include the SSP associated to the VLAN on the switch.
In this example, 2001:DB8:A:A:C5:1:0:1/128 is a host/VM address assigned
to a host/VM connected to the access switch.
Host/VM addresses can be configured Using Stateful DHCPv6 or other
network management tools. In this model, DHCPv6 is chosen for
illustration. It illustrates how IPv6 host addresses are assigned from
DHCPv6 server. Similar implementations can be done with other
protocol/tools. Section 11 describes how address pools are configured on
Ming, et al. Expires September 10, 2015 [Page 7]
Internet-Draft Scalability of Switching Systems in DC March 2015
DHCPv6 server and how information between switches and DHCP server is
exchanged with DHCPv6 messages that allows seamless address assignment
based on the proposed scheme. This makes it completely transparent to
the end-user thereby alleviating the problems of address management
from the network administrator.
5. SSP Aggregation
Typically, a routing domain is identified by a Virtual Routing and
Forwarding (VRF) instance. Reachability within a VRF is achieved via
regular layer-3 forwarding or routing. By default, reachability from
within a VRF to outside as well as vice-versa is restricted. In that
sense, a VRF provides isolation for a routing domain. A tenant can be
associated with multiple VRFs and each VRF can be associated with
multiple subnets/VLANs. There can be overlapping IP addressing across
VRFs allowing address re-usage. To simplify implementation, reduce
software processing, and improve scalability, all SSPs in a VRF on an
access switch can be aggregated into a single Aggregated SSP (ASSP).
Only one ASSP is needed per switch for a VRF in a DC cluster. ASSPs are
employed to aid simplified processing both in the control plane as well
as the data plane.
Typically, for every subnet instantiated on an access switch, a
corresponding subnet prefix needs to be installed in the LPM that points
to the glean adjacency. With ASSP, only a single entry needs to be
installed in the LPM irrespective of the number of subnets that are
instantiated on the access switch. In addition, the same benefit is
leveraged at the remote access switches where there needs to be a single
ASSP installed for every other access switch independent of what subnets
are instantiated at the remote switches. More details of how this FIB
programming is achieved are presented in the next section.
ASSP entries on an access switch MUST be distributed to all other access
switches in a cluster through a routing protocol such as BGP. When an
ASSP entry is learned through IGP/BGP protocol, a LPM entry SHOULD be
installed. Because of better scalability in large data center
environment (BGP Reflector Router can be used to reduce number of peers
a BGP node communicates with), BGP is recommended for this forwarding
model. In this document, we describe how BGP can be used for ASSP and
CSP distribution. A new BGP Opaque Extended community is specified in
section 10 for this solution.
As mentioned earlier, in modern data centers, overlay networks are
typically used for forwarding data traffic between access switches. On
aggregation switches, a very small number of FIB entries are needed for
underlay reachability since the aggregation switches are oblivious to
the tenant host addresses. So aggregation switching platforms can be
designed to be simple, low latency, high port density, and low cost.
ASSP entries programmed in LPM table are for forwarding data traffic
between access switches. The rewrite information in the corresponding
next-hop (or Adjacency) entry SHOULD include information to forward
Ming, et al. Expires September 10, 2015 [Page 8]
Internet-Draft Scalability of Switching Systems in DC March 2015
packets to the egress access switch corresponding to the Switch ID.
One ASSP local FIB CAM entry is also needed. The rewrite information in
the corresponding next-hop (or Adjacency) entry SHOULD include
information to punt packet to local processor. This local FIB CAM entry
is used for triggering address resolution if a destined host/VM is not
in the IPv6 Neighbor Cache (equivalent to a glean entry). Host/VM
addresses (/128) discovered through IPv6 ND protocol are installed in
Host Route table (HRT) on the access switch and only on that access
switch. Host routes learned through routing protocol MUST NOT be
programmed HRT table in hardware. Note exception can occur if a VM moves
across access switch boundary. For VM moves across access switch
boundary, special handlings are required that will be discussed in a
different draft for VM Mobility.
A IPv6 unicast data packet from a host/VM connected to an ingress switch
destined to another host on an egress switch is forwarded in the
following steps: 1) It arrives at the ingress switch; 2) A L3 lookup in
FIB (LPM) CAM table hits an entry because the packets destination
address includes the Switch Subnet Prefix; 3) The packet is forwarded to
the egress switch based on the FIB CAM entry and the corresponding
Adjacency entry; 4) The packet is forwarded to its destined host by the
egress switch.
For forwarding packets outside of the DC Cluster, a Default route ::/0
SHOULD be installed in FIB CAM that routes packets to one of DC edge
routers/switches that provides reachability both to other data center
sites as well as the external world (Internet).
To summarize in this forwarding model, only local Host/VM routes are
installed in HRT table. That greatly reduces the number of HRT table
entries required at an access switch. ASSP routes are installed in LPM
table for forwarding traffic between access switches. Because of ASSPs
are independent of subnet/VLANs, the total number of LPM entries
required are greatly reduced. These reduced requirements on the HRT and
LPBM on the access switches allow supporting very large number of VMs
with much smaller hardware FIB tables.
Similar forwarding model SHOULD be implemented in software. For example,
if special mark is used as discussed in section 6, when forwarding an
IPv6 packet in an SSP enabled VRF, the SSP subnet bits can be masked
with 0s when doing lookup in software FIB. If it results in a match
with an ASSP entry, the packet will be forwarded to the egress access
switch with the adjacency attached to the ASSP.
6. Programming in FIB CAM with Special Mask
Typically, FIB lookup requires a longest prefix match (LPM) for which a
CAM is utilized. CAM in ASIC is implemented with Value bits and mask
bits for each of its entries. Value bits are the value (0 or 1) of the
bits in the memory for L3 forwarding lookup against a lookup key in the
CAM table that includes typically the destination address of a data
Ming, et al. Expires September 10, 2015 [Page 9]
Internet-Draft Scalability of Switching Systems in DC March 2015
packet to be forwarded (the lookup key is typically vpn-id
(corresponding to the VRF, destination-IP). The mask bits are used to
include or exclude each bit in the value field of a CAM entry when
deciding if a match has occurred or not. Mask bit=1, or mask-in, means
include the value bit and mask bit=0, or mask-out, means exclude the
value bit or its a DONT-CARE (corresponding value bit can be 1 or 0).
When programming the FIB CAM for all Switch Subnet Prefixes from an
ACCESS switch, only one entry is installed in FIB CAM per destination
ACCESS switch by masking in all DC Cluster Prefix bits, masking out all
bits after DC Cluster Prefix and before the Cluster ID bits, and then
masking in both Cluster ID bits and ACCESS ID bits and masking out
remaining bits.
For example,
DC Cluster Prefix: 2001:0DB8:000A::/48
Cluster ID: 0xC5
ACCESS ID in hex: 0x1234
FIB CAM programming
Value: 2001:0DB8:000A:0000:00C5:1234:0000:0000
Mask: FFFF:FFFF:FFFF:0000:00FF:FFFF:0000:0000
With one such FIB CAM entry, it can match all Switch Subnet Prefixes
that includes the DC Cluster Prefix 2001:0DB8:000A::/48, Cluster ID 0xC5
and Switch ID 0x1234 no matter what values on those bits between DC
Cluster Prefix and the Cluster ID. That means only single FIB CAM entry
is needed for all packets destined to hosts connected to a switch no
matter what subnet prefixes are configured on VLANs on that switch. On a
given switch, one FIB CAM entry is required for each of other access
switches in the DC Cluster.
In case the LPM is not implemented as a true CAM but instead as an
algorithmic CAM as is the case with some of the ASICs, an alternative
approach can be employed. That is to set all subnet bits to 0s when
programming an ASSP entry in LPM table. Subnet bits SHOULD be cleared
when doing lookup in LPM table. This approach requires certain changes
in lookup logic of the ASIC.
Note that the above explanation applies on a per VRF basis since the FIB
lookup is always based on (VRF, Destination-IP). For example, in a data
center with 100 access switches, if a VRF spans 10 access switches, then
the number of LPM entries on those 10 access switches for this VRF is
equal to 10 (1 local and 9, one for each of the remote switches).
Section 11 provides additional details on scalability in terms of the
number of entries required for supporting a large multi-tenant data
center with millions of VMs.
Ming, et al. Expires September 10, 2015 [Page 10]
Internet-Draft Scalability of Switching Systems in DC March 2015
7. VM Mobility
VM mobility will be discussed in a separate IETF draft.
8. Scaling Neighbor Discovery
Another major issue with the traditional forwarding model is the
scalability of processing the Neighbor Discovery protocol (NDP)
messages. In a data center cluster with large number of VLANs and as
many of the VLANs span across multiple access switches, the volume of
NDP messages handled by software on an access switch could be huge that
can easily overwhelm the CPU. On the other hand, the large number of
entries in neighbor cache on an access switch could causes HRT table
overflow.
In our proposed forwarding model, Neighbor Discovery can be distributed
to access switches as described below. Please note all following
descriptions in this section only apply to ND operation for global
unicast target. No ND operation change is required for Link-local
target.
All NDP messages from host/VMs are restricted to the local access
switch.
Multicast NDP messages are flooded to all local switch ports on a VLAN
and also copied to local CPU. It SHOULD NOT be sent on link(s) connected
to aggregation switches.
When a multicast NS message is received, if its target matches with the
local ASSP, then it can be ignored because the host/VM SHOULD reply to
the NS since the destination is also locally attached to the access
switch; otherwise, a unicast NA message MUST be sent by the switch with
link-layer address equals to the switch's MAC (aka Router MAC).
When an unicast data packet is received, if the destination address
belongs to a remote switch, it will match the ASSP for the remote switch
in FIB table and be forwarded to the remote switch. On the remote
switch, if that destined host/VM is not discovered yet, the data packet
will be punt to the CPU and a ND will be triggered for host discovery in
software.
Distributed ND model can reduce software processing in CPU
substantially. It also takes much less space in hardware HRT table. Most
importantly there is no flooding both in L2 and L3. Flooding is a major
concern in large data centers so it SHOULD be avoided as much as
possible.
A subnet prefix and a unique address are configured on a L3 logical
interface on access switch. When the L3 logical interface has member
ports on multiple switches, the same subnet prefix and the address MUST
be configured on the L3 logical interface on all those switches. ND
operation on hosts/VMs remains the same without any change.
Ming, et al. Expires September 10, 2015 [Page 11]
Internet-Draft Scalability of Switching Systems in DC March 2015
9. DHCPv6
This section describes the host address assignment model with DHCPv6
protocol. Similar implementations can be done with other protocols and
management tools.
DHCPv6 Relay Agent [RFC 3315] SHOULD be supported on access switches for
this address assignment proposal. [draft-ietf-dhc-topo-conf-04]
specifies recommendations on real DHCPv6 Relay Agent deployments. For
the forwarding model described in this document, the method of using
link-address as described in section 3.2 of
[draft-ietf-dhc-topo-conf-04] SHOULD be implemented as follows:
The Switch Subnet Prefix (SSP) for the subnet on the switch SHOULD be
used as link-address in Relay-Forward message sent from switch. On
DHCPv6 server, the link-address is used to identify the link. A prefix
or address range should be configured on DHCPv6 server for the link.
The prefix or address range MUST match with the SSP on the switch. By
doing these, it is guaranteed that addresses assigned by DHCPv6 server
always include the SSP for the interface on the switch.
The number of SSP address pools could be very large on the DHCP server.
This can be alleviated by employing a cluster of DHCP servers to ensure
effective load distribution of client DHCPv6 requests.
10. BGP
As mentioned earlier, ASSP entries are redistributed to all access
switches through BGP. ASSP entries learned from BGP are inserted in RIB.
They will be used for FIB CAM programming in hardware and IPv6
Forwarding in software.
In this document, we define a BGP opaque extended community that can be
attached to BGP UPDATE messages to indicate the type of routes that are
advertised in the BGP UPDATE messages. This is the IPv6 Route Type
Community [RFC4360] with the following encoding:
+-------------------------------------+
| Type 0x3 or 0x43 (1 octet) |
+-------------------------------------+
| Sub-type 0xe (1 octet) |
+-------------------------------------+
| Route Type (1 octets) |
+-------------------------------------+
| Subnet ID Length (1 octet) |
+-------------------------------------+
| Reserved (4 octets) |
+-------------------------------------+
Type Field:
The value of the high-order octet of this Opaque Extended Community is
0x03 or 0x43. The value of the low-order octet of the extended type
Ming, et al. Expires September 10, 2015 [Page 12]
Internet-Draft Scalability of Switching Systems in DC March 2015
field for this community is 0x0E(or another value allocated by IANA).
Value Field:
The 6 octet Value field contains three distinct sub-fields, described
below:
The route type sub-field defines the type of IPv6 routes carried in this
BGP message. The following values are defined:
1: ASSP_Route indicates that the routes carried in this BGP Update
message are ASSP route
2: CSP_Route indicates that the routes carried in this BGP Update
message are CSP routes
The Subnet ID Length specifies the number of bits in an ASSP route.
Those bits can be ignored in the FIB look up either with special mask
when FIB lookup CAM is used or an alternative way as described in
section 5. This field is only used when the route type is ASSP_Route.
The 4 octet reserved field is for future use.
The IPv6 Route Type Community does not need to be carried in the BGP
Withdraw messages.
All operations SHOULD follow [RFC4360]. There is no exception for this
forwarding model.
11. Scalability
With this innovative forwarding model, the scalability of data center
switching system is improved significantly while still allowing
any-to-any communications between all hosts, and no restriction on host
placement or host mobility.
FIB TCAM utilization on an access switch becomes independent of number
of VLANs/subnets instantiated on that switch.
It is important to note that the number of host prefix routes (/128)
only depends on the number of VMs that are local to an access switch.
Network administrator can add as many as access switches with the same
network design and would never worry about running out of FIB HRT
resources. This greatly simplifies network design for large data centers
Ming, et al. Expires September 10, 2015 [Page 13]
Internet-Draft Scalability of Switching Systems in DC March 2015
The total number of VMs can be supported in a data center cluster can be
calculated as the following (assuming single VRF):
Number of LPM entries:
Only one LPM entry per access switch is required for local ASSP.
The total number of LPM entries on an access switch is equivalent
to the total number of access switches in a DC cluster plus 1
(for the default route).
Number of HRT entries:
There will be one HRT entry for each directly connected host/VM.
Scalability Calculation
H: max number of HRT entries
V: Number of VMs/port
P: number of ports/access switch
H = V x P
For example,
48 ports/access switch, 128 VMs/port
H = 48 x 128 = 6k HRT entries/access switch
T: total number of hosts/VMs
L: number of access switches
T = H x L
xample: 200 access switches
1.2 Million (6k x 200) VMs can be supported in a large data center
cluster.
12. DC edge router/switch
Multiple data center clusters can be interconnected with DC edge
routers/switches. The same subnet can span across multiple data center
clusters. While each subnet has a unique subnet prefix, each cluster in
which that subnet extends has a unique cluster subnet prefix. This will
be advertised over BGP to the edge routers, which in turn will attract
traffic for hosts that are part of that subnet in a given cluster.
Again, procedure to handle host mobility across clusters will be
described separately in a different draft.
12.1 DC Cluster Interconnect
This section describes a way to support VLAN across DC clusters for this
forwarding model.
Subnet Prefixes SHOULD be advertised by routing protocol within a DC
Cluster, but subnet prefixes SHOULD NOT be installed in hardware FIB
table. On a DC edge router/switch, Cluster Subnet Prefixes (CSP) can be
configured or auto-generated if SSP is enabled. CSP is special prefix to
be used at DC edge router/switch to forward traffic between directly
Ming, et al. Expires September 10, 2015 [Page 14]
Internet-Draft Scalability of Switching Systems in DC March 2015
connected DC clusters. Please refer to section 4 for CSP definition and
example. There SHOULD be one CSP per subnet.
A CSP SHOULD be advertised through a routing protocol between DC edge
router/switch that connects the DC Clusters. In section 10, special BGP
option is defined for advertising CSP routes. CSP routes SHOULD not be
advertised into a DC cluster.
CSP route message SHOULD be handled as follow:
On CSP originating DC edge router/switch, CSP SHOULD NOT be installed in
FIB table in hardware. On the receiving DC edge router/switch, CSP
SHOULD be installed in FIB table in hardware. All bits between DCCP and
Cluster ID MUST be masked out if the special mask scheme can be
implemented, or set those bits to 0s if FIB key mask is not supported.
Because CSPs consume FIB CAM space, user SHOULD determine if there is
enough FIB CAM resource on DC edge router/switch before enabling this
feature.
13. Multiple VRFs and Multiple Tenancies
For flexibility to users, an implementation can let user to
enable/disable this feature at VRF level on one or more access switches.
When it is enabled in a VRF, all functionalities described in this
document SHOULD be applied to that VRF on all those access switches. No
behavior changes SHOULD happen in other VRFs without this feature
enabled.
Multi-tenancy can be supported by employing multiple VRFs. A tenant can
be allocated VRFs.
13.1 Resource Allocation and Scalability with VRFs
For supporting more VRFs in a DC cluster, a DC network administrator can
enable this feature for a VRF only on a few access switches in the
cluster. The max number of VRFs can be calculated with this formula:
Scalability Calculation
L: Number of LPM entries
V: number of VRFs
P: number of ACCESSs per VRF (average)
L = V x (P + 1) or
V = L/(P + 1)
Example
8k LPM entries available per access switch and on average 9 ACCESSs are
allocated per VRF.
Number of VRFs that can be supported: V = 8000/(9 + 1) = 800
Ming, et al. Expires September 10, 2015 [Page 15]
Internet-Draft Scalability of Switching Systems in DC March 2015
More VRFs can be supported if the number of access switches per VRF is
decreased.
To support a large number of VRFs or tenants, larger LPM tables MAY be
required. That SHOULD be considered at ASIC design phase.
14. Security
No new security threat is expected to be imposed by this proposal.
15. References
[TCAM] Soraya Kasnavi and Vincent C. Gaudet, Paul Berube and Jose
Nelson Amaral, A Hardware-Based Longest Prefix Matching Scheme for
TCAMs IEEE, 2005
[OVERLAYS] S. Hooda, S. Kapadia, P. Krishnan, Using TRILL, FabricPath,
and VXLAN: Designing Massively Scalable Data Centers (MSDC) with
Overlays, ISBN-978-1587143939, 2014
[RFC 4291] IP Version 6 Addressing Architecture
[RFC 4861] Neighbor Discovery for IP version 6 (IPv6)
[RFC 3315] Dynamic Host Configuration Protocol for IPv6 (DHCPv6)
[draft-ietf-dhc-topo-conf-04] Customizing DHCP Configuration on the
Basis of Network Topology
[RFC 4271] A Border Gateway Protocol 4 (BGP-4)
[RFC4360] BGP Extended Community Attribute
Authors' Addresses
Ming Zhang
Cisco Systems
170 West Tasman Dr
San Jose, CA 95134
USA
Phone: +1 408 853 2419
EMail: mzhang@cisco.com
Shyam Kapadia
Cisco Systems
170 West Tasman Dr
Ming, et al. Expires September 10, 2015 [Page 16]
Internet-Draft Scalability of Switching Systems in DC March 2015
San Jose, CA 95134
USA
Phone: +1 408 527 8228
EMail: shkapadi@cisco.com
Liqin Dong
Cisco Systems
170 West Tasman Dr
San Jose, CA 95134
USA
Phone: +1 408 527 1532
EMail: liqin@cisco.com