Internet DRAFT - draft-karir-armd-datacenter-reference-arch
draft-karir-armd-datacenter-reference-arch
ARMD M. Karir
Internet Draft Merit Network Inc.
Intended status: Informational Track Ian Foo
Expires: January 2012 Huawei Technologies
October 24, 2011
Data Center Reference Architectures
draft-karir-armd-datacenter-reference-arch-00.txt
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with
the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
This Internet-Draft will expire on January 24, 2012.
Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with
respect to this document. Code Components extracted from this
document must include Simplified BSD License text as described in
Section 4.e of the Trust Legal Provisions and are provided without
warranty as described in the Simplified BSD License.
Karir Expires January 24, 2012 [Page 1]
Internet-Draft Data Center Reference Architectures Oct 18, 2011
Abstract
The continued growth of large-scale data centers has resulted in a
wide range of architectures and designs. Each design is tuned to
address the challenges and requirements of the specific applications
and workload that the data is being built for. Each design evolves
as engineering solutions are developed to workaround limitations of
existing protocols, hardware, as well as software implementations.
The goal of this document is to characterize this problem space in
detail in order to better understand if there is any gap in making
address resolution scale in various network designs for data
centers. In particular it is our goal to peel back the various
optimization and engineering solutions to develop generalized
reference architectures for a data center. We also discuss the
various factors that influence design choices in developing various
data center designs.
Conventions used in this document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC-2119 0.
Table of Contents
1. Introduction...................................................3
2. Terminology....................................................3
3. Generalized Data Center Design.................................4
3.1. Access Layer..............................................5
3.2. Aggregation Layer.........................................5
3.3. Core......................................................5
3.4. L3/L2 Topological Variations..............................5
3.4.1. Layer 3 to Access Switches...........................5
3.4.2. L3 to Aggregation Switches...........................5
3.4.3. L3 in the Core only..................................6
3.4.4. Overlays.............................................6
4. Factors that Affect Data Center Design.........................7
4.1. Traffic Patterns..........................................7
4.2. Virtualization............................................7
4.3. Impact of Data Center Design on L2/L3 protocols...........8
5. Conclusion and Recommendation..................................8
6. Manageability Considerations...................................9
7. Security Considerations........................................9
8. IANA Considerations............................................9
9. Acknowledgments................................................9
10. References....................................................9
Karir Expires January 24, 2012 [Page 2]
Internet-Draft Data Center Reference Architectures Oct 18, 2011
Authors' Addresses...............................................10
Intellectual Property Statement..................................10
Disclaimer of Validity...........................................10
1. Introduction
Data centers are a key part of delivering Internet scale
applications. Data center design and network architecture is an
important aspect of the overall service delivery plan. This
includes not only determining the scale of physical and virtual
servers but also optimizations to the entire data center stack
including in particular the layer 3 and layer 2 architectures.
Depending on the particular application requirements and scale, data
centers can be designed in variety of ways. Each design is often a
representation of which aspects of the problem were and were not
relevant to the purpose of that data center. In this document we
attempt to generalize the various design optimizations into a common
generic architecture to facilitate the discussion of potential
issues under a common framework.
2. Terminology
ARP: Address Resolution Protocol
ND: Neighbor Discovery
Host: Application running on a physical server or a virtual
machine. A host usually has at least one IP address and at
least one MAC address.
Server: a physical computing machine
ToR: Top of Rack Switch
EoR: End of Row
VM: Virtual Machines. Each server can support multiple VMs.
Karir Expires January 24, 2012 [Page 3]
Internet-Draft Data Center Reference Architectures Oct 18, 2011
3. Generalized Data Center Design
There are many different ways in which data centers might be
designed. The designs are usually engineered to suit the particular
application that is being deployed in the data center. For example,
a massive web sever farm might be engineered in a very different way
than a general-purpose multi-tenant cloud hosting service. However
in most cases the designs can be abstracted into a typical three-
layer model consisting of the Access Layer, the Aggregation Layer
and the Core. The access layer generally refers to the Layer 2
switches that are closest to the physical or virtual severs, the
aggregation layer refers to the Layer 2 - Layer 3 boundary. The
Core switches connect the aggregation switches to the larger network
core. Figure 1 shows a generalized Data Center design, which
captures the essential elements of various alternatives.
+-----+-----+ +-----+-----+
| Core0 | | Core1 | Core
+-----+-----+ +-----+-----+
/ \ / /
/ \----------\ /
/ /---------/ \ /
+-------+ +------+
+/------+ | +/-----+ |
| Aggr11| + --------|AggrN1| + Aggregation Layer
+---+---+/ +------+/
/ \ / \
/ \ / \
+---+ +---+ +---+ +---+
|T11|... |T1x| |T21| |T2y| Access Layer
+---+ +---+ +---+ +---+
| | | | | | | |
+---+ +---+ +---+ +---+
| |... | | | | | |
+---+ +---+ +---+ +---+ Server racks
| |... | | | | | |
+---+ +---+ +---+ +---+
| |... | | | | | |
+---+ +---+ +---+ +---+
Figure 1: Typical Layered Architecture in DC
Karir Expires January 24, 2012 [Page 4]
Internet-Draft Data Center Reference Architectures Oct 18, 2011
3.1. Access Layer
The Access switches provide connectivity directly to/from physical
and virtual servers. The access switches might be placed either on
top-of-rack (ToR) or at end-of-row(EoR) physical configuration. A
server rack may have a single uplink to one access switch, or may
have dual uplinks to two different access switches.
3.2. Aggregation Layer
In a typical data center, aggregation switches interconnect many ToR
switches. Usually there are multiple parallel aggregation switches,
serving the same group of ToRs to achieve load sharing. It is no
longer uncommon to see aggregation switches interconnecting hundreds
of ToR switches in large data centers.
3.3. Core
Core switches connect multiple aggregation switches and act as the
data center gateway to external networks or interconnect to
different PODs within one data center.
3.4. Layer 3 / Layer 2 Topological Variations
3.4.1. Layer 3 to Access Switches
In this scenario the L3 domain is extended all the way to the Access
Switches. Each rack enclosure consists of a single Layer 2 domain,
which is confined to the rack. In general in this scenario there
are no significant ARP/ND scaling issues as the Layer 2 domain
cannot grow very large. This topology is ideal for scenarios where
servers (or VMs) under one access switch don't need to be re-loaded
with applications with different IP addresses or hosts don't need to
be moved to other racks which are under different access switches.
A small server farm or very static compute cluster might be best
served via this design.
3.4.2. L3 to Aggregation Switches
When Layer 3 domain only extends to aggregation switches, hosts in
any of the IP subnets configured on the aggregation switches can be
reachable via Layer 2 through any access switches if access switches
enable all the VLANs. This topology allows for a great deal of
flexibility as servers attached to one access switch can be re-
loaded with applications with different IP prefix and VMs can now
migrate between racks without IP address changes. The drawback of
this design however is that multiple VLANs have to be enabled on all
Karir Expires January 24, 2012 [Page 5]
Internet-Draft Data Center Reference Architectures Oct 18, 2011
access switches and all ports of aggregation switches. Even though
layer 2 traffic are still partitioned by VLANs, the fact that all
VLANs enabled on all ports can lead to broadcast traffic on all
VLANs to traverse all links and ports, which is same effect as one
big Layer 2 domain. In addition, internal traffic itself might have
to cross different Layer 2 boundaries resulting in significant
ARP/ND load at the aggregation switches. This design provides the
best flexibility/Layer 2 domain size trade-off. A moderate sized
data center might utilize this approach to provide high availability
services at a single location.
3.4.3. L3 in the Core only
In some cases where wider range of VM mobility is desired (i.e.
greater number of racks among which VMs can move without IP address
change), the Layer 3 routed domain might be terminated at the core
routers themselves. In this case VLANs can span across multiple
groups of aggregation switches, which allow hosts to be moved among
more number of server racks without IP address change. This scenario
results in the largest ARP/ND performance impact as explained later.
A data center with very rapid workload shifting may consider this
kind of design.
3.4.4. Overlays
There are several approaches regarding how overlay networks can make
very large layer 2 network scale and enable mobility. Overlay
networks using various Layer 2 or Layer 3 mechanisms enable interior
switches/routers not to see the hosts' addresses. The Overlay Edge
switches/routers which perform the network address
encapsulation/decapsulation still however see host addresses.
When a large data center has tens of thousands of applications which
communicate with peers in different subnets, all those applications
send (and receive) data packets to their L2/L3 boundary nodes if the
targets are in different subnets. The L2/L3 boundary nodes have to
process ARP/ND requests sent from originating subnets and resolve
physical addresses (MAC) in the target subnets. In order to allow a
great number of VMs to move freely within a data center without re-
configuring IP addresses, they need to be under the common Gateway
routers. That means the common gateway has to handle address
resolution for all those hosts. Therefore, the use of overlays in
the data center network can be a useful design mechanism to help
manage a potential bottleneck at the Layer 2 / Layer 3 boundary by
redefining where that boundary exists.
Karir Expires January 24, 2012 [Page 6]
Internet-Draft Data Center Reference Architectures Oct 18, 2011
4. Factors that Affect Data Center Design
4.1. Traffic Patterns
Expected traffic patterns play an important role in designing the
appropriately sized Access, Aggregation and Core networks. Traffic
patterns also vary based on the expected use of the Data Center.
Broadly speaking it is desirable to keep as much traffic as possible
on the Access Layer in order to minimize the bandwidth usage at the
Aggregation Layer. If the expected use of the data center is to
serve as a large web server farm, where thousands of nodes are doing
similar things and the traffic pattern is largely in/out a large
access layer with EoR switches might be of the most use as it
minimizes complexity, allows for servers and databases to be located
in the same Layer 2 domain and provides for maximum density.
A Data Center that is expected to host a multi-tenant cloud hosting
service might have completely different requirements where in order
to isolate inter-customer traffic smaller Layer 2 domains are
preferred and though the size of the overall Data Center might be
comparable to the previous example, the multi-tenant nature of the
cloud hosting application requires a smaller more compartmentalized
Access layer. A multi-tenant environment might also require the use
of Layer 3 all the way to the Access Layer ToR switch.
Yet another example of an application with a unique traffic pattern
is a high performance compute cluster where most of the traffic is
expected to stay within the cluster but at the same time there is a
high degree of crosstalk between the nodes. This would once again
call for a large Access Layer in order to minimize the requirements
at the Aggregation Layer.
4.2. Virtualization
Using virtualization in the Data Center further serves to increase
the possible densities that can be achieved. Virtualization also
further complicates the requirements on the Access Layer as that
determines the scope of server migrations or failover of servers on
physical hardware failures.
Virtualization also can place additional requirements on the
Aggregation switches in terms of address resolution table size and
the scalability of any address learning protocols that might be used
on those switches. The use of virtualization often also requires the
use of additional VLANs for High Availability beaconing which would
need to span across the entire virtualized infrastructure. This
Karir Expires January 24, 2012 [Page 7]
Internet-Draft Data Center Reference Architectures Oct 18, 2011
would require the Access Layer to span as wide as the virtualized
infrastructure.
4.3. Impact of Data Center Design on L2/L3 protocols
When a L2/L3 boundary router receives data packets via its L3
interfaces destined towards hosts under its L2 domain, if the target
address is not present in the router's ARP/ND cache, it usually
holds the data packets and initiates ARP/ND requests towards its L2
domain to make sure the target actually exists before forwarding the
data packets to the target. If no response is received, the router
has to send the ARP/ND multiple times. If no response is received
after X number ARP/ND requests, the router needs to drop all those
data packets. This process can be very CPU intensive.
When a local host under the L2/L3 Router's L2 domain needs to send a
data frame to external peers, it usually sends ARP/ND requests to
get the physical address (i.e. MAC) of the L2/L3 routers. Many hosts
repetitively send ARP/ND requests to their default L3 gateway
routers to refresh its ARP/ND cache. This requires default routers
to process great number of ARP/ND requests when the number of hosts
under its L2 domains is very large. For IPv4, gateway routers
frequently sending out gratuitous ARP for all the hosts under its L2
domain to refresh their ARP cache for the default gateway's MAC
address can mitigate this pain point. However, for IPv6 hosts need
to validate bi-direction communication with the gateway router
before sending any data frames. Therefore, unsolicited neighbor
announcement from gateway router can't prevent hosts from sending ND
repetitively.
When hosts in two different subnets under the same L2/L3 boundary
router need to communicate with each other, the L2/L3 router not
only has to initiate ARP/ND requests to the target's Subnet, it also
has to process the ARP/ND requests from the originating subnet. This
process is even more CPU intensive.
5. Conclusion and Recommendation
In this document we have described a generalized Data Center network
design. Our goal is to distill the essence of different designs
into a common framework in an attempt to structure the discussion
regarding various scaling issues that might appear in different
scenarios. Different application needs such as traffic patterns,
and the role for which the data center is being designed determine
various design choices, which result in various scaling issues with
regards to port density, ARP/ND, VM mobility, and performance. As
Karir Expires January 24, 2012 [Page 8]
Internet-Draft Data Center Reference Architectures Oct 18, 2011
expected, engineering solutions serve to tune a given design to the
particular needs of the data center at the expense of other factors.
6. Manageability Considerations
This document does not add additional manageability considerations.
7. Security Considerations
This document has no additional requirement for security.
8. IANA Considerations
None.
9. Acknowledgments
We want to acknowledge the following people for their valuable
discussions related to this draft: Kyle Creyts, Alexander Welch and
Michael Milliken
This document was prepared using 2-Word-v2.0.template.dot.
10. References
[ARP] D.C. Plummer, "An Ethernet address resolution protocol."
RFC826, Nov 1982.
[ND] T. Narten, E. Nordmark, W. Simpson, H. Soliman, "Neighbor
Discovery for IP version 6 (IPv6)." RFC4861, Sept 2007.
[STUDY] Rees, J., Karir, M., "ARP Traffic Study." MANOG52, June
2011. URL
http://www.nanog.org/meetings/nanog52/presentations/Tuesda
y/Karir-4-ARP-Study-Merit Network.pdf
[DATA1] Cisco Systems, Data Center Design - IP Infrastructure ,
October 2009. URL
http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_
Center/DC_3_0/DC-3_0_IPInfra.html
[DATA2] Juniper Networks, Government Data Center Network Reference
Architecture, 2010. URL
www.juniper.net/us/en/local/pdf/reference-
architectures/8030004-en.pdf
Karir Expires January 24, 2012 [Page 9]
Internet-Draft Data Center Reference Architectures Oct 18, 2011
Authors' Addresses
Manish Karir
Merit Network Inc.
1000 Oakbrook Dr, Suite 200
Ann Arbor, MI 48104, USA
Phone: 734-527-5750
Email: mkarir@merit.edu
Ian Foo
Huawei Technologies
2330 Central Expressway
Santa Clara, CA 95050, USA
Phone: 919-747-9324
Email: Ian.Foo@huawei.com
Intellectual Property Statement
The IETF Trust takes no position regarding the validity or scope of
any Intellectual Property Rights or other rights that might be
claimed to pertain to the implementation or use of the technology
described in any IETF Document or the extent to which any license
under such rights might or might not be available; nor does it
represent that it has made any independent effort to identify any
such rights.
Copies of Intellectual Property disclosures made to the IETF
Secretariat and any assurances of licenses to be made available, or
the result of an attempt made to obtain a general license or
permission for the use of such proprietary rights by implementers or
users of this specification can be obtained from the IETF on-line
IPR repository at http://www.ietf.org/ipr
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
any standard or specification contained in an IETF Document. Please
address the information to the IETF at ietf-ipr@ietf.org.
Disclaimer of Validity
All IETF Documents and the information contained therein are
provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION
HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY,
THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL
WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
Karir Expires January 24, 2012 [Page 10]
Internet-Draft Data Center Reference Architectures Oct 18, 2011
WARRANTY THAT THE USE OF THE INFORMATION THEREIN WILL NOT INFRINGE
ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE.
Acknowledgment
Funding for the RFC Editor function is currently provided by the
Internet Society.
Karir Expires January 24, 2012 [Page 11]