Network Working Group B. Campbell
Internet-Draft Tekelec
Intended status: Informational June 06, 2013
Expires: December 08, 2013

Diameter Overload Control Solution Issues
draft-campbell-dime-overload-issues-00

Abstract

The Diameter Maintenance and Extensions (DIME) working group has undertaken an "overload control" work item, with the goal of standardizing a mechanism to allow Diameter nodes to report overload information among themselves. Requirements currently include, among others, the need to accurately report the scope of overload conditions, and the ability to report overload information between nodes that are not directly connected at the transport layer. These requirements introduce complex issues. This document describes those issues, in the hope that it will assist the working group's decision process.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on December 08, 2013.

Copyright Notice

Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Introduction

When a Diameter [RFC6733] server or agent becomes overloaded, it needs to be able to gracefully reduce its load, typically by requesting other nodes to reduce the number of Diameter requests for some period of time.

The Diameter Overload Control Requirements [I-D.ietf-dime-overload-reqs] describe requirements for overload control mechanisms. Requirement 31 states that Diameter nodes must be able to report overload with sufficient granularity to avoid forcing available capacity to go unused. Requirement 34 requires the ability to report overload across Diameter nodes that do not support the mechanism. These requirements introduce significant and interrelated complexities to potential solutions. This document describes the related issues. The author hopes that this document will assist the working group's decision process related to these requirements.

At the time of this writing, there have been two proposals for Diameter overload control mechanisms. "A Mechanism for Diameter Overload Control" (MDOC) [I-D.roach-dime-overload-ctrl] defines a mechanism that piggybacks overload and load state information over existing Diameter messages. "The Diameter Overload Control Application" (DOCA) [I-D.korhonen-dime-ovl] defines a mechanism that uses a new and distinct Diameter application to communicate similar information.

2. Documentation Conventions

This document uses terms defined in [RFC6733] and [I-D.ietf-dime-overload-reqs]. In particular, the terms "client", "server","upstream", and "downstream" are used as defined in RFC 6733. In addition, this document uses the following terms:

Overload:
A condition where a Diameter needs a reduction in the number of requests that it must handle.
Overload Report:
A request to reduce traffic that contributes to an overload condition.
Overload Scope:
The set of requests that may contribute to an overload conditions.
Reporting Node:
The node that sends an overload report. Also known as an "overloaded node".
Reacting Node:
A node that consumes and possibly acts on an overload report.
Adjacent Overload Report:
An overload report sent between adjacent Diameter peers.
Non-Adjacent Overload Report:
An overload report sent between Diameter separated by one or more intermediary nodes (i.e. agents or proxies) .

3. Overload Scopes

Diameter overload may affect some requests and not others. The Diameter overload requirements [I-D.ietf-dime-overload-reqs] list several scenarios that illustrate overload that affects some requests but not others. We refer to the set of requests affected by a particular overload event as the "overload scope" (or "scope") of the event. The overload requirements require an extensible scope mechanism, with support for at least scopes of type "Diameter node", "Realm", and "Diameter Application".

An scope indication in an overload report is a set of classifiers that identify requests likely to contribute to the overload condition. In general, this could include any aspect of a Diameter message that a reacting node can observe. For example, requests could be classified by Attribute Value Pair (AVP) values or next-hop routing decisions.

The ability to express the scope of an overload condition is only useful when reacting nodes can act on the information. There are only a small number of actions a reacting node may take to mitigate load. Essentially these actions boil down to reducing the number of requests that match the scope, either by sending fewer requests in the first place, or by routing around the problem. The former is limited by the node's ability to select between requests that match the overload scope, and request that do not. The latter is limited by the node's ability to predict or influence how a request will be routed.

This section discusses the meanings of the required scope-types, and analyses their implications for the selected mechanism.

3.1. Types of Overload Scopes

There are several different kinds, or types, of overload scopes. The type of a scope defines how the reacting node interprets it.

3.1.1. Diameter Node Scope

The "Diameter Node" scope-type indicates that a particular Diameter node is overloaded. Other nodes should mitigate the overload by reducing the number of requests that will land on the overloaded node, either by sending fewer requests, or by attempting to route requests around the overloaded node.

In practice, the reporting node may have three distinct relationships with the reacting node. The reporting node may be a Diameter peer, meaning it has a direct transport layer connection with the reacting node. It may be an endpoint, that is, a Diameter server (or client, in the case of server-to-client requests). Finally, it may be a non-adjacent agent, that is, a node that is neither a peer or an endpoint. Each of these cases is effectively a separate scope-type, since each requires different behaviors from reacting nodes.

3.1.1.1. Peer-Node Scope-Type

In the case of a peer, the reacting node simply sends fewer requests directly to the peer. If it has other peers that are candidates for the requests, it may reroute requests to them. We refer to this scope-type as "Peer-Node"

The "Peer-Node" scope-type can further be broken down by transport connection. Large-scale Diameter nodes are often implemented as clusters of IP hosts, which may or may not share their knowledge about upstream overload conditions. Certain IP hosts in a cluster could become overloaded when others do not. Therefore it may be useful to specify a "Peer-Connection" scope-type, to request reduction of traffic on a specific transport (i.e. TCP or SCTP) connection.

3.1.1.2. Destination-Host Scope-Type

If the overloaded node is an endpoint from the reacting node's perspective, the best the reacting node can do is reduce the number of requests that contain a Destination-Host AVP that match the overloaded node. Rerouting will not help in general, since the requests will simply take different routes to arrive at the overloaded server. Unless the destination node is a direct peer, the reacting node cannot do much about requests that don't contain a Destination-Host AVP in the first place, since it cannot predict whether these requests will land on the overloaded endpoint. We refer to this scope-type as "Destination-Host". While Destination-Host scopes may offer less utility to reacting nodes than Peer-Node scopes, they are still useful for requests bound to a particular server, for example, mid-session requests for a session-stateful application.

Diameter agents that implement certain topology-hiding schemes may modify Origin-Host AVPs inserted by servers, and use some local mechanism to bind sessions to specific servers. The "Destination-Host" type may not function correctly in this case. MDOC specifies a "session-group" scope-type, where a topology hiding agent can assign a common identifier to sessions that are fate-shared in some way, such as being bound to the same server. If that server becomes overloaded, the agent can send an overload report that matches requests in all sessions with the matching identifier. This scope-type may be useful under certain circumstances, but may also be complex to implement. Further discussion is needed to determine if the session-group type should be included in the base mechanism. Since the mechanism is required to allow extensible scope-types, session-groups could still be added in the future.

3.1.1.3. Non-Adjacent Nodes

The reacting node cannot in general predict which requests will impact a particular non-adjacent agent, other than by guessing that a certain percentage of requests for a particular realm or application might traverse it. Those examples would be better handled with scope types designed for that purpose, e.g. "Realm" or "Diameter Application".

3.1.2. Realm Scope

The "Realm" scope-type indicates overload for all servers that handle requests for the particular Diameter realm. That is, it impacts all requests with the particular realm in the Destination-Realm AVP.

The Realm scope-type is useful for declaring a global overload condition within a network serving a single realm. It is also useful for requesting third-parties to reduce Diameter traffic sent to a particular realm, for example, in roaming scenarios.

Since the Realm scope-type indicates overload for an entire realm, reacting nodes should reduce the number of messages sent for the realm. Rerouting traffic does not make sense for the Realm scope type, since it would almost be useful for Diameter nodes to reroute traffic destined for an overloaded realm to a different, non-overloaded realm. Client applications might, however, be able to choose to use services from a different operator if the Diameter realm of one operator reports an overload condition.

3.1.3. Diameter Application Scope

The "Diameter Application" scope-type indicates overload for a particular Diameter application. That is, it impacts all requests with the matching value in an Application-Id AVP.

The Diameter Application scope-type is useful for declaring an overload condition that affects a specific Diameter service, typically, but not necessarily, in a specific realm.

Since the Diameter Application scope-type indicates overload for an entire application, reacting nodes should reduce the number of requests sent for that application. Similarly to the Realm scope-type, it will rarely if ever make sense for a Diameter node to reroute traffic to a different Diameter application.

3.1.4. Origin-Host Scope

While most scope-types refer to where a request is likely to go, the "Origin-Host" scope-type refers to where the request originates. That is, any request with a matching Origin-Host AVP would match. The Origin-Host scope type is useful for situations where a specific client or set of clients sends an excessive number of requests. An overload report with an Origin-Host scope would tell matching clients to reduce traffic, or agents to throttle requests that came from matching clients.

3.2. Scope Values

Scope labels in an overload report will typically take the form of a scope-type and a value. For example, if the "example.com" realm is overloaded for all services, the overload report would indicate a scope-type of "Realm" and a scope-value of "example.com"

A possible exception is the "Peer-Connection" scope-type. Since an overload report with a Peer-Connection scope is only actionable by one of the peers connected via the specified connection, it makes sense to treat the Peer-Connection scope-type as always having a value of "this connection".

There has been discussion among working group participants about whether scope-values are really needed for a piggy-backed overload-control mechanism. The discussion boils down to a question about whether an overload-report indicates overload just for the realm, application, etc, of the Diameter message carrying the report, or whether it can indicate overload for other realms, applications, etc. MDOC allows values for most scope-types, even though it is a piggy-backed mechanism.

Implicit scope values would preclude the ability to signal just a realm, just an application, or just a connection, without signaling all three in combination. The overload control requirements explicitly require the ability to specify each of these. An implicit scope value approach would violate those requirements.

Scope-values are required for application-based mechanisms. For example, an overload report with a Diameter application scope will almost always need to talk about Diameter applications other than the "overload-control" application.

3.3. Combining Scopes

Diameter nodes will commonly need to construct overload reports that apply to a combination of scopes. For example, if a given realm is overloaded for subset of the applications it supports, it might indicate both a realm scope and and one or more Diameter application scopes.

Logically, combining multiple scopes of different types reduces the overall set of requests to which the overload report would apply. Combining multiple scopes of the same type increases the applicable set. A function that determines the requests affected by an overload report could model this as a logical "and" or "intersection" operator for combining scopes of different types, and a logical "or" or "union" operator for combining scopes of the same type.

We need further discussion about whether all possible combinations should be allowed. For example, it may or may not make sense to combine a "Peer-Connection" scope with other scopes, or to allow more than one "Peer-Connection" scope-value for a single overload report.

3.4. Scope Extensibility

[I-D.ietf-dime-overload-reqs] requires scope-types to be extensible. This requirement implies that the chosen mechanism or mechanisms must discuss how new scope-types can be added, how support for specific scope-types should be declared or negotiated, and which scope-types might be mandatory to support.

3.5. Scope Recommendations

In the author's opinion, the selected solution or solutions should support, at a minimum, the "Peer-Connection", "Peer-Host", "Destination-Host", "Realm" and "Application-ID" scope-types. The working group should consider also adding the "Origin-Host" scope-type.

The working group should consider whether the advantages of the "session-group" concept and scope-type are worth the complexity.

4. Non-adjacent Overload Information

Requirement 34 of [I-D.ietf-dime-overload-reqs] says that the selected Diameter overload control mechanism "SHOULD" be able to communicate overload and load information across intermediaries that do not support the mechanism. This requirement introduces a number of complications to the solution effort, creating complications in how Diameters negotiate support for overload control, address and route overload reports to the right places, and act on received overload reports.

While the requirement does not explicitly say it, we interpret "intermediaries" in this context to mean Diameter agents. The requirement is irrelevant for lower layer intermediaries (e.g. routers), and cannot be reasonably applied for non-Diameter entities, or hybrid entities such as gateways between Diameter and other protocols.

The requirement to traverse non-supporting intermediaries is not necessarily the same thing as a requirement for end-to-end communication of overload reports between Diameter clients and servers. Diameter agents can also originate and consume overload reports. Therefore, we refer to this requirement as "Non-adjacent Overload Control".

4.1. Use-Cases for Non-adjacent Overload Control

There are two primary use-cases for non-adjacent overload control.

4.1.1. Interconnect

The first significant non-adjacent use-case is the interconnect scenario described in section 2.3 of the overload control requirements [I-D.ietf-dime-overload-reqs]. Two or more Diameter network operators communicate with each other across a third-party interconnect provider that brokers Diameter traffic between the operators.

If the interconnect provider does not support Diameter overload control, each operator network becomes an island of overload control, similar to those in the non-supporting agent use-case [nonsupporting-agents]. Even if the interconnect provider does support overload control, the operators may not trust it to generate and act on overload reports on the operators' behalves, and may prefer to exchange overload and load information directly with each other.

The interconnect use-case may introduce additional security concerns. While the non-supporting agent use case typically (but not necessarily) occurs inside a single administrative domain, the interconnect case will almost always involve sending overload reports across multiple administrative domains. Since a malicious or incorrect overload report can effectively shut down Diameter processing, the current lack of a viable solution for end-to-end integrity protection of Diameter messages may be a problem.

4.1.2. Non-Supporting Agents

[I-D.ietf-dime-overload-reqs] requires the solution to function in networks where not all Diameter elements support it. That is, the solution must allow gradual deployment, and must not require a flag-day cutover. If non-adjacent overload control is not supported, one or more non-supporting Diameter Agents can divide a network into overload control islands, where overload information is communicated inside each island, but not among separate islands.

4.2. Issues with Non-Adjacent Overload Control

4.2.1. Topology Issues

Many of the issues with non-adjacent overload control derive from the fact that a Diameter node is unlikely to know the topology of the Diameter network past its immediate peers. In a trivial topology, that is, a Diameter network with only clients and servers, this is not a problem. But if the immediate peer is a Diameter agent, a node is unlikely to know what next hop the relay will select for a given Diameter message. This is particularly difficult if the agent hides topology in either direction, or uses dynamic peer discovery. While a node may be able to infer the path a given message will take in some specific cases (e.g. for mid-session messages), they cannot do this in general. And even those specific cases may fail if an agent on the message path performs topology hiding.

This lack of topology knowledge impacts the way that nodes can negotiate overload-control support, the ways they send overload reports, and the ways a reacting node can act to mitigate overload. A non-adjacent overload-control mechanism will need to solve the topology issues, either by offering ways to discover non-adjacent topologies, or offering ways to constrain overload-control relevant parts of such topologies in ways where a node could reasonably know them in advance.

4.2.2. Support Negotiation

Diameter nodes need to negotiate or otherwise indicate their support for overload control to other nodes. This includes indicating support for overload control in general, as well as potentially indicating support of certain parameters of the overload control solution. For example, a node may need to indicate which overload algorithms it supports. This becomes complex if two non-adjacent nodes need to negotiate support.

In a Diameter application-based solution, support for the overload control application would occur during the capabilities exchange between peers. Diameter capabilities exchange occurs strictly between peers; Diameter offers no mechanism for indicating support of a given Application-ID between non-adjacent nodes.

Diameter allows non-negotiated use of an arbitrary Application-Id between non-adjacent nodes across Diameter agents that implement the Diameter Relay application. In theory, this means that an application-based, non-adjacent overload control could only traverse Diameter relays, or Diameter proxies that explicitly support the overload-control Application-Id. In the latter case, we assume that a proxy will not indicate support for the overload-control Application-Id unless it supports the overload-control mechanism; such a proxy cannot be considered a non-supporting agent.

In practice, a Diameter agent can act as a proxy for some purposes and a relay for others. If a Diameter proxy indicates support for the Diameter relay application, we assume that it will relay any arbitrary application. This means it can be considered a relay for the purposes of overload control.

For both application-based and piggybacked solutions, a supporting node needs know the other nodes with which it should negotiate. For overload-control between Diameter peers, this is easy; a node exchanges support information with its immediate peers. But for non-adjacent overload control, this is more difficult for reasons discussed in Section 4.2.1.

Therefore, for non-adjacent overload control negotiation, each supporting node either needs advance knowledge of all nodes with which it may negotiate overload-control support, or it needs a mechanism for discovering that knowledge dynamically.

4.2.3. Overload Report Delivery

With hop-by-hop overload control mechanisms, overload report addressing and delivery is relatively simple. A node sends overload reports directly to its peers. This becomes more complex for non-adjacent overload-control.

For application-based overload control, nodes could address overload reports to specific endpoint nodes using the Destination-Host AVP. Doing so would be subject to the same non-adjacent topology issues described in Section 4.2.1. That is, a node can only send overload reports to non-adjacent clients or servers that it knows about, either from prior knowledge (i.e. provisioning) or from which it has observed previous Diameter messages.

An application-based mechanism could possibly address reports to non-adjacent Diameter agents using the Destination-Host AVP. This would effectively make the agent into an endpoint for the overload-control application.

A piggy-backed mechanism will have more difficulty addressing non-adjacent overload reports. A piggy-backed mechanism sends overload reports in already existing Diameter requests; That is, requests that have their own purposes and destinations independent of the overload-report. Thus, nodes can only select the destination of an overload report by bundling it into a Diameter message that was already going to that destination. While a piggy-backed mechanism might be able to send overload-reports across quiescent transport connections using watchdog (DWR/DWA) messages, these message are cannot be exchanged between non-adjacent nodes.

For both piggy-backed and application-based solutions, non-adjacent overload control introduces a need to identify the sender of a report, or at least determine whether the report is from an adjacent or non-adjacent node. This is not required for purely hop-by-hop solutions, since the sender could always be assumed to be the peer.

For example, a non-adjacent report with a "Peer-Connection" scope does not make sense. If a node receives one, it should probably ignore it. But in order to make that decision, it must be able distinguish a non-adjacent report from an adjacent one. For example, in an application-based mechanism,

4.2.4. Non-Adjacent Overload Scopes

A reacting node will typically attempt to mitigate an overload condition by either reducing the number of requests that contribute to the condition, or by rerouting part of that traffic to avoid the problem. In both cases, the reacting node's is limited by its ability to determine to which Diameter requests contribute to the overload condition in the first place. The overload scope concept [scopes] offers a way for overloaded nodes to indicate what traffic is likely to overload and should be abated.

Not all of the scope-types described in Section 3 make sense for non-adjacent overload control. The "connection" scope-type is an obvious example, since the reacting node will never share a transport connection with a non-adjacent node; this is the very definition of non-adjacent nodes.

Since a Diameter node cannot control in general how requests are forwarded to non-adjacent nodes, the "host" scope-type also does not work well, especially when there are multiple possible destinations up or downstream from the adjacent peer. For example in Figure 1, Node A sends Diameter requests to Nodes B and C across a non-supporting agent. If Node B becomes overloaded but Node C does not, Node A cannot reroute requests to Node C, since it has very little way to influence where the agent will forward any given request. If Node A tries to reduce traffic by 50%, the agent will likely still send half of the remaining traffic to Node B. If B and C are endpoints, Node A may in some cases be able to use the Destination-Host AVP for this purpose (in which case the "Destination-Host" scope-type would be more appropriate), but this does not help if B and C are agents rather than servers.

                      +--------+       +--------+
                      | Node B |       | Node C |
                      +----+---+       +---+----+
                           |               |
                           +-------+-------+
                                   |
                           +-------+--------+
                           | Non-Supporting | 
                           |  Agent         |
                           +-------+--------+
                                   |
                                   |
                              +----+----+
                              | Node  A |
                              +---------+
		

Figure 1: Non-Adjacent Routing

Scope-types that classify traffic by origin or final destinations, such as "Origin-Host","Destination-Realm", "Application-ID", and "Destination-Host" can be used for non-adjacent overload control. In general, scope-types that may denote non-adjacent intermediary devices, such "host" cannot, nor can scope-types that refer only to peers, e.g. "Peer-Connection".

Even for destination-oriented scope-types, the sender of an overload report must be authoritative for the indicated scope. That is, it must have full knowledge of the congestion state for the scope. For example, if Node B and C both serve the ream "example.com", and B becomes 50% overloaded while C does not, B cannot simply report 50% overload at realm scope. If it did, Node A would reduce its generated traffic by 50%. Since the overall realm is really only overloaded by 75%, this would leave the realm operating beneath available capacity.

Therefore, a given node must only report overload for scopes for which it has full knowledge of the load and overload state. That is, it must be a "scope authority" for any scope it reports. In the example, nodes B and C (and any other nodes serving "example.com") would be required to share current load and overload state. The state-sharing requirement could be substantial for high-capacity nodes.

When a node reports overload for a certain scope, reacting nodes will treat the overload condition as uniform across the entire scope. For example, if a node reports overload for an entire realm, reacting nodes will reduce traffic equally for all servers that serve that realm. If the servers are unequally overloaded, they must use a more granular scope-type, for example, "Destination-Host".

4.3. Non-adjacent Overload Control Recommendations

A hop-by-hop mechanism allows for very flexible and fine grained overload control. It solves or simplifies a number of issues, such as negotiation of support and parameters, requirements for topology knowledge, end-to-end security, etc, by avoiding them in the first place. Adding non-adjacent support to such a mechanism would complicate it considerably.

Non-adjacent overload control mechanism are better for connecting islands of overload control. Such a mechanism works well for larger scopes and relatively static topologies.

The author believes that we are unlikely to find a single solution that works well for both hop-by-hop and non-adjacent overload control. While a single solution is more desirable in general, a single solution that works well for both cases is likely to be extremely complicated. Therefore, the working group should consider a separate mechanism for the non-adjacent delivery of overload reports.

If the group chooses to accept two separate solutions, we should be able to specify a single data model and set of AVPs that work for both, with some restrictions. (For example, the non-adjacent solution would likely forbid the use of the "Peer-Connection" scope-type.)

5. IANA Considerations

This draft makes no requests of IANA.

6. Security Considerations

Overload reports induce Diameter nodes to reduce or reroute traffic. For large scopes, a single erroneous or malicious overload report could effectively shut down Diameter processing for an entire realm. A Diameter overload control solution needs mechanisms to ensure that overload reports are only accepted from trusted sources, and that nothing tampers with the reports en route.

For hop-by-hop approaches, the transport connection can be protected with TLS or IPSec. But this will not help for non-adjacent reporting, since no such transport connection exists.

While such work is in progress in the DIME working group, Diameter has no currently viable mechanism for end-to-end authentication and integrity protection. The working group should consider either making non-adjacent overload control contingent on a generic Diameter end-to-end protection mechanism, or adding a specialized protection mechanism to any resulting non-adjacent overload control solution.

7. References

7.1. Normative References

[RFC6733] Fajardo, V., Arkko, J., Loughney, J. and G. Zorn, "Diameter Base Protocol", RFC 6733, October 2012.
[I-D.ietf-dime-overload-reqs] McMurry, E. and B. Campbell, "Diameter Overload Control Requirements", Internet-Draft draft-ietf-dime-overload-reqs-05, February 2013.

7.2. Informative References

, "
[I-D.roach-dime-overload-ctrl] Roach, A., "A Mechanism for Diameter Overload Control", Internet-Draft draft-roach-dime-overload-ctrl-01, October 2012.
[I-D.korhonen-dime-ovl] Korhonen, J. and H. Tschofenig, "The Diameter Overload Control Application (DOCA)", Internet-Draft draft-korhonen-dime-ovl-01, February 2013.
[Whac-a-Mole]Whack-a-Mole Colloquial Usage", .

Appendix A. Contributors

Eric McMurry and Robert Sparks made significant contributions to the concepts in this draft.

Author's Address

Ben Campbell Tekelec 17210 Campbell Rd. Suite 250 Dallas, TX 75252 US EMail: ben@nostrum.com