Internet DRAFT - draft-lior-radius-reliable-transport
draft-lior-radius-reliable-transport
Network Working Group
INTERNET-DRAFT A. Lior
Category: Informational Bridgewater Systems
draft-lior-radius-reliable-accounting-00.txt
Expires: December 23rd 2003
Remote Authentication Dial-In User Service (RADIUS) Reliable
Transport
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of [RFC2026].
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Copyright Notice
Copyright (C) The Internet Society (2003). All Rights Reserved.
Abstract
Remote Authentication Dial-In User Service (RADIUS) Request For
Comments (RFCs) do not address RADIUS reliability with respect to
transport of RADIUS messages. This Informational Internet Draft
describes procedures for Retransmission, Failover and Failback.
Lior, et al. [Page 1]
RADIUS Reliable Transport February 2003
Table of Contents
1. Introduction...................................................3
1.1 Reliable Transport of Authentication and Authorization
messages.......................................................3
1.2 Reliable Transport of Accounting messages..................3
1.3 Terminology................................................4
1.4 Requirements language......................................4
2. RADIUS Transport Today.........................................4
2.1 RADIUS Transport of Authentication and Authorization.......4
2.2 RADIUS Transport of Accounting.............................5
2.3 RADIUS Transportation of Dynamic Authorization messages....6
3. Model..........................................................7
4. General Requirements...........................................9
5. General Algorithm..............................................9
5.1 Retransmit Algorithm......................................10
5.2 Offline Algorithm.........................................10
5.3 Online Algorithm..........................................12
6. Special Consideration.........................................13
6.1 Consideration for Accounting Messages.....................13
6.2 Consideration for Dynamic Authorization Messages..........13
7. Security Considerations.......................................14
8. Normative References..........................................14
9. Informative References........................................15
10. Acknowledgments..............................................15
11. Author's Addresses...........................................15
12. Intellectual Property Statement..............................15
13. Full Copyright Statement.....................................16
14. Expiration Date..............................................16
Lior, et al. Informational [Page 2]
RADIUS Reliable Transport February 2003
1. Introduction
In the context of this document, transport reliability includes the
ability to detect failures in communication between two RADIUS
entities (client and server), retransmission of messages, failover
procedures, and failback procedures.
Transport reliability is not part of the RADIUS specification and
has been left up to implementers resulting in an inconsistent
approaches that make it difficult to engineer reliable RADIUS based
deployments.
This document recommends an approach to provide a reliable transport
for RADIUS messages. There have been other discussions covering AAA
Reliable Transport [AAATransport], and implementation of these, for
example Diameter [Diameter]. However, these discussions covered AAA
protocols that use connection oriented protocols (TCP and SCTP)
where as RADIUS uses a connectionless based protocol (UDP) as the
transport mechanism. Where applicable, this document adopts some of
the principles covered by these other sources.
1.1 Reliable Transport of Authentication and Authorization messages.
TODO: Motivation for reliable transport during Authentication and
Authorization is needed.
1.2 Reliable Transport of Accounting messages
Usage based billing, requires accuracy in billing presentment.
Customers should be presented with a bill that is consistent with
their usage and contains no errors.
When errors in accounting occur, operators often err on the side of
the customer. This results in loss of revenue for the operator.
Worse, if a customer gets an inconsistent bill, or an inaccurate
bill, they may call customer support. Support calls are expensive,
and customer dissatisfaction must be avoided as well.
The RADIUS protocol does not have a reliable mechanism for
delivering of accounting messages. Historically RADIUS has been
used to service dialup subscribers that are generally billed in very
coarse grain fashion that range from monthly or yearly contracts to
Lior, et al. Informational [Page 3]
RADIUS Reliable Transport February 2003
block of time contracts measured in hours. Precision from the
accounting records was not required. For example, if the subscriber
is using a hourly plan and an accounting records for a particular
session are lost, then the operator may lose a few hours of revenue
billed at pennies per hour. This coarse granularity meant that
accounting records did not have to be reliable.
Systems such as Voice over IP (VoIP), WiFi LANs, where usage based
billing is used, loss of accounting records could reflect a
significant loss of revenue. For example, assuming RADIUS
Accounting Interim generated every 60-600 seconds (60 is the
minimum, 600 is the recommended minium) and if we lose the RADIUS
Accounting Stop record; this loss would represent a loss of 60 to
600 seconds of revenue.
This draft describes a best practice for increasing the reliability
of RADIUS Accounting messages.
1.3 Terminology
1.4 Requirements language
In this document, several words are used to signify the requirements
of the specification. These words are often capitalized. The key
words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
this document are to be interpreted as described in [RFC2119].
2. RADIUS Transport Today
2.1 RADIUS Transport of Authentication and Authorization
Lior, et al. Informational [Page 4]
RADIUS Reliable Transport February 2003
RADIUS Authentication and Authorization procedures are described in
[RFC2865] and [RFC2869]. In the process of authenticating a user,
the RADIUS Client (e.g. NAS) will send one or more Access-Request
messages to Proxy server, that may be forwarding the requests to a
Remote server. Under normal conditions, each Access-Request should
result in one of the following responses: Access-Accept,Access-
Reject or Access-Challenge. Under certain conditions, such as
network errors, the RADIUS Client may not get a response back.
If a response is not received after some unspecified time, the NAS
or Proxy RADIUS, will retry and eventually failover to another
RADIUS server. As time progresses, servers that failed need to be
retried.
How long to wait, how many times to retry, how to fail over, how and
when to failback, is not covered by the RADIUS specification.
2.2 RADIUS Transport of Accounting
RADIUS accounting messages are described in [RFC2866] and [RFC2869].
A RADIUS client sends Accounting-Request (start) messages at the
start of a session-segments, Accounting-Request (stop) messages at
the end of session-segments and optionally, Accounting-Request
(Interim) messages periodically during the session at a rate
controlled by Accounting Interim Interval.
The RADIUS client receives an Accounting-Response message once a
RADIUS Accounting Server has received the Accounting packet and took
responsibility for it.
The Accounting messages may traverse through zero or more proxy
RADIUS Accounting Server before reaching their destination. As
described in [RFC2866], the proxy RADIUS Accounting Servers may pass
the Accounting messages immediately to the next RADIUS Accounting
Server in the proxy chain or it may store the accounting messages
and send them at a later time. If the proxy RADIUS Accounting
Server stores the accounting message it responds back to the client
with an Accounting Response message.
The RADIUS specification only provides for an Accounting-Response to
acknowledge the successful reception of Accounting Packets. There
isn’t an Accounting NAK message. A receiver of an accounting
message will silently discard a bad message. The sender of the
Lior, et al. Informational [Page 5]
RADIUS Reliable Transport February 2003
message may not know why an acknowledgement was not received. Is it
because the access request message was lost, was the response lost,
or was there an error. The sender has no option but to retry.
From RFC2866, "It is recommended that the client continue attempting
to send the Accounting-Request packet until it receives an
acknowledgement, using some form of backoff. If no response is
returned within a length of time, the request is re-sent a number of
times. The client can also forward requests to an alternate server
or servers in the event that the primary server is down or
unreachable. An alternate server can be used either after a number
of tries to the primary server fail, or in a round-robin fashion.
Retry and fallback algorithms are the topic of current research and
are not specified in detail in this document."
Furthermore failure issues are made more complex due to the presence
of proxy servers. The failures can occur at each proxy. The
specification is not clear about how failures should be handled at
the proxy. Should they silently discard and let the originator try?
Or should they retry? How long do we wait.
The specification is also not clear on the issue of whether or not
we treat the three types of accounting messages equally when
failures are detected.
In this Internet draft we recommend strategies for dealing with the
above shortcomings. We believe that application of these
recommendations will go along way to make RADIUS accounting reliable
enough to be used in application that demand stringent accounting,
such as usage based billing.
2.3 RADIUS Transportation of Dynamic Authorization messages
Dynamic authorization messages described in [CHIBA] include
Disconnect-Messages and Change-of-Authorization messages. These
messages are sent by RADIUS server to the NAS directly or via
intermediaries.
A sender of a Disconnect Message or a Change of Authorization
message expects a NAK or ACK response to his message. If the sender
does not receive a NAK or ACK it should retry sending the message.
The retransmission and failover procedures are not specified.
Lior, et al. Informational [Page 6]
RADIUS Reliable Transport February 2003
In case of CHIBA, when the sender receives a NAK message, the sender
should examine the Error-Code and based on the value it may choose
to retransmit the message. See further details below.
3. Model
In this section we present a general set of recommendations to
address the above issues.
The general model used for these discussions is represented in
figure 1.
In figure 1, a RADIUS client can be a NAS, or another RADIUS server
(an Intermediary). The RADIUS server can be an intermediary or the
end RADIUS server. As well, in consideration for [CHIBA], where
messages flow is reversed, that is, from the end RADIUS server to
the NAS, the RADIUS Client can be the end RADIUS server and the
RADIUS server can represent either an intermediary or the NAS.
In most robust deployments, as is assumed here, a RADIUS Client has
two or more RADIUS Server that it can use to send a message destined
to a particular location. We call a collection of zero or more
RADIUS Servers that proxy to a given location a Proxy Group. A
RADIUS Client can have more then one Proxy Group. Specifically, a
RADIUS Client knows which Proxy Group to route a message to (the
routing decision can be based on the type of messages (e.g. Access
Request, Accounting) and/or attributes contained in the message (e.g
the NAI, a calling number). The client also keeps state about each
of the RADIUS Servers in the Proxy Group. It knows for example,
which RADIUS Servers are available. It may use only one of the
available RADIUS Servers all the time, or all of available RADIUS
Servers in the Proxy Group(round robin fashion). In figure 1, the
Proxy Group A is used to send messages based on NAI = x, the RADIUS
client is using only one RADIUS server. Proxy Group B is used to
route messages based on NAI = y, the RADIUS client is using both
RADIUS servers in a round robin fashion.
A RADIUS Server can exist in more then one Proxy Group. The RADIUS
Client keeps a separate state for that RADIUS Server in the group.
Therefore, from the clients point of view, a RADIUS Server that
appears in two Proxy Groups (services two realms) will appear as two
distinct RADIUS Servers. Unless otherwise specified, the term
Lior, et al. Informational [Page 7]
RADIUS Reliable Transport February 2003
RADIUS Server refers to the logical RADIUS Server in a particular
Proxy Group.
Proxy Group A
+------------+
| |
| +------+ |
| | | |
NAI x | |RADIUS| |
+------------------------|->|Server| |
| | | | |
| | +------+ |
| | |
| | +------+ |
| | | | |
| | |RADIUS| |
| | |Server| |
| | | | |
+--------+ | +------+ |
| | | |
| RADIUS | +------------+
| | Proxy Group B
| Client | +------------+
| | | |
+--------+ | +------+ |
| | | | |
| | |RADIUS| |
| +-------|->|Server| |
| | | | | |
| | | +------+ |
+----------------+ | |
NAI y | | +------+ |
| | | | |
| | |RADIUS| |
+-------|->|Server| |
| | | |
| +------+ |
| |
+------------+
Figure 1: Basic Architecture.
Lior, et al. Informational [Page 8]
RADIUS Reliable Transport February 2003
Note that when discussing these failover, we can choose to perform
the procedures at the origination of the messages only and not at
the intermediaries; or we can perform the algorithms at the
originating server and at the intermediaries. To minimize traffic
in the network, and to minimize time delays, it is highly desirable
that we detect failure conditions and act on them at the
intermediaries.
4. General Requirements
Given the above model the following are the general capabilities
that form a reliable transport.
- A RADIUS Client MUST be able to determine when a RADIUS Server
is not available. A RADIUS Server is not available either
because its not reachable due to a network failure, the machine
is not working, or the application is not responding. The
RADIUS Client will put a RADIUS Server that is not available in
the offline state. A RADIUS Server in an offline state will
not be used to send messages.
- RADIUS Server that has been previously declared as offline
SHOULD automatically be reinstated into the online state as
soon as possible. This is particularly important when load-
balancing is used. The algorithm used to bring a RADIUS server
online should be conservative. The cost of bringing a RADIUS
server online falsely is added traffic and added delays.
- A RADIUS Client that does not receive and acknowledgement will
attempt to retransmit. In order to keep the network traffic
down, and to reduce the number of duplicate requests, and to
give a potentially overload RADIUS Server a chance to clear its
queues, the retransmission algorithm should be conservative.
5. General Algorithm
When a RADIUS Client sends a message it expects a response. If that
response is not received in a given amount of time, the RADIUS
client will retry to transmit the message to the same RADIUS Server.
The algorithm for retransmission is described below. If the
retransmission algorithm fails, the RADIUS Server will attempt to
send the message to another RADIUS Server in the Proxy Group. Note,
the RADIUS Server, is not necessarily brought offline. The Offline
Lior, et al. Informational [Page 9]
RADIUS Reliable Transport February 2003
Algorithm determines when a RADIUS Server is brought offline. When
the failure rate reaches a certain threshold, for a number of time
periods, the Offline Algorithm places the RADIUS Server in the Proxy
Group into the offline state for a period of time.
The Online Algorithm is used to automatically bring RADIUS Server
that are in an Offline state back to the Online state. Normally an
offline RADIUS Server is brought out of the Offline state when the
Offline-Period time expires. However, under certain conditions an
offline RADIUS Server will be brought out of that state earlier.
These algorithms are explained in detail in the following sections.
5.1 Retransmit Algorithm
This algorithm describes how a message gets retransmitted if a
response is not received in a specified amount of time, T-retry.
1) Set T-retry to minimum value.
2) Send a message and start timer.
3) If a response is not received and T-retry is reached, double
T-retry up to a maximum value, and resend the message.
4) Repeat steps 2 and 3 N times. Where N is configurable.
T-retry is associated with each message, not the RADIUS Server.
Once a message has been retried N times we fail the message to the
next available RADIUS Server in the Proxy Group. Note, we do not
place the RADIUS Server in the offline state. The RADIUS Server is
placed in an Offline state by using the Offline Algorithm described
below.
5.2 Offline Algorithm
A RADIUS Client places a RADIUS Server in an Offline state when the
RADIUS Client perceives that the RADIUS Server is not responsive.
There are many reasons why a RADIUS Client may not receive a
response:
- The network dropped the packet;
- The server has silently discarded the packet due to errors;
- The server is busy.
Lior, et al. Informational [Page 10]
RADIUS Reliable Transport February 2003
- The server is dead.
Note that in proxy situation the failure may have occurred anywhere
in the proxy chain. As well, the RADIUS Client may timeout while a
RADIUS Server down the proxy chain is performing a retry algorithm.
Therefore, using responses to determine whether the immediate RADIUS
Server is operational is difficult.
In the cases where the RADIUS Server exists in more then one Proxy-
Group (it is servicing multiple realms), it may be possible to
determine whether that RADIUS Server is dead. However, generally,
the only way to determine whether the immediate server is alive is
to send it out of band message. This approach is outside the scope
of the RADIUS protocol and will not be considered here. See
[AAATransport]
The process for determining non-responsive must also be very
carefully considered. A single failure of the retransmit algorithm
is not sufficient. A better approach is to use a number of such
failures determine whether or not a RADIUS Server should be placed
in the Offline state. This will allow us to handle the case where
messages maybe silently discarded, or lost due to other reasons such
as maybe the case for UDP packets.
The Offline algorithm is used to determine when a server in the
Proxy Group is placed in an Offline state. The algorithm takes into
account consecutive failures caused when the RADIUS server has
completely failed; and also intermittent failures that may occur
when a server is overloaded.
The algorithm uses a number of buckets. Each bucket represents a
uniform period of time (for example one minute of time). Each
bucket consists of two counters: number-of-requests, which count the
total number of requests sent during the time period of the bucket;
and number-of-failures which counts the total number of failure
(timeouts) experienced during the time period of the bucket. These
counter are used to determine whether there were significant
failures during the bucket period.
As messages are sent the number-of-requests is incremented. If the
message fails (times-out), then we increment the number-of-failures.
The algorithm requires three threshold parameters:
Lior, et al. Informational [Page 11]
RADIUS Reliable Transport February 2003
a) minimum-request-threshold. The minimum-request-threshold is used
to make sure that we have sufficient number of messages in the
buffer to make a sound decision;
b) failure-rate-threshold determines at what error-rate do we
declare that the bucket has failed;
c) N which represents the number of consecutive buckets that need to
fail before we put the RADIUS server in the offline state.
d) Offline-Period which is the length of time that the RADIUS Server
should be kept in the offline state.
A RADIUS server is placed in offline state under the following
condition:
1) For a given bucket, providing the number of requests processed is
greater then the minimum-request-threshold and during the bucket
period there were 100% failures; or
2) If N consecutive buckets experienced significant intermittent
failures.
Note: a bucket that has not contain sufficient number of request is
simply skipped or ignored. It does not break the continuity of the
sample buckets.
We say that a bucket has experienced significant intermittent
failures if the number of requests processed is greater than the
minimum-request-threshold and the error rate exceeds the failure-
rate-threshold. That is:
number-of-failures/number-of-requests > failure-rate-threshold
Once a RADIUS server has been placed in the Offline state it will
remain in that state for the amount of time specified by the
Offline-Period parameter.
5.3 Online Algorithm
The Online Algorithm is the procedure used to bring a RADIUS Server
in the Offline state back online.
Lior, et al. Informational [Page 12]
RADIUS Reliable Transport February 2003
Normally, a RADIUS Server will be placed to in an Offline state for
a period of time known as the Offline-Period.
However, there could be situations where the number of available
RADIUS servers in a Proxy Group is too low. If all the RADIUS
Servers in a Proxy Group are offline, we would have a service outage
for the realm that that Proxy Group is servicing. Therefore we have
to make sure that we never run out of RADIUS Servers. Furthermore,
if we are using load balancing in a Proxy Group, it may be highly
desirable to try to maintain a certain number of RADIUS Servers even
if we have to bring some of them out of the Offline state earlier.
The Online Algorithm brings the RADIUS Servers out of the Offline
state when the Offline-Period has expired. As well, if the number
of available RADIUS Servers in a Proxy Group falls below a
threshold, the Online Algorithm will bring the RADIUS Server(s) that
are closest to approaching their Offline Period.
Alternatively, the Online Algorithm may also take into account the
RADIUS Server's state in other Proxy Groups.
6. Special Consideration
The following section describes special considerations for the
different types of messages.
6.1 Consideration for Accounting Messages
Accounting messages are critical but with respect to retransmission,
we recommend that the retransmission algorithm should not be applied
to Accounting-Request (Interim) messages.
Note failures should be reported as Accounting-Accept messages are
not received for the Accounting-Request (Interim) messages.
6.2 Consideration for Dynamic Authorization Messages
When the sender of a Disconnect message or a Change-of-Authorization
message receives a NAK it should examine the Error-codes to
determine whether it should retransmit or not.
Lior, et al. Informational [Page 13]
RADIUS Reliable Transport February 2003
If the Error-code is set to "Request Not Routable"(502) the sender
should retry to send the message to another RADIUS Server in the
Proxy Group.
If the Error-code is set to "Other Proxy Processing Error"(505) the
sender should treat this as a non-response, wait as it normally
would, and retry to transmit the message.
If the RADIUS Server is sending the Disconnect Message or Change-of-
Authorization message directly to the NAS where the session resides
then failing over does not make any sense.
7. Security Considerations
This document enhances existing RADIUS specification by recommending
strategies for failure detection, retransmission procedures,
failover procedures, and failback procedures.
The document does not modify the base protocols and therefore the
security considerations are the same as those discussed in the
appropriate documents.
However, in this section addresses the security concerns that are
introduced by the procedures that are discussed in this document.
-Effects due to Denial of Service attacks
8. Normative References
[RFC2026]
[RFC2119]
[RFC2865] Rigney, C., Rubens, A., Simpson, W. and S. Willens,
“Remote Authentication Dial In User Server (RADIUS)”,
RFC 2865, June 2000.
[RFC2866] Rigney, C., “RADIUS Accounting ”, RFC 2866, June
2000.
[RFC2869] Rigney, C., Willats, W., Calhoun, P., “RADIUS
Extensions”, RFC 2869, June 2000.
[RFC2868]
[CHIBA] Chiba, M., Dommety, G., Eklund, M., Mitton, D.,
Aboba, B., " Dynamic Authorization Extensions to
Lior, et al. Informational [Page 14]
RADIUS Reliable Transport February 2003
Remote Authentication Dial In User Service (RADIUS)",
draft-chiba-radius-dynamic-authorization-20.txt,
Internet draft (work in progress), 15 May, 2003.
[RFC2988] Paxson, V., Allman, M., "Computing TCP's
Retransmission Timer", RFC 2988, November 2000.
9. Informative References
[RFC3127] Mitton, D.," Authentication, Authorization, and
Accounting: Protocol Evaluation", RFC 3127, June
2001
[AAATransport] Aboba, B. and J. Wood, "Authentication,
Authorization and Accounting Transport Profile",
draft-ietf-aaa-transport-12.txt, Internet draft
(work in progress), January 2003.
10. Acknowledgments
Funding for the RFC Editor function is currently provided by the
Internet Society.
The author would like to thank the following people: Yong Li and
Helena Mancini from Bridgewater Systems.
11. Author's Addresses
Avi Lior
Bridgewater Systems
303 Terry Fox Drive
Suite 100
Ottawa Ontario
Canada
avi@bridgewatersystems.com
12. Intellectual Property Statement
The IETF takes no position regarding the validity or scope of any
intellectual property or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; neither does it represent that it
Lior, et al. Informational [Page 15]
RADIUS Reliable Transport February 2003
has made any effort to identify any such rights. Information on the
IETF's procedures with respect to rights in standards-track and
standards-related documentation can be found in BCP-11. Copies of
claims of rights made available for publication and any assurances
of licenses to be made available, or the result of an attempt made
to obtain a general license or permission for the use of such
proprietary rights by implementers or users of this specification
can be obtained from the IETF Secretariat.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights which may cover technology that may be required to practice
this standard. Please address the information to the IETF Executive
Director.
13. Full Copyright Statement
Copyright (C) The Internet Society (2003). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph
are included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English. The limited permissions granted above are perpetual and
will not be revoked by the Internet Society or its successors or
assigns. This document and the information contained herein is
provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE
INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."
14. Expiration Date
This memo is filed as <draft-lior-radius-reliable-transport-00.txt>,
and expires December 23, 2003.
Lior, et al. Informational [Page 16]