shen-soc-avalanche-restart-overload-07.txt

Internet DRAFT - draft-shen-soc-avalanche-restart-overload
draft-shen-soc-avalanche-restart-overload

Last Version:	draft-shen-soc-avalanche-restart-overload-07.txt	Tracker Entry
Date:	`11-Feb-2014`
Disposition:	expired
Previous Versions:	draft-shen-soc-avalanche-restart-overload-06.txt (diff) - 11-Aug-2013
	draft-shen-soc-avalanche-restart-overload-05.txt (diff) - 02-Jan-2013
	draft-shen-soc-avalanche-restart-overload-04.txt (diff) - 19-Jun-2012
	draft-shen-soc-avalanche-restart-overload-03.txt (diff) - 16-Dec-2011
	draft-shen-soc-avalanche-restart-overload-03.txt (diff) - 08-Dec-2011
	draft-shen-soc-avalanche-restart-overload-02.txt (diff) - 06-Jun-2011
	draft-shen-soc-avalanche-restart-overload-02.txt (diff) - 05-Jun-2011
	draft-shen-soc-avalanche-restart-overload-01.txt (diff) - 02-Dec-2010
	draft-shen-soc-avalanche-restart-overload-00.txt (diff) - 03-Jun-2010





IETF SOC Working Group                                           C. Shen
Internet-Draft                                            H. Schulzrinne
Intended status: Standards Track                             Columbia U.
Expires: August 15, 2014                                        A. Koike
                                                                     NTT
                                                       February 11, 2014


  A Mechanism for Session Initiation Protocol (SIP) Avalanche Restart
                            Overload Control
              draft-shen-soc-avalanche-restart-overload-07

Abstract

   When a large number of clients register with a SIP registrar server
   at approximately the same time, the server may become overloaded.
   Near-simultaneous floods of SIP SUBSCRIBE and PUBLISH requests may
   have similar effects.  Such request avalanches can occur, for
   example, after a power failure and recovery in a metropolitan area.
   This document describes how to avoid such overload situations.  Under
   this mechanism, a server estimates an avalanche restart backoff
   interval during its normal operation and conveys this interval to its
   clients through a new Restart-Timer header in normal response
   messages.  Once an avalanche restart actually occurs, the clients
   perform backoff based on the previously received Restart-Timer header
   value before sending out the first request attempt.  Thus, the
   mechanism spreads all the initial client requests and prevents them
   from overloading the server.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on August 15, 2014.






Shen, et al.             Expires August 15, 2014                [Page 1]

Internet-Draft   SIP Avalanche Restart Overload Control    February 2014


Copyright Notice

   Copyright (c) 2014 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   5
   3.  Restart-Timer Header for Registration Responses . . . . . . .   5
     3.1.  Generating the Restart-Timer Header . . . . . . . . . . .   5
     3.2.  Determining the Restart-Timer Header Value  . . . . . . .   5
     3.3.  Processing the Restart-Timer Header . . . . . . . . . . .   6
     3.4.  Using the Restart-Timer Header  . . . . . . . . . . . . .   6
   4.  Syntax  . . . . . . . . . . . . . . . . . . . . . . . . . . .   7
   5.  Backward Compatibility  . . . . . . . . . . . . . . . . . . .   7
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .   7
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   8
   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .   8
   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   8
     9.1.  Normative References  . . . . . . . . . . . . . . . . . .   8
     9.2.  Informative References  . . . . . . . . . . . . . . . . .   8
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   9

1.  Introduction

   A Session Initiation Protocol (SIP) [RFC3261] server can be
   overloaded for a number of different reasons.  One of them is
   avalanche restart, which is described in [RFC5390] as follows:

      Avalanche Restart: One of the most troubling sources of overload
      is avalanche restart.  This happens when a large number of clients
      all simultaneously attempt to connect to the network with a SIP
      registration.  Avalanche restart can be caused by several events.
      One is the "Manhattan Reboots" scenario, where there is a power
      failure in a large metropolitan area, such as Manhattan.  When
      power is restored, all of the SIP phones, whether in PCs or
      standalone devices, simultaneously power on and begin booting.



Shen, et al.             Expires August 15, 2014                [Page 2]

Internet-Draft   SIP Avalanche Restart Overload Control    February 2014


      They will all then connect to the network and register, causing a
      flood of SIP registration messages.  Another cause of avalanche
      restart is failure of a large network connection, for example, the
      access router for an enterprise.  When it fails, SIP clients will
      detect the failure rapidly using the mechanisms in [RFC5626].
      When connectivity is restored, this is detected, and clients re-
      registration, all within a short time period.  Another source of
      avalanche restart is failure of a proxy server.  If clients had
      all connected to the server with TCP, its failure will be
      detected, followed by re-connection and re-registration to another
      server.  Note that [RFC5626] does provide some remedies to this
      case.

   The SIP server avalanche restart overload problem is caused by the
   synchronized, simultaneous initial registration attempts after a
   failure recovery.  If the first round of registration attempts from
   all clients cause server overload, most of those registrations will
   fail.  Those clients will then by default all retry after the same
   amount of time, causing repeated server avalanche restart overload.
   [RFC5626] describes how to alleviate this situation: if the initial
   registration attempt after the boot fails, the clients wait for a
   randomized backoff time before retrying.  This mechanism reduces the
   possibility of repeated avalanche restart.  However, since all
   clients still send registration immediately after boot, it does not
   prevent the initial avalanche restart overload.

   A key method to prevent avalanche restart server overload is to have
   clients backoff before their first registration attempt.  The backoff
   intervals of each client must be carefully selected so that their
   registration attempts are spaced sufficiently far apart not to
   overload the server, and they are also not too conservative which may
   cause unnecessary client registration delays.  An individual client,
   without knowing the state information of all other peer clients and
   the registrar server, is inherently incapable of choosing such an
   appropriate backoff interval.

   This document specifies a solution to the avalanche restart overload
   problem by allowing the registrar server to instruct the clients how
   long they should wait before the initial registration upon a restart
   event.  Under this mechanism, the server estimates an avalanche
   restart backoff interval during its normal operation.  This interval
   is the minimum period of time that the server needs to serve all the
   expected registration requests after an avalanche restart, assuming
   all the registration requests are properly spaced.  In order for the
   server to convey this interval to its clients, this document defines
   a new SIP Restart-Timer header.  The registrar server places the
   avalanche restart backoff interval into the Restart-Timer header and
   inserts it into regular responses to client registration requests.



Shen, et al.             Expires August 15, 2014                [Page 3]

Internet-Draft   SIP Avalanche Restart Overload Control    February 2014


   When an avalanche restart actually happens, each client waits a
   randomly-chosen period between 0 and the avalanche restart backoff
   interval.

   This document also defines an algorithm to determine the avalanche
   restart backoff interval based on the server's processing capability
   and the number of clients it is serving.  The effectiveness of this
   algorithm depends on the assumption that both the server processing
   capability and the number of clients the server serves remain similar
   before and after the avalanche restart.  This assumption holds true
   in most cases when the registrar server before and after the
   avalanche restart is the same one, e.g., in the "Manhattan Reboots"
   scenario, and the loss and recovery of large network connection
   scenario.

   The method defined in this document is intended to be used for real
   avalanche restart situations, rather than just any local reboot or
   connection recovery.  Therefore, the device employing this mechanism
   SHOULD try to estimate the nature of the restart incidents whenever
   possible.  Some devices, especially mobile terminals, may also have
   lower layer (e.g., Physical or Data Link layer) backoff or blocking
   mechanisms during avalanche restart or network congestion.  In those
   cases, operators may also disable this application layer avalanche
   restart protection method.  Such cross-layer optimizations, however,
   are out of scope of this document.

   Throughout this document our description assumes the typical scenario
   where the clients send out REGISTER messages as the first message
   type after a restart and cause avalanche restart overload on the
   registrar server.  It should be noted that similar procedures are
   applicable to scenarios where the first message after reboot is of a
   different type and causes overload.  For example, when SIP based
   configuration mechanism is followed, the clients may first send out
   SUBSCRIBE messages to a SIP configuration server to get the registrar
   address after a reboot.  In that case, the server that determines the
   restart backoff interval needs to be the corresponding SIP
   configuration server, and the backoff mechanism could then be
   similarly applied to the sending of the initial SUBSCRIBE messages.

   This document complements other SIP server overload control
   specifications which address different aspects of the SIP server
   overload space, such as [I-D.ietf-soc-overload-control],
   [I-D.ietf-soc-load-control-event-package], and [RFC6357].








Shen, et al.             Expires August 15, 2014                [Page 4]

Internet-Draft   SIP Avalanche Restart Overload Control    February 2014


2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

3.  Restart-Timer Header for Registration Responses

   This document defines the SIP Restart-Timer header for registration
   responses.  The value of the Restart-Timer header, in seconds,
   denotes the avalanche restart backoff interval, which is the minimum
   time the server needs to successfully service all likely client
   registration requests under an avalanche restart situation, assuming
   all requests are spaced evenly in time.

3.1.  Generating the Restart-Timer Header

   A SIP registrar server inserts a Restart-Timer header containing its
   most up-to-date avalanche restart backoff interval value in the
   responses to registration requests.

   Example:

       SIP/2.0 200 OK
       Restart-Timer: 300

3.2.  Determining the Restart-Timer Header Value

   A registrar server computes and updates the Restart-Timer header
   values and conveys them to the clients during its normal operations.
   Once an avalanche restart actually happens, the most recent Restart-
   Timer header value that the clients have received from the registrar
   server are used.  A registrar server MAY use the following algorithm
   for determining the appropriate Restart-Timer header value.

   During the normal operation period, the SIP registrar server
   maintains the current count of all its registrants, e.g., assuming
   the number of registered clients is R. The SIP registrar server also
   estimates its processing capacity, e.g., assuming it is C requests
   per second.  The Restart-Timer value can be set to (R/C)*(1+k), where
   k is a small coefficient that provides a capacity redundancy.  A
   recommended value of k is 0.1.

   It should be noted that change of either R or C adjusts the server
   computed Restart-Timer value.  The value C is usually stable on the
   same server and with the same registration request pattern.  The
   value R may change over time.  The server SHOULD recompute the
   Restart-Timer value whenever there is a change in either R or C,



Shen, et al.             Expires August 15, 2014                [Page 5]

Internet-Draft   SIP Avalanche Restart Overload Control    February 2014


   unless it is considered too expensive to do so, which is normally not
   the case.  Since the updated Restart-Timer value is only pushed to
   the clients when the client sends in a registration, there might be a
   short period where the server side updated Restart-Timer value and
   some client side stored Restart-Timer values are not synchronized.
   However, considering that the changing pace of R is slow, and the
   time scale between the possible happenings of avalanche restarts
   (e.g., months) is usually much larger than the interval between
   typical registration renewals (e.g., hours), these short periods of
   discrepancies are not a concern.  Therefore, in general this approach
   provides a sufficiently accurate characterization of the system
   status.  More importantly, the values of R and C are expected to
   remain constant for the same server before and after typical
   avalanche restart events, e.g., a power failure and recovery.

3.3.  Processing the Restart-Timer Header

   Before receiving the very first registration response from a new
   registrar server, the client restart backoff value for that registrar
   server is zero, i.e., the restart backoff mechanism is disabled.

   Upon receiving a response to the registration request containing the
   Restart-Timer header, a SIP client that supports this specification
   MUST check if there is an existing Restart-Timer header value for
   this registrar stored in the system.  If not, it stores the newly
   received Restart-Timer header value.  Otherwise, it compares the new
   value with the existing one and updates it if they differ.  The value
   of Restart-Timer header SHOULD be stored together with the
   corresponding identity of the server, e.g., the DNS name of the
   registrar server.  There is no separate validity period parameter for
   Restart-Timer.  The validity duration of the Restart-Timer header is
   the same as that of the corresponding registration operation.

3.4.  Using the Restart-Timer Header

   At the client side, avalanche restart backoff is disabled by default,
   unless the client that supports this specification has received a
   positive Restart-Timer header value from the corresponding registrar
   server.

   A SIP client always keeps the most updated Restart-Timer header
   value.  When this value is positive and if the client detects that it
   is about to perform the first registration with the same registrar
   server after a power-off reboot or a connection-loss recovery, the
   client SHOULD generate a uniformly distributed random interval
   between 0 and the current Restart-Timer value, and wait until the end
   of that interval to send the registration request.  However, the
   client side backoff MAY be manually disabled by a human operator when



Shen, et al.             Expires August 15, 2014                [Page 6]

Internet-Draft   SIP Avalanche Restart Overload Control    February 2014


   necessary, e.g., when the operator is expecting an urgent call, or
   when the power-off or connection-loss event is known as a local
   incident rather than a global event.

   It should be noted that the power-off reboot case requires that the
   state information about the Restart-Timer value and the registrar
   server identity be stored in a memory space that could survive power
   restart.

4.  Syntax

   The new Restart-Timer header adds the following lines to the existing
   SIP header definition.


       message-header = Restart-Timer

       Restart-Timer = "Restart-Timer" HCOLON delta-seconds


5.  Backward Compatibility

   If a registrar server supports this specification, but not all of its
   clients are upgraded, then those non-compliant clients will ignore
   the Restart-Timer header and not perform backoff.  Although it
   appears that this might give the non-compliant clients an unfair
   advantage over those clients that do perform backoff, since the non-
   compliant clients will send synchronized registration attempts to the
   registrar server and can cause server overload, they will be
   penalized by registration failures.  Depending on the number of the
   non-compliant clients vs. compliant clients, if the registrar server
   can still process requests when it does not receive registration
   storms from all the non-compliant clients, the requests from the
   compliant clients which spread apart, are more likely to succeed.

6.  Security Considerations

   The Restart-Timer header can be used by an attacker to launch a
   possible denial-of-service attack on SIP clients if the attacker can
   insert an infinitely large Restart-Timer value in the response sent
   to the clients.  In those situations, the client may generate a very
   large backoff time before it attempts to send a registration request,
   and therefore the client is subject to denial-of-service attack.
   However, this kind of attack is only applicable after a power-cycle
   reboot or failure and recovery of a large network connection, which
   is rare.  Furthermore, if the attacker can modify the registration
   request or response, that attacker can very easily prevent
   registration in any number of ways, so the Restart-Timer header does



Shen, et al.             Expires August 15, 2014                [Page 7]

Internet-Draft   SIP Avalanche Restart Overload Control    February 2014


   not introduce new types of attacks.  One method to prevent the
   registration request and response from being altered by attackers is
   to use TLS between the client and the registrar server.

7.  IANA Considerations

   [TBD]

8.  Acknowledgements

   The authors would like to thank Janet P Gunn, Parthasarathi R and
   other members of the SIPPING and SOC working group for useful
   comments.

9.  References

9.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
              A., Peterson, J., Sparks, R., Handley, M., and E.
              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
              June 2002.

9.2.  Informative References

   [I-D.ietf-soc-load-control-event-package]
              Shen, C., Schulzrinne, H., and A. Koike, "A Session
              Initiation Protocol (SIP) Load Control Event Package",
              draft-ietf-soc-load-control-event-package-13 (work in
              progress), December 2013.

   [I-D.ietf-soc-overload-control]
              Gurbani, V., Hilt, V., and H. Schulzrinne, "Session
              Initiation Protocol (SIP) Overload Control", draft-ietf-
              soc-overload-control-14 (work in progress), December 2013.

   [RFC5390]  Rosenberg, J., "Requirements for Management of Overload in
              the Session Initiation Protocol", RFC 5390, December 2008.

   [RFC5626]  Jennings, C., Mahy, R., and F. Audet, "Managing Client-
              Initiated Connections in the Session Initiation Protocol
              (SIP)", RFC 5626, October 2009.






Shen, et al.             Expires August 15, 2014                [Page 8]

Internet-Draft   SIP Avalanche Restart Overload Control    February 2014


   [RFC6357]  Hilt, V., Noel, E., Shen, C., and A. Abdelal, "Design
              Considerations for Session Initiation Protocol (SIP)
              Overload Control", RFC 6357, August 2011.

Authors' Addresses

   Charles Shen
   Columbia University
   Department of Computer Science
   1214 Amsterdam Avenue, MC 0401
   New York, NY   10027
   USA

   Phone: +1 212 854 3109
   Email: charles@cs.columbia.edu


   Henning Schulzrinne
   Columbia University
   Department of Computer Science
   1214 Amsterdam Avenue, MC 0401
   New York, NY   10027
   USA

   Phone: +1 212 939 7004
   Email: hgs@cs.columbia.edu


   Arata Koike
   NTT Network Technology Labs &
   3-9-11 Midori-cho Musashino-shi
   Tokyo  180-8585
   Japan

   Phone: +81 422 59 6099
   Email: koike.arata@lab.ntt.co.jp















Shen, et al.             Expires August 15, 2014                [Page 9]