rosenberg-dispatch-cloudsip-00.txt

Internet DRAFT - draft-rosenberg-dispatch-cloudsip
draft-rosenberg-dispatch-cloudsip

Last Version:	draft-rosenberg-dispatch-cloudsip-00.txt	Tracker Entry
Date:	`22-Feb-2021`
Disposition:	expired





Network Working Group                                       J. Rosenberg
Internet-Draft                                                     Five9
Expires: August 25, 2021                                     C. Jennings
                                                                   Cisco
                                                              T. Asveren
                                                   Ribbon Communications
                                                       February 21, 2021


SIP Extensions for High Availability and Load Balancing for Public Cloud
                   draft-rosenberg-dispatch-cloudsip-00

Abstract

   Software making use of the Session Initiation Protocol (SIP) faces
   challenges in achieving high availability, especially for call
   stateful applications like softswitches, Session Border Controllers
   (SBCs), and IP-based call centers applications.  The state maintained
   in the SIP, SDP and SRTP layers changes frequently, and is difficult
   to replicate.  For this reason, commercial systems have often relied
   on complex active-standby configurations making use of IP address
   takeover.  These solutions are also ill-suited for usage in modern
   public cloud environments.  This document defines a SIP extension
   facilitating HA, including keeping calls active, which is optimized
   for server-to-server communication where one or both sides are in
   public cloud.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on August 25, 2021.

Copyright Notice

   Copyright (c) 2021 IETF Trust and the persons identified as the
   document authors.  All rights reserved.




Rosenberg, et al.        Expires August 25, 2021                [Page 1]

Internet-Draft                  Cloud SIP                  February 2021


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Applicability . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Requirements  . . . . . . . . . . . . . . . . . . . . . . . .   4
   4.  Relationship to RIPT  . . . . . . . . . . . . . . . . . . . .   5
   5.  Reference Architecture  . . . . . . . . . . . . . . . . . . .   5
   6.  Solution Applicability  . . . . . . . . . . . . . . . . . . .   6
   7.  Overview of Solution  . . . . . . . . . . . . . . . . . . . .   7
   8.  Configuration . . . . . . . . . . . . . . . . . . . . . . . .   8
   9.  SIP Behavioral Requirements . . . . . . . . . . . . . . . . .   9
     9.1.  Calling Server  . . . . . . . . . . . . . . . . . . . . .   9
       9.1.1.  Health Probing  . . . . . . . . . . . . . . . . . . .   9
       9.1.2.  Utilization Measurement . . . . . . . . . . . . . . .   9
       9.1.3.  New Call Initiation . . . . . . . . . . . . . . . . .  10
       9.1.4.  Instance Failure  . . . . . . . . . . . . . . . . . .  10
       9.1.5.  Instance to Inactive  . . . . . . . . . . . . . . . .  10
       9.1.6.  Receiving a REFER . . . . . . . . . . . . . . . . . .  11
     9.2.  Cluster Instances . . . . . . . . . . . . . . . . . . . .  11
       9.2.1.  Sending Utilization Values  . . . . . . . . . . . . .  11
       9.2.2.  Receiving INVITE w. Replaces  . . . . . . . . . . . .  12
       9.2.3.  Graceful Shutdown with Migration  . . . . . . . . . .  12
       9.2.4.  Graceful Shutdown without Migration . . . . . . . . .  12
     9.3.  Moving a Dialog . . . . . . . . . . . . . . . . . . . . .  13
   10. Cloud SIP Trunk Configuration File  . . . . . . . . . . . . .  13
   11. Webhook Registration Object . . . . . . . . . . . . . . . . .  14
   12. Instance-Utilization Header Field . . . . . . . . . . . . . .  14
   13. Why not DNS . . . . . . . . . . . . . . . . . . . . . . . . .  14
   14. TODO  . . . . . . . . . . . . . . . . . . . . . . . . . . . .  15
   15. Informative References  . . . . . . . . . . . . . . . . . . .  15
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  15

1.  Introduction

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119, BCP 14




Rosenberg, et al.        Expires August 25, 2021                [Page 2]

Internet-Draft                  Cloud SIP                  February 2021


   [RFC2119] and indicate requirement levels for compliant CoAP
   implementations.

   Software making use of the Session Initiation Protocol (SIP)
   [RFC3261] faces challenges in achieving high availability, especially
   for call stateful applications like softswitches, Session Border
   Controllers (SBCs), and IP-based call centers applications.  The
   state maintained in the SIP, Session Description Protocol (SDP)
   Offer/Answer [RFC3264] and Secure Real Time Transport Protocol (SRTP)
   [RFC3711] layers changes frequently, and is difficult to replicate.
   For this reason, commercial systems have often relied on complex
   active-standby configurations making use of IP address takeover.
   These solutions are also ill-suited for usage in modern public cloud
   environments.  SIP assumed server-side components would not maintain
   call state, and thus it never had built-in mechanisms to facilitate
   server side HA.  In practice, the vast majority of server deployments
   are B2BUAs and maintain call state.

   Besides the challenges in replicating call state, SIP also struggles
   in achieving HA in modern cloud deployments making use of elastic
   compute.  In these environments, the underlying cloud platform (such
   as kubernetes), can automatically add and remove instances to a
   cluster based on usage.  Similarly, they will remove elements from
   the cluster which fail health checks.  This information needs to
   propagate quickly to upstream elements, in order to avoid sending
   calls to failed or overloaded instances.  SIP envisioned that a DNS-
   based solution using SRV records, [RFC3263] would be sufficient.
   However, DNS changes are slow to propagate and unpredictable.
   Commercial implementations have made use of SIP OPTIONS probing to
   assess liveliness, without standardized behavior.  There is also no
   standardized way to communicate or update the IP addresses used in a
   cluster of servers.

   This specification seeks to remedy these gaps.  It defines a simple
   SIP extension, which is largely a definition of mandatory behaviors
   for SIP elements, that enable rapid detection and recovery from a
   failed instance while ensuring that calls do not drop.  It also
   defines a small protocol for retrieving and pushing the set of
   instances in a cluster so support elastic expansion and contraction
   of a cluster in a fully automated fashion.

2.  Applicability

   This extension is focused on server-to-server use cases, where one or
   both sides are a cluster of servers deployed in a public cloud
   environment.  Examples of these situations include SIP trunks between
   a PSTN carrier and an enterprise, a PSTN carrier and a VOIP provider
   (such as a cloud contact center), or between VOIP providers providing



Rosenberg, et al.        Expires August 25, 2021                [Page 3]

Internet-Draft                  Cloud SIP                  February 2021


   peering.  The extension also assumes usage in bilateral peering
   arrangements, and as such, provides no mechanism for discovery.
   Rather, it assumes both sides have agreed to use this extension as
   part of configuration provided through techniques outside the scope
   of this specification.

3.  Requirements

   o  The solution must enable a call to be recovered in less than 2
      seconds.  This time represents the amount of time before which a
      user would hangup because they cannot hear the other party.

   o  A recovered call means that media continues to flow, and future
      signaling for features or call hangup, can be performed

   o  The HA technique must not require servers in the cluster to
      replicate any SIP/SDP/RTP state beyond the dialog identifiers for
      calls

   o  The solution should minimize the changes required to the SIP and
      RTP protocols and their respective implementations

   o  The solution must support the case where the telco is using
      traditional SBCs and is not deploying kubernetes or using public
      cloud

   o  The solution must enable fully automated elastic expansion and
      contraction of clusters

   o  The solution must support availability, so that when an instance
      in a cluster fails, new calls are distributed across the remaining
      N instances

   o  The solution must support availability, so that when an instance
      of a cluster fails, all of the active calls that were being
      handled by that instance are spread across the remaining nodes in
      the cluster, within 2 seconds

   o  The solution must support clusters wherein each instance of a
      cluster has a differing amount of capacity for call handling

   o  The solution must support the ability for instances of a cluster
      to gracefully shut down without dropping calls








Rosenberg, et al.        Expires August 25, 2021                [Page 4]

Internet-Draft                  Cloud SIP                  February 2021


4.  Relationship to RIPT

   This protocol is similar in goals to RIPT - enabling SIP servers to
   run in public cloud environments, and achieve HA through techniques
   employed by web applications.  RIPT attempted to solve this problem
   by utilizing HTTP/3 and fully redefining SIP, repairing many of its
   problems in the process.  This specification is less ambitious,
   focusing on the minimum changes to SIP required to facilitate HA.

   As such, this specification does not alleviate the value in a full-
   fledged replacement for SIP.

5.  Reference Architecture

   Cloud SIP uses an assymetric relationship between peers.  One side
   acts as the caller, and the other as the call recipient.  New SIP
   calls can only be placed by the caller, not by the call recipient.
   If a deployment requires calls to flow in both directions, each side
   acts as both caller and call recipient.

  Caller                    Call Recipient

                             SIP Server Cluster
                           ......................
                           .    +------------+  .
                    +------.----| Instance 1 |  .
                    |      .    +------------+  .
                    |      .                    .         +------------+
     +---------+    |      .    +------------+  .         | Downstream |
     | Calling |----+------.----| Instance 2 |--.---------|    SIP     |
     | Server  |    |      .    +------------+  .         |    UA      |
     +---------+    |      .                    .         +------------+
                    |      .    +------------+  .
                    +------.----| Instance 3 |  .
                           .    +------------+  .
                           ......................
                                    |          |
                                    |          |
     +---------+               +---------+   +---------+
     | Config  |<------------->| Config  |   | Shared  |
     | Sync    |               | Source  |   |   DB    |
     +---------+               +---------+   +---------+


   The calling server wishes to send calls to a cluster, which has a set
   of instances.  The calling server is a B2BUA, and is capable of
   initiating calls, typically in response to an upstream INVITE it
   receives.  The calling server itself may be a member of a cluster.



Rosenberg, et al.        Expires August 25, 2021                [Page 5]

Internet-Draft                  Cloud SIP                  February 2021


   When the calling server wishes to generate a new INVITE for a new
   call, it load balances them amongst the instances in the cluster.
   Consider a specific call that was sent to instance 2, and was then
   forwarded through zero or more SIP proxies (not shown) before landing
   at a UA, referred to here as the downstream SIP UA.  The downstream
   SIP UA may itself be another B2BUA that is a member of a cluster, or
   even be an end user client.  This specification requires the
   downstream UA to implement the SIP Replaces header field [RFC3891].

   When instance 2 fails, we wish to have the call taken over by one of
   the other instances in the cluster, which can then re-establish media
   with the downstream SIP UA using INVITE/Replaces.

   There is a logical function associated with the cluster, called the
   config source, which is aware of the configuration of the cluster.
   Specifically, it knows the IP/port of each instance, and whether that
   instance is healthy.  This config source learns this information
   through non-standardized means, unique to the cloud environment in
   which the cluster resides.  The config source communicates that
   information to a config sync associated with the upstream calling
   server.  This communication is bidirectional, using HTTP requests.
   The config sync distributes this information to the calling server
   (and any other calling servers should they themselves be a cluster).

   There is a shared database of some sorts, accessible by all instances
   in the cluster.  This is used to store the dialog state needed for
   operation of this extension.

6.  Solution Applicability

   This specification is applicable in two scenarios:

   1.  The calling server (and other members in its cluster) and the
       config sync service are controlled by one entity, and the cluster
       and the downstream UA are controlled by a second.  A common
       example of this is where the calling server and config sync are
       part of a telecom carrier, and the cluster and downstream UA are
       part of an enterprise or SaaS provider that has purchased SIP
       trunking services from the carrier.

   2.  The calling server, cluster, and downstream UA are all controlled
       by a single administrative entity.

   3.  The instances which make up the cluster are assumed to be
       provided by the same vendor.  This allows for vendor-specific
       solutions to replicate state and messaging as required by this
       specification.




Rosenberg, et al.        Expires August 25, 2021                [Page 6]

Internet-Draft                  Cloud SIP                  February 2021


   This specification also requires that the calling server be a UA
   (including B2BUAs), and that the instances in the cluster are B2BUAs
   and the downstream UA is under the administrative control of the same
   entity that operates the cluster.

7.  Overview of Solution

   The solution is pretty straightforward.

   The calling server will maintain, through the HTTP-based protocol
   described below, a list of instances in the cluster.  These instances
   are identified by both IP and port.  The inclusion of a port allows
   the instances to share a common IP but vary by port.  Such a
   configuration is useful inside of public cloud environments which can
   be fronted by a network load balancer which allows each instance to
   actually have the same IP, but utilize different ports.

   The calling server continuously validates that each instance in the
   cluster is alive, every 250ms.  New calls are delivered only to
   instances which are healthy based on the algorithm defined here.  It
   can ascertain health via reverse RTP traffic, rapid RTCP receiver
   reports, or via SIP OPTIONS.  If SIP OPTIONS are used, these are
   performed at a rate of a new transaction every 250ms.  This is very
   fast, but it is critical for rapid detection of failures.

   If the calling server is itself a member of a cluster, the work of
   ascertaining the health of each instance can be distributed across
   the calling servers, in order to avoid a full-mesh of OPTION probing,
   and then the resulting state distributed through means outside of
   this specification.

   When a call is initially established - to instance 2 in this case -
   instance 2 will place an entry into the database which contains three
   pieces of information - (1) the dialogID of the SIP leg from the
   caller to itself, (2) the dialog ID of the downstream SIP leg from
   itself to the downstream SIP UA, (3) the IP address and port of the
   downstream SIP UA.

   If an instance transitions from healthy to unhealthy, the calling
   server 'moves' the existing instance 2 calls uniformly across to the
   remaining healthy instances in the cluster.  To avoid a flood of
   instant traffic, it moves these calls over a window of at least 200ms
   and at most one second.  To move the calls, the calling server sends
   an INVITE w.  Replaces header field for each such call.  Because this
   is a fresh SIP dialog, a new SDP offer/answer and SRTP is
   established.  This is what avoids the need for replication of RTP
   state, SDP state and other lower-layer states across the instances in
   the cluster.  When the INVITE/Replaces arrives at one of the



Rosenberg, et al.        Expires August 25, 2021                [Page 7]

Internet-Draft                  Cloud SIP                  February 2021


   instances in the cluster (say, instance 3), the instance takes the
   dialogID in the Replaces header field, and looks it up in the shared
   DB.  It will find that there is a matching dialog, and it will
   retrieve the outbound dialogID and downstream SIP UA.  Instance 3
   sends an INVITE/Replaces to the downstream UA, using the dialogID it
   retrieved from the database.

   Establishment of a new SIP dialog between the calling server and
   instance 3 can take place in parallel with the establishment of the
   new dialog between instance 3 and the downstream UA.  Thus the time
   required to failover the live call is equal to the time to detect
   instance failure, plus the time to establish a new SIP call.

8.  Configuration

   A cloud SIP "trunk" is configured in the config sync service through
   means outside of the scope of this specification.  Each such trunk is
   defined by an HTTPS URI - the trunk config URI - which points to the
   config source service representing that cluster.  This is the only
   configuration required to establish a cloud SIP trunk.

   Once configured with this URI, the config sync MUST perform a GET
   against this URI.  The config source MUST return a JSON document
   conformant to the schema defined in this specification.  This
   document MUST provide the config sync with a list of instances, each
   with IP address and port.  The JSON document MUST also contain a
   cluster name, formatted as a hostname, and a webhook registration
   URI.

   Once retrieved, the config sync MUST perform a POST against the
   webhook registration URI.  The POST MUST contain a JSON document
   conformant to the schema defined in this specification.  That
   document MUST contain an HTTPS webhook URI used by the config sync to
   receive webhook callbacks that push an updated cluster configuration
   from the config source.

   The config sync MUST refresh its webhook registration at least once a
   day to ensure that an up to date value for the webhook URI exists.

   The config source MUST perform a POST against the webhook URI
   whenever the cluster configuration changes, including when it
   detects, on its own, that an instance is unhealthy, removing it from
   the list.








Rosenberg, et al.        Expires August 25, 2021                [Page 8]

Internet-Draft                  Cloud SIP                  February 2021


9.  SIP Behavioral Requirements

9.1.  Calling Server

9.1.1.  Health Probing

   The calling server MUST be capable of detecting the failure of any
   instance of the cluster within 1.5 + RTT seconds.  The specific means
   for doing this detection can vary by implementation.  It is also
   expected that some implementations may have failure detection
   computed from one instance of a calling server, and the resulting
   state shared with other instances of the calling server through some
   means outside the scope of this specification.

   One suggested technique for detecting failure is to utilize a SIP
   OPTIONS probe.  The OPTIONS request can be sent every 250ms, directed
   to each instance of the cluster.  To facilitate high scale and
   determination of RTT, a single OPTIONS request can be sent for each
   transaction (since retransmits are largely useless due to the short
   timeout defined for this use case).  With such an interval, the
   calling server can consider an instance unhealthy at time T if, at
   time T, zero OPTIONS responses have been received for a time equal to
   the RTT to the instance plus 6 * 250ms = 1.5s.  The calling server
   can maintain the RTT in any fashion it desires.  If the OPTIONS
   requests for a specific transaction are not retransmitted, the time
   between transmission of the request and receipt of the response can
   be used to measure RTT.

   A MUST strength for 1.5s + RTT is specified to ensure that the
   cluster can count on consistent and predictable behavior from the
   upstream calling server.  An instance is considered healthy if it is
   not unhealthy.

   OPEN ISSUE: SHould we put a Require header field in the OPTIONS?
   Should we specify any other behaviors in the OPTIONS?

9.1.2.  Utilization Measurement

   The instances can place a SIP header, Instance-Utilization, into all
   responses sent to the calling server.  These values indicate the
   utilization of that instance, as an integral value from 0 to 100.
   They are used by the calling server to weight the traffic in
   proportion to utilization.

   The calling server MUST remove this header field before propagating
   it in any upstream responses, as they only have significance on the
   link between the calling server and cluster.




Rosenberg, et al.        Expires August 25, 2021                [Page 9]

Internet-Draft                  Cloud SIP                  February 2021


   If this header field is present in a response, the calling server
   MUST remember the most recent value received from that instance
   (ordered by the wall clock time at which the response is received).
   The calling server MUST NOT utilize the source IP of the response to
   identify the instance.  Instead, it MUST correlate the response to a
   request, and remember the instance to which the request was sent.

   If no value has been received for 5 seconds, or no value was ever
   received, the default value of 50 MUST be used as the utilization.

9.1.3.  New Call Initiation

   The calling server MUST NOT place a new SIP call to an instance in
   the cluster which is unhealthy at the time the call is to be placed.

   The calling server MUST select an instance for the call using a
   random function across the instances which are healthy.  The calling
   server MUST weight the probability of selecting that instance in
   proportion to (100 - the utilization of that instance) . It MUST then
   direct the call to this instance, by sending the SIP INVITE to the IP
   address and port of this instance.

   As an example, if a cluster has three instances with utilizations at
   50, 75 and 100, and all three instances are healthy, no INVITEs are
   sent to the third instance, 66% are sent to instance one, and 33% are
   sent to instance 2.  Note that, in this case, since instance 3 is not
   handling new calls, further utilization values can only be learned
   via responses to OPTION pings, which the calling server MUST send for
   instances with over 90% utilization.

9.1.4.  Instance Failure

   When the calling server detects the failure of an instance, it MUST
   identify all calls which are still active, which were sent to that
   instance.  For each such call, it MUST select a new instance for that
   call, by choosing one using a uniformly distributed random function
   amongst the healthy instances.  The calling server MUST generate a
   new INVITE (not a re-INVITE), establishing a new SIP dialog.  This
   INVITE MUST contain a Replaces header field.  The Replaces header
   field MUST contain the dialogID of the call which is being failed
   over.  The INVITE requests MUST be sent uniformly across a 500ms
   window of time.

9.1.5.  Instance to Inactive

   If the config sync receives an updated configuration file, and one of
   the instances from the cluster has been marked as inactive, the
   calling server MUST NOT send new calls to that instance.  However, it



Rosenberg, et al.        Expires August 25, 2021               [Page 10]

Internet-Draft                  Cloud SIP                  February 2021


   MUST keep existing calls up, and MUST continue to send OPTIONS probes
   to that instance.

9.1.6.  Receiving a REFER

   If the calling server receives a REFER request, and the Refer-To URI
   has a domain portion equal to the IP address of a cluster instance or
   the FQDN of the cluster, and the Refer-To URI contains an embedded
   Replaces header field containing a dialogID of a call managed by the
   calling server, then this REFER is meant to trigger a movement of the
   call.

   The calling server MUST authenticate that this request came from an
   instance in the cluster.  The request is authorized if the domain
   portion of the Refer-To URI contains an IP address of a cluster
   instance, or the FQDN of the instance.  Furthermore the dialogID in
   the embedded Refer-To header field matches a dialog that is in
   progress to that cluster.

   If the domain portion of the URI contains an IP address, the calling
   server MUST perform the requested INVITE/Replaces to that cluster
   instance.  If the domain portion contains the FQDN of the cluster,
   the calling server MUST send the INVITE/Replacs to one of the other
   cluster instances, besides the one to which the dialog is currently
   connected.  It MUST select amongst the other instances as if the
   currently connected instance were inactive, and then round robin
   using the utilization measures for the remaining instances.

   TODO: better explanation, more details

9.2.  Cluster Instances

9.2.1.  Sending Utilization Values

   It is RECOMMENDED that if any one instance of a cluster send values
   for Cluster-Utilization, all instances do.  If none send it, calls
   will be uniformly balanced across the cluster.  Thus, the usage of
   this header field is only meant for cases where uniform load
   balancing will not produce uniform utilization.

   If an instance is configured to send utilization, it MUST place an
   Instance-Utilization header field in all responses it sends to all
   transactions, and include its current measure of utilization.  The
   utilization measure MUST be an integer between 0 and 100 inclusive.
   Since absolute ordering of responses cannot be guaranteed, the
   measure SHOULD NOT change more frequently than once a second.





Rosenberg, et al.        Expires August 25, 2021               [Page 11]

Internet-Draft                  Cloud SIP                  February 2021


9.2.2.  Receiving INVITE w.  Replaces

   If an instance in the cluster receives an INVITE for a call, and that
   call has a Replaces header field containing a dialogID for a call
   that the instance knows is in progress within the cluster, it will
   know that this is a failover call.  It may happen that the failover
   call is one being handled by the instance receiving the INVITE with
   Replaces.  This is a race condition, but in this case the instance
   MUST still follow the procedures defined here.

   If this is a failover call, the instance MUST authenticate that the
   INVITE came from the upstream calling server.

   There may be cases where the cluster instance receives an INVITE with
   Replaces header field, but the dialogID does not match a dialog known
   to the cluster.  In such a case, the INVITE MUST be treated as a
   normal INVITE with a Replaces header field as defined by [RFC3891].
   In many cases this may be propagated downstream, or challenged for
   credentials, neither of which are done if the dialogID is a match for
   a dialog known to the cluster.

   Any downstream SIP dialogs associated with the call MUST be sent an
   INVITE with Replaces, moving the call to this instance.  This will
   necessarily require the cluster to store the dialogIDs for all
   dialogs in and out of the cluster, along with any application state
   needed to reconstruct the dialogs at a new instance.

9.2.3.  Graceful Shutdown with Migration

   In cases where an instance in the cluster wishes to shut down quickly
   (perhaps to facilitate a rolling upgrade across the cluster), it can
   do so by ceasing to respond to OPTIONS requests targeted to itself.
   The upstream caller will see this as a failure, and move all of the
   calls off of the instance, onto the remaining instances in the
   cluster.  When the instance reboots, it will begin responding to the
   OPTIONS probes, enabling it to begin to receive new calls.

9.2.4.  Graceful Shutdown without Migration

   Another common use case for graceful restart is to cease accepting
   new calls, but to allow the calls in progress to complete.  Once all
   of the calls have completed, the instance can shut down and restart
   if desired.

   To accomplish this, the cluster config service will mark the instance
   as inactive in the config file, and pass the updated file to the
   config sync via webhook.  This will cause the calling server to stop




Rosenberg, et al.        Expires August 25, 2021               [Page 12]

Internet-Draft                  Cloud SIP                  February 2021


   sending new calls to the instance.  However, calls in progress will
   not be dropped.

9.3.  Moving a Dialog

   Another common case is that an instance is overloaded and wishes to
   shed a few calls.  To facilitate this, a cluster instance MAY send a
   REFER to the calling server, requesting it to send an INVITE with a
   Replaces header field.  The Refer-To header field embedded in the
   Refer-To URI MUST contain the dialogID of the call from the calling
   server to that instance, which is to be moved.  To move the call to a
   specific other instance in the cluster, the domain portion of the URI
   is set to be equal to the IP address of that instance.  Note that the
   calling server will validate that this IP address is another member
   of the cluster before authorizing the REFER.  Alternatively, the
   REFER can request the calling server to send the call to any one of
   the other instances in the cluster, not including itself.  To do
   that, it sets the domain portion of the SIP URI equal to the cluster
   FQDN.

   TODO: Probably need examples and some more details on in or out of
   dialog REFER

10.  Cloud SIP Trunk Configuration File

   Something like:

   {
       "cloud-sip-trunk-name" : "trunk32.acme.com",
       "uri" : "https://configs.sip.acme.com/trunk32",
       "version": 23,
       "webhook-registration" : "https://webhooks.sip.acme.com/trunk32",

       "instances" : [
           {
               "IP" : "1.2.3.4",
               "port" : "5061",
               "status" : "active"
           },
           {
               "IP" : "1.2.3.7",
               "port" : "5061",
               "status" : "inactive"
           }

       ]
   }




Rosenberg, et al.        Expires August 25, 2021               [Page 13]

Internet-Draft                  Cloud SIP                  February 2021


11.  Webhook Registration Object

   Something like:

   {
       "webhook" : "https://webhook-receipt.sip.acme.com"
   }


12.  Instance-Utilization Header Field

   Something like:

   {
   Instance-Utilization: 34
   }


   IANA registration and formal syntax TBD.

13.  Why not DNS

   The usage of DNS - and specifically [RFC3263] - might appear to be an
   alternative to the mechanism in this specification for communicating
   the IP addresses for the instances of the cluster.  However, DNS does
   not meet the requirements outlined above.

   Firstly, DNS is not fast enough to be responsive to the need to add
   or remove an instance from the cluster.  Changes in DNS can take time
   to propagate.  At the time [RFC3263] was conceived, the notion of
   elastic (and automated) expansion and contraction of clusters did not
   exist.  Cluster instance IPs were extremely static and therefore DNS
   was sufficient.  This is no longer the case.

   Secondly, DNS cannot convey state - in particular, information about
   whether the cluster instances are active or inactive.  This is needed
   to facilitate graceful shutdown of instances.  [RFC3263] did not have
   to concern itself with this problem, because at the time it was
   believed SIP servers would not contain call state, and therefore, we
   would not need to worry about this problem.

   In addition, because we need to failover extremely quickly - in under
   two seconds - the calling server needs to perform rapid health
   probing against all instances in the cluster.  This requires the
   calling server to know all of the IP addresses of the all the
   instances in the cluster.  Typically, DNS queries for an FQDN return
   one or perhaps a handful of A records, and not every single A record.
   We expect this specification to be used with clusters that have



Rosenberg, et al.        Expires August 25, 2021               [Page 14]

Internet-Draft                  Cloud SIP                  February 2021


   instances counts in the hundreds, which is wholly inappropriate to
   convey via DNS.

14.  TODO

   Reconcile this with draft-kinamdar-dispatch-sip-audo-peer.

15.  Informative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
              A., Peterson, J., Sparks, R., Handley, M., and E.
              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
              DOI 10.17487/RFC3261, June 2002,
              <https://www.rfc-editor.org/info/rfc3261>.

   [RFC3263]  Rosenberg, J. and H. Schulzrinne, "Session Initiation
              Protocol (SIP): Locating SIP Servers", RFC 3263,
              DOI 10.17487/RFC3263, June 2002,
              <https://www.rfc-editor.org/info/rfc3263>.

   [RFC3264]  Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model
              with Session Description Protocol (SDP)", RFC 3264,
              DOI 10.17487/RFC3264, June 2002,
              <https://www.rfc-editor.org/info/rfc3264>.

   [RFC3711]  Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
              Norrman, "The Secure Real-time Transport Protocol (SRTP)",
              RFC 3711, DOI 10.17487/RFC3711, March 2004,
              <https://www.rfc-editor.org/info/rfc3711>.

   [RFC3891]  Mahy, R., Biggs, B., and R. Dean, "The Session Initiation
              Protocol (SIP) "Replaces" Header", RFC 3891,
              DOI 10.17487/RFC3891, September 2004,
              <https://www.rfc-editor.org/info/rfc3891>.

Authors' Addresses

   Jonathan Rosenberg
   Five9

   Email: jdrosen@jdrosen.net





Rosenberg, et al.        Expires August 25, 2021               [Page 15]

Internet-Draft                  Cloud SIP                  February 2021


   Cullen Jennings
   Cisco

   Email: fluffy@cisco.com


   Tolga Asveren
   Ribbon Communications

   Email: tasveren@rbbn.com









































Rosenberg, et al.        Expires August 25, 2021               [Page 16]