As of version 2.6.1, the SIS-SIP EasySIP Resource Adaptor supports session replication. This means that SIP dialogs can be failed over to other cluster nodes if the original node failed.
Overview of Operation
Session replication is disabled by default. It can be enabled for all dialogs, or alternatively an application can specify that a particular dialog must be replicated, using an API call.
Normal Operation
When replication is enabled, the SIS initially creates the dialog state in local memory. When the dialog reaches the "Confirmed" state, the SIS writes the dialog state to its replicated store. This is either Rhino’s in-memory database, or an external key-value store, such as Cassandra.
Early dialogs (dialogs where the initial request has not yet received a 2xx response) are not replicated. |
When the creating node receives mid-dialog requests, it processes the requests normally using the dialog state in local memory. When the SIP transaction completes, the updated dialog state is written to the replicated store.
When a dialog-terminating request such as BYE is processed, the local and replicated dialog state is removed when the transaction completes.
Failover and Recovery
If a node fails, it is assumed that an external mechanism, such as DNS or a load balancer, will direct SIP traffic to the surviving nodes in a cluster.
When a node receives a mid-dialog request for a dialog that does not exist in local memory, it attempts to load the dialog from its replicated state.
-
If the dialog is found, it is copied into local memory and the node can continue processing the mid-dialog request.
-
If the dialog is not found, then the SIS node rejects the request with a
481 Call/Transaction Does Not Exist
response.
When the original node recovers, failed-over dialogs do not migrate back to that node. Rather, the dialog remains on the node that last took over ownership of the dialog, as long as that node is alive. This is managed by Session Ownership.
Session Ownership
The SIS uses Rhino’s Session Ownership facility to track which node is currently responsible for a dialog. This is to ensure that requests in a dialog are processed on the same node if possible, for consistency. Otherwise dialog state could be updated by several nodes, leading to errors. In other words, dialogs are "sticky" and will only migrate to another node in the event of a node failure.
When a SIS node receives a mid-dialog request for a dialog that it does not currently own, the request is automatically proxied to the owning node, as determined by the Session Ownership facility. If the owning node is down, then the current node may take over ownership of the dialog and resume processing the request. The Session Ownership facility ensures that subsequent requests for this dialog will be directed to the correct node.
Session Ownership records are not automatically created by the SIS; they must be created explicitly by the application processing the initial request. The Sentinel VoLTE session tracking features perform this function. |
Configuration for session replication
Enabling Session Replication and Session Ownership requires some configuration changes in the Rhino platform, as well as the SIS instance.
Session Replication and Session Ownership are only available in SIS 2.6.1, which requires Rhino 2.6.1 or later. |
Replication method
The replication method is determined by the Rhino namespace that the SIS instance is deployed into.
Rhino 2.6.1 supports two replication methods:
-
savanna
— state is replicated to cluster nodes using Rhino’s Savanna reliable multicast protocol. -
key-value-store
— state is written to an external key-value store database, currently Cassandra.
The key-value-store
method is preferred for large clusters.
Session ownership
Rhino’s Session Ownership facility is automatically available to the SIS instance if its namespace has session ownership enabled. This requires configuring a Session Ownership store in Rhino, and creating a namespace with session ownership support.
Session replication mode
The sessionReplicationMode
configuration property on the SIS-SIP EasySIP RA determines whether session replication will be used.
This property has three values:
-
disabled
— no replication will be used. -
enabled
— replication is enabled. -
automatic
— replication is enabled if the replication method iskey-value-store
.
The default value is automatic
.
Replicate by default
When replication is enabled in the SIS, there is also the option to selectively enable it on particular sessions.
This is controlled by the replicateByDefault
boolean configuration property:
-
If
true
, and session replication is enabled, all sessions will be replicated. -
If
false
, and session replication is enabled, then a session is only replicated on demand, when the application callsSipSession.startReplicating()
.
The default value is true
.
When startReplicating()
is called, the SIS replicates the session as soon as it is able to.
If the dialog is not yet confirmed, replication will begin when it reaches the confirmed state.
If the dialog is already confirmed, replication begins immediately, storing the current dialog state.
Dynamic SRV Name Format
To make use of per-node SRV addresses, the
DynamicSRVNameFormat
network interface property must be configured.
The value of this property is a string that describes how a DNS SRV name is derived from the node’s IP address.
The special token ${IP}
is replaced with a DNS-safe encoding of the node’s IP address, that may be used as a domain name component.
For example if DynamicSRVNameFormat
is tas-${IP}.site1.home1.net
, then the node with IP address 192.168.10.1
will use the hostname tas-192-168-10-1.site1.home1.net
in its Contact
or Record-Route
URIs.
For IPv6 addresses, each 2-octet group is hex-encoded and delimited with a hyphen.
For example the address 1080::8:800:200C:417A
is encoded as 1080-0000-0000-0000-0008-0800-200c-417a
.
Per-node SRV addresses
For dialogs created by the SIS, mid-dialog requests are routed to the SIP URI provided in the Contact
or Record-Route
headers of the SIS’s dialog-creating request or response.
By default this URI will contain the IP address of the SIS node.
This means in the event of a node failure, mid-dialog requests cannot fail over to another IP address, so will fail.
The SIS already had the capability to use virtual addresses in its URIs, using the
VirtualAddresses
network interface property.
The virtual address could be a DNS name that resolves to a load balancer address, or it might refer to a DNS NAPTR, SRV or A (address) record.
In the event of a node failure, some load balancers may be able to ensure that subsequent requests in a dialog are routed to the same surviving node. But for sites relying on DNS and SIP’s standard RFC 3263 DNS procedures for locating servers, there is no guarantee that subsequent requests in a dialog will be routed to the same node after a failure — the single virtual address in the SIS’s URI can resolve to any SIS node.
To support DNS failover in a more predictable fashion, the SIS may use per-node SRV addresses in its own SIP URIs. These are DNS SRV names generated from the SIS node’s IP address, so they are specific to that node. The operator can provision corresponding DNS SRV records to specify the primary and backup nodes. Requests will be routed to the given node when it is available, but will fail over to the given backup nodes by the rules of RFC 3263.
The use of this feature requires that the Session Ownership subsystem is available, and that each SIP dialog has a Session Ownership record.
Example
Say we have three SIS nodes, hostnames sis-1.home1.net
, sis-2.home1.net
and sis-3.home1.net
.
Their address records in DNS are:
sis-1.home1.net <ttl> IN A 192.168.10.1
sis-2.home1.net <ttl> IN A 192.168.10.2
sis-3.home1.net <ttl> IN A 192.168.10.3
By configuring the DynamicSRVNameFormat
network interface property with value ip-${IP}.home1.net
,
the SIS nodes will use URIs of the form sip:ip-192-168-10-1;lr;transport=tcp
.
The operator provisions the corresponding DNS SRV records:
;; SRV address Priority Weight Port Target
_sip._tcp.ip-192-168-10-1.home1.net. <ttl> IN SRV 0 1 5060 sis-1.home1.net.
_sip._tcp.ip-192-168-10-1.home1.net. <ttl> IN SRV 10 1 5060 sis-2.home1.net.
_sip._tcp.ip-192-168-10-1.home1.net. <ttl> IN SRV 10 1 5060 sis-3.home1.net.
_sip._tcp.ip-192-168-10-2.home1.net. <ttl> IN SRV 0 1 5060 sis-2.home1.net.
_sip._tcp.ip-192-168-10-2.home1.net. <ttl> IN SRV 10 1 5060 sis-3.home1.net.
_sip._tcp.ip-192-168-10-2.home1.net. <ttl> IN SRV 10 1 5060 sis-1.home1.net.
_sip._tcp.ip-192-168-10-3.home1.net. <ttl> IN SRV 0 1 5060 sis-3.home1.net.
_sip._tcp.ip-192-168-10-3.home1.net. <ttl> IN SRV 10 1 5060 sis-1.home1.net.
_sip._tcp.ip-192-168-10-3.home1.net. <ttl> IN SRV 10 1 5060 sis-2.home1.net.
Corresponding records would be needed for UDP (_sip._udp ) and TLS (_sips._tcp ) transports, if used.
|
For each per-node SRV address, there are 3 records. The first points to the node’s own host name, and the other records point to the 2 other node’s host names. The addresses are always tried in priority order (lowest to highest), by the rules of RFC 3263 and RFC 2782.
If node sis-1
fails, mid-dialog requests routed to sip:ip-192-168-10-1.home1.net
will go to either sis-2
or sis-3
- assuming that both are available.
This ensures that dialogs are "sticky" to the node that created them and are only sent to other nodes in the presence of a failure.
Mid-dialog requests are automatically routed to one of the other nodes, and through the use of the Session Ownership subsystem are proxied to a single node such that requests for the same dialog are always processed on one node.
Initial requests
Initial requests can be directed to a virtual address, DNS NAPTR or SRV address that distributes traffic over all nodes in the cluster. So for the example cluster above, using a DNS SRV address we can say that all nodes have equal priority and weight:
;; SRV address Priority Weight Port Target
_sip._tcp.sis.home1.net. <ttl> IN SRV 10 1 5060 sis-1.home1.net.
_sip._tcp.sis.home1.net. <ttl> IN SRV 10 1 5060 sis-2.home1.net.
_sip._tcp.sis.home1.net. <ttl> IN SRV 10 1 5060 sis-3.home1.net.
When a SIP client sends an initial request to the URI sip:sis.home1.net;transport=tcp
, it will automatically select one of the three nodes at random as the destination for that request.
EasySIP API changes
Several enhancements were made to the EasySIP API to support SIP session replication.
Encoding and decoding of SIP messages
SipMessage
objects may now be easily encoded to byte streams and back.
This makes it easy to store messages in SBB CMP fields, for example.
See SipFactory.encodeMessage()
and
SipFactory.decodeMessage()
.
Replicate sessions on demand
The method SipSession.startReplicating()
requests that the session be replicated if possible.
It is only meaningful if the replicateByDefault
boolean configuration property is false
.
The method may be called at any point in the session’s lifetime. The session will not actually be replicated until it reaches the "Confirmed" state. If the session is already in the "Confirmed" state when the method is invoked, replication begins immediately. Multiple invocations of this method have no effect.
Obtain dialog ID for session
The SipSession.getDialogID()
method gets the session’s
DialogID
, consisting of Call-ID, local and remote tags.
The string form of this dialog ID may be used as the tracking key in the Session Ownership facility.
Internal Application Server URI
Each SIS instance has its own unique URI, referred to as the internal application server URI (AS URI). This is a SIP URI used by the SIS for its own communication between SIS nodes. Currently this is used when Session Ownership determines that a mid-dialog request must be handled by another node.
Applications must use the the internal AS URI as the "owner URI" for a session ownership record.
The SipFactory
methods
SipFactory.getInternalASURI()
and
SipFactory.isInternalASURI()
are used to obtain and check these URIs.