Diameter RAs (Diameter CCA, Ro, and Gx) using credit-control sessions (RFC 4006) support a fault-tolerant mode. In this mode, session state is replicated among nodes in the cluster or (optionally for Rhino 2.6.2 and higher) written to an external key/value store.
This RA operating mode, paired with a service designed and configured to be replicated, allows service instances and credit-control sessions to continue uninterrupted operation — even if a cluster node fails or network segmentation occurs.
Below are details of configuration and behaviour related to failures for both server and client credit-control sessions; including credit-control functionality that affects: outgoing request routing, request re-transmissions (AVPs controlling failure procedures), and session termination (Tx and Tcc timers).
Credit-control session replication and adoption
For credit-control sessions used by the Diameter CCA, Ro, and Gx resource adaptors, session replication is enabled using the ReplicateActivities
RA configuration property.
Using Rhino Replicated Storage
(This is the only replication method available in Rhino versions less than 2.6.2)
To use this method:
-
the namespace in which the application is running must be configured to use the
DomainedMemoryDatabase
.
When session replication is enabled, credit-control session state is stored in the Rhino Replicated Storage facility. If one of the nodes within a cluster fails after any hardware, software, or network error, credit-control sessions served by the failing node are "adopted" by another node within the cluster.
Using Key/Value Store
(This replication method is only available in Rhino versions 2.6.2 and higher)
AVPs controlling session adoption
Credit-control sessions use additional AVPs to control request re-transmissions and failover behaviour. Credit-control sessions store and expose current values of these AVPs to services. Initial values are based on RA configuration, and are later updated based on messages exchanged in the session.
Here are the RFC 4006 AVPs for controlling failure:
AVP | What it's for |
---|---|
|
Tells whether moving the credit-control message stream to a backup server during an ongoing credit-control session is supported. |
|
Used during session-based credit-control. The credit-control client uses information in this AVP to decide what to do if sending credit-control messages to the credit-control server has been, for instance, temporarily prevented due to a network problem. |
|
Used during one-time event credit-control ( |
Session adoption of credit-control sessions is based on the CC-Session-Failover
AVP value and session type (client or server):
-
If
CC-Session-Failover
is equal toFAILOVER_SUPPORTED
, the session is always adopted. -
If
CC-Session-Failover
is NOT equal toFAILOVER_SUPPORTED
:-
Server sessions are not adopted (that is, they are terminated).
-
Client session are only adopted if there is only a single peer where subsequent requests can be routed. That is, either the destination host for the session is defined, or realm addressing is used and there is only a single peer in the realm or default route configured.
-
Configuring failure and failover AVPs
The initial values of CC-Session-Failover
, Credit-Control-Failure-Handling
, and Direct-Debiting-Failure-Handling
for each credit-control session come from user-provided defaults. For server sessions, those values are configured using Diameter RA Configuration Properties. Client-side configuration of failure and failover related AVPs is done per realm. The diameter realm DTD version 1.1 defines an additional element extension-property
that can be specified for the realm. For example:
<realm>
<realm-name>opencloud</realm-name>
<extension-property>
<extension-property-name>CCSessionFailover</extension-property-name>
<extension-property-type>java.lang.String</extension-property-type>
<extension-property-value>FAILOVER_NOT_SUPPORTED</extension-property-value>
</extension-property>
<extension-property>
...
</extension-property>
<application-route>
...
</application-route>
</realm>
Diameter CCA, Ro, and Gx resource adaptors interrogate realm configuration in search of the following extension properties:
Extension property | Values | Default |
---|---|---|
|
optional string parameter If provided must, be one of:
|
(otherwise |
|
optional string parameter If provided must, be one of:
|
|
|
optional string parameter If provided must, be one of:
|
Whenever a session has a specific failure or failover AVP value defined (possibly based on its configuration or a previously received credit-control answer), the RA will add those AVPs to its outgoing credit-control answers (if those answers do not already contain such an AVP).
On the other hand, if a credit-control answer contains a value for a failure or failover AVP, that value is used to update the session-stored value.
Session failure and failover AVP values are initialized to configuration-defined values ONLY if the destination realm is known when creating the session. That is, the session must be created by a call to the RA provider createClientSessionActivity method, with a non-null destination realm parameter value.
|
Session adoption state changes
Session adoption is supported for sessions in specific states. Session state itself may be changed due to session adoption.
The following table describes credit-control state and session adoption interdependence:
Credit-control session state |
Adoption allowed? |
State after adoption |
---|---|---|
Idle |
No |
Terminated |
Open |
Yes |
Open |
PendingEvent |
No |
Terminated |
PendingInitial |
No |
Terminated |
PendingUpdate |
Yes |
Open |
PendingTerminate |
Yes |
Open |
Terminated |
No |
Terminated |
Credit-control session-specific timers
Credit-control specification RFC4006 defines additional timers, Tx and Tcc, used for both client and server credit-control sessions.
Tx timer
The Tx timer is introduced to control the waiting time for clients in Pending
states. When the Tx timer elapses, the credit-control client takes action towards the end user based on the value of the Credit-Control-Failure-Handling
or Direct-Debiting-Failure-Handling
AVP.
The Tx timer is started when the session enters a PendingInitial
, PendingUpdate
, or PendingEvent
state; that is, after sending any credit-control request other than a termination request.
Whenever a Tx timer expires before a credit-control answer is received, the RA fires a RequestTxTimeout
event. A session might also be terminated based on the value of the Credit-Control-Failure-Handling
or Direct-Debiting-Failure-Handling
AVP.
Here are the states for Tx-timeout-related session termination:
Credit-control session state |
Tx timeout session terminated |
Note |
---|---|---|
PendingEvent |
when the request’s |
The |
PendingInitial, PendingUpdate |
when the session’s |
Other |
Tcc timer
The Tcc timer supervises an ongoing server credit-control session. The Validity-Time
AVP is used as input to set the Tcc timer value.
The Tcc timer is set to 2
times the Validity-Time
AVP value present in the credit-control answer and/or the Validity-Time
AVP defined in the Multiple-Services-Credit-Control
AVP.
The Tcc timer is started when a successful credit-control answer is sent in the PendingInitial
or PendingUpdate
state, and stopped when a subsequent request is received. The Tcc timer expiry is reported to the service by firing a TccTimeout
event; and the session is terminated.
Whenever Multiple-Services-Credit-Control AVPs are used for credit-control of multiple services in a single credit-control session, multiple Tcc timers are tracked. An incoming request with Multiple-Services-Credit-Control AVPs for a subset of services in use does not impact expiry time for the rest of the currently used services. Independently of how many services and Tcc timers are tracked, only a single TccTimeout event is fired.
|
Outgoing credit-control request routing
Credit-control sessions (as described in RFC 4006 section 5.7) impose additional routing rules for failure scenarios, on top of the algorithm and configuration described in Diameter RA Request Routing Configuration.
Credit-control routing behaviour — influenced by the Tx Timer, Also, after evaluating the credit-control specific state, the actual re-transmission takes place only if the routing table configuration allows it; that is, if there is another peer available in the realm and the transport-failover configuration element value is appropriate. |
One-time event routing
Routing of requests in the event of failures during one-time event credit-control is based on CC-Session-Failover
. Re-transmission of a request to another peer is allowed when the session’s CC-Session-Failover
value is FAILOVER_SUPPORTED
.
Neither Tx timeout nor the Direct-Debiting-Failure-Handling
AVP are relevant to RA-requested re-transmission. The Tx Timer and Direct-Debiting-Failure-Handling
AVP do however impact expected service behaviour, as per RFC 4006 section 6.5.
Session-based credit control
Routing of requests in the event of failures during session-based credit control involves CC-Session-Failover
, Tx Timer, and the session’s Credit-Control-Failure-Handling
AVP value. Re-transmission to another peer in the realm is allowed when the session’s CC-Session-Failover
value is FAILOVER_SUPPORTED
.
Here are the re-transmission decisions in the event of transport failure; or answers with result codes DIAMETER_TOO_BUSY
, DIAMETER_UNABLE_TO_DELIVER
; or request timeout:
Credit-Control-Failure-Handling AVP | Tx expired | Tx not expired |
---|---|---|
TERMINATE |
report request timeout |
attempt re-transmission |
CONTINUE, RETRY_AND_TERMINATE |
attempt re-transmission |
attempt re-transmission |
As the Tx timer is not started for TERMINATION_REQUEST , re-transmission is always attempted.
|
Replicated credit-control service responsibilities
To provide a fault-tolerant service, both the RA and the deployed JAIN SLEE service must cooperate. This section describes what functionality is expected from a JAIN SLEE service to acheive a fault-tolerant solution when paired with a Diameter RA with session replication enabled.
At a high level, a Diameter RA operating with session replication enabled assures that: services that receive Diameter requests will continue to receive requests on the same session after a node failure (for the server); and services that send Diameter requests can continue to send requests on the same session after a node failure (for the client). This works through the RA replicating session state changes between cluster nodes, along with appropriate Diameter RA request-routing configuration.
The service is responsible for:
-
duplicate detection — The Diameter RA can be configured to allow service handling of duplicate requests. (E.g. if the service only supports mediation scenarios then it could forward the duplicates for handling by upstream server.)
-
handling failover for pending states — Such sessions are either terminated or their state is changed to
OPEN
during session adoption. Services are expected to:-
use a guard timer and retry sending (for
UPDATE_REQUEST
orTERMINATION_REQUEST
), or wait for re-transmission of the request from the client, -
create a new session (for
INITIAL_REQUEST
).
-
-
using realm-based request addressing.
Tips and notes
Below are general tips and notes about configuration and operation of the Rhino cluster, Diameter RA, and JAIN SLEE services with replication enabled. |
Staging queue configuration
Rhino guards the state of replicated services and Diameter sessions using a distributed lock.
For the lock to be acquired on a node other than the node that previously owned the lock, one of two events must occur:
|
This is of little importance during normal operation, as Diameter sessions should be treated as sticky; and all messages in the same session should be handled by the same node (no lock ownership change). This is critical during a node-failure scenario. In that case, before a message for the session served by the failed node can be served on another node in the same cluster, the distributed locks guarding the service and Diameter session state must be acquired.
As the failed node is no longer there to acknowledge the lock ownership change, all Diameter message processing for sessions served by the failing node must wait until the failed node is no longer a part of primary component; that is, until node failure is detected and the node is considered dead.
The same is true when a node creates and tries to acquire a new lock, as this fact also must be acknowledged by other nodes in the primary component. Hence, in a failure scenario, acquisition of a new lock must wait for the failed node to leave the primary component.
During the time between node failure and the node being considered dead by the remaining nodes in cluster, the staging queue size grows because staging threads must wait for the acquisition of distributed locks (when processing events on activities served by failed node, or receiving requests for new sessions). If the staging queue size is too small to hold all the events until the failed node is considered dead, some events will be dropped from the staging queue (resulting in possible re-transmissions or straight-out failures).
To counter that behaviour — to ensure that the staging queue configuration is appropriate to accommodate incoming Diameter traffic until a failed node is considered dead — the administrator should adjust:
-
the
maximumAge
andmaximumSize
staging queue parameters -
the
node_abort_interval
parameter (time without response before considering another node dead) in the Savannaconfig.properties
(RHINO_HOME/etc/defaults/config/savanna
and/orRHINO_HOME/node-xxx/config/savanna
).
Adoption of sessions in PENDING states
The state of credit-control sessions in PendingUpdate
and PendingTerminate
states is changed to OPEN
during adoption.
The current version of the Diameter RA does not handle the case when a client re-transmits the request on a session handled by a failed node before that session is adopted on another node, resulting in possible failure for that particular session.
Duplicate detection
The default RA configuration disallows services to handle some credit-control message duplicates by strong validation of credit-control session state against allowed credit-control message types (CC-Request-Type
AVP values EVENT_REQUEST
, INITIAL_REQUEST
, UPDATE_REQUEST
, and TERMINATION_REQUEST
). Services receive only duplicate credit-control requests with CC-Request-Type
AVP equal to UPDATE_REQUEST
.
The InitialTerminateDuplicateHandling
and LingerOnTerminate
RA configuration properties can be used to cede the responsibility for detection and handling of duplicate credit-control requests to service.
For details, see:
Diameter CCA, Ro, and Gx Configuration Properties.
|
Default RA duplicate handling
If RA configuration is not customized the default RA duplicate request handling depends on the value of CC-Request-Type
AVP present in credit-control request:
-
EVENT_REQUEST
orINITIAL_REQUEST
— the Diameter RA detects that anEVENT_REQUEST
orINITIAL_REQUEST
request was already received and internally generates an error response. In case ofINITIAL_REQUEST
the session is also terminated. -
UPDATE_REQUEST
— no special handling of duplicated request by the Diameter RA, request is fired on existing diameter activity. -
TERMINATION_REQUEST
— the Diameter RA detects that the session was already terminated (the old session does not exist, a new session is created that receives an unexpectedTERMINATION_REQUEST
as its first request), and it generates an error condition .
Service duplicate handling
To enable service handling of duplicated session credit-control requests following RA configuration properties must be customized:
-
InitialTerminateDuplicateHandling
— must be set to valueSERVICE
-
LingerOnTerminate
— must be set to value greater or equal to1000
(the unit is milliseconds) .
Under such configuration RA fires ALL (including duplicates) session credit-control requests on diameter session.
The handling of duplicated event credit-control requests is not affected. Duplicated credit-control request with CC-Request-Type equal to EVENT_REQUEST are still automatically handled by RA internally generating an error response.
|