Diameter RAs (Diameter CCA, Ro, and Gx) using credit-control sessions (RFC 4006) support a fault-tolerant mode. In this mode, session state is replicated among nodes in the cluster.

This RA operating mode, paired with a service designed and configured to be replicated, allows service instances and credit-control sessions to continue uninterrupted operation — even if a cluster node fails or network segmentation occurs.

Below are details of configuration and behaviour related to failures for both server and client credit-control sessions; including credit-control functionality that affects: outgoing request routing, request re-transmissions (AVPs controlling failure procedures), and session termination (Tx and Tcc timers).

Credit-control session replication and adoption

For credit-control sessions used by the Diameter CCA, Ro, and Gx resource adaptors, session replication is enabled using the ReplicateActivities RA configuration property.

When session replication is enabled, credit-control session state is stored in the Rhino Replicated Storage facility. If one of the nodes within a cluster fails after any hardware, software, or network error, credit-control sessions served by the failing node are "adopted" by another node within the cluster.

AVPs controlling session adoption

Credit-control sessions use additional AVPs to control request re-transmissions and failover behaviour. Credit-control sessions store and expose current values of these AVPs to services. Initial values are based on RA configuration, and are later updated based on messages exchanged in the session.

Here are the RFC 4006 AVPs for controlling failure:

AVP What it’s for

CC-Session-Failover

Tells whether moving the credit-control message stream to a backup server during an ongoing credit-control session is supported.

Credit-Control-Failure-Handling

Used during session-based credit-control. The credit-control client uses information in this AVP to decide what to do if sending credit-control messages to the credit-control server has been, for instance, temporarily prevented due to a network problem.

Direct-Debiting-Failure-Handling

Used during one-time event credit-control (Requested-Action AVP set to DIRECT_DEBITING). The credit-control client uses information in this AVP to decide what to do, for example if sending credit-control messages to the credit-control server has been, for instance, temporarily prevented due to a network problem.

Session adoption of credit-control sessions is based on the CC-Session-Failover AVP value and session type (client or server):

  • If CC-Session-Failover is equal to FAILOVER_SUPPORTED, the session is always adopted.

  • If CC-Session-Failover is NOT equal to FAILOVER_SUPPORTED:

    • Server sessions are not adopted (that is, they are terminated).

    • Client session are only adopted if there is only a single peer where subsequent requests can be routed. That is, either the destination host for the session is defined, or realm addressing is used and there is only a single peer in the realm or default route configured.

Configuring failure and failover AVPs

The initial values of CC-Session-Failover, Credit-Control-Failure-Handling, and Direct-Debiting-Failure-Handling for each credit-control session come from user-provided defaults. For server sessions, those values are configured using Diameter RA Configuration Properties. Client-side configuration of failure and failover related AVPs is done per realm. The diameter realm DTD version 1.1 defines an additional element extension-property that can be specified for the realm. For example:

<realm>
    <realm-name>opencloud</realm-name>
    <extension-property>
        <extension-property-name>CCSessionFailover</extension-property-name>
        <extension-property-type>java.lang.String</extension-property-type>
        <extension-property-value>FAILOVER_NOT_SUPPORTED</extension-property-value>
    </extension-property>
    <extension-property>
        ...
    </extension-property>
    <application-route>
        ...
    </application-route>
</realm>

Diameter CCA, Ro, and Gx resource adaptors interrogate realm configuration in search of the following extension properties:

Extension property Values Default

CCSessionFailover

optional string parameter

If provided must, be one of:

  • FAILOVER_NOT_SUPPORTED

  • FAILOVER_SUPPORTED

FAILOVER_SUPPORTED if application-route with application id equal to 4 has a transport-failover element with value ALWAYS

(otherwise FAILOVER_NOT_SUPPORTED or undefined, which is equivalent)

CreditControlFailureHandling

optional string parameter

If provided must, be one of:

  • TERMINATE

  • CONTINUE

  • RETRY_AND_TERMINATE

DirectDebitingFailureHandling

optional string parameter

If provided must, be one of:

  • TERMINATE_OR_BUFFER

  • CONTINUE

Whenever a session has a specific failure or failover AVP value defined (possibly based on its configuration or a previously received credit-control answer), the RA will add those AVPs to its outgoing credit-control answers (if those answers do not already contain such an AVP).

On the other hand, if a credit-control answer contains a value for a failure or failover AVP, that value is used to update the session-stored value.

Warning Session failure and failover AVP values are initialized to configuration-defined values ONLY if the destination realm is known when creating the session. That is, the session must be created by a call to the RA provider createClientSessionActivity method, with a non-null destination realm parameter value.

Session adoption state changes

Session adoption is supported for sessions in specific states. Session state itself may be changed due to session adoption.

The following table describes credit-control state and session adoption interdependence:

Credit-control
session state
Adoption
allowed?
State
after adoption
Idle

No

Terminated
Open

Yes

Open
PendingEvent

No

Terminated
PendingInitial

No

Terminated
PendingUpdate

Yes

Open
PendingTerminate

Yes

Open
Terminated

No

Terminated

Credit-control session-specific timers

Credit-control specification RFC4006 defines additional timers, Tx and Tcc, used for both client and server credit-control sessions.

Tx timer

The Tx timer is introduced to control the waiting time for clients in Pending states. When the Tx timer elapses, the credit-control client takes action towards the end user based on the value of the Credit-Control-Failure-Handling or Direct-Debiting-Failure-Handling AVP.

The Tx timer is started when the session enters a PendingInitial, PendingUpdate, or PendingEvent state; that is, after sending any credit-control request other than a termination request.

Whenever a Tx timer expires before a credit-control answer is received, the RA fires a RequestTxTimeout event. A session might also be terminated based on the value of the Credit-Control-Failure-Handling or Direct-Debiting-Failure-Handling AVP.

Here are the states for Tx-timeout-related session termination:

Credit-control
session state
Tx timeout
session terminated
Note
PendingEvent

when the request’s Requested-Action AVP is NOT DIRECT_DEBITING

The Direct-Debiting-Failure-Handling AVP value can change the expected behaviour of a service on a Tx timeout.

PendingInitial, PendingUpdate

when the session’s Credit-Control-Failure-Handling AVP value is TERMINATE

Other Credit-Control-Failure-Handling AVP values can change the expected behaviour of a service on a Tx timeout

Tcc timer

The Tcc timer supervises an ongoing server credit-control session. The Validity-Time AVP is used as input to set the Tcc timer value.

The Tcc timer is set to 2 times the Validity-Time AVP value present in the credit-control answer and/or the Validity-Time AVP defined in the Multiple-Services-Credit-Control AVP.

The Tcc timer is started when a successful credit-control answer is sent in the PendingInitial or PendingUpdate state, and stopped when a subsequent request is received. The Tcc timer expiry is reported to the service by firing a TccTimeout event; and the session is terminated.

Note Whenever Multiple-Services-Credit-Control AVPs are used for credit-control of multiple services in a single credit-control session, multiple Tcc timers are tracked. An incoming request with Multiple-Services-Credit-Control AVPs for a subset of services in use does not impact expiry time for the rest of the currently used services. Independently of how many services and Tcc timers are tracked, only a single TccTimeout event is fired.

Outgoing credit-control request routing

Credit-control sessions (as described in RFC 4006 section 5.7) impose additional routing rules for failure scenarios, on top of the algorithm and configuration described in Diameter RA Request Routing Configuration.

Note

Credit-control routing behaviour — influenced by the Tx Timer, CC-Session-Failover, and the Credit-Control-Failure-Handling AVP — is only applicable when using realm-based addressing.

Also, after evaluating the credit-control specific state, the actual re-transmission takes place only if the routing table configuration allows it; that is, if there is another peer available in the realm and the transport-failover configuration element value is appropriate.

One-time event routing

Routing of requests in the event of failures during one-time event credit-control is based on CC-Session-Failover. Re-transmission of a request to another peer is allowed when the session’s CC-Session-Failover value is FAILOVER_SUPPORTED.

Neither Tx timeout nor the Direct-Debiting-Failure-Handling AVP are relevant to RA-requested re-transmission. The Tx Timer and Direct-Debiting-Failure-Handling AVP do however impact expected service behaviour, as per RFC 4006 section 6.5.

Session-based credit control

Routing of requests in the event of failures during session-based credit control involves CC-Session-Failover, Tx Timer, and the session’s Credit-Control-Failure-Handling AVP value. Re-transmission to another peer in the realm is allowed when the session’s CC-Session-Failover value is FAILOVER_SUPPORTED.

Here are the re-transmission decisions in the event of transport failure; or answers with result codes DIAMETER_TOO_BUSY, DIAMETER_UNABLE_TO_DELIVER; or request timeout:

Credit-Control-Failure-Handling AVP Tx expired Tx not expired
TERMINATE

report request timeout
(for INITIAL_REQUEST, UPDATE_REQUEST)

attempt re-transmission

CONTINUE, RETRY_AND_TERMINATE

attempt re-transmission

attempt re-transmission

Note As the Tx timer is not started for TERMINATION_REQUEST, re-transmission is always attempted.

Replicated credit-control service responsibilities

To provide a fault-tolerant service, both the RA and the deployed JAIN SLEE service must cooperate. This section describes what functionality is expected from a JAIN SLEE service to acheive a fault-tolerant solution when paired with a Diameter RA with session replication enabled.

At a high level, a Diameter RA operating with session replication enabled assures that: services that receive Diameter requests will continue to receive requests on the same session after a node failure (for the server); and services that send Diameter requests can continue to send requests on the same session after a node failure (for the client). This works through the RA replicating session state changes between cluster nodes, along with appropriate Diameter RA request-routing configuration.

The service is responsible for:

  • duplicate detection — The Diameter RA can be configured to allow service handling of duplicate requests. (E.g. if the service only supports mediation scenarios then it could forward the duplicates for handling by upstream server.)

  • handling failover for pending states — Such sessions are either terminated or their state is changed to OPEN during session adoption. Services are expected to:

    • use a guard timer and retry sending (for UPDATE_REQUEST or TERMINATION_REQUEST), or wait for re-transmission of the request from the client,

    • create a new session (for INITIAL_REQUEST).

  • using realm-based request addressing.

Tips and notes

Tip Below are general tips and notes about configuration and operation of the Rhino cluster, Diameter RA, and JAIN SLEE services with replication enabled.

Staging queue configuration

Rhino guards the state of replicated services and Diameter sessions using a distributed lock.

Warning

For the lock to be acquired on a node other than the node that previously owned the lock, one of two events must occur:

  • The node previously owning the lock must be informed and acknowledge the lock ownership change.

  • The node previously owning the lock must leave the primary component.

This is of little importance during normal operation, as Diameter sessions should be treated as sticky; and all messages in the same session should be handled by the same node (no lock ownership change). This is critical during a node-failure scenario. In that case, before a message for the session served by the failed node can be served on another node in the same cluster, the distributed locks guarding the service and Diameter session state must be acquired.

As the failed node is no longer there to acknowledge the lock ownership change, all Diameter message processing for sessions served by the failing node must wait until the failed node is no longer a part of primary component; that is, until node failure is detected and the node is considered dead.

The same is true when a node creates and tries to acquire a new lock, as this fact also must be acknowledged by other nodes in the primary component. Hence, in a failure scenario, acquisition of a new lock must wait for the failed node to leave the primary component.

During the time between node failure and the node being considered dead by the remaining nodes in cluster, the staging queue size grows because staging threads must wait for the acquisition of distributed locks (when processing events on activities served by failed node, or receiving requests for new sessions). If the staging queue size is too small to hold all the events until the failed node is considered dead, some events will be dropped from the staging queue (resulting in possible re-transmissions or straight-out failures).

To counter that behaviour — to ensure that the staging queue configuration is appropriate to accommodate incoming Diameter traffic until a failed node is considered dead — the administrator should adjust:

  • the maximumAge and maximumSize staging queue parameters

  • the node_abort_interval parameter (time without response before considering another node dead) in the Savanna config.properties (RHINO_HOME/etc/defaults/config/savanna and/or RHINO_HOME/node-xxx/config/savanna).

Adoption of sessions in PENDING states

The state of credit-control sessions in PendingUpdate and PendingTerminate states is changed to OPEN during adoption.

The current version of the Diameter RA does not handle the case when a client re-transmits the request on a session handled by a failed node before that session is adopted on another node, resulting in possible failure for that particular session.

Duplicate detection

The default RA configuration disallows services to handle some credit-control message duplicates by strong validation of credit-control session state against allowed credit-control message types (CC-Request-Type AVP values EVENT_REQUEST, INITIAL_REQUEST, UPDATE_REQUEST, and TERMINATION_REQUEST). Services receive only duplicate credit-control requests with CC-Request-Type AVP equal to UPDATE_REQUEST.

The InitialTerminateDuplicateHandling and LingerOnTerminate RA configuration properties can be used to cede the responsibility for detection and handling of duplicate credit-control requests to service.

Default RA duplicate handling

If RA configuration is not customized the default RA duplicate request handling depends on the value of CC-Request-Type AVP present in credit-control request:

  • EVENT_REQUEST or INITIAL_REQUEST — the Diameter RA detects that an EVENT_REQUEST or INITIAL_REQUEST request was already received and internally generates an error response. In case of INITIAL_REQUEST the session is also terminated.

  • UPDATE_REQUEST — no special handling of duplicated request by the Diameter RA, request is fired on existing diameter activity.

  • TERMINATION_REQUEST — the Diameter RA detects that the session was already terminated (the old session does not exist, a new session is created that receives an unexpected TERMINATION_REQUEST as its first request), and it generates an error condition .

Service duplicate handling

To enable service handling of duplicated session credit-control requests following RA configuration properties must be customized:

  • InitialTerminateDuplicateHandling — must be set to value SERVICE

  • LingerOnTerminate — must be set to value greater or equal to 1000 (the unit is milliseconds) .

Under such configuration RA fires ALL (including duplicates) session credit-control requests on diameter session.

Note The handling of duplicated event credit-control requests is not affected. Duplicated credit-control request with CC-Request-Type equal to EVENT_REQUEST are still automatically handled by RA internally generating an error response.
Previous page Next page