SIP Resource Adaptor 3.1 :: SIP Resource Adaptor Guide :: RFC 3263: Failover and Load Balancing

RFC 3263 — Locating SIP Servers specifies this feature, that lets a SIP client use DNS procedures to resolve a SIP URI into the address, port, and transport protocol of the next server to contact

In addition, these DNS procedures can result in multiple addresses that clients can try sequentially if earlier addresses failed.

The SIP RA now supports RFC 3263 to provide failover and load balancing of outgoing requests. Previously the SIP RA used RFC 3263 only for determining the first address to contact; it did not automatically try backup addresses if the first address failed. Now the SIP RA has been updated to automatically failover to backup addresses as well, with no intervention required from the application.

On this page...

About the RFC 3263 process
SIP RA usage of RFC 3263
Configuring the SIP RA for RFC 3263

About the RFC 3263 process

When a SIP client sends a request, it must select either the first Route header’s URI, if present, or the Request-URI. This URI determines where the request is sent to (the next-hop address). This URI might only contain a domain name, such as sip:bob@example.com. RFC 3263 DNS procedures are required to convert the URI into the address, port, and transport protocol of an actual SIP server (or servers).

At a high level there are three parts to the RFC 3263 DNS process. Some of these may be skipped depending on what information is already given in the URI, such as transport, IP address, or port numbers. Here is a simplified description of the process that applies to any SIP client using RFC 3263.

Determine the transport protocol. If not already specified in the URI, the client will do a DNS NAPTR lookup on the domain. This may return some NAPTR records which specify in order of preference the transport protocols that shall be used. The NAPTR records will contain SRV addresses for each supported transport protocol.
Next, determine the port. This can be found by looking up the SRV address in the NAPTR record. Or, if there were no NAPTR records, the SIP client will try the default SRV addresses for its preferred transport protocols, such as _sip._tcp.example.com. The SRV query may return one or more SRV records. Each record contains the hostname and port of a SIP server. Multiple SRV records are sorted according to their priority and weight, and ordered randomly as per RFC 2782. This means that the results will be ordered slightly differently every time, providing a form of load balancing.
Finally the hostnames provided in SRV records must be looked up to obtain their IP addresses. Sometimes this information is already included in the SRV response. If there were no SRV records, the SIP client will default to looking up the IP address of the hostname in the URI.

At the end of this process the SIP client has a list of (address, port, transport) tuples to try. If the list is empty then the request cannot be routed and will fail. Otherwise, the client picks the first address in the list and starts a client transaction. The next backup address will be tried if the transaction fails for any of these reasons:

the server responds with 503 (Service Unavailable)
the transaction times out
the transaction fails with a transport error.

When such a failure occurs, the client selects the next address and tries again with a new client transaction. If all addresses failed, then the client must fail the request.

SIP RA usage of RFC 3263

The SIP RA closely follows the RFC 3263 procedures but with some adaptations, described below.

Use of client transactions

RFC 3263 states that a new client transaction is used for each backup address. The SIP RA does this; however these additional client transactions are hidden from the application. The application just creates a single JAIN SIP ClientTransaction to send a request, as before. If a failure occurs and there are backup addresses available, the RA automatically creates a new client transaction and sends the request to the backup address. The RA takes care of routing the responses and other events up to the application’s ClientTransaction activity so that it looks like a single client transaction was used.

Each new client transaction created for contacting backup servers will have a new Via branch ID (as per the RFC), but these are derived from the original branch ID with a different suffix for each new transaction. For example if the original branch was z9hG4bK776asdhds, any subsequent transactions will have branches z9hG4bK776asdhds%1, z9hG4bK776asdhds%2 so on. In this way the branches for each transaction are still unique in the RA and the network, but can be correlated when looking at logs or packet captures. The application will only see the original branch ID in responses.

The first final response to be received will end the transaction, and no more backup addresses will be tried. If all of the available addresses failed, the application will see the last error response received from a server (a 503 or 408 response), or the RA will generate a 503 response and pass it up to the application.

Failover timer

Failover to a backup server is triggered by transaction timeouts, transport errors, or 503 responses. If a server has failed completely, the resulting transport error (such as ICMP Port Unreachable) may not be directly visible to the RA, especially when using UDP. Or there may be no ICMP rejects at all in the event of a network partition. This means failover will not occur until the transaction timeout (Timers B or F in RFC 3261) occurs, which will usually be 32 seconds!

For many applications it is desirable to try and detect failures a bit faster than this. The SIP RA defines an additional failover timer, which can expire before the default transaction timeout to trigger a failover faster. The default value of this timer is 10 seconds.

The failover timer is only started when there are multiple addresses to try. If there is just a single address, the normal transaction timeout behaviour applies. The timer is stopped as soon as any response (including 100 Trying) is received, as this indicates that the next-hop server is functioning. If this timer expires and no responses were received, the RA moves on to the next address.

Server failure and recovery

By itself, DNS does not tell you if a server is available. When using the DNS process above, the same set of servers will always be returned, regardless of whether they are currently available or not. To avoid contacting servers that are likely to be down, the SIP RA maintains a "block list".

Whenever a server failure is detected, either by a timeout or transport error, the RA adds the failed address to a block list so that the address is not tried again for some period of time. This avoids the RA needlessly contacting servers that are most likely going to be down. By default, failed servers are blocked for 5 minutes. After that time the RA will try using the server again.

503 response handling

Servers that respond with a 503 (Service Unavailable) response are not added to the block list. If a 503 response is received, the SIP RA will then try a number of backup servers. The number of servers to try is either 2, or 10% of the number of candidate servers, whichever is greater. If all retries fail, a 504 (Server Timeout) response is returned to the application. This is intentional, to stop upstream nodes performing additional RFC3263 retries of their own, which they might have done had they received a 503 response.

Last resort attempts

If all non-blocked servers have been tried unsuccessfully, then the SIP RA can be configured to optionally try the highest-priority server in a last-resort attempt to reach somewhere before giving up.

Configuring the SIP RA for RFC 3263

See the RFC3263 configuration properties in SIP RA Features.

Previous page Next page