Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for Rhino performance issues.
High latency can be observed in services running on the Rhino SLEE if the system is not configured appropriately for the workload or if a resource leak is present. Services should always be performance tested before deployment however the tested load profile may not match the production load so sometimes live diagnosis is required. Always attempt to reproduce performance problems in a test environment before trying solutions on a live system unless the fix is both simple and obvious e.g. a faster disk or more nodes.
Long Garbage Collection pauses in Garbage Collection/console logs
Longer than expected processing and/or response times
Failed or timed out attempts in your application
Firstly, determine whether or not the operating environment is the cause of the problem. For more information on the operating environment refer to Operating Environment Issues
Secondly, determine whether or not heap use is the cause of the problem. This should be considered the most likely cause when garbage collection pauses are long. For more information on heap use refer to sections Java Virtual Machine Heap Issues and Application or Resource Adaptor Heap Issues.
The following steps are to be followed if the previous two stages of diagnosis were unsuccessful.
All three steps use the output from
client/bin/rhino-stats monitoring the Staging Threads parameter set.
The output for the Staging Threads should typically be monitored over at least a 30 minute period in order to ensure that the system characteristics are noticed through different garbage collection cycles.
An example command which connects to a Rhino cluster member on the local host and monitors the Staging Threads parameter set is as follows.
./client/bin/rhino-stats -h localhost -m Staging\ Threads
The number of available threads is in the avail column of the rhino-stats output. Typically in a system which has too few threads configured the number of available threads is close to the maximum number of threads however it may drop to a near zero value for a brief period of time (half a second to several seconds).
The staging queue is used by Rhino to store incoming events when there are no free threads available to process the event. In a well sized system the staging queue is almost always empty. This can be observed by looking at the queueSize column. This should be a low number, less than the number of available threads. If it is not try doubling the number of threads available to Rhino.
Dropped staged items means that queued events have been discarded due to lack of queue size to hold the item. If this value is ever non-zero it indicates that events are not being processed fast enough. This can indicate that the system is configured inappropriately with respect to the number of threads or that a Service or Resource Adaptor in Rhino has an issue. Overload conditions can also cause cause dropped staging items.
Event processing time should be monitored for each event type that makes up the application flows, particularly those flows showing poor performance e.g.
./client/bin/rhino-stats -h localhost -m "Events.ocsipra.[javax.sip.message.Request.INVITE net.java.slee, 1.2]"
The most important statistic in event processing is the SBB Processing Time (SBBT). This tells how long the thread processing took in the service logic and is an indicator of the theoretical upper limit on how many events/s can be processed per thread. High SBB Processing Time statistics indicate either too many staging threads, high CPU load or, more frequently, a service that has slow dependencies e.g. a database query.
The Event Router Time (ERT) statistic shows the Rhino overhead of processing the event including transaction commit time. This is dependent on Rhino overheads, SBB initial-event-selectors and the SBB’s data storage activity. High Event Router Time statistics usually indicate either too many staging threads, high CPU load or a service that stores very large amounts of data (more than 1MB per transaction) in a the replicated MemDB. It may also be a symptom of a Rhino bug.
Event Processing Time (EPT) is the equal to the sum of these and indicates the actual limit on per-thread event processing rate with the currently deployed set of services.
If the diagnostic steps show that there are not enough threads available, try doubling the number of staging threads.
This can be achieved using the
rhino-console command at runtime, or by editing the thread-count attribute of the staging queues element in config/config.xml and then restarting that node.
Note that it is possible to set the number of staging threads too high, and in doing so cause other performance problems. We do not recommend setting the number of staging threads higher than 200 on most systems or approximately 10x the number of CPU cores. Setting it too high will cause scheduler thrashing which will increase latency and reduce throughput.
If the problem persists then this indicates there is a potential issue with the Services, and/or Resource Adaptors installed in Rhino.
In the case of performance problems with services or suspected bugs in Rhino please contact your solution provider for support. If possible the problem should be reproduced in a test environment so additional logging for diagnosis does not impact live traffic.
Services running on Rhino have real-time commitments and latency for those services needs to be at an absolute minimum. “Dropped calls” occur when an activity cannot be completed in an appropriate timeframe. This has many causes. The most common causes are described and diagnostic steps for each is provided.
Longer than expected processing and/or call setup times
Failed or timed out call setup attempts
Timed out or dropped stage items
Rhino logs and/or console output containing exceptions
Resource Adaptors throwing Exceptions
Lock timeout messages in Rhino logs and/or console
A dependant external system is not functioning properly
Check the Rhino log for exceptions. Exceptions in the logs often indicate a cause for dropped calls. If exceptions are found in the logs, follow the steps in Rhino logs and/or console output containing exceptions
Follow the diagnostic steps defined in sections Operating Environment Issues, Java Virtual Machine Heap Issues and Application or Resource Adaptor Heap Issues and follow through diagnostic steps to determine whether or not they are the cause.
Follow the diagnostic steps defined in section High Latency.
If the exception is of the type
AccessControlException, please refer to section Security Related Exceptions to solve the issue.
If the exceptions contain the text "Event router transaction failed to commit" it is likely that the in-memory database is not sized correctly for the load. Follow the diagnostic and resolution steps in Memory Database Full.
For other exceptions please contact your solution provider for support.
If the log key for the exception log matches the tracer for a resource adaptor entity e.g.
trace.sipra the cause of the problem is likely to be a faulty resource adaptor.
This is indicative of either a misconfiguration of the resource adaptors in question or a bug in their implementation.
If this is occurring frequently then please contact your solution provider support.
Contact your solution provider for support attaching the log files that contain the exception message.
If lock timeout messages are occuring please refer to section Lock Timeouts. Two examples of lock timeout messages are as follows.
... ... Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT ... ...
... ... ========= Lock timeout ========= ... ... Unable to acquire lock on [lockId] within 5000 ms ... ...
If lock timeout messages are occurring please refer to Lock Timeouts.
A rate limiter may be rejecting input if the incoming rate exceeds its specified rate.
The rate limiter raises an alarm and outputs to the Rhino logging system if it is actively rejecting input. An example of this is shown as follows.
2011-08-02 22:02:59.931 Major [rhino.facility.alarm.manager] <Timer-1> Alarm 103:212775261552:14 [SubsystemNotification[subsystem=ThresholdAlarms],LIMITING,queue-saturation-limiter-rejecting-work] was raised at 2011-08-02 22:02:59. 931 to level Major QueueSaturation limiter is rejecting work 2014-11-18 23:26:41.134 Major [rhino.facility.alarm.manager] <Timer-1> Alarm 101:316812519480:1 [SubsystemNotification[subsystem=ThresholdAlarms],LIMITING,system-input-limiter-rejecting-work] was raised at 2014-11-18 23:26:41.134 to level Major
A user rate limiter is designed to reject input above a certain threshold to avoid overloading Rhino. If this threshold is appropriately configured then the messages should occur very rarely. If they occur frequently then it is possible that the solution is not sized appropriately for the incoming load. In this case please contact your solution provider support.
This is difficult to diagnose in a general manner as it is specific to the solution running on Rhino. However the system administrator should be aware of any external systems which the solution communicates with. Examples may include rating/charging engines, HLRs, HSS, MSCs, IVRs, SMSCs, MMSCs, databases, etc.
Please refer to administrators of dependant systems and if necessary contact your solution provider support.
A new cluster or node has been brought into service and latencies are much greater than target
Traffic to a new node fails with timeouts
Java takes some time to settle during application startup due to Just in Time(JIT) compilation of code hotspots. For this reason, we recommend slowly introducing traffic to a newly booted node. Most of the hotspots will be optimised in the first minute, however it is better to ramp up load over the course of up to 15 minutes to allow JIT compilation and optimization to occur.
Cluster size should be set during provisioning to allow sufficient reserve capacity to handle peak-hour traffic with at least one failed node. This allows a lot of flexibility in bringing a failed node back into the cluster.
The recommended approach to bringing a node back into the cluster is to use a load balancing mechanism to limit traffic to the rebooted node. (Signalware has this mechanism built in) Usually a smooth increase from zero to a fair share over the course of 5 to 15 minutes (depending on total load) allows the JIT compiler to work without causing call failures or unacceptable latencies.
Another option, if sufficient reserve capacity exists, is to simply wait until cluster load drops to a level that a new node can cope with before rebooting the failed node. This works, as the reserve capacity ensures that the node failure is not visible to callers.
For Java 7 update 45 and newer enabling tiered compilation may result in some improvement at node boot time. This will require a much, much larger code cache as well due to the more aggressive compilation. This it does not avoid the initial very slow interpreted calls, it merely begins compilation more aggressively, which improves performance faster. This option is on by default in Java 8 where the default code cache size has been increased to handle the extra data.
OPTIONS="$OPTIONS -XX:+TieredCompilation -XX:InitialCodeCacheSize=512m"
Note that to avoid performance decreasing, the default code cache size must be set when using Java 7, as the default size of 48MB is not large enough. Java 8 defaults to 240MB and fixes some performance bugs in flushing old compilation data however you may still see benefit from increasing the code cache.
Examine the service usage statistics to rule out differences in applied load.
A small difference in CPU load is expected for nodes servicing management connections. These nodes will also consume slightly more heap and Java Permanent Generation memory. If the amount of PermGen memory used on a node is greater than 90% of the maximum a CMS GC cycle will be triggered to try to reclaim PermGen, When permgen remains above 90% CMS cycles will run almost continuously and cause a significant increase in CPU usage. Check the heap statistics for the node or the GC log entries in the console log to determine if this is the cause of the imbalance. Log entries will contain a section similar to the following with a high proportion of the Perm space used and only a small change after GC:
[CMS Perm : 186608K->186604K(196608K)]
If the problem is memory related increase the allocated heap or PermGen so that there is sufficient free space for normal operation. For more information on heap configuration see Java Virtual Machine Heap Issues
If the problem is not memory related investigate the balance of load across threads.
A small number of threads consuming a large percentage of CPU time may indicate a software bug or a load profile that is unusually serial in execution.
On Linux the command
top -ch will display the busiest threads and the process they belong to.
Run the command
$RHINO_HOME/dumpthreads.sh to trigger a Java thread dump.
The thread IDs in the dump can be matched to the PIDs of the threads displayed in
top by converting the PID to hexadecimal.
Frequently the stack traces for busy threads will be similar.
This can suggest the part of the code that contains the bug and may help identify the trigger conditions.
Contact your solution provider for assistance, attaching the thread dump and list of busy threads.