Operating environment issues

There are several common cases where Rhino’s operating environment is not appropriately configured. These are typically related to resource competition or poor sizing for a given load. The most common causes are resource competition for CPU or heavy disk I/O.

Symptoms

  • Slow event response times

  • Low event throughput rate

  • Dropped calls

  • Periodic failures

Diagnostic steps

Operating system utilities such as vmstat, mpstat, and iostat are particularly helpful in determining whether or not a problem is caused by resource contention. Run such tools for enough time to observe the problem, and then review the logged output immediately before and after. Also look for CPU, I/O, swap, and network use. You may also want to record these continuously, for analysis of long-term system performance and transient faults.

If possible, collect system statistics no less frequently than 10s intervals, and cover every major subsystem. The UNIX standard SAR monitor also collects a useful data; and other tools may offer higher resolution or better analysis support.

Some of the specific things to look for are listed below:

More than 75% CPU utilisation on the machine

Please ensure that the system is sized such that no more than around 75% of CPU is used by Rhino under peak load with one failed node. If the machine is above this utilisation of CPU, then the machine has been incorrectly sized. If running a virtualised environment, this should be scaled to allow for failure of one physical host.

Other system processes

If other processes running on the same hardware compete for CPU for long enough, the responsiveness of a Rhino node may suffer. In our experience, operating system scheduling frequently does not provide Rhino with enough resources if there is competition for those resources. Rhino should be installed on dedicated hardware with no competing applications. Processes that can be especially demanding of CPU and I/O include databases and other Java servers.

Cron jobs

Cron jobs can cause other system processes to run which may compete for system resources. Cron jobs which are not essential should be disabled. Particular note should be taken of the locate service which will rebuild its database as a daily or weekly task, an I/O intensive procedure.

CPU Use

CPU use of the machine should be monitored to ensure that there is more than about 25% of CPU available at all times, including with one node failed. Ideally data should be collected no less frequently than every 10s, preferably with 1s intervals.

Disk I/O

High levels of disk I/O can result in increased CPU use, and can cause the Rhino JVM process to be scheduled infrequently. This can also cause writes to be scheduled synchronously, which may block threads during critical tasks such as GC. This is particularly likely on highly contended disks such as found in virtualised environments and SAN storage arrays.

Swap use

Swap use is often the cause of disk I/O. Rhino should be installed on a machine with sufficient physical RAM for the workload of the machine.

Swap will also significantly impair GC performance. If even a small amount of the JVM heap is swapped out, GC pause times can increase by up to 300% and potentially much longer for greater swap usage. While the CMS collector will usually touch pages frequently enough to keep them in physical memory, the G1 collector can leave regions untouched for long enough to be swapped out preemptively on Linux.

Network issues

Rhino’s cluster membership services use multicast UDP. If the IP network is loaded heavily, this can cause packets to be dropped, potentially causing cluster segmentation or failure if individual cluster nodes cannot reliably communicate.

See Cluster Segmentation, Cluster Failing to Start and Cluster Starts but Stops After a few Minutes for more details of multicast related problems.

Large system NUMA binding

Large systems (that is, four or more sockets per host have significant asymmetry in memory access speed and latency between CPU sockets. This can cause performance to be significantly less than optimum.

Workaround or resolution

If any of the machines Rhino is running on are running over 75% CPU use, and this is not due to other system processes, then consider:

  • configuring a use rate limiter as a workaround

  • increasing the number of available threads

  • installing more CPUs into a node

  • installing more nodes into a cluster

  • using faster CPUs.

Processes causing a problem due to resource usage should not be used on a machine running Rhino. Move these to another host.

To prevent preemptive swapping slowing GC, reduce the value of the sysctl vm.swappiness from the default of 60 to something low enough to prevent this. 30 is a good starting point. If using G1 the vm.swappiness sysctl should be set no higher than 30 and preferably 10 or lower.

When running Rhino on NUMA hosts it is usually best to run multiple nodes on each host with each JVM bound to a separate NUMA domain. Configure this with numactl --cpunodebind=X --membind=X on Linux.

Java Virtual Machine heap issues

Poor performance or failure of a Rhino node may result from a JVM configuration which has too little heap to suit the applications deployed and incoming traffic processed by the application.

This section describes how to diagnose incorrect heap sizing. See Application or resource adaptor heap issues to diagnose application or resource adaptor memory leaks.

Symptoms

  • Large garbage collection pauses in garbage collection logs

  • Longer than expected message processing and/or call setup times

  • Failed or timed-out call setup or message processing attempts

  • Out of memory error appearing in Rhino logs

  • Watchdog errors "Over memory use limit"

Diagnostic steps

There are two common causes of heap issues when running Rhino:

  • heap sizing may be incorrect

  • something may be leaking memory.

Incorrect heap sizing

The heap usage of Rhino should be monitored. This includes both the Java heap and the native heap. The Java heap can be measured by Rhino’s statistics tool rhino-stats, or one of the Java monitoring tools — jstat or jvisualvm. The native heap size can be measured using system utilities such as ps or top.

Here’s an example of using the Rhino statistics tool to monitor the size of the Java heap:

./client/bin/rhino-stats -m JVM

The behaviour of the JVM process is that a well-sized heap should gradually decrease from, say, 1.5 gigabytes free to around, say, 800 megabytes free, and then increase up to its higher limit again. This will be overlaid by a shorter pattern of peaks of say 128 megabytes over the long pattern usage. The actual figures will depend on the services that are active and deployed; and the total heap used should never exceed 60% of the heap size allocated for sustained periods when using the default CMS collector.

Component memory leak

If the JVM heap is sized appropriately for a given load, and a system appears to run stably at a given load and gradually degrades over a period of at least several hours, then it is possible that the JVM, a resource adaptor or Rhino has a memory leak.

Check the number of live activities and SBBs using the Rhino statistics monitoring tool. If the number of each are within expected values, then it is likely there is a memory leak. If there are too many, it indicates there may be a bug in the component. A strong indicator of an RA or service bug is activities that are more than ten times the age of the average call duration, or older than one whole day.

Workaround or resolution

Incorrect heap sizing

First raise the maximum heap size available to the JVM by modifying the HEAP_SIZE variable in the config/config_variables file. Ensure that there is enough physical memory on the machine such that the system will not swap if this extra memory is used by the JVM. Also set the NEW_SIZE based on the duration and memory usage of a typical session. The Rhino node must be restarted for this change to take effect. See to JVM documentation for more information on heap sizing and related overheads.

Warning

Tuning garbage collection settings may improve performance, but this is NOT RECOMMENDED without prolonged performance testing, as certain combinations of garbage collection settings may severely affect the stability of a cluster and cause JVM crashes. If you wish to explore this avenue:

  • contact your solution provider support or OpenCloud support for help

  • thoroughly load test the system for at least two weeks using the new garbage collection settings before deployment to production

  • consider adding more RAM to a machine, or more machines to the cluster.

Alternative garbage collectors such as G1, which may improve worst-case performance at some cost to average-case but have their own tuning requirements. G1 requires the use of Java 8 or newer to function reliably.

With some allocation patterns it may not be possible to find a heap size or new size that provides an acceptable balance between pause times and durations for the target load. If this is the case the only solutions are to use a different cluster size or contact your solution provider for an improved version of the service that uses memory more efficiently. This problem should only arise as a result of an error in capacity planning and testing. A workaround that is sometimes available if the host server has sufficient free RAM is to increase the heap size until it is large enough that the pauses can be avoided by restarting nodes during idle periods (however this is rarely possible).

Component memory leak

Please contact your solution provider support to work through component memory leak issues.

Application or resource adaptor heap issues

A system with application or resource adaptor related heap issues displays several symptoms, which are similar to those of an inappropriately sized JVM heap.

Symptoms

  • Large garbage collection pauses in garbage collection logs

  • Longer than expected message processing and/or call setup times

  • Failed or timed-out call setup or message processing attempts

  • Out of memory errors in Rhino logs

Diagnostic steps

When these symptoms occur it is often necessary to check both for application or resource adaptor related heap issues and for JVM heap issues.

Application or resource adaptor caused memory leaks

Start the rhino-console, and query for activities that are older than expected for the system. For example, call activities typically last the duration of a phone call; so query for activities that are older than the majority of phone calls (say one hour). To perform this query use the findactivities command with -created-before as a flag.

Here’s an example of finding activities created before 2:30pm:

findactivities -created-before 14:30:00

You can also query for activities older than a specified age, for example:

findactivities -created-before 1h

The resulting number of activities should be a low number. If it is a number larger than, say, 30 (depending on your application) then there may be an issue with some activities not being cleaned up properly by the service or resource adaptor. You should exclude ServiceActivities and ProfileActivities from the set counted.

If the number of activities is small, test for the number of SBB entities in Rhino. SBB entities are removed when they are no longer in use; so if you find a large number of SBB entities that were created a long time ago, it is likely that the service has a problem. The command to query SBB entities is the findsbbs command.

Here’s an example of finding SBB entities for the SIP proxy service created before 2:30pm:

findsbbs -service SIP\ Proxy\ Service\ 1.5,\ Open\ Cloud -created-before 14:30:00

We also recommend that you use the statistics client to examine the memory usage and number of activities in the running SLEE:

$ client/bin/rhino-stats -m Activities
2006-02-27 11:56:29.610 INFO [rhinostat] Connecting to localhost:1199
2006-02-27 11:56:33.952 INFO [rhinostat] Monitoring
2006-02-27 11:56:34.957 INFO [rhinostat] Cluster has members [101]

                                            Activities

time                         active dropped finished started
-----------------------    -----------------------------------
2006-02-27 11:58:59.574        57      -       -        -
2006-02-27 11:59:00.612        57      0      19       19
2006-02-27 11:59:01.635        57      0      21       21
2006-02-27 11:59:02.657        57      0      19       19
2006-02-27 11:59:03.875        57      0      21       21
2006-02-27 11:59:05.033        57      0      20       20
2006-02-27 11:59:06.053        57      0      20       20

Workaround or resolution

Application or resource adaptor caused memory leaks

This indicates an unhandled case in the Services or resource adaptors being used. Activities and SBB entities known to be invalid may be removed through client/bin/rhino-console using the removeactivity and removesbb commands respectively. These commands may be used to stop a system from running out of memory before a patched application or resource adaptor is installed.

Rhino start-up fails with 'java.io.IOException: Not Enough Space'

The most likely cause for this exception is that there is not enough free memory. When compiling deployable units, Rhino versions earlier than 2.3.0.7 fork off another JVM for running javac. This may double memory requirements during deployment. The exact behaviour depends on the memory management of the operating system. Since version 2.3.0.7, Rhino no longer forks to run javac unless specifically configured to for diagnostic purposes. The problem can still occur but is less likely.

Symptoms

INFO [rhino.management.deployment.builder] INFO [rhino.management.deployment.builder]
common classes java.io.IOException: Not enough space
<main> Generating profile implementation
<main> Compiling generated profile specification
   at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.<init>(UNIXProcess.java:52) at java.lang.Runtime.exec Internal(Native Method)
at java.lang.Runtime.exec(Runtime.java:566)
at java.lang.Runtime.exec(Runtime.java:491)
at java.lang.Runtime.exec(Runtime.java:457)
WARN [rhino.management.deployment] <main> Installation of deployable unit failed:
com.opencloud.rhino.management.deployment.BuildException: javac failed - exit code -1
Failed to start rhino node
com.opencloud.rhino.management.deployment.BuildException: javac failed - exit code -1
Node 101 exited due to misconfiguration.

Diagnostic steps

Firstly, check to see how much free memory is available to the JVMs, and that swap space is available. Rhino should never run using swap space (doing so will cause unpredictable performance); but nonetheless, sufficient swap space should be present to allow the OS to manage memory more effectively.

Check the amount of physical memory available. Under Linux, use the free command.

Compare this amount to the size of the JVM specified in the HEAP_SIZE variable in node-???/config/config_variables (for Rhino-production). Make sure that the JVM can fit comfortably in the physical memory on the computer.

Workaround or resolution

Rhino versions earlier than 2.3.0.7 require about twice the amount of physical memory as the size of HEAP_SIZE to ensure that a fork() of the JVM during compilation of services will always be able to run in physical memory and not incur delays by swapping memory to and from disk. Linux systems use copy-on-write memory cloning when forking processes so the likelihood of swapping is much lower than on Solaris.

You may need to reduce the value of HEAP_SIZE by editing the file $RHINO_HOME/node-???/config/config_variables to make sure Rhino can fit in the available physical memory.

Also make sure there are no other running processes on the computer which may be using excessive system resources, and that there is sufficient swap space.

Warning about UDP buffer sizes

Rhino cluster communication requires large UDP send and receive buffers.

The operating system limits for socket transmit and receive buffers must be large enough to allow the buffer size to be set.

Symptoms

Rhino prints warnings at startup:

WARN    [savanna.ocio.socket.sc]   <main> Asked for a UDP receive buffer of 261120 bytes, but only got 212992 bytes. Check system-wide limits (net.core.rmem_max)

Resolution

Ensure that the kernel parameters net.core.rmem_max and net.core.wmem_max are large enough.

See UDP buffer sizes in the Preparing your Network section of the Rhino Production Getting Started Guide for details.

Java Virtual Machine error

The Sun Java Virtual Machine is typically very stable and, theoretically, should not crash even under high load. On rare occasions, however, the JVM can crash for any of the following reasons:

  • Memory on the local computer has been corrupted.

  • The swap partition on the local disk has had data corruption.

  • Some other aspect of the hardware in the local computer has caused a JVM crash.

  • A bug or library versioning issue in the operating system has caused the JVM to crash.

  • A bug in one of the included Java Native Interface (JNI) libraries has caused the JVM to crash.

  • A bug in the JVM itself has caused it to crash.

Symptoms

When the JVM crashes, the Rhino node will fail and the cluster will reconfigure itself. On Unix systems, the node will die with either a SIGSEGV or a SIGBUS and perhaps dump core, depending on how your system is configured.

JVM error message in Rhino logs and console

A message similar to the following appearing in logs and/or the Rhino console indicates a JVM error.

#
# An unexpected error has been detected by HotSpot Virtual Machine: #
# SIGSEGV (0xb) at pc=0xf968c5f9, pid=27897, tid=64 #
# Java VM: Java HotSpot(TM) Server VM (1.5.0_05-b05 mixed mode)
# Problematic frame:
# J com.opencloud.slee.resources.signalware.OutgoingBackendConnection.writePacket (Lcom/opencloud/slee/resources/signalware/PacketType;Lcom/opencloud/slee/resources/psc/Encodeable;)V #
#
#
# An error report file with more information is saved as hs_err_pid27897.log
# If you would like to submit a bug report, please visit: http://java.sun.com/webapps/bugreport/crash.jsp
#
#
27897 Abort - core dumped
Configuration change messages in Rhino logs and console

The following are examples of configuration change messages.

 ...
2016-04-27 12:16:24.562  INFO    [rhino.main]   <SavannaDelivery/domain-0-rhino-db-stripe-2> Group membership change: GrpCCM: domain-0-rhino-db-stripe-2 Reg [101,18108] - Memb [ 101, 102, 103, 104, 105, 106, 107 ] { BboneCCM: Reg [101,27068] - Memb [ 101, 102, 103, 104, 105, 106, 107, 200 ] }
2016-04-27 12:16:24.569  INFO    [rhino.main]   <SavannaDelivery/rhino-monitoring> Group membership change: GrpCCM: rhino-monitoring Reg [101,15192] - Memb [ 101, 102, 103, 104, 105, 106, 107 ] { BboneCCM: Reg [101,27068] - Memb [ 101, 102, 103, 104, 105, 106, 107, 200 ] }
2016-04-27 12:16:24.601  WARN    [rhino.membership.rhino-cluster]   <SavannaDelivery/rhino-cluster> Node(s) [108] left cluster operational member set
2016-04-27 12:16:24.601  INFO    [rhino.membership.rhino-cluster]   <SavannaDelivery/rhino-cluster> Current membership set: [101,102,103,104,105,106,107,200]
2016-04-27 12:16:24.619  INFO    [framework.scs]   <SavannaDelivery/rhino-management> [rhino-management]  Component change: Component(members=[101,102,103,104,105,106,107],transitioning=[101,102,103,104,105,106,107],nonPrimary=[],id=5222)
...
2016-04-27 12:22:39.041  WARN    [rhino.membership.rhino-cluster]   <SavannaDelivery/rhino-cluster> Node has left primary component
2016-04-27 12:22:39.041  INFO    [rhino.exithandler]   <SavannaDelivery/rhino-cluster> Exiting process (Node left primary component)
Alarm indicating a node has failed

An alarm appears in the console if a node has failed. Use this command to query active alarms:

./client/bin/rhino-console -listActiveAlarms

This command produces a block of output for each active alarm. A block in the list of active alarms similar to the following indicates that a node has left the cluster due to hardware, network, or software failure.

...
Alarm 101:193861858302463 [rhino.node-failure]
  Rhino node : 101
  Level      : Major
  InstanceID : 102
  Source     : (Subsystem) ClusterStateListener
  Timestamp  : 20161103 17:44:02 (active 0m 6s)
  Message    : Node 102 has left the cluster
...

Diagnstic steps

If any of these symptoms occur, then first check the logs of all Rhino cluster members (file console.log). If a JVM error message is part of the logs of any node, then a Java Virtual Machine error has occurred. If an error message cannot be found in the logs please refer to Cluster Segmentation, Cluster Failing to Start and Cluster Starts but Stops After a few Minutes

After checking Rhino’s logs, determine whether or not the hardware or operating system is causing the problem. Look at the logs of the local machine and determine whether the machine has had a history of restarts, kernel panics, process segmentation faults, and so forth. On Unix machines, system logs can be viewed using the dmesg command or by viewing logs in /var/logs.

When a JVM crash occurs, a file will be left with a name like hs_err_pid*.log. View this file and try to determine which part of the JVM caused the crash. This may provide clues as to how to resolve the situation and will be needed if a bug in the JVM is to be reported.

Workaround or resolution

If the crash appears to be a one-off, the node can simply be restarted.

If the problem resides in faulty hardware, then that hardware will need to be fixed or replaced.

OpenCloud has experience with configuring the Sun Java Virtual Machine to be as stable as possible. Consider consulting OpenCloud support for information regarding known stable configurations of JVMs. Also ensure that thorough stability tests are performed on a new version of a Java Virtual Machine before it is deployed in a production system.

Also scan through the list of dynamic libraries in the hs_err_pid*.log file and see if they are all the latest version. Keep an eye out for libraries which should not be there, for example from other installations of older JVMs.

If it is a realistic option, upgrading the operating system to the latest version may also resolve this problem.

JVM errors that cannot be identified as being caused by system configuration or dynamic library problems should be reported to Oracle.

Multicast traffic is using the wrong network interface

Symptoms

The host machines for Rhino have multiple network interfaces, and we need to control which interface is used for SLEE cluster traffic. Network monitoring shows this is currently not on the desired interface.

When the configuration differs between hosts, nodes may fail to go primary. This is indicated by repeated messages in the Rhino log indicating that a node is waiting for the cluster to go primary and alack of configuration change messages.

Messages like this show up in the Rhino console and logs:

...
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
...

Diagnostic steps

Use a network monitoring tcpdump or snoop to detect multicast traffic on the groups configured for Rhino clustering.

Use netstat -rn, route, or ip route to examine the current multicast routing configuration.

Workaround or resolution

The choice of interface is controlled solely by the machines' multicast configuration; it’s controlled at the OS level rather than by Rhino. Use route or ip route to modify the OS routing configuration. See Cluster Failing to Start for how to configure the interface used for multicast traffic.

Previous page Next page