Rhino 3.0.0 :: Rhino Troubleshooting Guide

Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for common general issues with Rhino.

On this page...

Rhino watchdog
Long-running transaction warnings
Lock timeouts
I/O exceptions
Inactive RA provider
Leaked OCIO buffers
License over capacity alarms
Rhino logs containing exceptions

Rhino watchdog

The watchdog is a thread in Rhino that monitors the system clock and threads in the SLEE for strange behaviour. There are two types of threads monitored:

DeliveryThreads — internal threads inside the SLEE
StageWorkers — threads responsible for executing deployed service logic.

The watchdog thread watches for the following strange behaviour:

threads becoming “stuck”
threads dying
the system clock suddenly jumping forwards from the point of view of the JVM
the system clock suddenly jumping backwards from the point of view of the JVM

Stuck Threads watchdog

The stuck threads watchdog behaviour handles stuck or dying threads. Under some circumstances as discussed below, the node will leave the cluster and exit. If configured for restarting, the node will restart, rejoin the cluster and continue operating.

Symptoms

If a thread becomes stuck, the following appears in the logs:

2016-05-10 12:34:54.487 ERROR [watchdog.threadactive] <Global watchdog thread> Thread has become stuck: StageWorker/0

If a thread dies, the following appears in the logs:

2016-05-10 12:34:54.487 ERROR [watchdog.threadactive] <Global watchdog thread> Thread has died: StageWorker/0

If the watchdog detects that a DeliveryThread has become stuck or has died, that node will exit with the following message and a stack trace:

2016-05-10 12:34:56.509 ERROR [watchdog] <Global watchdog thread> *** WATCHDOG TIMEOUT ***
2016-05-10 12:34:56.509 ERROR [watchdog] <Global watchdog thread> Failed watchdog condition: ThreadActive:DeliveryThread/0

The watchdog monitors the number of alive StageWorker threads to ensure that more than 50% of them are alive. If fewer that 50% of the StageWorker threads have remained alive without terminating or becoming stuck, then the watchdog will cause the node to exit with an error message similar to the above. This proportion of surviving threads is configurable; for information about configuring the watchdog, see the Rhino Production Getting Started Guide.

Conditions under which a StageWorker becomes stuck or dies are usually indicative that a deployed service is experiencing unexpected problems or has a faulty implementation. In this scenario, the best approach is to examine the reasons for the service causing threads to block or exit. Common culprits include excessive blocking on external resources such as JDBC connections, infinite loops or poor handling of unexpected circumstances.

The watchdog timeout values for StageWorker threads can be increased, However this is not usually recommended as a blocked StageWorker indicates a session with very long processing time. This may be a solution when a service consistently needs to perform a blocking action that is expected to take a long time. It is also possible to increase the number of StageWorker threads; this will improve reactiveness when a large proportion of StageWorker threads are blocking on an external service or long computation. Increasing the StageWorker thread count is strongly recommended when increasing the timeout.

Conditions under which a DeliveryThread becomes stuck or dies usually indicate that there is a fault in the hardware, operating system, libraries, or in the Rhino SLEE. If this is the case, first determine if hardware is an issue. It would be a good idea to refer to any available documentation about diagnoses of hardware problems for the platform in question. Look at the output of the dmesg command on Unix operating systems, and whether the machine has had a history of restarts, kernel panics, process segmentation faults, and so forth.

In the case of a DeliveryThread becoming stuck or dying, investigate and resolve any possible hardware issues. Contact your solution provider with the stack trace information from the Rhino logs and circumstances under which the watchdog timeout occurred for information about what further action could be taken to resolve this issue.

Timewarp watchdog

The timewarp watchdog behaviour handles the system clock suddenly changing from the JVM perspective. A warning is printed in the log if the system clock suddenly changes. A Critical alarm is raised whenever a timewarp is detected, and will not be automatically cleared. This alarm is considered critical as all timewarps indicate severe problems in the underlying system that threaten cluster stability and ability to process sessions in a timely manner.

A timewarp of 5s or greater is sufficient to break the clustering mechanism used in Rhino, and may cause the node to leave the cluster. This is done to prevent a split-brain situation, or data corruption. A single node cluster will not self shutdown due to timewarps, but this should still be considered a critical issue as the node is unavailable during the timewarps.

Rhino may fail if system time ever goes backwards on a single node while the cluster is running, or if it goes backwards to overlap with a time period when the cluster was previously running. This is because the unique IDs being generated may no longer be unique.

If the watchdog observes a sudden change in the system clock, the following will be printed on the console:

2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879
ERROR [watchdog]
<Global watchdog thread> Forward timewarp detected! interval=8021 but should be at most 2000
<Global watchdog thread> old timestamp=1188363629857, new timestamp=1188363637878
<Global watchdog thread> Check for CPU overloading or stop-the-world GC.
<Global watchdog thread> Check for external processes that could step the system clock forwards ('date -s', NTP in step mode, etc)

Several causes of timewarps exist.

Long Garbage Collection(GC) pause

The most likely cause of forward timewarps is GC. GC cannot cause a reverse timewarp. To check GC pause times, look for entries in the console log reporting real times greater than 1s for ParNew, CMS-initial-mark, and CMS-remark. Also look for entries reporting Full GC or concurrent-mode-failure. concurrent-mode-failure indicates a severe problem.

concurrent mode failure GC log

 2016-04-14 14:07:22.387  2016-04-14T14:07:19.279+0200: [GC2016-04-14T14:07:19.279+0200: [ParNew (promotion failed): 127844K->127844K(130112K),
3.1073460 secs] 1766087K->1792457K(3144768K), 3.1078370 secs] [Times: user=4.05 sys=0.00, real=3.11 secs]
 2016-04-14 14:07:23.865  2016-04-14T14:07:22.387+0200: [Full GC2016-04-14T14:07:22.387+0200: [CMS: 1664612K->254343K(3014656K), 1.4772760 secs]
1792457K->254343K(3144768K), [CMS Perm : 77532K->77327K(196608K)], 1.4775730 secs] [Times: user=1.50 sys=0.00, real=1.47 secs]
 2015-09-24 22:09:13.668   (concurrent mode failure): 22907K->8498K(32768K), 0.0504460 secs] 49900K->8498K(65344K), [CMS Perm : 15238K->15237K
(196608K)], 0.0796580 secs] [Times: user=0.20 sys=0.00, real=0.08 secs]

Resolving long GC pauses almost always requires modifying the heap size/new size to eliminate the long pause. This must be done in tandem with performance testing to verify the effects of changes.

System clock changes

This is most commonly caused by a misconfigured time synchronization daemon (usually NTP). Rhino requires that NTP be run in slew mode, as step mode may make steps greater than the timewarp limit when enabled. NTP steps may be both forwards, and reverse.

Virtual Machine pauses

Virtual machines may pause processing for many reasons, and may produce both foward and reverse timewarps depending on hypervisor configuration. In order to minimize VM pauses, we strongly recommend avoiding overcommiting resources.

Excessive system IO waits

Excessive system IO waits may only cause foward timewarps. These occur when the host fails to handle disk IO in a timely manner, blocking the JVM. This may be diagnosed through OS level logging and IO monitoring tools. Other causes of forward timewarps should be investigated first.

Note: VM’s are particularly prone to this form of timewarp.

Long-running transaction warnings

Long-running transaction warning messages appear in Rhino logs if a transaction has been running for too long. Long-running transactions can be caused when:

a profile has been opened for editing but has not been committed or restored
a large deployable unit is being installed, done in a single transaction, and taking a long time to compile
a single node is restarting

Nodes load most of their state from the database in a single transaction, which can take some time.

a service makes blocking calls to external resources that are very slow to respond
the cluster is running under excessive load.

Symptoms

Warning messages like the following in the Rhino logs or console indicate long-running transactions.

...
WARN [transaction.manager] There are 1 long-running transactions currently active:
WARN [transaction.manager] 43190ms: [OCTransactionImpl:TransactionId:[101:16]:ACTIVE]
...

Diagnostic steps and correction

A long-running transaction may be reported during installation of components during deployment or when a node is starting up. In this case, the error message is benign; and the warning will be cleared when the components have finished installing.

If this message appears only during start-up or deployment of components, then it can be safely ignored.

If such messages persist or occur during normal operation, then they should be reported to your solution provider. Please gather the output of the dumpthreads.sh command from the node directory. This output is written to the Rhino console and is not gathered as part of Rhino’s logging system. It can usually be found in the file console.log. Please provide this output to your solution provider support.

Lock timeouts

Lock timeouts occur when a node cannot obtain a cluster-wide lock within a certain timeout. Lock timeouts are most likely to occur when nodes leave the cluster, when nodes are experiencing overload conditions, or when network problems are being experienced. Some possible causes are:

Overloaded nodes
Network congestion
Network failure
Nodes which are about to leave the cluster due to some failure (e.g. garbage collection pauses)
Too much contention for the lock in question,
that users of the lock are queued sufficiently long enough for their requests to time out.
Too much load in general on the system,
such that the lock request can’t be processed by the in-memory database before it times out.
Deadlock,
when the locking resources are accessed in a different order.
A transaction has somehow been "lost" and left open holding a lock
this could be a product bug.
A thread has got stuck, holding its transaction open and therefore holding its lock,
this has various possible causes including: service bugs, JVM bugs and/or product bugs.

Symptoms

Below are examples of warning messages in Rhino logs or console that indicate lock timeouts.

...
... Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT ... ...
...

...
... ========= Lock timeout ========= ...
...
... Unable to acquire lock on [lockId] within 5000 ms ...
...

Diagnostic steps and correction

The diagnostic process determines whether a failure has occurred due to environmental issues or hardware, network, or software related issues. A hardware, network, or software related issue is usually indicated by the presence of an alarm. In order to check for alarms, use the following rhino-console command.

./client/bin/rhino-console listactivealarms

This command produces one line of output for each active alarm. If the list of active alarms contains a line similar to the following, it indicates that a node has left the cluster due to hardware, network, or software failure.

...
Alarm ...(Node 101, 23-Nov-05 15:10:36.636): Major [rhino.cluster] Node 102 has left the cluster
...

If an alarm indicates node failure, follow the diagnostic and resolution procedures for Node Failure.

If the output of the command does not contain a line similar to above, or indicates no alarms are active (as below), then the symptoms are occurring because of environmental issues or software faults.

No active alarms

Overload conditions are likely to cause lock timeouts. CPU usage should never exceed 70% for an extended period of time. CPU usage can be measured on Linux using the top command.

Garbage collection, particularly under overload conditions can cause lock timeouts. The behaviour of the garbage collector can be analysed by examining the garbage collection logs. By default these are logged to the Rhino console or the console.log file in
node-???/work/log For further information on diagnosing GC related problems, please refer to Java Virtual Machine Heap Issues and Application or Resource Adaptor Heap Issues.

SLEE load can also be investigated with the use of the Statistics Client. For more information about usage of the Statistics Client, please see the rhino-stats section in the Rhino Administration and Deployment Guide. Of particular value are the time statistics for the lock manager, MemDB and Savanna subsystems. If any of these report times in the 10s of milliseconds or greater this is usually a symptom of overload.

To resolve problems caused by overload it is usually necessary to add more cluster nodes on new hardware, correct the configuration of the software or improve the efficiency of the service logic. Follow the diagnostic steps in the following sections Operating Environment Issues, Java Virtual Machine Heap Issues, and Application or Resource Adaptor Heap Issues. If this does not lead to a resolution contact your solution provider for assistance.

If a lock timeout is caused by overloading the cluster, then an interim resolution is to enable a user-specified rate limiter to limit the number of events processed by the cluster. Instructions for doing so are in the Rhino Administration and Deployment Guide.

Threshold rules should be set up to alert the SLEE administrator to adverse conditions within the SLEE.

Deadlock, lost transactions and stuck threads are typically caused by software bugs. These can be in Rhino, service logic, or RAs, and will often be unrelated to call rate. To assist in diagnosing these problems observe the statistics parameter sets StagingThreads and Events. Diagnosing the cause of these will usually require setting the log keys transaction.manager and transaction.instance to the debug level. The logs obtained should be sent to your solution provider.

This writes a lot of log data and can impact service performance.

I/O exceptions

There are several causes for I/O exceptions to occur in Rhino logs. Below are the diagnostic steps and workarounds or resolution for I/O exceptions from:

Savanna
Rhino’s logging infrastructure
Rhino’s core
resource adaptors or SBBs

For all diagnostic steps and workarounds or resolution first check:

There is sufficient disk space on the running system
Executing processes have write permissions to files in their subdirectories
The ulimit for the number of open file descriptors is not too low, depending on the protocols in use Rhino may require a higher open files ulimit than the OS default.

I/O exceptions from Savanna in Rhino logs

Symptoms

The log key for these I/O exceptions contain any of the following phrases:

savanna
framework
ocio

Diagnostic steps and correction

These messages typically occur if the network is unable to accept multicast traffic from Rhino. This can be caused by either inappropriate kernel IP routing configuration or problems with network elements such as routers, switches, ethernet cards, firewalls etc. Start diagnosis from Operating Environment Issues. If this does not produce a solution continue with bClustering. Finally contact your solution provider for support.

I/O exceptions from Rhino’s logging infrastructure in Rhino logs

Symptoms

The log key for these I/O exceptions contains the following phrases:

rhino.logging
log4j.*

Workaround or resolution

These messages typically occur if the Rhino installation does not have appropriate disk space and/or permissions available or there is a hardware error.

I/O exceptions from Rhino’s core in Rhino logs

Symptoms

The log key for these I/O exceptions contains the following phrases:

rhino.*
memdb
transaction.*

Workaround or resolution

These messages should be provided to the solution provider unless they clearly indicate the cause.

I/O exceptions from resource adaptors or SBBs in Rhino logs

If the log messages do not occur on any of the log keys mentioned in the cases, then they are likely to come from a resource adaptor or SBB. These messages should be provided to the solution provider and possibly the resource adaptor vendor.

Inactive RA provider

Symptoms

A service fails with an error that looks like:

IllegalStateException
java.lang.IllegalStateException: sendQuery called on inactive RA provider

Resolution

The above exception indicates that you need to activate the RA entity. Documentation on managing RA entities can be found in the Rhino Administration and Deployment Guide.

Leaked OCIO buffers

Symptoms

Rhino logs contain the message:

2015-04-15 22:54:30.794  INFO    [savanna.ocio.leaks]   <OCIO Cleanup Thread> Garbage-collected 12201 leaked OCIO buffers
2015-04-15 22:54:30.794  INFO    [savanna.ocio.leaks]   <OCIO Cleanup Thread> Garbage-collected 12301 leaked OCIO buffers

Diagnostic steps

Compare the times these warnings are logged with garbage collection cycles. Cleanup of leaked OCIO buffers happens during CMS GC cycles and all G1 collections, and it writes one log entry per 100 buffers cleaned up. The log entry reports the accumulated total of all OCIO buffers cleaned up since the node started. If this message is occurring on every GC cycle, that may indicate an environment-related performance problem.

Continue diagnosis of the performance problem, starting at Operating Environment Issues.

If there are no signs of overload, then the warning is harmless and indicates that a section of code in the Rhino clustering implementation has cleaned up internal objects that would have otherwise leaked.

Profile Management and/or provisioning failing

Symptoms

Creating or importing profiles fails.

Diagnostic steps

If profile management and/or provisioning commands are unsuccessful this is possibly due to the size restriction of the ProfileDatabase installed in Rhino. The following command can be used to monitor the size of the ProfileDatabase installed in Rhino.

./client/bin/rhino-stats -m MemDB-Replicated.ProfileDatabase

Check the Rhino log for messages with a log key profile.* similar to the following:

2016-12-01 13:41:57.081  WARN    [profile.mbean]   <RMI TCP Connection(2)-192.168.0.204> [foo:8] Error committing profile:
javax.slee.management.ManagementException: Cannot commit transaction
        at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:36)
        at com.opencloud.rhino.management.SleeSupport.commitTx(SleeSupport.java:28)
        at com.opencloud.rhino.impl.profile.GenericProfile.commitProfile(GenericProfile.java:127)
        at sun.reflect.GeneratedMethodAccessor97.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.opencloud.rhino.management.DynamicMBeanSupport.doInvoke(DynamicMBeanSupport.java:157)
        at com.opencloud.rhino.management.DynamicMBeanSupport$1.run(DynamicMBeanSupport.java:121)
        at java.security.AccessController.doPrivileged(Native Method)
        at com.opencloud.rhino.management.DynamicMBeanSupport.invoke(DynamicMBeanSupport.java:113)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
        at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.opencloud.rhino.management.BaseMBeanInterceptor.intercept(BaseMBeanInterceptor.java:12)
        at com.opencloud.rhino.management.AuditingMBeanInterceptor.intercept(AuditingMBeanInterceptor.java:245)
        at com.opencloud.rhino.management.ObjectNameNamespaceQualifierMBeanInterceptor.intercept(ObjectNameNamespaceQualifierMBeanInterceptor.java:74)
        at com.opencloud.rhino.management.NamespaceAssociatorMBeanInterceptor.intercept(NamespaceAssociatorMBeanInterceptor.java:66)
        at com.opencloud.rhino.management.RhinoPermissionCheckInterceptor.intercept(RhinoPermissionCheckInterceptor.java:56)
        at com.opencloud.rhino.management.CompatibilityMBeanInterceptor.intercept(CompatibilityMBeanInterceptor.java:130)
        at com.opencloud.rhino.management.StartupManagementMBeanInterceptor.intercept(StartupManagementMBeanInterceptor.java:97)
        at com.opencloud.rhino.management.RemoteSafeExceptionMBeanInterceptor.intercept(RemoteSafeExceptionMBeanInterceptor.java:26)
        at com.opencloud.rhino.management.SleeMBeanServerBuilder$MBeanServerInvocationHandler.invoke(SleeMBeanServerBuilder.java:38)
        at com.sun.proxy.$Proxy9.invoke(Unknown Source)
        at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
        at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
        at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1427)
        at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
        at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
        at sun.rmi.transport.Transport$2.run(Transport.java:202)
        at sun.rmi.transport.Transport$2.run(Transport.java:199)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:198)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.opencloud.transaction.TransientRollbackException: Unable to prepare due to size limits of the DB
        at com.opencloud.transaction.local.OCTransactionImpl.commit(OCTransactionImpl.java:107)
        at com.opencloud.transaction.local.OCTransactionManagerImpl.commit(OCTransactionManagerImpl.java:190)
        at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:33)
        ... 48 more

Resolution

Increase the size of the profile database as described in Memory Database Full

License over capacity alarms

If this occurs please contact your solution provider to purchase additional capacity licenses.

Rhino logs containing exceptions

If the exception is of the type AccessControlException, please refer to section Security Related Exceptions to solve the issue.

If the exceptions contain the text "Event router transaction failed to commit" it is likely that the in-memory database is not sized correctly for the load. Follow the diagnostic and resolution steps in Memory Database Full.

For other exceptions please contact your solution provider for support.

Resource Adaptors throwing Exceptions

If the log key for the exception log matches the tracer for a resource adaptor entity e.g. trace.sipra the cause of the problem is likely to be a faulty resource adaptor. This is indicative of either a misconfiguration of the resource adaptors in question or a bug in their implementation. If this is occurring frequently then please contact your solution provider support. Contact your solution provider for support attaching the log files that contain the exception message.

Previous page Next page

Common Symptoms

Rhino watchdog

Stuck Threads watchdog

Symptoms

Timewarp watchdog

Long Garbage Collection(GC) pause

System clock changes

Virtual Machine pauses

Excessive system IO waits

Long-running transaction warnings

Symptoms

Diagnostic steps and correction

Lock timeouts

Symptoms

Diagnostic steps and correction

I/O exceptions

I/O exceptions from Savanna in Rhino logs

Symptoms

Diagnostic steps and correction

I/O exceptions from Rhino’s logging infrastructure in Rhino logs

Symptoms

Workaround or resolution

I/O exceptions from Rhino’s core in Rhino logs

Symptoms

Workaround or resolution

I/O exceptions from resource adaptors or SBBs in Rhino logs

Inactive RA provider

Symptoms

Resolution

Leaked OCIO buffers

Symptoms

Diagnostic steps

Profile Management and/or provisioning failing

Symptoms

Diagnostic steps

Resolution

License over capacity alarms

Rhino logs containing exceptions

Resource Adaptors throwing Exceptions