Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for common general issues with Rhino.
The watchdog is a thread in Rhino that monitors the system clock and threads in the SLEE for strange behaviour. There are two types of threads monitored:
DeliveryThreads— internal threads inside the SLEE
StageWorkers— threads responsible for executing deployed service logic.
The watchdog thread watches for the following strange behaviour:
The stuck threads watchdog behaviour handles stuck or dying threads. Under some circumstances as discussed below, the node will leave the cluster and exit. If configured for restarting, the node will restart, rejoin the cluster and continue operating.
If a thread becomes stuck, the following appears in the logs:
2016-05-10 12:34:54.487 ERROR [watchdog.threadactive] <Global watchdog thread> Thread has become stuck: StageWorker/0
If a thread dies, the following appears in the logs:
2016-05-10 12:34:54.487 ERROR [watchdog.threadactive] <Global watchdog thread> Thread has died: StageWorker/0
If the watchdog detects that a
DeliveryThread has become stuck or has died, that node will exit with the following message and a stack trace:
2016-05-10 12:34:56.509 ERROR [watchdog] <Global watchdog thread> *** WATCHDOG TIMEOUT *** 2016-05-10 12:34:56.509 ERROR [watchdog] <Global watchdog thread> Failed watchdog condition: ThreadActive:DeliveryThread/0
The watchdog monitors the number of alive
StageWorker threads to ensure that more than 50% of them are alive.
If fewer that 50% of the
StageWorker threads have remained alive without terminating or becoming stuck,
then the watchdog will cause the node to exit with an error message similar to the above.
This proportion of surviving threads is configurable; for information about configuring the watchdog,
see the Rhino Production Getting Started Guide.
Conditions under which a
StageWorker becomes stuck or dies are usually indicative that a deployed service is experiencing unexpected problems or
has a faulty implementation.
In this scenario, the best approach is to examine the reasons for the service causing threads to block or exit.
Common culprits include excessive blocking on external resources such as JDBC connections, infinite loops or poor handling of unexpected
The watchdog timeout values for
StageWorker threads can be increased,
However this is not usually recommended as a blocked
StageWorker indicates a session with very long processing time.
This may be a solution when a service consistently needs to perform a blocking action that is expected to take a long time.
It is also possible to increase the number of
StageWorker threads; this will improve reactiveness when a large proportion of
are blocking on an external service or long computation. Increasing the
StageWorker thread count is strongly recommended when increasing the
Conditions under which a
DeliveryThread becomes stuck or dies usually indicate that there is a fault in the hardware, operating system, libraries,
or in the Rhino SLEE.
If this is the case, first determine if hardware is an issue.
It would be a good idea to refer to any available documentation about diagnoses of hardware problems for the platform in question.
Look at the output of the
dmesg command on Unix operating systems, and whether the machine has had a history of restarts, kernel panics, process
segmentation faults, and so forth.
In the case of a
DeliveryThread becoming stuck or dying, investigate and resolve any possible hardware issues.
Contact your solution provider with the stack trace information from the Rhino logs and circumstances under which the watchdog timeout occurred for
information about what further action could be taken to resolve this issue.
The timewarp watchdog behaviour handles the system clock suddenly changing from the JVM perspective. A warning is printed in the log if the system clock suddenly changes. A Critical alarm is raised whenever a timewarp is detected, and will not be automatically cleared. This alarm is considered critical as all timewarps indicate severe problems in the underlying system that threaten cluster stability and ability to process sessions in a timely manner.
A timewarp of 5s or greater is sufficient to break the clustering mechanism used in Rhino, and may cause the node to leave the cluster. This is done to prevent a split-brain situation, or data corruption. A single node cluster will not self shutdown due to timewarps, but this should still be considered a critical issue as the node is unavailable during the timewarps.
Rhino may fail if system time ever goes backwards on a single node while the cluster is running, or if it goes backwards to overlap with a time period when the cluster was previously running. This is because the unique IDs being generated may no longer be unique.
If the watchdog observes a sudden change in the system clock, the following will be printed on the console:
2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879 ERROR [watchdog] <Global watchdog thread> Forward timewarp detected! interval=8021 but should be at most 2000 <Global watchdog thread> old timestamp=1188363629857, new timestamp=1188363637878 <Global watchdog thread> Check for CPU overloading or stop-the-world GC. <Global watchdog thread> Check for external processes that could step the system clock forwards ('date -s', NTP in step mode, etc)
Several causes of timewarps exist.
The most likely cause of forward timewarps is GC. GC cannot cause a reverse timewarp.
To check GC pause times, look for entries in the console log reporting real times greater than 1s for
Also look for entries reporting
Full GC or
concurrent-mode-failure indicates a severe problem.
2016-04-14 14:07:22.387 2016-04-14T14:07:19.279+0200: [GC2016-04-14T14:07:19.279+0200: [ParNew (promotion failed): 127844K->127844K(130112K), 3.1073460 secs] 1766087K->1792457K(3144768K), 3.1078370 secs] [Times: user=4.05 sys=0.00, real=3.11 secs] 2016-04-14 14:07:23.865 2016-04-14T14:07:22.387+0200: [Full GC2016-04-14T14:07:22.387+0200: [CMS: 1664612K->254343K(3014656K), 1.4772760 secs] 1792457K->254343K(3144768K), [CMS Perm : 77532K->77327K(196608K)], 1.4775730 secs] [Times: user=1.50 sys=0.00, real=1.47 secs] 2015-09-24 22:09:13.668 (concurrent mode failure): 22907K->8498K(32768K), 0.0504460 secs] 49900K->8498K(65344K), [CMS Perm : 15238K->15237K (196608K)], 0.0796580 secs] [Times: user=0.20 sys=0.00, real=0.08 secs]
Resolving long GC pauses almost always requires modifying the heap size/new size to eliminate the long pause. This must be done in tandem with performance testing to verify the effects of changes.
This is most commonly caused by a misconfigured time synchronization daemon (usually NTP). Rhino requires that NTP be run in slew mode, as step mode may make steps greater than the timewarp limit when enabled. NTP steps may be both forwards, and reverse.
Virtual machines may pause processing for many reasons, and may produce both foward and reverse timewarps depending on hypervisor configuration. In order to minimize VM pauses, we strongly recommend avoiding overcommiting resources.
Excessive system IO waits may only cause foward timewarps. These occur when the host fails to handle disk IO in a timely manner, blocking the JVM. This may be diagnosed through OS level logging and IO monitoring tools. Other causes of forward timewarps should be investigated first.
Note: VM’s are particularly prone to this form of timewarp.
Long-running transaction warning messages appear in Rhino logs if a transaction has been running for too long. Long-running transactions can be caused when:
a profile has been opened for editing but has not been committed or restored
a large deployable unit is being installed, done in a single transaction, and taking a long time to compile
a single node is restarting
|Nodes load most of their state from the database in a single transaction, which can take some time.|
a service makes blocking calls to external resources that are very slow to respond
the cluster is running under excessive load.
Warning messages like the following in the Rhino logs or console indicate long-running transactions.
... WARN [transaction.manager] There are 1 long-running transactions currently active: WARN [transaction.manager] 43190ms: [OCTransactionImpl:TransactionId:[101:16]:ACTIVE] ...
A long-running transaction may be reported during installation of components during deployment or when a node is starting up. In this case, the error message is benign; and the warning will be cleared when the components have finished installing.
If this message appears only during start-up or deployment of components, then it can be safely ignored.
If such messages persist or occur during normal operation, then they should be reported to your solution provider.
Please gather the output of the
dumpthreads.sh command from the node directory.
This output is written to the Rhino console and is not gathered as part of Rhino’s logging system.
It can usually be found in the file console.log.
Please provide this output to your solution provider support.
Lock timeouts occur when a node cannot obtain a cluster-wide lock within a certain timeout. Lock timeouts are most likely to occur when nodes leave the cluster, when nodes are experiencing overload conditions, or when network problems are being experienced. Some possible causes are:
Nodes which are about to leave the cluster due to some failure (e.g. garbage collection pauses)
Too much contention for the lock in question,
that users of the lock are queued sufficiently long enough for their requests to time out.
Too much load in general on the system,
such that the lock request can’t be processed by the in-memory database before it times out.
when the locking resources are accessed in a different order.
A transaction has somehow been "lost" and left open holding a lock
this could be a product bug.
A thread has got stuck, holding its transaction open and therefore holding its lock,
this has various possible causes including: service bugs, JVM bugs and/or product bugs.
Below are examples of warning messages in Rhino logs or console that indicate lock timeouts.
... ... Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT ... ... ...
... ... ========= Lock timeout ========= ... ... ... Unable to acquire lock on [lockId] within 5000 ms ... ...
The diagnostic process determines whether a failure has occurred due to environmental issues or hardware, network, or software related issues.
A hardware, network, or software related issue is usually indicated by the presence of an alarm.
In order to check for alarms, use the following
This command produces one line of output for each active alarm. If the list of active alarms contains a line similar to the following, it indicates that a node has left the cluster due to hardware, network, or software failure.
... Alarm ...(Node 101, 23-Nov-05 15:10:36.636): Major [rhino.cluster] Node 102 has left the cluster ...
If an alarm indicates node failure, follow the diagnostic and resolution procedures for Node Failure.
If the output of the command does not contain a line similar to above, or indicates no alarms are active (as below), then the symptoms are occurring because of environmental issues or software faults.
No active alarms
Overload conditions are likely to cause lock timeouts.
CPU usage should never exceed 70% for an extended period of time.
CPU usage can be measured on Linux using the
Garbage collection, particularly under overload conditions can cause lock timeouts.
The behaviour of the garbage collector can be analysed by examining the garbage collection logs.
By default these are logged to the Rhino console or the
console.log file in
For further information on diagnosing GC related problems, please refer to
Java Virtual Machine Heap Issues and
Application or Resource Adaptor Heap Issues.
SLEE load can also be investigated with the use of the Statistics Client. For more information about usage of the Statistics Client, please see the rhino-stats section in the Rhino Administration and Deployment Guide. Of particular value are the time statistics for the lock manager, MemDB and Savanna subsystems. If any of these report times in the 10s of milliseconds or greater this is usually a symptom of overload.
To resolve problems caused by overload it is usually necessary to add more cluster nodes on new hardware, correct the configuration of the software or improve the efficiency of the service logic. Follow the diagnostic steps in the following sections Operating Environment Issues, Java Virtual Machine Heap Issues, and Application or Resource Adaptor Heap Issues. If this does not lead to a resolution contact your solution provider for assistance.
If a lock timeout is caused by overloading the cluster, then an interim resolution is to enable a user-specified rate limiter to limit the number of events processed by the cluster. Instructions for doing so are in the Rhino Administration and Deployment Guide.
|Threshold rules should be set up to alert the SLEE administrator to adverse conditions within the SLEE.|
Deadlock, lost transactions and stuck threads are typically caused by software bugs.
These can be in Rhino, service logic, or RAs, and will often be unrelated to call rate.
To assist in diagnosing these problems observe the statistics parameter sets StagingThreads and Events.
Diagnosing the cause of these will usually require setting the log keys
transaction.instance to the
The logs obtained should be sent to your solution provider.
|This writes a lot of log data and can impact service performance.|
There are several causes for I/O exceptions to occur in Rhino logs. Below are the diagnostic steps and workarounds or resolution for I/O exceptions from:
For all diagnostic steps and workarounds or resolution first check:
There is sufficient disk space on the running system
Executing processes have write permissions to files in their subdirectories
The ulimit for the number of open file descriptors is not too low, depending on the protocols in use Rhino may require a higher open files ulimit than the OS default.
The log key for these I/O exceptions contain any of the following phrases:
These messages typically occur if the network is unable to accept multicast traffic from Rhino. This can be caused by either inappropriate kernel IP routing configuration or problems with network elements such as routers, switches, ethernet cards, firewalls etc. Start diagnosis from Operating Environment Issues. If this does not produce a solution continue with bClustering. Finally contact your solution provider for support.
These messages typically occur if the Rhino installation does not have appropriate disk space and/or permissions available or there is a hardware error.
The log key for these I/O exceptions contains the following phrases:
These messages should be provided to the solution provider unless they clearly indicate the cause.
If the log messages do not occur on any of the log keys mentioned in the cases, then they are likely to come from a resource adaptor or SBB. These messages should be provided to the solution provider and possibly the resource adaptor vendor.
A service fails with an error that looks like:
IllegalStateException java.lang.IllegalStateException: sendQuery called on inactive RA provider
The above exception indicates that you need to activate the RA entity. Documentation on managing RA entities can be found in the Rhino Administration and Deployment Guide.
Rhino logs contain the message:
2015-04-15 22:54:30.794 INFO [savanna.ocio.leaks] <OCIO Cleanup Thread> Garbage-collected 12201 leaked OCIO buffers 2015-04-15 22:54:30.794 INFO [savanna.ocio.leaks] <OCIO Cleanup Thread> Garbage-collected 12301 leaked OCIO buffers
Compare the times these warnings are logged with garbage collection cycles. Cleanup of leaked OCIO buffers happens during CMS GC cycles and all G1 collections, and it writes one log entry per 100 buffers cleaned up. The log entry reports the accumulated total of all OCIO buffers cleaned up since the node started. If this message is occurring on every GC cycle, that may indicate an environment-related performance problem.
Continue diagnosis of the performance problem, starting at Operating Environment Issues.
If there are no signs of overload, then the warning is harmless and indicates that a section of code in the Rhino clustering implementation has cleaned up internal objects that would have otherwise leaked.
If profile management and/or provisioning commands are unsuccessful this is possibly due to the size restriction of the ProfileDatabase installed in Rhino. The following command can be used to monitor the size of the ProfileDatabase installed in Rhino.
./client/bin/rhino-stats -m MemDB-Replicated.ProfileDatabase
Check the Rhino log for messages with a log key
profile.* similar to the following:
2016-12-01 13:41:57.081 WARN [profile.mbean] <RMI TCP Connection(2)-192.168.0.204> [foo:8] Error committing profile: javax.slee.management.ManagementException: Cannot commit transaction at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:36) at com.opencloud.rhino.management.SleeSupport.commitTx(SleeSupport.java:28) at com.opencloud.rhino.impl.profile.GenericProfile.commitProfile(GenericProfile.java:127) at sun.reflect.GeneratedMethodAccessor97.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.DynamicMBeanSupport.doInvoke(DynamicMBeanSupport.java:157) at com.opencloud.rhino.management.DynamicMBeanSupport$1.run(DynamicMBeanSupport.java:121) at java.security.AccessController.doPrivileged(Native Method) at com.opencloud.rhino.management.DynamicMBeanSupport.invoke(DynamicMBeanSupport.java:113) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.BaseMBeanInterceptor.intercept(BaseMBeanInterceptor.java:12) at com.opencloud.rhino.management.AuditingMBeanInterceptor.intercept(AuditingMBeanInterceptor.java:245) at com.opencloud.rhino.management.ObjectNameNamespaceQualifierMBeanInterceptor.intercept(ObjectNameNamespaceQualifierMBeanInterceptor.java:74) at com.opencloud.rhino.management.NamespaceAssociatorMBeanInterceptor.intercept(NamespaceAssociatorMBeanInterceptor.java:66) at com.opencloud.rhino.management.RhinoPermissionCheckInterceptor.intercept(RhinoPermissionCheckInterceptor.java:56) at com.opencloud.rhino.management.CompatibilityMBeanInterceptor.intercept(CompatibilityMBeanInterceptor.java:130) at com.opencloud.rhino.management.StartupManagementMBeanInterceptor.intercept(StartupManagementMBeanInterceptor.java:97) at com.opencloud.rhino.management.RemoteSafeExceptionMBeanInterceptor.intercept(RemoteSafeExceptionMBeanInterceptor.java:26) at com.opencloud.rhino.management.SleeMBeanServerBuilder$MBeanServerInvocationHandler.invoke(SleeMBeanServerBuilder.java:38) at com.sun.proxy.$Proxy9.invoke(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487) at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328) at java.security.AccessController.doPrivileged(Native Method) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848) at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322) at sun.rmi.transport.Transport$2.run(Transport.java:202) at sun.rmi.transport.Transport$2.run(Transport.java:199) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:198) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: com.opencloud.transaction.TransientRollbackException: Unable to prepare due to size limits of the DB at com.opencloud.transaction.local.OCTransactionImpl.commit(OCTransactionImpl.java:107) at com.opencloud.transaction.local.OCTransactionManagerImpl.commit(OCTransactionManagerImpl.java:190) at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:33) ... 48 more
Increase the size of the profile database as described in Memory Database Full
If this occurs please contact your solution provider to purchase additional capacity licenses.
If the exception is of the type
AccessControlException, please refer to section Security Related Exceptions to solve the issue.
If the exceptions contain the text "Event router transaction failed to commit" it is likely that the in-memory database is not sized correctly for the load. Follow the diagnostic and resolution steps in Memory Database Full.
For other exceptions please contact your solution provider for support.
If the log key for the exception log matches the tracer for a resource adaptor entity e.g.
trace.sipra the cause of the problem is likely to be a faulty resource adaptor.
This is indicative of either a misconfiguration of the resource adaptors in question or a bug in their implementation.
If this is occurring frequently then please contact your solution provider support.
Contact your solution provider for support attaching the log files that contain the exception message.