This document contains troubleshooting help for the Rhino TAS and other Metaswitch Rhino products.

Section Troubleshooting for …​

Environmental

environmental issues with Rhino.

Clustering

clustering issues with Rhino.

Performance

Rhino performance issues.

Configuration Problems

Rhino configuration.

Management

Rhino management tools and utilities.

Common Symptoms

common general issues with Rhino.

Database

the Rhino management database persistent store.

Signalware

using Signalware with Rhino’s CGIN Resource Adaptor.

Exit Codes

Rhino exit codes.

Notices

Copyright © 2024 Microsoft. All rights reserved

This manual is issued on a controlled basis to a specific person on the understanding that no part of the Metaswitch Networks product code or documentation (including this manual) will be copied or distributed without prior agreement in writing from Metaswitch Networks.

Metaswitch Networks reserves the right to, without notice, modify or revise all or part of this document and/or change product features or specifications and shall not be responsible for any loss, cost, or damage, including consequential damage, caused by reliance on these materials.

Metaswitch and the Metaswitch logo are trademarks of Metaswitch Networks. Other brands and products referenced herein are the trademarks or registered trademarks of their respective holders.

Common Symptoms

Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for common general issues with Rhino.

Rhino watchdog

The watchdog is a thread in Rhino that monitors the system clock and threads in the SLEE for strange behaviour. There are two types of threads monitored:

  • DeliveryThreads — internal threads inside the SLEE

  • StageWorkers — threads responsible for executing deployed service logic.

The watchdog thread watches for the following strange behaviour:

Stuck Threads watchdog

The stuck threads watchdog behaviour handles stuck or dying threads. Under some circumstances as discussed below, the node will leave the cluster and exit. If configured for restarting, the node will restart, rejoin the cluster and continue operating.

Symptoms

If a thread becomes stuck, the following appears in the logs:

2016-05-10 12:34:54.487 ERROR [watchdog.threadactive] <Global watchdog thread> Thread has become stuck: StageWorker/0

If a thread dies, the following appears in the logs:

2016-05-10 12:34:54.487 ERROR [watchdog.threadactive] <Global watchdog thread> Thread has died: StageWorker/0

If the watchdog detects that a DeliveryThread has become stuck or has died, that node will exit with the following message and a stack trace:

2016-05-10 12:34:56.509 ERROR [watchdog] <Global watchdog thread> *** WATCHDOG TIMEOUT ***
2016-05-10 12:34:56.509 ERROR [watchdog] <Global watchdog thread> Failed watchdog condition: ThreadActive:DeliveryThread/0

The watchdog monitors the number of alive StageWorker threads to ensure that more than 50% of them are alive. If fewer that 50% of the StageWorker threads have remained alive without terminating or becoming stuck, then the watchdog will cause the node to exit with an error message similar to the above. This proportion of surviving threads is configurable; for information about configuring the watchdog, see the Rhino Production Getting Started Guide.

Conditions under which a StageWorker becomes stuck or dies are usually indicative that a deployed service is experiencing unexpected problems or has a faulty implementation. In this scenario, the best approach is to examine the reasons for the service causing threads to block or exit. Common culprits include excessive blocking on external resources such as JDBC connections, infinite loops or poor handling of unexpected circumstances.

The watchdog timeout values for StageWorker threads can be increased, However this is not usually recommended as a blocked StageWorker indicates a session with very long processing time. This may be a solution when a service consistently needs to perform a blocking action that is expected to take a long time. It is also possible to increase the number of StageWorker threads; this will improve reactiveness when a large proportion of StageWorker threads are blocking on an external service or long computation. Increasing the StageWorker thread count is strongly recommended when increasing the timeout.

Conditions under which a DeliveryThread becomes stuck or dies usually indicate that there is a fault in the hardware, operating system, libraries, or in the Rhino SLEE. If this is the case, first determine if hardware is an issue. It would be a good idea to refer to any available documentation about diagnoses of hardware problems for the platform in question. Look at the output of the dmesg command on Unix operating systems, and whether the machine has had a history of restarts, kernel panics, process segmentation faults, and so forth.

In the case of a DeliveryThread becoming stuck or dying, investigate and resolve any possible hardware issues. Contact your solution provider with the stack trace information from the Rhino logs and circumstances under which the watchdog timeout occurred for information about what further action could be taken to resolve this issue.

Timewarp watchdog

The timewarp watchdog behaviour handles the system clock suddenly changing from the JVM perspective. A warning is printed in the log if the system clock suddenly changes. A Critical alarm is raised whenever a timewarp is detected, and will not be automatically cleared. This alarm is considered critical as all timewarps indicate severe problems in the underlying system that threaten cluster stability and ability to process sessions in a timely manner.

A timewarp of 5s or greater is sufficient to break the clustering mechanism used in Rhino, and may cause the node to leave the cluster. This is done to prevent a split-brain situation, or data corruption. A single node cluster will not self shutdown due to timewarps, but this should still be considered a critical issue as the node is unavailable during the timewarps.

Rhino may fail if system time ever goes backwards on a single node while the cluster is running, or if it goes backwards to overlap with a time period when the cluster was previously running. This is because the unique IDs being generated may no longer be unique.

If the watchdog observes a sudden change in the system clock, the following will be printed on the console:

2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879
ERROR [watchdog]
<Global watchdog thread> Forward timewarp detected! interval=8021 but should be at most 2000
<Global watchdog thread> old timestamp=1188363629857, new timestamp=1188363637878
<Global watchdog thread> Check for CPU overloading or stop-the-world GC.
<Global watchdog thread> Check for external processes that could step the system clock forwards ('date -s', NTP in step mode, etc)

Several causes of timewarps exist.

Long Garbage Collection(GC) pause

The most likely cause of forward timewarps is GC. GC cannot cause a reverse timewarp. To check GC pause times, look for entries in the console log reporting real times greater than 1s for ParNew, CMS-initial-mark, and CMS-remark. Also look for entries reporting Full GC or concurrent-mode-failure. concurrent-mode-failure indicates a severe problem.

concurrent mode failure GC log
 2016-04-14 14:07:22.387  2016-04-14T14:07:19.279+0200: [GC2016-04-14T14:07:19.279+0200: [ParNew (promotion failed): 127844K->127844K(130112K),
3.1073460 secs] 1766087K->1792457K(3144768K), 3.1078370 secs] [Times: user=4.05 sys=0.00, real=3.11 secs]
 2016-04-14 14:07:23.865  2016-04-14T14:07:22.387+0200: [Full GC2016-04-14T14:07:22.387+0200: [CMS: 1664612K->254343K(3014656K), 1.4772760 secs]
1792457K->254343K(3144768K), [CMS Perm : 77532K->77327K(196608K)], 1.4775730 secs] [Times: user=1.50 sys=0.00, real=1.47 secs]
 2015-09-24 22:09:13.668   (concurrent mode failure): 22907K->8498K(32768K), 0.0504460 secs] 49900K->8498K(65344K), [CMS Perm : 15238K->15237K
(196608K)], 0.0796580 secs] [Times: user=0.20 sys=0.00, real=0.08 secs]

Resolving long GC pauses almost always requires modifying the heap size/new size to eliminate the long pause. This must be done in tandem with performance testing to verify the effects of changes.

System clock changes

This is most commonly caused by a misconfigured time synchronization daemon (usually NTP). Rhino requires that NTP be run in slew mode, as step mode may make steps greater than the timewarp limit when enabled. NTP steps may be both forwards, and reverse.

Virtual Machine pauses

Virtual machines may pause processing for many reasons, and may produce both foward and reverse timewarps depending on hypervisor configuration. In order to minimize VM pauses, we strongly recommend avoiding overcommiting resources.

Excessive system IO waits

Excessive system IO waits may only cause foward timewarps. These occur when the host fails to handle disk IO in a timely manner, blocking the JVM. This may be diagnosed through OS level logging and IO monitoring tools. Other causes of forward timewarps should be investigated first.

Note: VM’s are particularly prone to this form of timewarp.

Long-running transaction warnings

Long-running transaction warning messages appear in Rhino logs if a transaction has been running for too long. Long-running transactions can be caused when:

  • a profile has been opened for editing but has not been committed or restored

  • a large deployable unit is being installed, done in a single transaction, and taking a long time to compile

  • a single node is restarting

Note Nodes load most of their state from the database in a single transaction, which can take some time.
  • a service makes blocking calls to external resources that are very slow to respond

  • the cluster is running under excessive load.

Symptoms

Warning messages like the following in the Rhino logs or console indicate long-running transactions.
...
WARN [transaction.manager] There are 1 long-running transactions currently active:
WARN [transaction.manager] 43190ms: [OCTransactionImpl:TransactionId:[101:16]:ACTIVE]
...

Diagnostic steps and correction

A long-running transaction may be reported during installation of components during deployment or when a node is starting up. In this case, the error message is benign; and the warning will be cleared when the components have finished installing.

If this message appears only during start-up or deployment of components, then it can be safely ignored.

If such messages persist or occur during normal operation, then they should be reported to your solution provider. Please gather the output of the dumpthreads.sh command from the node directory. This output is written to the Rhino console and is not gathered as part of Rhino’s logging system. It can usually be found in the file console.log. Please provide this output to your solution provider support.

Lock timeouts

Lock timeouts occur when a node cannot obtain a cluster-wide lock within a certain timeout. Lock timeouts are most likely to occur when nodes leave the cluster, when nodes are experiencing overload conditions, or when network problems are being experienced. Some possible causes are:

  1. Overloaded nodes

  2. Network congestion

  3. Network failure

  4. Nodes which are about to leave the cluster due to some failure (e.g. garbage collection pauses)

  5. Too much contention for the lock in question,
    that users of the lock are queued sufficiently long enough for their requests to time out.

  6. Too much load in general on the system,
    such that the lock request can’t be processed by the in-memory database before it times out.

  7. Deadlock,
    when the locking resources are accessed in a different order.

  8. A transaction has somehow been "lost" and left open holding a lock
    this could be a product bug.

  9. A thread has got stuck, holding its transaction open and therefore holding its lock,
    this has various possible causes including: service bugs, JVM bugs and/or product bugs.

Symptoms

Below are examples of warning messages in Rhino logs or console that indicate lock timeouts.

...
... Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT ... ...
...
...
... ========= Lock timeout ========= ...
...
... Unable to acquire lock on [lockId] within 5000 ms ...
...

Diagnostic steps and correction

The diagnostic process determines whether a failure has occurred due to environmental issues or hardware, network, or software related issues. A hardware, network, or software related issue is usually indicated by the presence of an alarm. In order to check for alarms, use the following rhino-console command.

./client/bin/rhino-console listactivealarms

This command produces one line of output for each active alarm. If the list of active alarms contains a line similar to the following, it indicates that a node has left the cluster due to hardware, network, or software failure.

...
Alarm ...(Node 101, 23-Nov-05 15:10:36.636): Major [rhino.cluster] Node 102 has left the cluster
...

If an alarm indicates node failure, follow the diagnostic and resolution procedures for Node Failure.

If the output of the command does not contain a line similar to above, or indicates no alarms are active (as below), then the symptoms are occurring because of environmental issues or software faults.

No active alarms

Overload conditions are likely to cause lock timeouts. CPU usage should never exceed 70% for an extended period of time. CPU usage can be measured on Linux using the top command.

Garbage collection, particularly under overload conditions can cause lock timeouts. The behaviour of the garbage collector can be analysed by examining the garbage collection logs. By default these are logged to the Rhino console or the console.log file in
node-???/work/log For further information on diagnosing GC related problems, please refer to Java Virtual Machine Heap Issues and Application or Resource Adaptor Heap Issues.

SLEE load can also be investigated with the use of the Statistics Client. For more information about usage of the Statistics Client, please see the rhino-stats section in the Rhino Administration and Deployment Guide. Of particular value are the time statistics for the lock manager, MemDB and Savanna subsystems. If any of these report times in the 10s of milliseconds or greater this is usually a symptom of overload.

To resolve problems caused by overload it is usually necessary to add more cluster nodes on new hardware, correct the configuration of the software or improve the efficiency of the service logic. Follow the diagnostic steps in the following sections Operating Environment Issues, Java Virtual Machine Heap Issues, and Application or Resource Adaptor Heap Issues. If this does not lead to a resolution contact your solution provider for assistance.

If a lock timeout is caused by overloading the cluster, then an interim resolution is to enable a user-specified rate limiter to limit the number of events processed by the cluster. Instructions for doing so are in the Rhino Administration and Deployment Guide.

Tip Threshold rules should be set up to alert the SLEE administrator to adverse conditions within the SLEE.

Deadlock, lost transactions and stuck threads are typically caused by software bugs. These can be in Rhino, service logic, or RAs, and will often be unrelated to call rate. To assist in diagnosing these problems observe the statistics parameter sets StagingThreads and Events. Diagnosing the cause of these will usually require setting the log keys transaction.manager and transaction.instance to the debug level. The logs obtained should be sent to your solution provider.

Warning This writes a lot of log data and can impact service performance.

I/O exceptions

There are several causes for I/O exceptions to occur in Rhino logs. Below are the diagnostic steps and workarounds or resolution for I/O exceptions from:

For all diagnostic steps and workarounds or resolution first check:

  1. There is sufficient disk space on the running system

  2. Executing processes have write permissions to files in their subdirectories

  3. The ulimit for the number of open file descriptors is not too low, depending on the protocols in use Rhino may require a higher open files ulimit than the OS default.

I/O exceptions from Savanna in Rhino logs

Symptoms

The log key for these I/O exceptions contain any of the following phrases:

  • savanna

  • framework

  • ocio

Diagnostic steps and correction

These messages typically occur if the network is unable to accept multicast traffic from Rhino. This can be caused by either inappropriate kernel IP routing configuration or problems with network elements such as routers, switches, ethernet cards, firewalls etc. Start diagnosis from Operating Environment Issues. If this does not produce a solution continue with bClustering. Finally contact your solution provider for support.

I/O exceptions from Rhino’s logging infrastructure in Rhino logs

Symptoms

The log key for these I/O exceptions contains the following phrases:

  • rhino.logging

  • log4j.*

Workaround or resolution

These messages typically occur if the Rhino installation does not have appropriate disk space and/or permissions available or there is a hardware error.

I/O exceptions from Rhino’s core in Rhino logs

Symptoms

The log key for these I/O exceptions contains the following phrases:

  • rhino.*

  • memdb

  • transaction.*

Workaround or resolution

These messages should be provided to the solution provider unless they clearly indicate the cause.

I/O exceptions from resource adaptors or SBBs in Rhino logs

If the log messages do not occur on any of the log keys mentioned in the cases, then they are likely to come from a resource adaptor or SBB. These messages should be provided to the solution provider and possibly the resource adaptor vendor.

Inactive RA provider

Symptoms

A service fails with an error that looks like:

IllegalStateException
java.lang.IllegalStateException: sendQuery called on inactive RA provider

Resolution

The above exception indicates that you need to activate the RA entity. Documentation on managing RA entities can be found in the Rhino Administration and Deployment Guide.

Leaked OCIO buffers

Symptoms

Rhino logs contain the message:

2015-04-15 22:54:30.794  INFO    [savanna.ocio.leaks]   <OCIO Cleanup Thread> Garbage-collected 12201 leaked OCIO buffers
2015-04-15 22:54:30.794  INFO    [savanna.ocio.leaks]   <OCIO Cleanup Thread> Garbage-collected 12301 leaked OCIO buffers

Diagnostic steps

Compare the times these warnings are logged with garbage collection cycles. Cleanup of leaked OCIO buffers happens during CMS GC cycles and all G1 collections, and it writes one log entry per 100 buffers cleaned up. The log entry reports the accumulated total of all OCIO buffers cleaned up since the node started. If this message is occurring on every GC cycle, that may indicate an environment-related performance problem.

Continue diagnosis of the performance problem, starting at Operating Environment Issues.

If there are no signs of overload, then the warning is harmless and indicates that a section of code in the Rhino clustering implementation has cleaned up internal objects that would have otherwise leaked.

Profile Management and/or provisioning failing

Symptoms

Creating or importing profiles fails.

Diagnostic steps

If profile management and/or provisioning commands are unsuccessful this is possibly due to the size restriction of the ProfileDatabase installed in Rhino. The following command can be used to monitor the size of the ProfileDatabase installed in Rhino.

./client/bin/rhino-stats -m MemDB-Replicated.ProfileDatabase

Check the Rhino log for messages with a log key profile.* similar to the following:

2016-12-01 13:41:57.081  WARN    [profile.mbean]   <RMI TCP Connection(2)-192.168.0.204> [foo:8] Error committing profile:
javax.slee.management.ManagementException: Cannot commit transaction
        at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:36)
        at com.opencloud.rhino.management.SleeSupport.commitTx(SleeSupport.java:28)
        at com.opencloud.rhino.impl.profile.GenericProfile.commitProfile(GenericProfile.java:127)
        at sun.reflect.GeneratedMethodAccessor97.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.opencloud.rhino.management.DynamicMBeanSupport.doInvoke(DynamicMBeanSupport.java:157)
        at com.opencloud.rhino.management.DynamicMBeanSupport$1.run(DynamicMBeanSupport.java:121)
        at java.security.AccessController.doPrivileged(Native Method)
        at com.opencloud.rhino.management.DynamicMBeanSupport.invoke(DynamicMBeanSupport.java:113)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
        at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.opencloud.rhino.management.BaseMBeanInterceptor.intercept(BaseMBeanInterceptor.java:12)
        at com.opencloud.rhino.management.AuditingMBeanInterceptor.intercept(AuditingMBeanInterceptor.java:245)
        at com.opencloud.rhino.management.ObjectNameNamespaceQualifierMBeanInterceptor.intercept(ObjectNameNamespaceQualifierMBeanInterceptor.java:74)
        at com.opencloud.rhino.management.NamespaceAssociatorMBeanInterceptor.intercept(NamespaceAssociatorMBeanInterceptor.java:66)
        at com.opencloud.rhino.management.RhinoPermissionCheckInterceptor.intercept(RhinoPermissionCheckInterceptor.java:56)
        at com.opencloud.rhino.management.CompatibilityMBeanInterceptor.intercept(CompatibilityMBeanInterceptor.java:130)
        at com.opencloud.rhino.management.StartupManagementMBeanInterceptor.intercept(StartupManagementMBeanInterceptor.java:97)
        at com.opencloud.rhino.management.RemoteSafeExceptionMBeanInterceptor.intercept(RemoteSafeExceptionMBeanInterceptor.java:26)
        at com.opencloud.rhino.management.SleeMBeanServerBuilder$MBeanServerInvocationHandler.invoke(SleeMBeanServerBuilder.java:38)
        at com.sun.proxy.$Proxy9.invoke(Unknown Source)
        at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
        at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
        at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1427)
        at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
        at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
        at sun.rmi.transport.Transport$2.run(Transport.java:202)
        at sun.rmi.transport.Transport$2.run(Transport.java:199)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:198)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.opencloud.transaction.TransientRollbackException: Unable to prepare due to size limits of the DB
        at com.opencloud.transaction.local.OCTransactionImpl.commit(OCTransactionImpl.java:107)
        at com.opencloud.transaction.local.OCTransactionManagerImpl.commit(OCTransactionManagerImpl.java:190)
        at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:33)
        ... 48 more
Resolution

Increase the size of the profile database as described in Memory Database Full

License over capacity alarms

If this occurs please contact your solution provider to purchase additional capacity licenses.

Rhino logs containing exceptions

If the exception is of the type AccessControlException, please refer to section Security Related Exceptions to solve the issue.

If the exceptions contain the text "Event router transaction failed to commit" it is likely that the in-memory database is not sized correctly for the load. Follow the diagnostic and resolution steps in Memory Database Full.

For other exceptions please contact your solution provider for support.

Resource Adaptors throwing Exceptions

If the log key for the exception log matches the tracer for a resource adaptor entity e.g. trace.sipra the cause of the problem is likely to be a faulty resource adaptor. This is indicative of either a misconfiguration of the resource adaptors in question or a bug in their implementation. If this is occurring frequently then please contact your solution provider support. Contact your solution provider for support attaching the log files that contain the exception message.

Environmental

Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for environmental issues with Rhino.

Operating environment issues

There are several common cases where Rhino’s operating environment is not appropriately configured. These are typically related to resource competition or poor sizing for a given load. The most common causes are resource competition for CPU or heavy disk I/O.

Symptoms

  • Slow event response times

  • Low event throughput rate

  • Dropped calls

  • Periodic failures

Diagnostic steps

Operating system utilities such as vmstat, mpstat, and iostat are particularly helpful in determining whether or not a problem is caused by resource contention. Run such tools for enough time to observe the problem, and then review the logged output immediately before and after. Also look for CPU, I/O, swap, and network use. You may also want to record these continuously, for analysis of long-term system performance and transient faults.

If possible, collect system statistics no less frequently than 10s intervals, and cover every major subsystem. The UNIX standard SAR monitor also collects a useful data; and other tools may offer higher resolution or better analysis support.

Some of the specific things to look for are listed below:

More than 75% CPU utilisation on the machine

Please ensure that the system is sized such that no more than around 75% of CPU is used by Rhino under peak load with one failed node. If the machine is above this utilisation of CPU, then the machine has been incorrectly sized. If running a virtualised environment, this should be scaled to allow for failure of one physical host.

Other system processes

If other processes running on the same hardware compete for CPU for long enough, the responsiveness of a Rhino node may suffer. In our experience, operating system scheduling frequently does not provide Rhino with enough resources if there is competition for those resources. Rhino should be installed on dedicated hardware with no competing applications. Processes that can be especially demanding of CPU and I/O include databases and other Java servers.

Cron jobs

Cron jobs can cause other system processes to run which may compete for system resources. Cron jobs which are not essential should be disabled. Particular note should be taken of the locate service which will rebuild its database as a daily or weekly task, an I/O intensive procedure.

CPU Use

CPU use of the machine should be monitored to ensure that there is more than about 25% of CPU available at all times, including with one node failed. Ideally data should be collected no less frequently than every 10s, preferably with 1s intervals.

Disk I/O

High levels of disk I/O can result in increased CPU use, and can cause the Rhino JVM process to be scheduled infrequently. This can also cause writes to be scheduled synchronously, which may block threads during critical tasks such as GC. This is particularly likely on highly contended disks such as found in virtualised environments and SAN storage arrays.

Swap use

Swap use is often the cause of disk I/O. Rhino should be installed on a machine with sufficient physical RAM for the workload of the machine.

Swap will also significantly impair GC performance. If even a small amount of the JVM heap is swapped out, GC pause times can increase by up to 300% and potentially much longer for greater swap usage. While the CMS collector will usually touch pages frequently enough to keep them in physical memory, the G1 collector can leave regions untouched for long enough to be swapped out preemptively on Linux.

Network issues

Rhino’s cluster membership services use multicast UDP. If the IP network is loaded heavily, this can cause packets to be dropped, potentially causing cluster segmentation or failure if individual cluster nodes cannot reliably communicate.

See Cluster Segmentation, Cluster Failing to Start and Cluster Starts but Stops After a few Minutes for more details of multicast related problems.

Large system NUMA binding

Large systems (that is, four or more sockets per host have significant asymmetry in memory access speed and latency between CPU sockets. This can cause performance to be significantly less than optimum.

Workaround or resolution

If any of the machines Rhino is running on are running over 75% CPU use, and this is not due to other system processes, then consider:

  • configuring a use rate limiter as a workaround

  • increasing the number of available threads

  • installing more CPUs into a node

  • installing more nodes into a cluster

  • using faster CPUs.

Processes causing a problem due to resource usage should not be used on a machine running Rhino. Move these to another host.

To prevent preemptive swapping slowing GC, reduce the value of the sysctl vm.swappiness from the default of 60 to something low enough to prevent this. 30 is a good starting point. If using G1 the vm.swappiness sysctl should be set no higher than 30 and preferably 10 or lower.

When running Rhino on NUMA hosts it is usually best to run multiple nodes on each host with each JVM bound to a separate NUMA domain. Configure this with numactl --cpunodebind=X --membind=X on Linux.

Java Virtual Machine heap issues

Poor performance or failure of a Rhino node may result from a JVM configuration which has too little heap to suit the applications deployed and incoming traffic processed by the application.

This section describes how to diagnose incorrect heap sizing. See Application or resource adaptor heap issues to diagnose application or resource adaptor memory leaks.

Symptoms

  • Large garbage collection pauses in garbage collection logs

  • Longer than expected message processing and/or call setup times

  • Failed or timed-out call setup or message processing attempts

  • Out of memory error appearing in Rhino logs

  • Watchdog errors "Over memory use limit"

Diagnostic steps

There are two common causes of heap issues when running Rhino:

  • heap sizing may be incorrect

  • something may be leaking memory.

Incorrect heap sizing

The heap usage of Rhino should be monitored. This includes both the Java heap and the native heap. The Java heap can be measured by Rhino’s statistics tool rhino-stats, or one of the Java monitoring tools — jstat or jvisualvm. The native heap size can be measured using system utilities such as ps or top.

Here’s an example of using the Rhino statistics tool to monitor the size of the Java heap:

./client/bin/rhino-stats -m JVM

The behaviour of the JVM process is that a well-sized heap should gradually decrease from, say, 1.5 gigabytes free to around, say, 800 megabytes free, and then increase up to its higher limit again. This will be overlaid by a shorter pattern of peaks of say 128 megabytes over the long pattern usage. The actual figures will depend on the services that are active and deployed; and the total heap used should never exceed 60% of the heap size allocated for sustained periods when using the default CMS collector.

Component memory leak

If the JVM heap is sized appropriately for a given load, and a system appears to run stably at a given load and gradually degrades over a period of at least several hours, then it is possible that the JVM, a resource adaptor or Rhino has a memory leak.

Check the number of live activities and SBBs using the Rhino statistics monitoring tool. If the number of each are within expected values, then it is likely there is a memory leak. If there are too many, it indicates there may be a bug in the component. A strong indicator of an RA or service bug is activities that are more than ten times the age of the average call duration, or older than one whole day.

Workaround or resolution

Incorrect heap sizing

First raise the maximum heap size available to the JVM by modifying the HEAP_SIZE variable in the config/config_variables file. Ensure that there is enough physical memory on the machine such that the system will not swap if this extra memory is used by the JVM. Also set the NEW_SIZE based on the duration and memory usage of a typical session. The Rhino node must be restarted for this change to take effect. See to JVM documentation for more information on heap sizing and related overheads.

Warning

Tuning garbage collection settings may improve performance, but this is NOT RECOMMENDED without prolonged performance testing, as certain combinations of garbage collection settings may severely affect the stability of a cluster and cause JVM crashes. If you wish to explore this avenue:

  • contact your solution provider support or Metaswitch support for help

  • thoroughly load test the system for at least two weeks using the new garbage collection settings before deployment to production

  • consider adding more RAM to a machine, or more machines to the cluster.

Alternative garbage collectors such as G1, which may improve worst-case performance at some cost to average-case but have their own tuning requirements. G1 requires the use of Java 8 or newer to function reliably.

With some allocation patterns it may not be possible to find a heap size or new size that provides an acceptable balance between pause times and durations for the target load. If this is the case the only solutions are to use a different cluster size or contact your solution provider for an improved version of the service that uses memory more efficiently. This problem should only arise as a result of an error in capacity planning and testing. A workaround that is sometimes available if the host server has sufficient free RAM is to increase the heap size until it is large enough that the pauses can be avoided by restarting nodes during idle periods (however this is rarely possible).

Component memory leak

Please contact your solution provider support to work through component memory leak issues.

Application or resource adaptor heap issues

A system with application or resource adaptor related heap issues displays several symptoms, which are similar to those of an inappropriately sized JVM heap.

Symptoms

  • Large garbage collection pauses in garbage collection logs

  • Longer than expected message processing and/or call setup times

  • Failed or timed-out call setup or message processing attempts

  • Out of memory errors in Rhino logs

Diagnostic steps

When these symptoms occur it is often necessary to check both for application or resource adaptor related heap issues and for JVM heap issues.

Application or resource adaptor caused memory leaks

Start the rhino-console, and query for activities that are older than expected for the system. For example, call activities typically last the duration of a phone call; so query for activities that are older than the majority of phone calls (say one hour). To perform this query use the findactivities command with -created-before as a flag.

Here’s an example of finding activities created before 2:30pm:

findactivities -created-before 14:30:00

You can also query for activities older than a specified age, for example:

findactivities -created-before 1h

The resulting number of activities should be a low number. If it is a number larger than, say, 30 (depending on your application) then there may be an issue with some activities not being cleaned up properly by the service or resource adaptor. You should exclude ServiceActivities and ProfileActivities from the set counted.

If the number of activities is small, test for the number of SBB entities in Rhino. SBB entities are removed when they are no longer in use; so if you find a large number of SBB entities that were created a long time ago, it is likely that the service has a problem. The command to query SBB entities is the findsbbs command.

Here’s an example of finding SBB entities for the SIP proxy service created before 2:30pm:

findsbbs -service SIP\ Proxy\ Service\ 1.5,\ Open\ Cloud -created-before 14:30:00

We also recommend that you use the statistics client to examine the memory usage and number of activities in the running SLEE:

$ client/bin/rhino-stats -m Activities
2006-02-27 11:56:29.610 INFO [rhinostat] Connecting to localhost:1199
2006-02-27 11:56:33.952 INFO [rhinostat] Monitoring
2006-02-27 11:56:34.957 INFO [rhinostat] Cluster has members [101]

                                            Activities

time                         active dropped finished started
-----------------------    -----------------------------------
2006-02-27 11:58:59.574        57      -       -        -
2006-02-27 11:59:00.612        57      0      19       19
2006-02-27 11:59:01.635        57      0      21       21
2006-02-27 11:59:02.657        57      0      19       19
2006-02-27 11:59:03.875        57      0      21       21
2006-02-27 11:59:05.033        57      0      20       20
2006-02-27 11:59:06.053        57      0      20       20

Workaround or resolution

Application or resource adaptor caused memory leaks

This indicates an unhandled case in the Services or resource adaptors being used. Activities and SBB entities known to be invalid may be removed through client/bin/rhino-console using the removeactivity and removesbb commands respectively. These commands may be used to stop a system from running out of memory before a patched application or resource adaptor is installed.

Rhino start-up fails with 'java.io.IOException: Not Enough Space'

The most likely cause for this exception is that there is not enough free memory. When compiling deployable units, Rhino versions earlier than 2.3.0.7 fork off another JVM for running javac. This may double memory requirements during deployment. The exact behaviour depends on the memory management of the operating system. Since version 2.3.0.7, Rhino no longer forks to run javac unless specifically configured to for diagnostic purposes. The problem can still occur but is less likely.

Symptoms

INFO [rhino.management.deployment.builder] INFO [rhino.management.deployment.builder]
common classes java.io.IOException: Not enough space
<main> Generating profile implementation
<main> Compiling generated profile specification
   at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.<init>(UNIXProcess.java:52) at java.lang.Runtime.exec Internal(Native Method)
at java.lang.Runtime.exec(Runtime.java:566)
at java.lang.Runtime.exec(Runtime.java:491)
at java.lang.Runtime.exec(Runtime.java:457)
WARN [rhino.management.deployment] <main> Installation of deployable unit failed:
com.opencloud.rhino.management.deployment.BuildException: javac failed - exit code -1
Failed to start rhino node
com.opencloud.rhino.management.deployment.BuildException: javac failed - exit code -1
Node 101 exited due to misconfiguration.

Diagnostic steps

Firstly, check to see how much free memory is available to the JVMs, and that swap space is available. Rhino should never run using swap space (doing so will cause unpredictable performance); but nonetheless, sufficient swap space should be present to allow the OS to manage memory more effectively.

Check the amount of physical memory available. Under Linux, use the free command.

Compare this amount to the size of the JVM specified in the HEAP_SIZE variable in node-???/config/config_variables (for Rhino-production). Make sure that the JVM can fit comfortably in the physical memory on the computer.

Workaround or resolution

Rhino versions earlier than 2.3.0.7 require about twice the amount of physical memory as the size of HEAP_SIZE to ensure that a fork() of the JVM during compilation of services will always be able to run in physical memory and not incur delays by swapping memory to and from disk. Linux systems use copy-on-write memory cloning when forking processes so the likelihood of swapping is much lower than on Solaris.

You may need to reduce the value of HEAP_SIZE by editing the file $RHINO_HOME/node-???/config/config_variables to make sure Rhino can fit in the available physical memory.

Also make sure there are no other running processes on the computer which may be using excessive system resources, and that there is sufficient swap space.

Warning about UDP buffer sizes

Rhino cluster communication requires large UDP send and receive buffers.

The operating system limits for socket transmit and receive buffers must be large enough to allow the buffer size to be set.

Symptoms

Rhino prints warnings at startup:

WARN    [savanna.ocio.socket.sc]   <main> Asked for a UDP receive buffer of 261120 bytes, but only got 212992 bytes. Check system-wide limits (net.core.rmem_max)

Resolution

Ensure that the kernel parameters net.core.rmem_max and net.core.wmem_max are large enough.

See UDP buffer sizes in the Preparing your Network section of the Rhino Production Getting Started Guide for details.

Java Virtual Machine error

The Sun Java Virtual Machine is typically very stable and, theoretically, should not crash even under high load. On rare occasions, however, the JVM can crash for any of the following reasons:

  • Memory on the local computer has been corrupted.

  • The swap partition on the local disk has had data corruption.

  • Some other aspect of the hardware in the local computer has caused a JVM crash.

  • A bug or library versioning issue in the operating system has caused the JVM to crash.

  • A bug in one of the included Java Native Interface (JNI) libraries has caused the JVM to crash.

  • A bug in the JVM itself has caused it to crash.

Symptoms

When the JVM crashes, the Rhino node will fail and the cluster will reconfigure itself. On Unix systems, the node will die with either a SIGSEGV or a SIGBUS and perhaps dump core, depending on how your system is configured.

JVM error message in Rhino logs and console

A message similar to the following appearing in logs and/or the Rhino console indicates a JVM error.

#
# An unexpected error has been detected by HotSpot Virtual Machine: #
# SIGSEGV (0xb) at pc=0xf968c5f9, pid=27897, tid=64 #
# Java VM: Java HotSpot(TM) Server VM (1.5.0_05-b05 mixed mode)
# Problematic frame:
# J com.opencloud.slee.resources.signalware.OutgoingBackendConnection.writePacket (Lcom/opencloud/slee/resources/signalware/PacketType;Lcom/opencloud/slee/resources/psc/Encodeable;)V #
#
#
# An error report file with more information is saved as hs_err_pid27897.log
# If you would like to submit a bug report, please visit: http://java.sun.com/webapps/bugreport/crash.jsp
#
#
27897 Abort - core dumped
Configuration change messages in Rhino logs and console

The following are examples of configuration change messages.

 ...
2016-04-27 12:16:24.562  INFO    [rhino.main]   <SavannaDelivery/domain-0-rhino-db-stripe-2> Group membership change: GrpCCM: domain-0-rhino-db-stripe-2 Reg [101,18108] - Memb [ 101, 102, 103, 104, 105, 106, 107 ] { BboneCCM: Reg [101,27068] - Memb [ 101, 102, 103, 104, 105, 106, 107, 200 ] }
2016-04-27 12:16:24.569  INFO    [rhino.main]   <SavannaDelivery/rhino-monitoring> Group membership change: GrpCCM: rhino-monitoring Reg [101,15192] - Memb [ 101, 102, 103, 104, 105, 106, 107 ] { BboneCCM: Reg [101,27068] - Memb [ 101, 102, 103, 104, 105, 106, 107, 200 ] }
2016-04-27 12:16:24.601  WARN    [rhino.membership.rhino-cluster]   <SavannaDelivery/rhino-cluster> Node(s) [108] left cluster operational member set
2016-04-27 12:16:24.601  INFO    [rhino.membership.rhino-cluster]   <SavannaDelivery/rhino-cluster> Current membership set: [101,102,103,104,105,106,107,200]
2016-04-27 12:16:24.619  INFO    [framework.scs]   <SavannaDelivery/rhino-management> [rhino-management]  Component change: Component(members=[101,102,103,104,105,106,107],transitioning=[101,102,103,104,105,106,107],nonPrimary=[],id=5222)
...
2016-04-27 12:22:39.041  WARN    [rhino.membership.rhino-cluster]   <SavannaDelivery/rhino-cluster> Node has left primary component
2016-04-27 12:22:39.041  INFO    [rhino.exithandler]   <SavannaDelivery/rhino-cluster> Exiting process (Node left primary component)
Alarm indicating a node has failed

An alarm appears in the console if a node has failed. Use this command to query active alarms:

./client/bin/rhino-console -listActiveAlarms

This command produces a block of output for each active alarm. A block in the list of active alarms similar to the following indicates that a node has left the cluster due to hardware, network, or software failure.

...
Alarm 101:193861858302463 [rhino.node-failure]
  Rhino node : 101
  Level      : Major
  InstanceID : 102
  Source     : (Subsystem) ClusterStateListener
  Timestamp  : 20161103 17:44:02 (active 0m 6s)
  Message    : Node 102 has left the cluster
...

Diagnstic steps

If any of these symptoms occur, then first check the logs of all Rhino cluster members (file console.log). If a JVM error message is part of the logs of any node, then a Java Virtual Machine error has occurred. If an error message cannot be found in the logs please refer to Cluster Segmentation, Cluster Failing to Start and Cluster Starts but Stops After a few Minutes

After checking Rhino’s logs, determine whether or not the hardware or operating system is causing the problem. Look at the logs of the local machine and determine whether the machine has had a history of restarts, kernel panics, process segmentation faults, and so forth. On Unix machines, system logs can be viewed using the dmesg command or by viewing logs in /var/logs.

When a JVM crash occurs, a file will be left with a name like hs_err_pid*.log. View this file and try to determine which part of the JVM caused the crash. This may provide clues as to how to resolve the situation and will be needed if a bug in the JVM is to be reported.

Workaround or resolution

If the crash appears to be a one-off, the node can simply be restarted.

If the problem resides in faulty hardware, then that hardware will need to be fixed or replaced.

Metaswitch has experience with configuring the Sun Java Virtual Machine to be as stable as possible. Consider consulting Metaswitch support for information regarding known stable configurations of JVMs. Also ensure that thorough stability tests are performed on a new version of a Java Virtual Machine before it is deployed in a production system.

Also scan through the list of dynamic libraries in the hs_err_pid*.log file and see if they are all the latest version. Keep an eye out for libraries which should not be there, for example from other installations of older JVMs.

If it is a realistic option, upgrading the operating system to the latest version may also resolve this problem.

JVM errors that cannot be identified as being caused by system configuration or dynamic library problems should be reported to Oracle.

Multicast traffic is using the wrong network interface

Symptoms

The host machines for Rhino have multiple network interfaces, and we need to control which interface is used for SLEE cluster traffic. Network monitoring shows this is currently not on the desired interface.

When the configuration differs between hosts, nodes may fail to go primary. This is indicated by repeated messages in the Rhino log indicating that a node is waiting for the cluster to go primary and alack of configuration change messages.

Messages like this show up in the Rhino console and logs:

...
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
...

Diagnostic steps

Use a network monitoring tcpdump or snoop to detect multicast traffic on the groups configured for Rhino clustering.

Use netstat -rn, route, or ip route to examine the current multicast routing configuration.

Workaround or resolution

The choice of interface is controlled solely by the machines' multicast configuration; it’s controlled at the OS level rather than by Rhino. Use route or ip route to modify the OS routing configuration. See Cluster Failing to Start for how to configure the interface used for multicast traffic.

Clustering

Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for clustering issues with Rhino.

Node failure

Rhino is designed to survive hardware and software failure by using a redundant cluster of nodes. There are many possible causes for node failure; below are symptoms and diagnostic steps for the most common.

In most deployments Rhino is configured to automatically restart failed nodes if the failure was caused by network or software faults. In all cases not caused by hardware faults contact your solution provider support, providing all the rhino.log and console.log files from the failed Rhino node. Restart the failed node if it did not restart automatically.

Configuration change messages appear in Rhino log output.

Symptoms

A Rhino node may determine that another node has left the primary component (due to failure of hardware, network, or software). When this happens, the output looks like this:

...
WARN [rhino.membership] Node(s) [102] left cluster admin operational member set
INFO [rhino.membership] Current membership set: [101]
...
Diagnosis and Resolution

Please see Cluster Segmentation.

An alarm indicates a node has failed.

Symptoms

An alarm appears in the console if a node has failed. To query active alarms run this command:

  ./client/bin/rhino-console -listActiveAlarms

This produces one line of log output for each active alarm. If the list of active alarms contains a line similar to the following, it indicates that a node has left the cluster due to hardware, network, or software failure.

...
Alarm ...(Node 101, 23-Nov-05 15:10:36.636): Major [rhino.cluster] Node 102 has left the cluster ...
...
Diagnosis and Resolution

Please see Cluster Segmentation.

A Rhino node exits the JVM.

Symptoms

A Rhino node may determine it is not in the cluster majority after a cluster segmentation. If this is the case it terminates.

An example of the log output from the terminated node is as follows.

INFO [rhino.main] Cluster backbone membership change: BboneCCM: Trans [103,30] - Memb [ 103 ]
INFO [rhino.main] Cluster backbone membership change: BboneCCM: Reg [103,32] - Memb [ 103 ]
...
WARN [rhino.membership] Cluster admin group has left primary component - exiting process
WARN [rhino.starter.loader] Node is no longer ready - exiting process
Tip For more information please see Cluster Segmentation.

Out of memory errors

Symptom

Out of memory errors in the console log (console.log).

JVM errors.

A message in the Rhino console log like the following indicates a JVM error.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f9c742ffe25, pid=16224, tid=0x00007f9c720c6700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x5c2e25]  G1ParScanThreadState::copy_to_survivor_space(InCSetState, oopDesc*, markOopDesc*)+0x45
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
...

If such a message is present, please see Java Virtual Machine Error.

Watchdog timeouts.

Symptoms

A message like the following in the Rhino logs indicates a watchdog timeout.

2005-09-06 10:18:13.484 ERROR [watchdog] Waiting for thread cleanup..
...
2005-09-06 10:18:14.494 ERROR [watchdog] ***** KILLING NODE *****
Diagnosis and Resolution

If such a message is present please see Rhino Watchdog.

Cluster segmentation

Cluster segmentation occurs when part of a Rhino cluster determines that a cluster node or a set of nodes is no longer reachable. Nodes that determine that they no longer make up a majority of the cluster (known as a quorum) automatically deactivate themselves.

Cluster segmentation can be caused by several things, the most common being a network failure such as a switch malfunctioning.

Symptoms

  • Configuration change messages appear in Rhino log output

  • A Rhino node exits the JVM

  • An alarm indicating a node has failed

  • Watchdog timeouts

Diagnostic steps

First determine whether or not the segmentation is environmental. This means that the operating environment (see Operating Environment Issues) and the JVM Heap (see Java Virtual Machine Heap Issues) must be suitably configured. If these are configured suitably then the issue is likely to something with the network which interconnects the Rhino nodes. Typical network components to check for failure are machines, network interface cards, routers, switches, and cables.

Tip See Node failure for other causes of node failure.

Workaround or resolution

If the cause of the symptoms are environmental, then refer to the workaround or resolution steps for these issues. If the cause is a network issue, then it needs to be repaired by replacing the failed components.

Cluster failing to start

Rhino nodes must be part of the “primary component” in order to perform any work. The primary component is the set of nodes sharing the same cluster state. When booting up, nodes will wait to join the primary component before being able to perform work. This section describes the likely issues which cause a node to wait on the primary component for an extended period of time (greater than several seconds).

If the following diagnostic steps do not resolve the problem, please ensure that the various cluster machines can reach each other. For example, the ping command should indicate whether or not nodes can reach each other. If nodes cannot ping each other, then the cause is likely to be network misconfiguration or hardware failure.

No Primary Component

Symptoms

Repeated messages in the Rhino log indicating that a node is waiting for the cluster to go primary.

Messages like this show up in the Rhino console and logs:

...
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []
...
Resolution

If this is occuring on all nodes, run the make-primary.sh script on one node.

Tip See the Rhino Production Getting Started Guide for more information about primary components.

Multicast Configuration Error

Symptoms

No configuration change messages in the Rhino logs.

A lack of configuration change messages in a clustered setup indicates that the Rhino node is not able to communicate with other Rhino nodes in the cluster.

Diagnostic and corrective steps This is caused by an inability for nodes to communicate. Diagnosis and correction depends on whether the cluster communication mode is Multicast or Scattercast.

Multicast

Ensure that a route for the multicast IP addresses being used by Rhino is present in the operating system.

Tip Use the route and netstat system tools for more information.

Ensure that any routers are configured to allow multicast IP packets with the configured multicast addresses to be passed to the appropriate machines.

Ensure that the following variables in the config/config_variables files on all Rhino nodes which are part of a cluster match:

SAVANNA_CLUSTER_ID
SAVANNA_CLUSTER_ADDR
SAVANNA_MCAST_START
SAVANNA_MCAST_END

Check that the host firewall is not configured to block the address/port range Savanna uses on the interface used for multicast. A simple test is to remove all firewall rules with iptables -F. Rhino uses UDP multicast to a range of IP addresses and ports set in node-<NODE_ID>/config/config_variables and node-<NODE_ID>/config/savanna/cluster.properties The IGMP membership groups 224.0.0.1 and 224.0.0.2 must also be allowed on this interface.

Verify whether or not multicast IP messages are being passed from one machine to another correctly. To see the IP traffic between cluster members, use tcpdump and/or snoop.

For example, to record packet traces for the network interface eth1 limiting to a maximum of two 500MB files:

tcpdump -C 500 -i eth1 -W 2 -w mcast_dump.pcap

The collected packet capture can be copied to the administrator’s workstation and opened with Wireshark.

Scattercast

Ensure that all scattercast endpoint addresses are routable from every host in the cluster. Check that the scattercast endpoints set is up to date on nodes waiting to go primary. If the node is sufficiently out of date to not receive from any node, it cannot detect that it is out of date and shutdown.

Check that the host firewall is not configured to block the address/port pairs and ranges used by Savanna for scattercast. A simple test is to remove all firewall rules with iptables -F. Rhino uses UDP unicast to a range of IP addresses and ports set in:

node-<NODE_ID>/config/config_variables
node-<NODE_ID>/config/savanna/cluster.properties
node-<NODE_ID>/config/savanna/scattercast.endpoints

Verify whether or not Savanna UDP IP messages are being passed from one machine to another correctly. To see the IP traffic between cluster members, use tcpdump and/or snoop.

For example, to record packet traces for the network interface eth1 limiting to a maximum of two 500MB files:

tcpdump -C 500 -i eth1 -W 2 -w mcast_dump.pcap

The collected packet capture can be copied to the administrator’s workstation and opened with Wireshark.

Cluster starts but stops after a few minutes

Note This affects multicast only.

Symptoms

  • Configuration change messages appearing in Rhino log output

  • Many or all Rhino nodes exit the JVM
    For example:

    2016-04-27 12:29:24.352  WARN    [rhino.membership.rhino-cluster]   <SavannaDelivery/rhino-cluster> Node has left primary component
  • Multiple alarms indicating a node has failed

  • Watchdog failures for the condition "GroupHeartbeat"
    For example:

    2016-04-27 12:29:24.207  ERROR   [watchdog]   <Global watchdog thread>   Failed watchdog condition: GroupHeartbeat for group rhino-monitoring (sent=33 received=23)

Diagnostic steps

Multicast registration timeout

Check the current set of registered multicast groups with netstat -gn. These should include all the groups used by Savanna and the global IGMP membership reporting address 224.0.0.1. Compare the reported set between nodes and with the set present immediately after a node has started.

Check for "Martians" in syslog. Sometimes a peer will be a device on the multicast network that is in a different IP range from the hosts.

To enable logging of these on Linux run:

echo 1 > /proc/sys/net/ipv4/conf/all/log_martians

When multicast networking is in use a network device, commonly the switch or router, acts as an IGMP querier, sending regular messages to test for active multicast devices. The OS kernel receives and responds to these queries with membership reports. If no membership reports are received before the timeout configured on the querier it will stop forwarding multicast traffic.

To report IGMP traffic (Multicast group membership queries and reports) for all network interfaces on a host:

tcpdump -i any igmp

Use tcpdump or snoop to capture IGMP traffic on the interface used for multicast. Look for Query messages and Membership Reports.

Workaround or resolution

The IGMP membership addresses 224.0.0.1 and 224.0.0.2 must be routed on the interface used for Savanna clustering as well as the configured Savanna group multicast addresses.

Configure the local switch to act as an IGMP querier. Especially important for virtualised environments

IGMP version

Sometimes you need to force the kernel to fall back to IGMPv2. Usually this will be because the switch does not support IGMPv3. To configure this, run:

/sbin/sysctl -w net.ipv4.conf.eth0.force_igmp_version=2`

Make this change permanent by adding net.ipv4.conf.eth0.force_igmp_version=2 to /etc/sysctl.conf

Martians

While uncommon for hosts in the cluster, switches are frequently in different IP ranges. The Linux kernel will drop these messages from "impossible" addresses by default as a DOS prevention measure. Also some switches will send IGMP queries on the All-Hosts group instead of the specific group Rhino joined. If your querier is on a different subnet from the nodes, disable the Martian packet safety check.

Fix this by configuring the rp_filter sysctl to accept packets from IPs that are not reachable from the cluster interface, by setting the rp_filter sysctl to 0 or 2 if the source address is on a network visible from another interface.

/sbin/sysctl -w net.ipv4.conf.default.rp_filter=0
/sbin/sysctl -w net.ipv4.conf.all.rp_filter=0

Make this change permanent by adding these to /etc/sysctl.conf:

net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.all.rp_filter=0
Tip For more information on rp_filter and other kernel multicast configuration, see https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt.

You will also need to configure the firewall to accept IGMP packets sent to 224.0.0.1 and 224.0.0.2 if your rule does not include these implicitly.

Rhino SLEE fails to start cluster groups

Typically, the failure to start a cluster group indicates a network configuration problem.

Unless configured to use scattercast, Rhino distributes state between cluster members using UDP multicast. The particular multicast addresses used are defined during installation. Multicast addresses are, by definition, in the range 224.0.0.0 through to 239.255.255.255, otherwise notated as 224.0.0.0/4.

A failure will occur if Rhino tries to create a multicast socket on a non-multicast address. The Rhino install script checks to ensure that a valid multicast address is used, so this would only occur if the Rhino configuration has been edited manually after installation.

It is possible to change the addresses used by the cluster manually, by editing the configuration on every node that has been created. Care should be taken to ensure all nodes in the cluster are configured to use the same multicast addresses.

To change these addresses, edit the file $RHINO_HOME/node-101/config/config_variables.

Symptoms

Output is usually something like this:

ERROR [savanna.stack.membership.manager] Failed to start cluster groups
java.lang.RuntimeException: Failed to change addresses to /224.0.24.0:45601

Diagnostic steps

If the cluster is configured to use scattercast follow the diagnostic procedures for Cluster Failing to Start and Scattercast endpoints out of sync. For multicast clusters follow the steps below:

The install script checks to ensure that a valid multicast range is used but this may be changed manually afterward. If the address reported is in the multicast range described above, check that your OS kernel supports multicast:

cat /boot/config-<kernel version> \grep CONFIG_IP_MULTICAST

Workaround or resolution

If the configured addresses are not all within the multicast range change the configuration to use only multicast addresses.

If the configured UDP multicast range is not used by other applications, it should work fine for Rhino. Failing that, any address range can be used in 224.0.0.0/4.

To change the addresses without reinstalling, edit the file $RHINO_BASE/node-101/config/config_variables and change the following lines to use multicast addresses not taken by other programs:

SAVANNA_CLUSTER_ADDR=224.0.24.1
SAVANNA_MCAST_START=224.0.24.1
SAVANNA_MCAST_END=224.0.24.8

The master copy of this file (used when creating additional nodes) is $RHINO_BASE/etc/defaults/config/config_variables. If several nodes are to be deployed, this can be changed before nodes are created, to save extra editing at a later stage.

Group heartbeat timeout

Rhino uses a heartbeat mechanism to ensure Savanna groups are alive and functioning correctly. This can cause a node to shut down when it detects abnormal group behaviour.

Symptoms

Output is something like this:

ERROR   [watchdog]   <Global watchdog thread> Watchdog Failure has been notified to all listeners because condition 'GroupHeartbeat for group domain-0-rhino-db-stripe-0 (sent=20 received=10)' failed
ERROR   [watchdog]   <Global watchdog thread> *** WATCHDOG TIMEOUT ***
ERROR   [watchdog]   <Global watchdog thread>   Failed watchdog condition: GroupHeartbeat for group domain-0-rhino-db-stripe-0 (sent=20 received=10)
ERROR   [watchdog]   <Global watchdog thread> ***** KILLING NODE *****
 Diagnostic and corrective steps
This is almost always caused by a networking issue that does not affect the backbone Savanna group used for cluster membership.
Diagnosis and correction is therefore identical to bxref:#cluster-failing-to-start[Cluster Failing to Start].

Also see Rhino Watchdog.

Scattercast endpoints out of sync

This issue can only occur on a cluster using scattercast communications mode. Scattercast does not permit autodiscovery of nodes. Therefore each node must a priori know the addresses of all nodes in the cluster. Whenever the persistent set of known endpoints gets out of sync on a node, it will not be able to successfully join the cluster.

Symptoms

This problem can manifest in several ways, depending on how the scattercast endpoints are out of date. One example is shown below. Almost all examples of scattercast endpoints out of sync will result in the out of date nodes shutting down.

ERROR   [savanna.stack.membership.manager]   <SavannaIO/pool-thread-1> Halting due to scattercast endpoint mismatch detection:Network advertised version: 2 does not match local version: 1
ERROR   [savanna.stack.membership.manager]   <SavannaIO/pool-thread-1> An up to date scattercast.endpoints file can be found on any running node
INFO    [rhino.exithandler]   <SavannaIO/pool-thread-1> Exiting process (Exiting due to fatal Savanna misconfiguration or conflict)

It is possible to create a situation where a node will boot with an out-of-date persistent endpoints set, and fail to shut down. If booted after a scattercast management command removes it from the endpoints set, a node will not go primary nor shut down.

This will trigger frequent cluster membership changes in the remaining nodes. The cluster membership change messages will not change the set of members.

Node 101 log fragment when Node102 booted with out of date endpoints set.
INFO    [rhino.main]   <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Trans [101,10] - Memb [ 101 ]
INFO    [rhino.main]   <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Reg [101,12] - Memb [ 101 ]
INFO    [rhino.main]   <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Trans [101,14] - Memb [ 101 ]
INFO    [rhino.main]   <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Reg [101,16] - Memb [ 101 ]
INFO    [rhino.main]   <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Trans [101,18] - Memb [ 101 ]
INFO    [rhino.main]   <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Reg [101,20] - Memb [ 101 ]

Diagnostic and corrective steps If some nodes are operational, run the Rhino console command getscattercastendpoints to learn the current cluster configuration. This will print the live in-memory configuration; and if the scattercast.endpoints file on any running node contains a different configuration, it will print the memory state for each node and the differences from the disk state.


If all nodes have the same configuration, the output will be similar to the example below:

[Rhino@host (#0)] getscattercastendpoints
[Consensus] Disk Mapping   : Coherent
[Consensus] Memory Mapping :
      [101] Address  : 192.168.4.103:18000
      [102] Address  : 192.168.4.103:18001
      [103] Address  : 192.168.4.103:18002
      [104] Address  : 192.168.4.103:18003
      [105] Address  : 192.168.4.103:18004
      [106] Address  : 192.168.4.105:18005
      [107] Address  : 192.168.4.105:18006
      [108] Address  : 192.168.4.105:18007
      [109] Address  : 192.168.4.105:18008
      [110] Address  : 192.168.4.105:18009
      [111] Address  : 192.168.4.105:18010

If the node configuration is consistent but does not contain the expected set of nodes, then the missing nodes should be added by running the addscattercastendpoints command.


If the disk mapping differs from the in-memory configuration, then the nodes with the differing configuration are likely to fail when restarted. An example output where node 101 has a malformed file that will cause a node to fail to reboot is below:

[Rhino@host (#0)] getscattercastendpoints
[101] Disk Mapping   : 192.168.4.103:18001 present twice in endpoints file
[101] Memory Mapping :
      [101] Address  : 192.168.4.103:18000
      [102] Address  : 192.168.4.103:18001
      [103] Address  : 192.168.4.103:18002
      [104] Address  : 192.168.4.103:18003
      [105] Address  : 192.168.4.103:18004
      [106] Address  : 192.168.4.105:18005
      [107] Address  : 192.168.4.105:18006
      [108] Address  : 192.168.4.105:18007
      [109] Address  : 192.168.4.105:18008
      [110] Address  : 192.168.4.105:18009
      [111] Address  : 192.168.4.105:18010

[102] Disk Mapping   : Coherent
[102] Memory Mapping :
      [101] Address  : 192.168.4.103:18000
      [102] Address  : 192.168.4.103:18001
      [103] Address  : 192.168.4.103:18002
      [104] Address  : 192.168.4.103:18003
      [105] Address  : 192.168.4.103:18004
      [106] Address  : 192.168.4.105:18005
      [107] Address  : 192.168.4.105:18006
      [108] Address  : 192.168.4.105:18007
      [109] Address  : 192.168.4.105:18008
      [110] Address  : 192.168.4.105:18009
      [111] Address  : 192.168.4.105:18010
...

If some nodes have a different configuration on disk from the live memory configuration, then copy the scattercast.endpoints file from a node that has matching disk and memory configurations.


Note See also the Repair section of Scattercast Management in the Rhino Administration and Deployment Guide.

Performance

Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for Rhino performance issues.

High Latency

High latency can be observed in services running on the Rhino SLEE if the system is not configured appropriately for the workload or if a resource leak is present. Services should always be performance tested before deployment however the tested load profile may not match the production load so sometimes live diagnosis is required. Always attempt to reproduce performance problems in a test environment before trying solutions on a live system unless the fix is both simple and obvious e.g. a faster disk or more nodes.

Symptoms

  • Long Garbage Collection pauses in Garbage Collection/console logs

  • Longer than expected processing and/or response times

  • Failed or timed out attempts in your application

Diagnostic steps

  1. Firstly, determine whether or not the operating environment is the cause of the problem. For more information on the operating environment refer to Operating Environment Issues

  2. Secondly, determine whether or not heap use is the cause of the problem. This should be considered the most likely cause when garbage collection pauses are long. For more information on heap use refer to sections Java Virtual Machine Heap Issues and Application or Resource Adaptor Heap Issues.

The following steps are to be followed if the previous two stages of diagnosis were unsuccessful. All three steps use the output from client/bin/rhino-stats monitoring the Staging Threads parameter set.

The output for the Staging Threads should typically be monitored over at least a 30 minute period in order to ensure that the system characteristics are noticed through different garbage collection cycles.

An example command which connects to a Rhino cluster member on the local host and monitors the Staging Threads parameter set is as follows.

./client/bin/rhino-stats -h localhost -m Staging\ Threads
Low number of available threads in Rhino statistics output

The number of available threads is in the avail column of the rhino-stats output. Typically in a system which has too few threads configured the number of available threads is close to the maximum number of threads however it may drop to a near zero value for a brief period of time (half a second to several seconds).

High staging queue size in Rhino statistics output

The staging queue is used by Rhino to store incoming events when there are no free threads available to process the event. In a well sized system the staging queue is almost always empty. This can be observed by looking at the queueSize column. This should be a low number, less than the number of available threads. If it is not try doubling the number of threads available to Rhino.

Dropped staging items

Dropped staged items means that queued events have been discarded due to lack of queue size to hold the item. If this value is ever non-zero it indicates that events are not being processed fast enough. This can indicate that the system is configured inappropriately with respect to the number of threads or that a Service or Resource Adaptor in Rhino has an issue. Overload conditions can also cause cause dropped staging items.

High event processing time

Event processing time should be monitored for each event type that makes up the application flows, particularly those flows showing poor performance e.g.

./client/bin/rhino-stats -h localhost -m "Events.ocsipra.[javax.sip.message.Request.INVITE net.java.slee, 1.2]"

The most important statistic in event processing is the SBB Processing Time (SBBT). This tells how long the thread processing took in the service logic and is an indicator of the theoretical upper limit on how many events/s can be processed per thread. High SBB Processing Time statistics indicate either too many staging threads, high CPU load or, more frequently, a service that has slow dependencies e.g. a database query.

The Event Router Time (ERT) statistic shows the Rhino overhead of processing the event including transaction commit time. This is dependent on Rhino overheads, SBB initial-event-selectors and the SBB’s data storage activity. High Event Router Time statistics usually indicate either too many staging threads, high CPU load or a service that stores very large amounts of data (more than 1MB per transaction) in a the replicated MemDB. It may also be a symptom of a Rhino bug.

Event Processing Time (EPT) is the equal to the sum of these and indicates the actual limit on per-thread event processing rate with the currently deployed set of services.

Workaround or Resolution

If the diagnostic steps show that there are not enough threads available, try doubling the number of staging threads. This can be achieved using the rhino-console command at runtime, or by editing the thread-count attribute of the staging queues element in config/config.xml and then restarting that node.

Note that it is possible to set the number of staging threads too high, and in doing so cause other performance problems. We do not recommend setting the number of staging threads higher than 200 on most systems or approximately 10x the number of CPU cores. Setting it too high will cause scheduler thrashing which will increase latency and reduce throughput.

If the problem persists then this indicates there is a potential issue with the Services, and/or Resource Adaptors installed in Rhino.

In the case of performance problems with services or suspected bugs in Rhino please contact your solution provider for support. If possible the problem should be reproduced in a test environment so additional logging for diagnosis does not impact live traffic.

Dropped Calls

Services running on Rhino have real-time commitments and latency for those services needs to be at an absolute minimum. “Dropped calls” occur when an activity cannot be completed in an appropriate timeframe. This has many causes. The most common causes are described and diagnostic steps for each is provided.

Symptoms

  • Longer than expected processing and/or call setup times

  • Failed or timed out call setup attempts

  • Timed out or dropped stage items

  • Rhino logs and/or console output containing exceptions

    • Resource Adaptors throwing Exceptions

  • Lock timeout messages in Rhino logs and/or console

  • Rate limiting

  • A dependant external system is not functioning properly

Diagnostic steps

  1. Check the Rhino log for exceptions. Exceptions in the logs often indicate a cause for dropped calls. If exceptions are found in the logs, follow the steps in Rhino logs and/or console output containing exceptions

  2. Follow the diagnostic steps defined in sections Operating Environment Issues, Java Virtual Machine Heap Issues and Application or Resource Adaptor Heap Issues and follow through diagnostic steps to determine whether or not they are the cause.

  3. Follow the diagnostic steps defined in section High Latency.

Rhino logs containing exceptions

If the exception is of the type AccessControlException, please refer to section Security Related Exceptions to solve the issue.

If the exceptions contain the text "Event router transaction failed to commit" it is likely that the in-memory database is not sized correctly for the load. Follow the diagnostic and resolution steps in Memory Database Full.

For other exceptions please contact your solution provider for support.

Resource Adaptors throwing Exceptions

If the log key for the exception log matches the tracer for a resource adaptor entity e.g. trace.sipra the cause of the problem is likely to be a faulty resource adaptor. This is indicative of either a misconfiguration of the resource adaptors in question or a bug in their implementation. If this is occurring frequently then please contact your solution provider support. Contact your solution provider for support attaching the log files that contain the exception message.

Lock timeout messages in Rhino logs and/or console

If lock timeout messages are occuring please refer to section Lock Timeouts. Two examples of lock timeout messages are as follows.

...
... Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT ...
...
...
... ========= Lock timeout ========= ...
... Unable to acquire lock on [lockId] within 5000 ms ...
...

If lock timeout messages are occurring please refer to Lock Timeouts.

Rate limiting

A rate limiter may be rejecting input if the incoming rate exceeds its specified rate.

The rate limiter raises an alarm and outputs to the Rhino logging system if it is actively rejecting input. An example of this is shown as follows.

2011-08-02 22:02:59.931  Major   [rhino.facility.alarm.manager]   <Timer-1> Alarm 103:212775261552:14 [SubsystemNotification[subsystem=ThresholdAlarms],LIMITING,queue-saturation-limiter-rejecting-work] was raised at 2011-08-02 22:02:59.
931 to level Major
        QueueSaturation limiter is rejecting work
2014-11-18 23:26:41.134 Major [rhino.facility.alarm.manager] <Timer-1> Alarm 101:316812519480:1 [SubsystemNotification[subsystem=ThresholdAlarms],LIMITING,system-input-limiter-rejecting-work] was raised at 2014-11-18 23:26:41.134 to level Major

A user rate limiter is designed to reject input above a certain threshold to avoid overloading Rhino. If this threshold is appropriately configured then the messages should occur very rarely. If they occur frequently then it is possible that the solution is not sized appropriately for the incoming load. In this case please contact your solution provider support.

A dependant external system is not functioning properly

This is difficult to diagnose in a general manner as it is specific to the solution running on Rhino. However the system administrator should be aware of any external systems which the solution communicates with. Examples may include rating/charging engines, HLRs, HSS, MSCs, IVRs, SMSCs, MMSCs, databases, etc.

Please refer to administrators of dependant systems and if necessary contact your solution provider support.

A newly started node is unable to handle full traffic load

Symptoms

  • A new cluster or node has been brought into service and latencies are much greater than target

  • Traffic to a new node fails with timeouts

Workaround or Resolution

Java takes some time to settle during application startup due to Just in Time(JIT) compilation of code hotspots. For this reason, we recommend slowly introducing traffic to a newly booted node. Most of the hotspots will be optimised in the first minute, however it is better to ramp up load over the course of up to 15 minutes to allow JIT compilation and optimization to occur.

Cluster size should be set during provisioning to allow sufficient reserve capacity to handle peak-hour traffic with at least one failed node. This allows a lot of flexibility in bringing a failed node back into the cluster.

The recommended approach to bringing a node back into the cluster is to use a load balancing mechanism to limit traffic to the rebooted node. (Signalware has this mechanism built in) Usually a smooth increase from zero to a fair share over the course of 5 to 15 minutes (depending on total load) allows the JIT compiler to work without causing call failures or unacceptable latencies.

Another option, if sufficient reserve capacity exists, is to simply wait until cluster load drops to a level that a new node can cope with before rebooting the failed node. This works, as the reserve capacity ensures that the node failure is not visible to callers.

For Java 7 update 45 and newer enabling tiered compilation may result in some improvement at node boot time. This will require a much, much larger code cache as well due to the more aggressive compilation. This it does not avoid the initial very slow interpreted calls, it merely begins compilation more aggressively, which improves performance faster. This option is on by default in Java 8 where the default code cache size has been increased to handle the extra data.

OPTIONS="$OPTIONS -XX:+TieredCompilation -XX:InitialCodeCacheSize=512m"

Note that to avoid performance decreasing, the default code cache size must be set when using Java 7, as the default size of 48MB is not large enough. Java 8 defaults to 240MB and fixes some performance bugs in flushing old compilation data however you may still see benefit from increasing the code cache.

Uneven CPU load/memory usage across cluster nodes

Symptoms

CPU load or memory usage is inconsistent across the cluster.

Diagnostic steps and correction

Examine the service usage statistics to rule out differences in applied load.

A small difference in CPU load is expected for nodes servicing management connections. These nodes will also consume slightly more heap and Java Permanent Generation memory. If the amount of PermGen memory used on a node is greater than 90% of the maximum a CMS GC cycle will be triggered to try to reclaim PermGen, When permgen remains above 90% CMS cycles will run almost continuously and cause a significant increase in CPU usage. Check the heap statistics for the node or the GC log entries in the console log to determine if this is the cause of the imbalance. Log entries will contain a section similar to the following with a high proportion of the Perm space used and only a small change after GC:

[CMS Perm : 186608K->186604K(196608K)]

If the problem is memory related increase the allocated heap or PermGen so that there is sufficient free space for normal operation. For more information on heap configuration see Java Virtual Machine Heap Issues

If the problem is not memory related investigate the balance of load across threads. A small number of threads consuming a large percentage of CPU time may indicate a software bug or a load profile that is unusually serial in execution. On Linux the command top -ch will display the busiest threads and the process they belong to. Run the command $RHINO_HOME/dumpthreads.sh to trigger a Java thread dump. The thread IDs in the dump can be matched to the PIDs of the threads displayed in top by converting the PID to hexadecimal. Frequently the stack traces for busy threads will be similar. This can suggest the part of the code that contains the bug and may help identify the trigger conditions. Contact your solution provider for assistance, attaching the thread dump and list of busy threads.

Configuration Problems

Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for Rhino configuration.

Rhino provides a secured environment for its management infrastructure and for deployed components such as SBBs, Resource Adaptors, Profiles and Mlets. If some component of this management infrastructure attempts to perform an unauthorized operation, then `AccessControlException`s will be reported in the logging output of Rhino. The two most common forms are described in this section.

Symptoms

The default installation of Rhino is strict about which remote machines may connect to Rhino to perform management operations. The following exceptions are common when a machine attempts to connect to Rhino that has not been explicitly configured. Examples of the two most common messages are as follows.

A refused JMX-Remote connection produces the following message:

Exception in thread "RMI TCP Connection(6)-192.168.0.38" java.security.AccessControlException: access denied (java.net.SocketPermission 192.168.0.38:48070 accept,resolve)
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:264)
at java.security.AccessController.checkPermission(AccessController.java:427)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:532)
at java.lang.SecurityManager.checkAccept(SecurityManager.java:1157)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.checkAcceptPermission(TCPTransport.java:560) at sun.rmi.transport.tcp.TCPTransport.checkAcceptPermission(TCPTransport.java:208)
at sun.rmi.transport.Transport$1.run(Transport.java:152)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:149)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:460)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:701)
at java.lang.Thread.run(Thread.java:595)

This can be caused by a client/bin/rhino-console, client/bin/rhino-stats, client/bin/rhino-console, and external clients such as the Rhino Element Manager or any other application which uses Rhino’s JMX-Remote connection support.

The following message is caused by the rhino-console connecting from a host which is not in the list of accepted IP addresses:

2016-12-01 12:18:38.444  ERROR   [rhino.main]   <RMI TCP Connection(idle)> Uncaught exception detected in thread Thread[RMI TCP Connection(idle),5,RMI Runtime]: java.security.AccessControlException: access denied ("java.net.SocketPermission" "192.168.2.33:34611" "accept,resolve")
java.security.AccessControlException: access denied ("java.net.SocketPermission" "192.168.2.33:34611" "accept,resolve")
        at java.security.AccessControlContext.checkPermission(AccessControlContext.java:372)
        at java.security.AccessController.checkPermission(AccessController.java:559)
        at java.lang.SecurityManager.checkPermission(SecurityManager.java:549)
        at java.lang.SecurityManager.checkAccept(SecurityManager.java:1170)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.checkAcceptPermission(TCPTransport.java:668)
        at sun.rmi.transport.tcp.TCPTransport.checkAcceptPermission(TCPTransport.java:305)
        at sun.rmi.transport.Transport$2.run(Transport.java:201)
        at sun.rmi.transport.Transport$2.run(Transport.java:199)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:198)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Resolution

In this case, the components can be given permission to accept connections from certain IP addresses or hostnames by entering those addresses in the appropriate section in config/mlet-permachine.conf for the production system and in config/mlet.conf for the SDK. This is described in more detail in the Rhino Administrators Guide.

If the connection is local, it may be necessary to add that local interface to the LOCALIPS variable in the config/config_variables file. Both IPv4 and IPv6 addresses will need to be specified, and that node or SDK will need to be restarted.

Symptoms

Components inside Rhino run in a security context. These components include SBBs, Resource Adaptors, Profiles and Mlets. If a component attempts to perform some operation which requires a security permission that the component does not have, then an exception is generated. An example of this is when a Resource Adaptor attempts to write to the filesystem when it does not have permission to do so.

...
WARN [rhino.remotetx] <StageWorker/RTM/1> Prepare failed:
Node 101 failed: com.opencloud.slee.ext.resource.ResourceException: Could not create initial CDR file ...
at java.lang.Thread.run(Thread.java:534)
Caused by: java.io.IOException: Access denied trying to create a new CDR logfile
...
at java.security.AccessController.doPrivileged(Native Method)
... 14 more
Caused by: java.security.AccessControlException: access denied (java.io.FilePermission ... read)
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:269)
at java.security.AccessController.checkPermission(AccessController.java:401)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:524)
at java.lang.SecurityManager.checkRead(SecurityManager.java:863)
at java.io.File.exists(File.java:678)
... 19 more
...

Diagnosis and Resolution

Typically this indicates that the component is not configured appropriately.

As a step in diagnosing security-related problems, the security manager can be disabled by commenting out the following line in the read-config-variables script:

OPTIONS="$OPTIONS -Djava.security.manager"

This is not recommended as a permanent solution and should not be used on a production system. Disabling the security manager should only be used temporarily as a means to diagnose the problem in a test environment. To determine which permissions are required uncomment this line in $RHINO_HOME/read-config-variables:

#OPTIONS="$OPTIONS \-Djava.security.debug=access:failure"

Rhino will then print the permissions that are failing security checks to the console log. The required permissions should be added to the <security-permissions> section of the component deployment descriptor. For more information regarding granting additional permissions to an RA or SBB, refer to the sections on RA and SBB Deployment Descriptors in the Rhino Administration Manual.

Memory Database Full

Rhino’s in-memory database (MemDB) has a specified fixed capacity. Transactions which execute against a MemDB will not commit if the transaction would take MemDB past its fixed capacity. This typically occurs due to inadequate sizing of a Rhino installation but may indicate faulty logic in a service or resource adaptor. Log messages containing "Unable to prepare due to size limits of the DB" or "Unable to prepare due to committed size limits of MemDB Local" are a clear indicator of this problem. The symptoms are varied and are described below.

Profile Management and/or provisioning failing

Symptoms

Creating or importing profiles fails. The Rhino log contains a message with a log key profile.* similar to the following:

2016-12-01 13:41:57.081  WARN    [profile.mbean]   <RMI TCP Connection(2)-192.168.0.204> [foo:8] Error committing profile:
javax.slee.management.ManagementException: Cannot commit transaction
        at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:36)
        at com.opencloud.rhino.management.SleeSupport.commitTx(SleeSupport.java:28)
        at com.opencloud.rhino.impl.profile.GenericProfile.commitProfile(GenericProfile.java:127)
        at sun.reflect.GeneratedMethodAccessor97.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.opencloud.rhino.management.DynamicMBeanSupport.doInvoke(DynamicMBeanSupport.java:157)
        at com.opencloud.rhino.management.DynamicMBeanSupport$1.run(DynamicMBeanSupport.java:121)
        at java.security.AccessController.doPrivileged(Native Method)
        at com.opencloud.rhino.management.DynamicMBeanSupport.invoke(DynamicMBeanSupport.java:113)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
        at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.opencloud.rhino.management.BaseMBeanInterceptor.intercept(BaseMBeanInterceptor.java:12)
        at com.opencloud.rhino.management.AuditingMBeanInterceptor.intercept(AuditingMBeanInterceptor.java:245)
        at com.opencloud.rhino.management.ObjectNameNamespaceQualifierMBeanInterceptor.intercept(ObjectNameNamespaceQualifierMBeanInterceptor.java:74)
        at com.opencloud.rhino.management.NamespaceAssociatorMBeanInterceptor.intercept(NamespaceAssociatorMBeanInterceptor.java:66)
        at com.opencloud.rhino.management.RhinoPermissionCheckInterceptor.intercept(RhinoPermissionCheckInterceptor.java:56)
        at com.opencloud.rhino.management.CompatibilityMBeanInterceptor.intercept(CompatibilityMBeanInterceptor.java:130)
        at com.opencloud.rhino.management.StartupManagementMBeanInterceptor.intercept(StartupManagementMBeanInterceptor.java:97)
        at com.opencloud.rhino.management.RemoteSafeExceptionMBeanInterceptor.intercept(RemoteSafeExceptionMBeanInterceptor.java:26)
        at com.opencloud.rhino.management.SleeMBeanServerBuilder$MBeanServerInvocationHandler.invoke(SleeMBeanServerBuilder.java:38)
        at com.sun.proxy.$Proxy9.invoke(Unknown Source)
        at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
        at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
        at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1427)
        at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
        at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
        at sun.rmi.transport.Transport$2.run(Transport.java:202)
        at sun.rmi.transport.Transport$2.run(Transport.java:199)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:198)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.opencloud.transaction.TransientRollbackException: Unable to prepare due to size limits of the DB
        at com.opencloud.transaction.local.OCTransactionImpl.commit(OCTransactionImpl.java:107)
        at com.opencloud.transaction.local.OCTransactionManagerImpl.commit(OCTransactionManagerImpl.java:190)
        at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:33)
        ... 48 more
Resolution and monitoring

If profile management and/or provisioning commands are unsuccessful due to the size restriction of the profile database installed in Rhino the size of the database should be increased. To resize the profile database follow the instructions in Resizing MemDB Instances to alter the size of the ProfileDatabase.

The following command can be used to monitor the size of the ProfileDatabase installed in Rhino.

./client/bin/rhino-stats -m MemDB-Replicated.domain-0-ProfileDatabase

Deployment failing

Symptoms

If deployment is unsuccessful this is possibly due to the size restriction of the ManagementDatabase installed in Rhino.

The error message and Rhino log will look similar to:

javax.slee.management.ManagementException: File storage error
        at com.opencloud.rhino.node.FileManager.store(FileManager.java:119)
        at com.opencloud.rhino.management.deployment.Deployment.install(Deployment.java:400)
        at com.opencloud.rhino.management.deployment.Deployment.install(Deployment.java:267)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.opencloud.rhino.management.DynamicMBeanSupport.doInvoke(DynamicMBeanSupport.java:157)
        at com.opencloud.rhino.management.DynamicMBeanSupport$1.run(DynamicMBeanSupport.java:121)
        at java.security.AccessController.doPrivileged(Native Method)
        at com.opencloud.rhino.management.DynamicMBeanSupport.invoke(DynamicMBeanSupport.java:113)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
        at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.opencloud.rhino.management.BaseMBeanInterceptor.intercept(BaseMBeanInterceptor.java:12)
        at com.opencloud.rhino.management.AuditingMBeanInterceptor.intercept(AuditingMBeanInterceptor.java:245)
        at com.opencloud.rhino.management.ObjectNameNamespaceQualifierMBeanInterceptor.intercept(ObjectNameNamespaceQualifierMBeanInterceptor.java:74)
        at com.opencloud.rhino.management.NamespaceAssociatorMBeanInterceptor.intercept(NamespaceAssociatorMBeanInterceptor.java:66)
        at com.opencloud.rhino.management.RhinoPermissionCheckInterceptor.intercept(RhinoPermissionCheckInterceptor.java:56)
        at com.opencloud.rhino.management.CompatibilityMBeanInterceptor.intercept(CompatibilityMBeanInterceptor.java:130)
        at com.opencloud.rhino.management.StartupManagementMBeanInterceptor.intercept(StartupManagementMBeanInterceptor.java:97)
        at com.opencloud.rhino.management.RemoteSafeExceptionMBeanInterceptor.intercept(RemoteSafeExceptionMBeanInterceptor.java:26)
        at com.opencloud.rhino.management.SleeMBeanServerBuilder$MBeanServerInvocationHandler.invoke(SleeMBeanServerBuilder.java:38)
        at com.sun.proxy.$Proxy9.invoke(Unknown Source)
        at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
        at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
        at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1427)
        at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
        at sun.reflect.GeneratedMethodAccessor82.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
        at sun.rmi.transport.Transport$2.run(Transport.java:202)
        at sun.rmi.transport.Transport$2.run(Transport.java:199)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:198)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: javax.slee.TransactionRolledbackLocalException: Cannot commit transaction
        at com.opencloud.rhino.facilities.FacilitySupport.commitTx(FacilitySupport.java:47)
        at com.opencloud.rhino.node.FileManager.store(FileManager.java:112)
        ... 49 more
Caused by: com.opencloud.transaction.TransientRollbackException$$_Safe: Unable to prepare due to size limits of the DB
        at com.opencloud.transaction.local.OCTransactionImpl.commit(OCTransactionImpl.java:107)
        at com.opencloud.transaction.local.OCTransactionManagerImpl.commit(OCTransactionManagerImpl.java:190)
        at com.opencloud.rhino.facilities.FacilitySupport.commitTx(FacilitySupport.java:44)
        ... 50 more
Resolution and monitoring

If deployment commands are unsuccessful due to the size restriction of the management database installed in Rhino the size of the database should be increased. To resize the management database follow the instructions in Resizing MemDB Instances to alter the size of the ManagementDatabase.

The following command can be used to monitor the size of the ManagementDatabase installed in Rhino.

./client/bin/rhino-stats -m MemDB-Replicated.ManagementDatabase

Calls not being setup successfully

If calls are not being setup successfully the failures may be caused by the size restriction of either or both of the LocalMemoryDatabase and ReplicatedMemoryDatabase installed in Rhino.

If the service does not use replicated transactions the Rhino log will contain messages similar to this:

2016-11-30 09:08:01.732  WARN    [rhino.er.stage.eh]   <jr-74> Event router transaction failed to commit
com.opencloud.transaction.TransientRollbackException: Unable to prepare due to committed size limits of MemDB Local
        at com.opencloud.ob.Rhino.bO.commit(2.3-1.20-85576:108)
        at com.opencloud.ob.Rhino.me.commit(2.3-1.20-85576:191)
        at com.opencloud.ob.Rhino.AB.run(2.3-1.20-85576:95)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at com.opencloud.ob.Rhino.bq$a$a$a.run(2.3-1.20-85576:440)

Replicated services will produce log entries similar to this:

2016-11-30 09:08:01.732  WARN    [rhino.er.stage.eh]   <jr-74> Event router transaction failed to commit
com.opencloud.transaction.TransientRollbackException: Unable to prepare due to size limits of the DB
        at com.opencloud.ob.Rhino.bO.commit(2.3-1.20-85576:108)
        at com.opencloud.ob.Rhino.me.commit(2.3-1.20-85576:191)
        at com.opencloud.ob.Rhino.AB.run(2.3-1.20-85576:95)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at com.opencloud.ob.Rhino.bq$a$a$a.run(2.3-1.20-85576:440)
Resolution and monitoring

If calls are failing due to the size restriction of the in-memory database installed in Rhino the size of the database should be increased. Data should be collected before resizing to determine which database is reaching its size limit.

The following commands can be used to monitor the size of these databases installed in Rhino.

./client/bin/rhino-stats -m MemDB-Local.LocalMemoryDatabase
./client/bin/rhino-stats -m MemDB-Replicated.domain-0-DomainedMemoryDatabase
./client/bin/rhino-stats -m MemDB-Replicated.domain-0-ReplicatedMemoryDatabase

Note that it may be necessary to additionally monitor the output of the following memory databases in Rhino: MemDB-Replicated.ManagementDatabase, and MemDB-Replicated.ProfileDatabase.

If Rhino is configured to use striping, the DomainedMemoryDatabase will be divided into stripes that are each limited to a fraction of the size allocated to the MemDB instance. When using a striped MemDB it is possible for individual stripes to become full without the whole database filling. This is typically a symptom of a highly asymmetric workload or a poorly designed service. To check for full stripes monitor all the stripe statistics in the DB displaying problems. For example the DomainedMemoryDatabase stripes in an 8-stripe configuration are:

 "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-0"
 "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-1"
 "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-2"
 "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-3"
 "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-4"
 "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-5"
 "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-6"
 "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-7"

If the committed size of the whole MemDB instance reported is close to the maximum, increase the configured size of this instance. If the committed size of only one stripe is close to the maximum then the service is a poor match for striping. Reduce the number of stripes.

To resize a MemDB instance follow the instructions in Resizing MemDB Instances to alter the size of the LocalMemoryDatabase, DomainedMemoryDatabase or ReplicatedMemoryDatabase.

When monitoring the MemDB, watch the output of the churn, committedSize and maxCommittedSize fields. The churn field is the total change in the content of the Memory Database in bytes. The committedSize field is the current committed size of the Memory Database in kilobytes. The maxCommittedSize field is the maximum allowed committed size of the Memory Database in kilobytes. If the the difference between committedSize and maxCommittedSize is close to the churn or less than 100kB it is possible that the Memory Database is not committing transactions as they grow the database past its specified limit.

If the committedSize increases over time or does not approximately track the system load this may indicate a fault in a resource adaptor or service. Increase the MemDB size to prevent call failures and contact your solution provider for assistance.

For other problems that cause dropped calls refer to sections Dropped Calls, Java Virtual Machine heap issues and Application or resource adaptor heap issues.

Resizing MemDB Instances

The general workaround for MemDB sizing problems is to make the appropriate Memory Database larger. To do this, edit the node-*/config/rhino-config.xml file and look for a <memdb> or <memdb-local> entry which has a <jndi-name> with the appropriate Database name (e.g. ProfileDatabase). Increase the <committed-size> of the database to increase its maximum committed size.

Increasing the size of a Memory Database also requires increasing the maximum heap size of the JVM to accomodate the larger database. To do this, edit the node-*/config/config_variables file and increase the HEAP_SIZE by the same amount.

These steps will need to be performed for all nodes, and the nodes will need to be restarted for the change to take effect.

Initially try doubling the size of a database and monitor usage to determine the relationship between load and usage. Assuming usage is proportional to call load you may then need to alter the configured size to accomodate the highest sustained load peak. If usage does not appear to be proportional to load or the problem is not solved after a second increase in size, please contact your solution provider for support.

Resource Adaptors refuse to connect using TCP/IP

Make sure that your SLEE is in the “Running” state. A SLEE that is not running will not activate resource adaptors so will neither listen nor initiate connections.

If you have IPv6 installed on your machine, then Java may be attempting to use IPv6 rather than IPv4. Depending on the configuration of the network services and host computer’s interfaces Java may be resolving hostnames differently from the other network components. If this is the case then the resource adaptor may attempt to connect to the remote system using IPv6 when the system is only listening on an IPv4 address. Less frequently the reverse may be true.

Diagnostic steps

Use the network diagnostic tools provided with your OS distribution to check the current connected and listening ports for running programs. One common tool for this purpose is netstat:

$  netstat --inet6 -anp
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0      0 :::1202                 :::*                    LISTEN      6429/java
tcp6       0      0 :::1203                 :::*                    LISTEN      6429/java
tcp6       0      0 :::53                   :::*                    LISTEN      -
tcp6       0      0 :::22                   :::*                    LISTEN      -
tcp6       0      0 ::1:631                 :::*                    LISTEN      -
tcp6       0      0 ::1:5432                :::*                    LISTEN      -
tcp6       0      0 :::51673                :::*                    LISTEN      -
tcp6       0      0 :::9474                 :::*                    LISTEN      6429/java
tcp6       0      0 :::1199                 :::*                    LISTEN      6429/java
tcp6       0      0 :::111                  :::*                    LISTEN      -
tcp6       0      0 :::22000                :::*                    LISTEN      6429/java
tcp6       0      0 127.0.0.1:55682         127.0.0.1:5432          ESTABLISHED 6429/java
tcp6       0      0 127.0.0.1:55684         127.0.0.1:5432          ESTABLISHED 6429/java
udp6       0      0 ::1:35296               ::1:35296               ESTABLISHED -
udp6       0      0 :::45751                :::*                                -
udp6       0      0 :::5353                 :::*                                7991/plugins --disa
udp6       0      0 192.168.0.204:16100     :::*                                6429/java
udp6       0      0 :::53                   :::*                                -
udp6       0      0 :::111                  :::*                                -
udp6       0      0 fe80::9eeb:e8ff:fe0:123 :::*                                -
udp6       0      0 ::1:123                 :::*                                -
udp6       0      0 :::123                  :::*                                -
udp6       0      0 :::704                  :::*                                -
raw6       0      0 :::58                   :::*                    7           -

Check the firewall configuration on both the host running Rhino and the host providing the network function the RA connects to. Also check any intermediate network components that provide filtering functions.

Workaround or Resolution

Try adding`-Djava.net.preferIPv4Stack=true` as a command-line argument to Rhino. This option can be set by adding a line EXTRA_OPTIONS=-Djava.net.preferIPv4Stack=true to the RHINO_HOME/config_variables file. Other Java programs that connect to Rhino may also need this argument added. For other programs an RA connects to consult the documentation for that program to learn how to configure the address family to use. If a program cannot be configured appropriately you may need to disable support for IPv6 in the OS configuration.

If a firewall is configured to block the addresses used by the resource adaptor it must be reconfigured to allow connections.

Local hostname not resolved properly

In some cases the local name of a host (e.g. test-vm.example.com) is resolved wrongly (to 127.0.0.1 instead of its external address e.g. 192.168…​). This results in remote connections being refused when trying to connect via management clients from a different host.

Symptoms

  • Connections via management clients are refused

Diagnostic steps

First check that proper permissions have been given in the appropriate sections of $RHINO_HOME/config/mlet.conf (SDK) respectively $RHINO_HOME/etc/default/config/mlet-permachine.conf (Production).

Check that no matter whether you are trying to connect through the Standalone Webconsole or the Commandline Console application you get an exception similar to the one below:

...
Connection failed to mycomputer:1199
Connection failed to mycomputer:1199 (unreachable at Thu Apr 26 11:32:10 NZST 2007) javax.security.auth.login.LoginException: Could not connect
...
Caused by: com.opencloud.slee.client.SleeClientException: Could not connect to a Rhino management host
    at com.opencloud.slee.mlet.shell.spi.jmx.RhinoJmxClient.connect(RhinoJmxClient.java:186)
    at com.opencloud.slee.mlet.shell.spi.jmx.RhinoJmxClient.connect(RhinoJmxClient.java:124)
    at com.opencloud.slee.mlet.shell.spi.jmx.RhinoJmxClient.login(RhinoJmxClient.java:242)

Workaround or Resolution

Edit the contents of the file /etc/hosts. It should contain entries as shown below.

...
127.0.0.1 localhost
192.168.123.123 mycomputer
...

Entries that resolve the machine’s name to its local address should be commented out:

...
#127.0.0.1 localhost.localdomain localhost mycomputer
...

Logging configuration errors

Symptoms

  • Reconfiguring the logging subsystem fails

Diagnostic steps

First check your command parameters for errors as prompted by the error message returned.

If the error is unclear or the parameters appear to be correct, read the logging status log node-???/work/logging-status.log. The messages in this file may provide sufficient detail to resolve the problem. Check for known issues

Workaround or resolution

Fix the configuration command parameters, avoiding configuration combinations known to be faulty.

If the configuration appears to be correct contact your solution support provider for assistance.

Management

Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for Rhino management tools and utilities.

Connections Refused for the Command Console, Deployment Script or Rhino Element Manager

The remote management clients can not connect to the Rhino SLEE

Symptoms

The management clients show the following error when attempting to connect to the SLEE:

user@host:~/rhino/client/bin$ ./rhino-console
Could not connect to Rhino:
  [localhost:1199] Connection refused
    -> This normally means Rhino is not running or the client is connecting to the wrong port.

Use -D switch to display connection debugging messages.
Could not connect to Rhino:
  [localhost:1199] No route to host

Use -D switch to display connection debugging messages.
Could not connect to Rhino:
  [localhost:1199] Could not retrieve RMI stub
    -> This often means the m-let configuration has not been modified to allow remote connections.

Use -D switch to display connection debugging messages.
BUILD FAILED
~/rhino/client/etc/common.xml:99: The following error occurred while executing this line:
~/rhino/client/etc/common.xml:77: error connecting to rhino: Login failed

Diagnostic steps and correction

Rhino is not listening for management connections

First, check that there is a running Rhino node on the host the client is trying to connect to. Use the ps command to check that the Rhino process is running, e.g. ps ax | grep Rhino. If Rhino is running, check the rhino.log to determine if the node has joined the primary component and started fully. If the Rhino node is failing to join the primary component or otherwise failing to fully start then consult the Clustering troubleshooting guide.

Make sure that the remote host is accessible using the ping command. Alternatively, make sure that you can log in to the remote host using ssh to make sure the network connection is working (some firewalls block ping).

Rhino refuses connections

By default, Rhino is set up to not allow remote connections by management clients. Permissions to do so need to be manually configured before starting the SLEE, as described in the next section.

The management clients connect to Rhino via SSL secured JMX connections. These require both a client certificate and permission to connect configured in the Java security configuration for Rhino.

To allow remote connections to the SLEE, the MLet configuration file will need to be edited. On the SDK version of Rhino, this is in $RHINO_HOME/config/mlet.conf and for the Production version of Rhino, this is in $RHINO_HOME/node-???/config/mlet-permachine.conf for each node.

Edit the MLet configuration file and add the following permission to the JMXRAdaptor MLet security-permission-spec. This should already be present but commented out in the file. You will need to replace “host_name” with either a host name or a wildcard (e.g. *).

grant {
    permission java.net.SocketPermission "{host_name}", "accept,resolve";
}

It is also possible that the Rhino SLEE host has multiple network interfaces and has bound the RMI server to a network interface other than the one that the management client is trying to connect to.

If this is the case then the following could be added to file:$RHINO_HOME/read-config-variables for the SDK or file:$RHINO_HOME/node-???/read-config-variables for the Production version of Rhino:

OPTIONS="$OPTIONS -Djava.rmi.server.hostname={public IP}"

Rhino will need to be restarted in order for any of these changes to have effect. For the SDK, this simply means restarting it. For the Production version, this means restarting the particular node that has had these permissions added.

Management client is not configured to connect to the Rhino host

Make sure that the settings for the management clients are correct. For rhino-console, these are stored in client/etc/client.properties.

You can also specify the remote host and port to connect to using the -h <hostname> and -p <port> command-line arguments. If the SLEE has been configured to use a different port than the standard one for management client connections (and this has not been configured in the client/etc/client.properties files), then the port will also need to be specified on the command-line arguments.

If connecting to localhost then the problem is likely to be a misconfigured /etc/hosts file causing the system to resolve localhost to an address other than 127.0.0.1.

For Ant deployment scripts run with ant -v and ant will tell you the underlying exception which will provide more detail.

To run the command console or run deployment scripts from a remote machine:

  1. Copy $RHINO_HOME/client to the host

  2. Edit the file client/etc/client.properties and change the remote.host property to the address of the Rhino host

  3. Make sure your Ant build script is using the correct client directory. The Ant property ${client.home} must be set to the location of your client directory

A Management Client Hangs

The management clients use SSL connections to connect securely to Rhino. To generate keys for secure connections, these read (and block doing so) from the /dev/random device. The /dev/random device gathers entropic data from the current system’s devices, but on an idle system it is possible that the system has no entropy to gather, meaning that a read on /dev/random will block.

Symptoms

A management client hangs for a long period of time on start-up as it tries to read from /dev/random.

Workaround or Resolution

The ideal problem resolution is to create more system entropy. This can be done by wiggling the mouse, or on a remote server by logging in and running top or other system utilities. Refer also to the operating system’s documentation, on Linux this is the random(4) man page: man 4 random.

Statistics client reports “Full thread sample containers”

If statistics gathering is done at a sampling rate which is set too high, the per-thread sample containers may fill before the statistics client can read the statistics out of those containers.

Symptoms

When gathering statistics, the following may appear in the logs:

2006-10-16 12:59:26.353 INFO [rhino.monitoring.stats.paramset.Events]
<StageWorker/Misc/1> [Events] Updating thread sample statistics
found 4 full thread sample containers

This is a benign problem and can be safely ignored. The reported sample statistics will be slightly inaccurate. To prevent it, reduce the sampling rate.

Statistics Client Out of Memory

When running in graphical mode, the statistics client will, by default, store 6 hours of statistical data. If there is a large amount of data or if the statistics client is set to gather statistics for an extended period of time, it is possible for the statistics client to fail with an Out of Memory Exception.

Symptoms

The statistics client will fail with an Out of Memory Exception.

Workaround or Resolution

The user is recommended to use the -k option of the statistics client when running in graphical mode to limit the number of hours of statistics kept. If it is required that statistics be kept for a longer period, it is recommended that the statistics client be run in command-line mode and the output be piped to a text file for later analysis.

For more information, run client/bin/rhino-stats without any parameters. This will print a detailed usage description of the statistics client.

Creating a SyslogAppender gives an AccessControlException

Creating a SyslogAppender using the following entry in logging.xml will not work, as the appender does not perform it’s operations using the proper security privileges:

<appender appender-class="org.apache.log4j.net.SyslogAppender" name="SyslogLog">
    <property name="SyslogHost" value="localhost"/>
    <property name="Facility" value="user"/>
</appender>

Symptoms

The following error would appear in Rhino’s logs:

2006-10-19 15:16:02.311 ERROR [simvanna.threadedcluster] <ThreadedClusterDeliveryThread> Exception thrown in delivery thread java.security.AccessControlException: access denied (java.net.SocketPermission 127.0.0.1:514 connect,resolve)
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:264)
at java.security.AccessController.checkPermission(AccessController.java: 427)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:532)
at java.lang.SecurityManager.checkConnect(SecurityManager.java:1034)
at java.net.DatagramSocket.send(DatagramSocket.java:591)
at org.apache.log4j.helpers.SyslogWriter.write(SyslogWriter.java:69)
at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:39)
at org.apache.log4j.helpers.SyslogQuietWriter.write(SyslogQuietWriter.java:45)
at org.apache.log4j.net.SyslogAppender.append(SyslogAppender.java:245)

Workaround or Resolution

For the case where a SyslogAppender is required, the createsyslogappender command of the rhino-console provides a much easier user interface to achieve this task.

Replacing the entry org.apache.log4j.net.SyslogAppender above with com.opencloud.rhino.logging.RhinoSyslogAppender will also fix this problem. The Open Cloud version of the SyslogAppender is a simple wrapper around the Log4J version which wraps the append(LoggingEvent event) method in a “doPrivileged” block. For custom appenders not provided with Rhino, the same method can be used:

public void append(final LoggingEvent event) {
    AccessController.doPrivileged(new PrivilegedAction() {
        public Object run() {
            RhinoSyslogAppender.super.append(event);
            return null;
        }
    });
}

Platform Alarms

Rhino raises alarms in various situations, some of which are discussed in this section for troubleshooting purposes. The full list of current Rhino core alarms is available using the alarmcatalog command in the rhino-console.

Symptoms

  • Alarm notification messages in Rhino logs

  • Alarms appearing in network management systems

  • Entries present in rhino-console command listactivealarms

Diagnostic steps

The presence of an alarm may be viewed by the output of the following command.

./client/bin/rhino-console listactivealarms
  1. Upon the loss of a node from the cluster an alarm with alarm type of rhino.node-failure and an alarm source of ClusterStateListener is raised. This alarm is cleared either by the administrator or when the node rejoins the cluster. This alarm is not raised for quorum nodes.

  2. If a user rate limiter capacity is exceeded an alarm with an alarm source of ThresholdAlarms is raised. This alarm is cleared when the event rate drops below the limiters configured capacity.

  3. If a JMX mlet cannot be started successfully an alarm with an alarm source of MLetStarter is raised. These alarms must be cleared manually.

  4. If the rule for any user-defined threshold-based alarm is met an alarm with a user defined alarm type and alarm source is raised. These alarms are cleared when the rule condition is no longer met or if the administrator clears them

  5. The licenses installed on the platform are insufficient for the deployed configuration: The alarm type is “rhino.license” and the alarm source is “LicenseManager”.

    1. A license has expired.

    2. A license is due to expire in the next seven days.

    3. A license units are being processed for a currently unlicensed function.

    4. The consumption rate for a particular license is greater than the consumption rate which the license allows.

The alarm type used in notifications that are reporting an alarm has cleared is the original alarm type plus .clear, for example rhino.node-failure.clear.

Workaround or Resolution

Alarms with an alarm source of ThresholdAlarms indicate that the system is receiving more input that it has been configured to receive.

Alarms with an alarm source of LicenseManager indicate that a Rhino installation is not licensed appropriately. Other alarms are either user defined licenses, or defined by an Application or Resource Adaptor.

Alarms with an alarm source of MLetStarter or some other non-obvious key usually indicate a software issue, such as a misconfiguration of the installed cluster.

In most of these cases, the remedy is to contact your solution support provider for a new license or for instructions on how to remedy the situation.

DeploymentException when trying to deploy a component

Symptoms

DeploymentException when trying to deploy a component.

Diagnostic steps

The error message or Rhino log contains an error similar to
Native Library XXXXLib.so already loaded in another classloader

Each time you deploy the RA, it happens in a new classloader (because the code may have changed). If no class GC has happened, or if something is holding a reference to the old classloader and keeping it alive, the old library will still be loaded as well. See http://java.sun.com/docs/books/jni/html/design.html#8628

The DU jar was modified from the original structure e.g.

modifying the deployable-unit.xml file within it and the output of jar -tvf looks similar to the below:

$ jar -tvf service.jar
1987 Wed Jun 13 09:34:02 NZST 2007 events.jar
76358 Wed Jun 13 09:34:02 NZST 2007 sbb.jar
331 Wed Jun 13 09:34:02 NZST 2007 META-INF\deployable-unit.xml
106 Wed Jun 13 09:34:02 NZST 2007 META-INF\MANIFEST.MF
693 Wed Jun 13 09:34:02 NZST 2007 service-jar.xml

Workaround or Resolution

Native Library XXXXLib.so already loaded in another classloader
  • Restart Rhino nodes before redeployment

  • Force a full GC manually before redeployment (Requires Rhino to be configured with -XX:-DisableExplicitGC)

  • Change the JNI library name whenever redeploying

  • Ensure the classes that use JNI are loaded by a higher-level classloader, e.g. the Rhino system classloader or a library. (of course, that also means you can’t deploy new versions of those classes at runtime)

Your jar contains files that contain backslashes ("\").

Jars always use forward slashes ("/") as a path separator. Repackage the DU jar with a different file archiver, preferrably the jar tool.

Deploying to multiple nodes in parallel fails

Symptoms

You are deploying Rhino using a script that creates and deploys components to multiple nodes asynchronously. The deployment fails with one of the following exceptions on each node. When deploying the nodes serially, one after the other, no exceptions are reported.

WARN [rhino.management.deployment]Installation of deployable unit failed:
javax.slee.management.AlreadyDeployedException: URL already installed: file:/opt/rhino/apps/sessionconductor/rhino/dist/is41-ra-type_1.2-du.jar
at com.opencloud.rhino.management.deployment.Deployment.install(4276)
[WARN, rhino.management.resource, RMI TCP Connection(4)-192.168.84.173] -->
Resource adaptor entity creation failed: java.lang.IllegalStateException: Not in primary component
at com.opencloud.ob.Rhino.runtime.agt.release(4276)
...

Diagnostic steps

Rhino provides single system image for management. You do not need to deploy a DU on each node in a cluster. Installing a deployable unit on any node in Rhino cluster propagates that DU to all nodes in the cluster, so if the DU had already deployed via node 102, it can’t also be deployed via node 101.

In addition, if a new node is created and joins a running cluster, it will be automatically synchronised with the active cluster members (i.e DUs installed, service states, log levels, trace levels, alarms etc).

A Rhino cluster will only allow one management operation that modifes internal state to be executed at any one time, so you can’t, for example, install a DU on node 101 and a DU on node 102 at the same time. One of the install operations will block until the other has finished. You can run multiple read-only operations simultaneously, though.

Workaround or Resolution

Create all nodes from the same base install, optionally starting the nodes in parallel. Wait for the nodes to start then run the component installation script against only one node.

Management of multiple Rhino instances

Symptoms

You are trying to use rhino-console to talk to multiple rhino instances however it will not connect to the second instance.

Workaround or Resolution

Unfortunately it is not possible to store keys for multiple Rhino instances in the client’s keystore, they are stored using fixed aliases. With the current implementation, there are two ways to connect to multiple Rhino instances from a single management client:

  1. Copy the rhino-private.keystore to all the Rhino home directories so that all instances have the same private key on the server. This may be adequate for test environments.

  2. Create a copy of client.properties that points to a different client keystore, and tweak the scripts to parameterise the client.properties Java system property. Example:

OPTIONS="$OPTIONS -Dclient.properties=file:$CLIENT_HOME/etc/${RMISSL_PROPERTIES:client.properties}"

If doing this you may also want to parameterise the keystore password to restrict access to authorised users.

Deployment problem on exceeding DB size

Symptoms

Deployment fails with Unable to prepare due to size limits of the DB

Diagnostic steps

See Memory Database Full for how to diagnose and resolve problems with the size of the Rhino in-memory databases, including the management database.

BUILD FAILED when installing an Metaswitch product

Symptoms

Installation fails with an error like:

$:/opt/RhinoSDK/cgin-connectivity-trial-1.5.2.19 # ant -f deploy.xml
Buildfile: deploy.xml

management-init:
     [echo] Open Cloud Rhino SLEE Management tasks defined

login:

BUILD FAILED
/opt/RhinoSDK/client/etc/common.xml:102: The following error occurred while executing this line:
/opt/RhinoSDK/client/etc/common.xml:74: No supported regular expression matcher found: java.lang.ClassNotFoundException: org.apache.tools.ant.util.regexp.Jdk14RegexpRegexp

Total time: 0 seconds

Diagnostic steps

Run Ant with debugging output to check which version of Ant is being used and the classpath

ant -d -f deploy.xml > output.txt

Workaround or Resolution

Add the missing libraries to your Ant lib directory

We recommend you use the Ant version shipped with Rhino to avoid this problem e.g.

/opt/RhinoSDK/client/bin/ant -f deploy.xml

REM connection failure during management operations

Symptoms

Performing a management operation, e.g. activating an RA entity, fails with the following error:

Could not acquire exclusive access to Rhino server

Diagnostic steps

The message is sometimes seen when Rhino is under load and JMX operations are slow to return. Check the CPU load on the Rhino servers.

REM Exceptions of this type can also occur when stopping or starting the whole cluster.

When REM auto-refresh interval is set to low value (default is 30 seconds) there is high likelihood of a lock collision happening. With higher auto-refresh intervals the likelihood drops down. With auto-refresh interval set to "Off" the exception may not occur at all.

If the rem.interceptor.connection log key is set to DEBUG in REM’s log4j.properties, then the logs will show which operations could not acquire the JMX connection lock.

Workaround or Resolution

If the CPU load on the Rhino server is high then follow the resolution advice in Operating environment issues.

If the auto-refresh interval is low then increase it until the problem stops.

For further diagnostic and resolution assistance contact Open Cloud or your solution provider, providing the REM logs.

Export error: Multiple Profile Snapshot for profiles residing in seperate memdb instances is unsupported

Symptoms

Trying to export Rhino configuration with rhino-export fails with an error like:

com.opencloud.ui.snapshot.SnapshotClientException: Multiple Profile Snapshot for profiles residing in seperate memdb instances is unsupported
at com.opencloud.ob.client.be.a(80947:202)
at com.opencloud.ui.snapshot.SnapshotClient.performProfileTableSnapshot(80947:294)
at com.opencloud.rhino.management.exporter.Exporter.b(80947:382)
at com.opencloud.rhino.management.exporter.Exporter.a(80947:350)
at com.opencloud.rhino.management.exporter.Exporter.run(80947:291)
at com.opencloud.rhino.management.exporter.Exporter.main(80947:201)
Press any key to continue...

Workaround or Solution

Run rhino-export with the -J option to use JMX for exporting the profile data. This is slightly less efficient but can handle multiple profile storage locations.

Unused log keys configured in Rhino

Symptoms

After installing multiple versions of a service on Rhino listlogkeys reports a number of obsolete keys. How can these be removed?

Workaround or Resolution

The unused log keys are marked for removal - they should disappear when the Rhino JVM restarts. Until then they are harmless.

Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT

Symptoms

Any management operation done on the cluster fails with:

2014-06-12 11:40:39.156  WARN    [rhino.management.trace]   <RMI TCP Connection(79)-10.240.83.131> Error setting trace level of root tracer for SbbNotification[service=ServiceID[name=TST,vendor=XXX,version=1.2.0],sbb=SbbID[na
me=SDMSbb,vendor=XXX,version=1.2.0]]: com.opencloud.savanna2.framework.lock2.LockUnavailableException: Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT, current owners=[TransactionId:[101:15393
0147780781]]
        at com.opencloud.ob.Rhino.aR.a(2.3-1.12-72630:662)
        at com.opencloud.ob.Rhino.aR.acquireExclusive(2.3-1.12-72630:65)
        at com.opencloud.rhino.management.trace.Trace.setTraceLevel(2.3-1.12-72630:256)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Diagnostic steps

The management lock is acquired and held for the duration of all management operations to prevent concurrent modification of Rhino state and on-disk data. To find out more information about the transaction holding the lock use the gettransactioninfo console command. To find out what method is blocking release of the lock use jstack $(cat node-???/work/rhino.pid) or kill -QUIT $(cat node-???/work/rhino.pid) to dump the current thread state to the Rhino console log.

Workaround or resolution

Contact your solution provider with the Rhino logs showing the problem and a list of the management operations that were performed immediately prior to the one that timed out. If the management operation is permanently blocked, e.g. by an infinite loop in the raStopping() callback of an RA, the cluster will need to be restarted to interrupt the stuck operation. If it is not permanently blocked you must wait until the operation has finished.

Log level for trace appender not logging

Symptoms

Setting the log level for a logger trace.??? does not change the level of information logged under this key. The log level was set using the setloglevel command.

Workaround or Resolution

The log keys trace.??? are special. These are SLEE tracers which have their own level configuration that they feed into the logging subsystem. Use the settracerlevel command to set tracer levels for RA entities and SBBs.

Access to REM fails with Command CHECK_CONNECTION invoked without connection ID

Symptoms

After updating or reinstalling REM access fails with an error in the REM log similar to:

2012-02-24 16:16:17.204  ERROR   [rem.server.http.connection]   <btpool0-6> Command CHECK_CONNECTION invoked without connection ID

Workaround or Resolution

The most common cause for these errors is that the browser-hosted part of REM does not match the server code. Refresh the browser tab to reload the client code. You may need to clear the browser cache.

Database

Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for the Rhino management database persistent store.

Management Database Server Failure

Rhino uses either the PostgreSQL or Oracle Database Server as a non-volatile storage mechanism. It is possible that the database server can terminate due to administrative action or software failure. Failure to communicate with the database server may also occur due to a network fault. This is a problem related to the database only and is not indicative of a problem in Rhino.

A Rhino cluster gracefully handles the failure of a management database server, however if all nodes in the cluster fail and are restarted before the database server is restarted, then the cluster will pause in its booting process until a server is available. If all nodes in the cluster fail during the period the database server was unavailable, then any configuration or profile changes made after the database failure will be lost.

Symptoms

Warning messages in the system console from the log key memdb.storer

Messages similar to the following indicate that a PostgreSQL Database Server has terminated either due to administrative action or failure.

...
WARN [memdb.storer]
    <SQLStableStorage Storer thread for rhino_profiles>
        [ProfileDatabase/rhino_profiles] Connection failed
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:201)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:388)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:329)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:239)
    ...
    at java.lang.Thread.run(Thread.java:595)
Caused by: java.io.EOFException
    at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:243)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1122)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:175)
    ... 8 more
WARN [memdb.storer]
    <SQLStableStorage Storer thread for rhino_management>
        [ManagementDatabase/rhino_management] Connection failed org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:201)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:388)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:329)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:239)
    ...
    at java.lang.Thread.run(Thread.java:595)
Caused by: java.io.EOFException
    at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:243)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1122)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:175)
... 8 more
...

Resolution and Mitigation

The management database server should be restarted as soon as possible. Once the database has been restarted and the Rhino cluster has reconnected, messages similar to the following should appear in the Rhino logs.

...
INFO [memdb.storer] <SQLStableStorage Storer thread for rhino_profiles>
    [ProfileDatabase/rhino_profiles] Connected to database
INFO [memdb.storer] <SQLStableStorage Storer thread for rhino_management>
    [ManagementDatabase/rhino_management] Connected to database
...

To reduce the risk of this causing data loss Rhino can be configured to replicate persisted data across multiple PostgreSQL instances. When using Oracle as the persistence database, replication is provided by the Oracle RAC clustering mechanism.

Signalware

Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for using Signalware with Rhino’s CGIN Resource Adaptor.

CGIN RA to Signalware Backend Connection Errors

Rhino’s CGIN resource adaptor for the SS7 protocol family requires a native backend process, known as a Signalware backend, to communicate with the Signalware SS7 stack. The Signalware backend process must run on the Signalware host systems, and the resource adaptor communicates with it via TCP. Generally, the cause is configuration related or a version mismatch between the resource adaptor and backend.

Symptoms

  • Resource Adaptor cannot connect to backend.

  • Resource Adaptor drops connection to backend.

Diagnostic steps

Resource Adaptor cannot connect to backend

If, when activating the resource adaptor (or starting the Rhino SLEE with the resource adaptor active) the following message immediately appears in the Rhino logs and an alarm is raised stating that a backend connection has been lost, then this indicates that the backend process is not reachable by the resource adaptor. Generally this would mean it is either not running or, due to resource adaptor or network misconfiguration, the resource adaptor is unable to establish a TCP connection to it.

2013-07-18 14:54:24.907  Major   [rhino.facility.alarm.manager]   <New I/O  worker #47> Alarm 101:140617544395779 [RAEntityNotification[entity=insis-ptc-external],noconnection,localhost:10102] was raised at 2013-07-18 14:54:24.907 to level Major
        Lost connection to backend localhost:10102
2013-07-18 14:54:24.907  INFO    [rhino.facility.alarm.csv]   <New I/O  worker #47> 2013-07-18 14:54:24.907,raised,2013-07-18 14:54:24.907,101:140617544395779,RAEntityNotification[entity=insis-ptc-external],noconnection,localhost:10102,Major,Lost connection to backend localhost:10102,
Resource Adaptor drops connection to backend

If the resource adaptor successfully established a connection to the backend but later one or both of the following messages appears in the logs this indicates that the connection to the backend process was setup successfully but lost due to backend or network failure.

2011-03-29 10:20:30.942  Warning [trace.cginra.backend.[sedition:24146]]   <RAContextTimer-cginra> sedition:24146#4: connection heartbeat lost. Last sent: 1301347220, last received: 1301347210
2016-11-17 16:20:26.070  Major   [rhino.facility.alarm.manager]   <cginra-thread-1> Alarm 101:194478654849023 [RAEntityNotification[entity=cginra],noconnection,autotests-signalware:10101] was raised at 2016-11-17 16:20:26.069 to level Major
        Lost connection to backend autotests-signalware:10101
2016-11-17 16:20:26.071  Warning [trace.cginra.backend.[autotests-signalware:10101]]   <cginra-thread-1> Connection autotests-signalware:10101#1 lost: Remote host closed connection

Workaround or Resolution

Resource Adaptor cannot connect to backend

First determine whether the backend process is running. If the backend process is not running then either it or Signalware may be incorrectly configured.

If the backend process is running check that the host and port it is listening on is correctly configured in the resource adaptor entity configuration properties. Also verify that the host running the backend is reachable from the Rhino host and that the port is not blocked by a firewall on either host.

Resource Adaptor drops connection to backend

If a connection is established but gets dropped this indicates either a software failure in the backend or a network level failure. Check that the backend process is running and if not, restart it. Any error messages from a failed backend should be sent to your solution provider to determine the cause of failure. If the backend process is still running check for any network connectivity problems between the Rhino host and the host running the backend processes.

CGIN RA Cannot Create Outgoing Dialogs

If one of Rhino’s Signalware Resource Adaptors has more than the configured maximum number of active dialogs with a single backend process it will fail to allocate new dialogs.

Symptoms

  • Applications fail to create outgoing dialogs using the CGIN Resource Adaptor

Diagnostic steps

Applications fail to create outgoing dialogs using the CGIN Resource Adaptor

If the backend process has reached its maximum number of active dialogs an warning similar to the following will be logged by the Resource Adaptor:

2016-11-17 14:49:42.410  Warning [trace.cginra.backend.[signalware:10100]]   <cginra-thread-2> Unable to allocate a dialog handle: Out of dialog ids

Additionally an error similar to the following will be visible in the output of the backend process:

2016-11-17 11:45:28.823003 rhino_252_1: Failed to allocate a new dialog ID.  errno=7

Workaround or Resolution

Applications fail to create outgoing dialogs using a Signalware Resource Adaptor

By default each backend process supports up to 32,000 simultaneous dialogs. If the number of simultaneous dialogs exceeds this limit the above errors will occur. There are two approaches to working around this issue:

  • Increase the number of dialogs per backend process using the -maxdialogs N parameter (where N is the number of dialogs. Up to 32,000 concurrent dialogs are supported).

  • If the backend process is already configured for the maximum 32,000 dialogs add additional backend processes, then, to support more active calls, spread the load across additional processes. See the Using the Ulticom Signalware TCAP Stack for details.

CGIN RA Cannot Receive Incoming Dialogs

Symptoms

  • Incoming dialogs do not arrive at CGIN RA.

Diagnostic steps

If the backend process has reached its maximum number of active dialogs the Signalware backend process will log something similar to the following:

2016-11-17 13:12:39.035658 rhino_252_1: cTCAPTakeMsg() failed: (ctcap_errno=12), Out of Dialog entries

Workaround or Resolution

The root cause and solution for this is identical to that for CGIN RA Cannot Create Outgoing Dialogs. Please see that section for details.

Problems with Signalware not involving the CGIN backends

Please refer to the Signalware Installation Manual for more information about installing and configuring Signalware. If you do not find a solution for your problem, contact your solution provider and provide the system report created by the dumpce command.

Exit Codes

Below is a list of the possible exit codes of the Rhino process.

Rhino Exit Codes

Exit Code Error Action

1

Unspecified error.

See the Rhino and console logs for details.

2

Misconfiguration found while reading a config file. See the Rhino and console logs for details.

Check the config files and fix any problems reported in the logs.

3

Watchdog shutdown

Check the Rhino log for the cause. Most frequently this is caused by system overload or a faulty service.

4

Timeout waiting for normal shutdown to complete

No action required

5

Restart of node requested by management operation

No action required

JVM Exit Codes

The JVM reports a number of exit codes that may be added to from version to version. At the time of writing this they include the following:

Internal JVM Exit Codes

No known exit codes other than 0 (Normal exit)

Signal Handler Exit Codes

Exit codes above 128 occur as a result of the JVM receiving a signal from the OS. These are always a result of either OS process management or a JVM bug that has triggered an OS level protection mechanism. In many cases the console log will have information on the cause. There may also be information in the system log. On Linux these are the following, other OSes differ:

Signal Name Signal Number Exit Code Common causes

HUP

1

129

  • User closed the terminal running Rhino.

  • Operator sent signal to the Rhino process.

INT

2

130

  • Ctrl-C in the terminal running Rhino.

  • Operator sent signal to the Rhino process.

QUIT

3

N/A JVM does not exit.

Operator triggered JVM thread stack dump.

ILL

4

134

JVM Bug

TRAP

5

133

Bug in a debugger attached to the Rhino JVM

ABRT/IOT

6

134

JVM bug

BUS

7(x86) 10(SPARC)

135

JVM bug

FPE

8

134

JVM bug

KILL

9

137

Operator or OS killed the Rhino process (Out of memory or system shutdown)

USR1

10(x86) 30(SPARC)

138

Operator sent a signal to the Rhino process

SEGV

11

134

JVM bug

USR2

12(x86) 31(SPARC)

134

Operator sent a signal to the Rhino process

ALRM

14

142

Native library set an OS alarm timer

TERM

15

143

Operator or OS killed the Rhino process (system shutdown)

STKFLT

16(x86 only)

144

JVM bug

XCPU

24(x86) 30(SPARC)

152

Ulimit on CPU time

VTALRM

26(x86) 28(SPARC)

154

Native library set an OS alarm timer

PROF

27(x86) 29(SPARC)

155

Bug in profiler attached to the Rhino JVM

IO

29(x86) 22(SPARC)

157

Bug in native library performing asynchronous IO

SYS

31(x86) 12(SPARC)

140

Bug in JVM, native library or kernel

Other exit codes

Some libraries may call System.exit() with a code not in this list. We are not aware of any libraries used by Rhino that do this and recommend avoiding any that do when developing services.