Rhino 2.5.0 :: Rhino Troubleshooting Guide

Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for Rhino management tools and utilities.

On this page...

Connections Refused for the Command Console, Deployment Script or Rhino Element Manager
A Management Client Hangs
Statistics client reports “Full thread sample containers”
Statistics Client Out of Memory
Creating a SyslogAppender gives an AccessControlException
Platform Alarms
DeploymentException when trying to deploy a component
Deploying to multiple nodes in parallel fails
Management of multiple Rhino instances
Deployment problem on exceeding DB size
Diagnostic steps
BUILD FAILED when installing an OpenCloud product
REM connection failure during management operations
Export error: Multiple Profile Snapshot for profiles residing in seperate memdb instances is unsupported
Unused log keys configured in Rhino
Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT
Log level for trace appender not logging
Access to REM fails with Command CHECK_CONNECTION invoked without connection ID

Connections Refused for the Command Console, Deployment Script or Rhino Element Manager

The remote management clients can not connect to the Rhino SLEE

Symptoms

The management clients show the following error when attempting to connect to the SLEE:

user@host:~/rhino/client/bin$ ./rhino-console
Could not connect to Rhino:
  [localhost:1199] Connection refused
    -> This normally means Rhino is not running or the client is connecting to the wrong port.

Use -D switch to display connection debugging messages.

Could not connect to Rhino:
  [localhost:1199] No route to host

Use -D switch to display connection debugging messages.

Could not connect to Rhino:
  [localhost:1199] Could not retrieve RMI stub
    -> This often means the m-let configuration has not been modified to allow remote connections.

Use -D switch to display connection debugging messages.

BUILD FAILED
~/rhino/client/etc/common.xml:99: The following error occurred while executing this line:
~/rhino/client/etc/common.xml:77: error connecting to rhino: Login failed

Diagnostic steps and correction

Rhino is not listening for management connections

First, check that there is a running Rhino node on the host the client is trying to connect to. Use the ps command to check that the Rhino process is running, e.g. ps ax | grep Rhino. If Rhino is running, check the rhino.log to determine if the node has joined the primary component and started fully. If the Rhino node is failing to join the primary component or otherwise failing to fully start then consult the Clustering troubleshooting guide.

Make sure that the remote host is accessible using the ping command. Alternatively, make sure that you can log in to the remote host using ssh to make sure the network connection is working (some firewalls block ping).

Rhino refuses connections

By default, Rhino is set up to not allow remote connections by management clients. Permissions to do so need to be manually configured before starting the SLEE, as described in the next section.

The management clients connect to Rhino via SSL secured JMX connections. These require both a client certificate and permission to connect configured in the Java security configuration for Rhino.

To allow remote connections to the SLEE, the MLet configuration file will need to be edited. On the SDK version of Rhino, this is in $RHINO_HOME/config/mlet.conf and for the Production version of Rhino, this is in $RHINO_HOME/node-???/config/mlet-permachine.conf for each node.

Edit the MLet configuration file and add the following permission to the JMXRAdaptor MLet security-permission-spec. This should already be present but commented out in the file. You will need to replace “host_name” with either a host name or a wildcard (e.g. *).

grant {
    permission java.net.SocketPermission "{host_name}", "accept,resolve";
}

It is also possible that the Rhino SLEE host has multiple network interfaces and has bound the RMI server to a network interface other than the one that the management client is trying to connect to.

If this is the case then the following could be added to file:$RHINO_HOME/read-config-variables for the SDK or file:$RHINO_HOME/node-???/read-config-variables for the Production version of Rhino:

OPTIONS="$OPTIONS -Djava.rmi.server.hostname={public IP}"

Rhino will need to be restarted in order for any of these changes to have effect. For the SDK, this simply means restarting it. For the Production version, this means restarting the particular node that has had these permissions added.

Management client is not configured to connect to the Rhino host

Make sure that the settings for the management clients are correct. For rhino-console, these are stored in client/etc/client.properties.

You can also specify the remote host and port to connect to using the -h <hostname> and -p <port> command-line arguments. If the SLEE has been configured to use a different port than the standard one for management client connections (and this has not been configured in the client/etc/client.properties files), then the port will also need to be specified on the command-line arguments.

If connecting to localhost then the problem is likely to be a misconfigured /etc/hosts file causing the system to resolve localhost to an address other than 127.0.0.1.

For Ant deployment scripts run with ant -v and ant will tell you the underlying exception which will provide more detail.

To run the command console or run deployment scripts from a remote machine:

Copy $RHINO_HOME/client to the host
Edit the file client/etc/client.properties and change the remote.host property to the address of the Rhino host
Make sure your Ant build script is using the correct client directory. The Ant property ${client.home} must be set to the location of your client directory

A Management Client Hangs

The management clients use SSL connections to connect securely to Rhino. To generate keys for secure connections, these read (and block doing so) from the /dev/random device. The /dev/random device gathers entropic data from the current system’s devices, but on an idle system it is possible that the system has no entropy to gather, meaning that a read on /dev/random will block.

Symptoms

A management client hangs for a long period of time on start-up as it tries to read from /dev/random.

Workaround or Resolution

The ideal problem resolution is to create more system entropy. This can be done by wiggling the mouse, or on a remote server by logging in and running top or other system utilities. Refer also to the operating system’s documentation, on Linux this is the random(4) man page: man 4 random.

Statistics client reports “Full thread sample containers”

If statistics gathering is done at a sampling rate which is set too high, the per-thread sample containers may fill before the statistics client can read the statistics out of those containers.

Symptoms

When gathering statistics, the following may appear in the logs:

2006-10-16 12:59:26.353 INFO [rhino.monitoring.stats.paramset.Events]
<StageWorker/Misc/1> [Events] Updating thread sample statistics
found 4 full thread sample containers

This is a benign problem and can be safely ignored. The reported sample statistics will be slightly inaccurate. To prevent it, reduce the sampling rate.

Statistics Client Out of Memory

When running in graphical mode, the statistics client will, by default, store 6 hours of statistical data. If there is a large amount of data or if the statistics client is set to gather statistics for an extended period of time, it is possible for the statistics client to fail with an Out of Memory Exception.

Symptoms

The statistics client will fail with an Out of Memory Exception.

Workaround or Resolution

The user is recommended to use the -k option of the statistics client when running in graphical mode to limit the number of hours of statistics kept. If it is required that statistics be kept for a longer period, it is recommended that the statistics client be run in command-line mode and the output be piped to a text file for later analysis.

For more information, run client/bin/rhino-stats without any parameters. This will print a detailed usage description of the statistics client.

Creating a SyslogAppender gives an AccessControlException

Creating a SyslogAppender using the following entry in logging.xml will not work, as the appender does not perform it’s operations using the proper security privileges:

<appender appender-class="org.apache.log4j.net.SyslogAppender" name="SyslogLog">
    <property name="SyslogHost" value="localhost"/>
    <property name="Facility" value="user"/>
</appender>

Symptoms

The following error would appear in Rhino’s logs:

2006-10-19 15:16:02.311 ERROR [simvanna.threadedcluster] <ThreadedClusterDeliveryThread> Exception thrown in delivery thread java.security.AccessControlException: access denied (java.net.SocketPermission 127.0.0.1:514 connect,resolve)
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:264)
at java.security.AccessController.checkPermission(AccessController.java: 427)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:532)
at java.lang.SecurityManager.checkConnect(SecurityManager.java:1034)
at java.net.DatagramSocket.send(DatagramSocket.java:591)
at org.apache.log4j.helpers.SyslogWriter.write(SyslogWriter.java:69)
at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:39)
at org.apache.log4j.helpers.SyslogQuietWriter.write(SyslogQuietWriter.java:45)
at org.apache.log4j.net.SyslogAppender.append(SyslogAppender.java:245)

Workaround or Resolution

For the case where a SyslogAppender is required, the createsyslogappender command of the rhino-console provides a much easier user interface to achieve this task.

Replacing the entry org.apache.log4j.net.SyslogAppender above with com.opencloud.rhino.logging.RhinoSyslogAppender will also fix this problem. The Open Cloud version of the SyslogAppender is a simple wrapper around the Log4J version which wraps the append(LoggingEvent event) method in a “doPrivileged” block. For custom appenders not provided with Rhino, the same method can be used:

public void append(final LoggingEvent event) {
    AccessController.doPrivileged(new PrivilegedAction() {
        public Object run() {
            RhinoSyslogAppender.super.append(event);
            return null;
        }
    });
}

Platform Alarms

Rhino raises alarms in various situations, some of which are discussed in this section for troubleshooting purposes. The full list of current Rhino core alarms is available using the alarmcatalog command in the rhino-console.

Symptoms

Alarm notification messages in Rhino logs
Alarms appearing in network management systems
Entries present in rhino-console command listactivealarms

Diagnostic steps

The presence of an alarm may be viewed by the output of the following command.

./client/bin/rhino-console listactivealarms

Upon the loss of a node from the cluster an alarm with alarm type of rhino.node-failure and an alarm source of ClusterStateListener is raised. This alarm is cleared either by the administrator or when the node rejoins the cluster. This alarm is not raised for quorum nodes.
If a user rate limiter capacity is exceeded an alarm with an alarm source of ThresholdAlarms is raised. This alarm is cleared when the event rate drops below the limiters configured capacity.
If a JMX mlet cannot be started successfully an alarm with an alarm source of MLetStarter is raised. These alarms must be cleared manually.
If the rule for any user-defined threshold-based alarm is met an alarm with a user defined alarm type and alarm source is raised. These alarms are cleared when the rule condition is no longer met or if the administrator clears them
The licenses installed on the platform are insufficient for the deployed configuration: The alarm type is “rhino.license” and the alarm source is “LicenseManager”.
1. A license has expired.
2. A license is due to expire in the next seven days.
3. A license units are being processed for a currently unlicensed function.
4. The consumption rate for a particular license is greater than the consumption rate which the license allows.

The alarm type used in notifications that are reporting an alarm has cleared is the original alarm type plus .clear, for example rhino.node-failure.clear.

Workaround or Resolution

Alarms with an alarm source of ThresholdAlarms indicate that the system is receiving more input that it has been configured to receive.

Alarms with an alarm source of LicenseManager indicate that a Rhino installation is not licensed appropriately. Other alarms are either user defined licenses, or defined by an Application or Resource Adaptor.

Alarms with an alarm source of MLetStarter or some other non-obvious key usually indicate a software issue, such as a misconfiguration of the installed cluster.

In most of these cases, the remedy is to contact your solution support provider for a new license or for instructions on how to remedy the situation.

DeploymentException when trying to deploy a component

Symptoms

DeploymentException when trying to deploy a component.

Diagnostic steps

The error message or Rhino log contains an error similar to

Native Library XXXXLib.so already loaded in another classloader

Each time you deploy the RA, it happens in a new classloader (because the code may have changed). If no class GC has happened, or if something is holding a reference to the old classloader and keeping it alive, the old library will still be loaded as well. See http://java.sun.com/docs/books/jni/html/design.html#8628

The DU jar was modified from the original structure e.g.

modifying the deployable-unit.xml file within it and the output of jar -tvf looks similar to the below:

$ jar -tvf service.jar
1987 Wed Jun 13 09:34:02 NZST 2007 events.jar
76358 Wed Jun 13 09:34:02 NZST 2007 sbb.jar
331 Wed Jun 13 09:34:02 NZST 2007 META-INF\deployable-unit.xml
106 Wed Jun 13 09:34:02 NZST 2007 META-INF\MANIFEST.MF
693 Wed Jun 13 09:34:02 NZST 2007 service-jar.xml

Workaround or Resolution

Native Library XXXXLib.so already loaded in another classloader

Restart Rhino nodes before redeployment
Force a full GC manually before redeployment (Requires Rhino to be configured with -XX:-DisableExplicitGC)
Change the JNI library name whenever redeploying
Ensure the classes that use JNI are loaded by a higher-level classloader, e.g. the Rhino system classloader or a library. (of course, that also means you can’t deploy new versions of those classes at runtime)

Your jar contains files that contain backslashes ("\").

Jars always use forward slashes ("/") as a path separator. Repackage the DU jar with a different file archiver, preferrably the jar tool.

Deploying to multiple nodes in parallel fails

Symptoms

You are deploying Rhino using a script that creates and deploys components to multiple nodes asynchronously. The deployment fails with one of the following exceptions on each node. When deploying the nodes serially, one after the other, no exceptions are reported.

WARN [rhino.management.deployment]Installation of deployable unit failed:
javax.slee.management.AlreadyDeployedException: URL already installed: file:/opt/rhino/apps/sessionconductor/rhino/dist/is41-ra-type_1.2-du.jar
at com.opencloud.rhino.management.deployment.Deployment.install(4276)

[WARN, rhino.management.resource, RMI TCP Connection(4)-192.168.84.173] -->
Resource adaptor entity creation failed: java.lang.IllegalStateException: Not in primary component
at com.opencloud.ob.Rhino.runtime.agt.release(4276)
...

Diagnostic steps

Rhino provides single system image for management. You do not need to deploy a DU on each node in a cluster. Installing a deployable unit on any node in Rhino cluster propagates that DU to all nodes in the cluster, so if the DU had already deployed via node 102, it can’t also be deployed via node 101.

In addition, if a new node is created and joins a running cluster, it will be automatically synchronised with the active cluster members (i.e DUs installed, service states, log levels, trace levels, alarms etc).

A Rhino cluster will only allow one management operation that modifes internal state to be executed at any one time, so you can’t, for example, install a DU on node 101 and a DU on node 102 at the same time. One of the install operations will block until the other has finished. You can run multiple read-only operations simultaneously, though.

Workaround or Resolution

Create all nodes from the same base install, optionally starting the nodes in parallel. Wait for the nodes to start then run the component installation script against only one node.

Management of multiple Rhino instances

Symptoms

You are trying to use rhino-console to talk to multiple rhino instances however it will not connect to the second instance.

Workaround or Resolution

Unfortunately it is not possible to store keys for multiple Rhino instances in the client’s keystore, they are stored using fixed aliases. With the current implementation, there are two ways to connect to multiple Rhino instances from a single management client:

Copy the rhino-private.keystore to all the Rhino home directories so that all instances have the same private key on the server. This may be adequate for test environments.
Create a copy of client.properties that points to a different client keystore, and tweak the scripts to parameterise the client.properties Java system property. Example:

OPTIONS="$OPTIONS -Dclient.properties=file:$CLIENT_HOME/etc/${RMISSL_PROPERTIES:client.properties}"

If doing this you may also want to parameterise the keystore password to restrict access to authorised users.

Deployment problem on exceeding DB size

Symptoms

Deployment fails with Unable to prepare due to size limits of the DB

Diagnostic steps

See Memory Database Full for how to diagnose and resolve problems with the size of the Rhino in-memory databases, including the management database.

BUILD FAILED when installing an OpenCloud product

Symptoms

Installation fails with an error like:

$:/opt/RhinoSDK/cgin-connectivity-trial-1.5.2.19 # ant -f deploy.xml
Buildfile: deploy.xml

management-init:
     [echo] Open Cloud Rhino SLEE Management tasks defined

login:

BUILD FAILED
/opt/RhinoSDK/client/etc/common.xml:102: The following error occurred while executing this line:
/opt/RhinoSDK/client/etc/common.xml:74: No supported regular expression matcher found: java.lang.ClassNotFoundException: org.apache.tools.ant.util.regexp.Jdk14RegexpRegexp

Total time: 0 seconds

Diagnostic steps

Run Ant with debugging output to check which version of Ant is being used and the classpath

ant -d -f deploy.xml > output.txt

Workaround or Resolution

Add the missing libraries to your Ant lib directory

We recommend you use the Ant version shipped with Rhino to avoid this problem e.g.

/opt/RhinoSDK/client/bin/ant -f deploy.xml

REM connection failure during management operations

Symptoms

Performing a management operation, e.g. activating an RA entity, fails with the following error:

Could not acquire exclusive access to Rhino server

Diagnostic steps

The message is sometimes seen when Rhino is under load and JMX operations are slow to return. Check the CPU load on the Rhino servers.

REM Exceptions of this type can also occur when stopping or starting the whole cluster.

When REM auto-refresh interval is set to low value (default is 30 seconds) there is high likelihood of a lock collision happening. With higher auto-refresh intervals the likelihood drops down. With auto-refresh interval set to "Off" the exception may not occur at all.

If the rem.interceptor.connection log key is set to DEBUG in REM’s log4j.properties, then the logs will show which operations could not acquire the JMX connection lock.

Workaround or Resolution

If the CPU load on the Rhino server is high then follow the resolution advice in Operating environment issues.

If the auto-refresh interval is low then increase it until the problem stops.

For further diagnostic and resolution assistance contact Open Cloud or your solution provider, providing the REM logs.

Export error: Multiple Profile Snapshot for profiles residing in seperate memdb instances is unsupported

Symptoms

Trying to export Rhino configuration with rhino-export fails with an error like:

com.opencloud.ui.snapshot.SnapshotClientException: Multiple Profile Snapshot for profiles residing in seperate memdb instances is unsupported
at com.opencloud.ob.client.be.a(80947:202)
at com.opencloud.ui.snapshot.SnapshotClient.performProfileTableSnapshot(80947:294)
at com.opencloud.rhino.management.exporter.Exporter.b(80947:382)
at com.opencloud.rhino.management.exporter.Exporter.a(80947:350)
at com.opencloud.rhino.management.exporter.Exporter.run(80947:291)
at com.opencloud.rhino.management.exporter.Exporter.main(80947:201)
Press any key to continue...

Workaround or Solution

Run rhino-export with the -J option to use JMX for exporting the profile data. This is slightly less efficient but can handle multiple profile storage locations.

Unused log keys configured in Rhino

Symptoms

After installing multiple versions of a service on Rhino listlogkeys reports a number of obsolete keys. How can these be removed?

Workaround or Resolution

The unused log keys are marked for removal - they should disappear when the Rhino JVM restarts. Until then they are harmless.

Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT

Symptoms

Any management operation done on the cluster fails with:

2014-06-12 11:40:39.156  WARN    [rhino.management.trace]   <RMI TCP Connection(79)-10.240.83.131> Error setting trace level of root tracer for SbbNotification[service=ServiceID[name=TST,vendor=XXX,version=1.2.0],sbb=SbbID[na
me=SDMSbb,vendor=XXX,version=1.2.0]]: com.opencloud.savanna2.framework.lock2.LockUnavailableException: Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT, current owners=[TransactionId:[101:15393
0147780781]]
        at com.opencloud.ob.Rhino.aR.a(2.3-1.12-72630:662)
        at com.opencloud.ob.Rhino.aR.acquireExclusive(2.3-1.12-72630:65)
        at com.opencloud.rhino.management.trace.Trace.setTraceLevel(2.3-1.12-72630:256)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Diagnostic steps

The management lock is acquired and held for the duration of all management operations to prevent concurrent modification of Rhino state and on-disk data. To find out more information about the transaction holding the lock use the gettransactioninfo console command. To find out what method is blocking release of the lock use jstack $(cat node-???/work/rhino.pid) or kill -QUIT $(cat node-???/work/rhino.pid) to dump the current thread state to the Rhino console log.

Workaround or resolution

Contact your solution provider with the Rhino logs showing the problem and a list of the management operations that were performed immediately prior to the one that timed out. If the management operation is permanently blocked, e.g. by an infinite loop in the raStopping() callback of an RA, the cluster will need to be restarted to interrupt the stuck operation. If it is not permanently blocked you must wait until the operation has finished.

Log level for trace appender not logging

Symptoms

Setting the log level for a logger trace.??? does not change the level of information logged under this key. The log level was set using the setloglevel command.

Workaround or Resolution

The log keys trace.??? are special. These are SLEE tracers which have their own level configuration that they feed into the logging subsystem. Use the settracerlevel command to set tracer levels for RA entities and SBBs.

Access to REM fails with Command CHECK_CONNECTION invoked without connection ID

Symptoms

After updating or reinstalling REM access fails with an error in the REM log similar to:

2012-02-24 16:16:17.204  ERROR   [rem.server.http.connection]   <btpool0-6> Command CHECK_CONNECTION invoked without connection ID

Workaround or Resolution

The most common cause for these errors is that the browser-hosted part of REM does not match the server code. Refresh the browser tab to reload the client code. You may need to clear the browser cache.

Previous page Next page