Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for Rhino management tools and utilities.
- Connections Refused for the Command Console, Deployment Script or Rhino Element Manager
- A Management Client Hangs
- Statistics client reports “Full thread sample containers”
- Statistics Client Out of Memory
- Creating a SyslogAppender gives an AccessControlException
- Platform Alarms
- DeploymentException when trying to deploy a component
- Deploying to multiple nodes in parallel fails
- Management of multiple Rhino instances
- Deployment problem on exceeding DB size
- Diagnostic steps
- BUILD FAILED when installing an OpenCloud product
- REM connection failure during management operations
- Export error: Multiple Profile Snapshot for profiles residing in seperate memdb instances is unsupported
- Unused log keys configured in Rhino
- Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT
- Log level for trace appender not logging
- Access to REM fails with Command CHECK_CONNECTION invoked without connection ID
Connections Refused for the Command Console, Deployment Script or Rhino Element Manager
The remote management clients can not connect to the Rhino SLEE
Symptoms
The management clients show the following error when attempting to connect to the SLEE:
user@host:~/rhino/client/bin$ ./rhino-console
Could not connect to Rhino:
[localhost:1199] Connection refused
-> This normally means Rhino is not running or the client is connecting to the wrong port.
Use -D switch to display connection debugging messages.
Could not connect to Rhino:
[localhost:1199] No route to host
Use -D switch to display connection debugging messages.
Could not connect to Rhino:
[localhost:1199] Could not retrieve RMI stub
-> This often means the m-let configuration has not been modified to allow remote connections.
Use -D switch to display connection debugging messages.
BUILD FAILED
~/rhino/client/etc/common.xml:99: The following error occurred while executing this line:
~/rhino/client/etc/common.xml:77: error connecting to rhino: Login failed
Diagnostic steps and correction
Rhino is not listening for management connections
First, check that there is a running Rhino node on the host the client is trying to connect to.
Use the ps
command to check that the Rhino process is running, e.g. ps ax | grep Rhino
.
If Rhino is running, check the rhino.log
to determine if the node has joined the primary component and started fully.
If the Rhino node is failing to join the primary component or otherwise failing to fully start then consult the Clustering troubleshooting guide.
Make sure that the remote host is accessible using the ping command. Alternatively, make sure that you can log in to the remote host using ssh to make sure the network connection is working (some firewalls block ping).
Rhino refuses connections
By default, Rhino is set up to not allow remote connections by management clients. Permissions to do so need to be manually configured before starting the SLEE, as described in the next section.
The management clients connect to Rhino via SSL secured JMX connections. These require both a client certificate and permission to connect configured in the Java security configuration for Rhino.
To allow remote connections to the SLEE, the MLet configuration file will need to be edited.
On the SDK version of Rhino, this is in $RHINO_HOME/config/mlet.conf
and for the Production version of Rhino, this is in $RHINO_HOME/node-???/config/mlet-permachine.conf
for each node.
Edit the MLet configuration file and add the following permission to the JMXRAdaptor MLet security-permission-spec.
This should already be present but commented out in the file.
You will need to replace “host_name” with either a host name or a wildcard (e.g. *
).
grant {
permission java.net.SocketPermission "{host_name}", "accept,resolve";
}
It is also possible that the Rhino SLEE host has multiple network interfaces and has bound the RMI server to a network interface other than the one that the management client is trying to connect to.
If this is the case then the following could be added to file:$RHINO_HOME/read-config-variables
for the SDK or
file:$RHINO_HOME/node-???/read-config-variables
for the Production version of Rhino:
OPTIONS="$OPTIONS -Djava.rmi.server.hostname={public IP}"
Rhino will need to be restarted in order for any of these changes to have effect. For the SDK, this simply means restarting it. For the Production version, this means restarting the particular node that has had these permissions added.
Management client is not configured to connect to the Rhino host
Make sure that the settings for the management clients are correct.
For rhino-console, these are stored in client/etc/client.properties
.
You can also specify the remote host and port to connect to using the -h <hostname>
and -p <port>
command-line arguments.
If the SLEE has been configured to use a different port than the standard one for management client connections (and this has not been configured in the client/etc/client.properties
files), then the port will also need to be specified on the command-line arguments.
If connecting to localhost then the problem is likely to be a misconfigured /etc/hosts
file causing the system to resolve localhost
to an address other than 127.0.0.1
.
For Ant deployment scripts run with ant -v
and ant will tell you the underlying exception which will provide more detail.
To run the command console or run deployment scripts from a remote machine:
-
Copy
$RHINO_HOME/client
to the host -
Edit the file
client/etc/client.properties
and change theremote.host
property to the address of the Rhino host -
Make sure your Ant build script is using the correct client directory. The Ant property
${client.home}
must be set to the location of your client directory
A Management Client Hangs
The management clients use SSL connections to connect securely to Rhino.
To generate keys for secure connections, these read (and block doing so) from the /dev/random
device.
The /dev/random
device gathers entropic data from the current system’s devices, but on an idle system it is possible that the system has no entropy to gather, meaning that a read on /dev/random
will block.
Symptoms
A management client hangs for a long period of time on start-up as it tries to read from /dev/random
.
Workaround or Resolution
The ideal problem resolution is to create more system entropy.
This can be done by wiggling the mouse, or on a remote server by logging in and running top or other system utilities.
Refer also to the operating system’s documentation, on Linux this is the random(4) man page: man 4 random
.
Statistics client reports “Full thread sample containers”
If statistics gathering is done at a sampling rate which is set too high, the per-thread sample containers may fill before the statistics client can read the statistics out of those containers.
Symptoms
When gathering statistics, the following may appear in the logs:
2006-10-16 12:59:26.353 INFO [rhino.monitoring.stats.paramset.Events]
<StageWorker/Misc/1> [Events] Updating thread sample statistics
found 4 full thread sample containers
This is a benign problem and can be safely ignored. The reported sample statistics will be slightly inaccurate. To prevent it, reduce the sampling rate.
Statistics Client Out of Memory
When running in graphical mode, the statistics client will, by default, store 6 hours of statistical data. If there is a large amount of data or if the statistics client is set to gather statistics for an extended period of time, it is possible for the statistics client to fail with an Out of Memory Exception.
Workaround or Resolution
The user is recommended to use the -k
option of the statistics client when running in graphical mode to limit the number of hours of statistics kept.
If it is required that statistics be kept for a longer period, it is recommended that the statistics client be run in command-line mode and the output be piped to a text file for later analysis.
For more information, run client/bin/rhino-stats
without any parameters.
This will print a detailed usage description of the statistics client.
Creating a SyslogAppender gives an AccessControlException
Creating a SyslogAppender
using the following entry in logging.xml
will not work, as the appender does not perform it’s operations using the proper security privileges:
<appender appender-class="org.apache.log4j.net.SyslogAppender" name="SyslogLog">
<property name="SyslogHost" value="localhost"/>
<property name="Facility" value="user"/>
</appender>
Symptoms
The following error would appear in Rhino’s logs:
2006-10-19 15:16:02.311 ERROR [simvanna.threadedcluster] <ThreadedClusterDeliveryThread> Exception thrown in delivery thread java.security.AccessControlException: access denied (java.net.SocketPermission 127.0.0.1:514 connect,resolve)
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:264)
at java.security.AccessController.checkPermission(AccessController.java: 427)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:532)
at java.lang.SecurityManager.checkConnect(SecurityManager.java:1034)
at java.net.DatagramSocket.send(DatagramSocket.java:591)
at org.apache.log4j.helpers.SyslogWriter.write(SyslogWriter.java:69)
at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:39)
at org.apache.log4j.helpers.SyslogQuietWriter.write(SyslogQuietWriter.java:45)
at org.apache.log4j.net.SyslogAppender.append(SyslogAppender.java:245)
Workaround or Resolution
For the case where a SyslogAppender is required, the createsyslogappender
command of the rhino-console
provides a much easier user interface to achieve this task.
Replacing the entry org.apache.log4j.net.SyslogAppender
above with com.opencloud.rhino.logging.RhinoSyslogAppender
will also fix this problem.
The Open Cloud version of the SyslogAppender is a simple wrapper around the Log4J version which
wraps the append(LoggingEvent event) method in a “doPrivileged” block. For custom appenders not provided with
Rhino, the same method can be used:
public void append(final LoggingEvent event) {
AccessController.doPrivileged(new PrivilegedAction() {
public Object run() {
RhinoSyslogAppender.super.append(event);
return null;
}
});
}
Platform Alarms
Rhino raises alarms in various situations, some of which are discussed in this section for troubleshooting purposes.
The full list of current Rhino core alarms is available using the alarmcatalog
command in the rhino-console
.
Symptoms
-
Alarm notification messages in Rhino logs
-
Alarms appearing in network management systems
-
Entries present in
rhino-console
commandlistactivealarms
Diagnostic steps
The presence of an alarm may be viewed by the output of the following command.
./client/bin/rhino-console listactivealarms
-
Upon the loss of a node from the cluster an alarm with alarm type of
rhino.node-failure
and an alarm source ofClusterStateListener
is raised. This alarm is cleared either by the administrator or when the node rejoins the cluster. This alarm is not raised for quorum nodes. -
If a user rate limiter capacity is exceeded an alarm with an alarm source of
ThresholdAlarms
is raised. This alarm is cleared when the event rate drops below the limiters configured capacity. -
If a JMX mlet cannot be started successfully an alarm with an alarm source of
MLetStarter
is raised. These alarms must be cleared manually. -
If the rule for any user-defined threshold-based alarm is met an alarm with a user defined alarm type and alarm source is raised. These alarms are cleared when the rule condition is no longer met or if the administrator clears them
-
The licenses installed on the platform are insufficient for the deployed configuration: The alarm type is “rhino.license” and the alarm source is “LicenseManager”.
-
A license has expired.
-
A license is due to expire in the next seven days.
-
A license units are being processed for a currently unlicensed function.
-
The consumption rate for a particular license is greater than the consumption rate which the license allows.
-
The alarm type used in notifications that are reporting an alarm has cleared is the original alarm type plus .clear
,
for example rhino.node-failure.clear
.
Workaround or Resolution
Alarms with an alarm source of ThresholdAlarms
indicate that the system is receiving more input that it has been
configured to receive.
Alarms with an alarm source of LicenseManager
indicate that a Rhino installation is not licensed appropriately.
Other alarms are either user defined licenses, or defined by an Application or Resource Adaptor.
Alarms with an alarm source of MLetStarter
or some other non-obvious key usually indicate a software issue, such as a
misconfiguration of the installed cluster.
In most of these cases, the remedy is to contact your solution support provider for a new license or for instructions on how to remedy the situation.
DeploymentException when trying to deploy a component
Diagnostic steps
Native Library XXXXLib.so already loaded in another classloader
Each time you deploy the RA, it happens in a new classloader (because the code may have changed). If no class GC has happened, or if something is holding a reference to the old classloader and keeping it alive, the old library will still be loaded as well. See http://java.sun.com/docs/books/jni/html/design.html#8628
modifying the deployable-unit.xml file within it and the output of jar -tvf
looks similar to the below:
$ jar -tvf service.jar
1987 Wed Jun 13 09:34:02 NZST 2007 events.jar
76358 Wed Jun 13 09:34:02 NZST 2007 sbb.jar
331 Wed Jun 13 09:34:02 NZST 2007 META-INF\deployable-unit.xml
106 Wed Jun 13 09:34:02 NZST 2007 META-INF\MANIFEST.MF
693 Wed Jun 13 09:34:02 NZST 2007 service-jar.xml
Workaround or Resolution
-
Restart Rhino nodes before redeployment
-
Force a full GC manually before redeployment (Requires Rhino to be configured with
-XX:-DisableExplicitGC
) -
Change the JNI library name whenever redeploying
-
Ensure the classes that use JNI are loaded by a higher-level classloader, e.g. the Rhino system classloader or a library. (of course, that also means you can’t deploy new versions of those classes at runtime)
Jars always use forward slashes ("/") as a path separator.
Repackage the DU jar with a different file archiver, preferrably the jar
tool.
Deploying to multiple nodes in parallel fails
Symptoms
You are deploying Rhino using a script that creates and deploys components to multiple nodes asynchronously. The deployment fails with one of the following exceptions on each node. When deploying the nodes serially, one after the other, no exceptions are reported.
WARN [rhino.management.deployment]Installation of deployable unit failed:
javax.slee.management.AlreadyDeployedException: URL already installed: file:/opt/rhino/apps/sessionconductor/rhino/dist/is41-ra-type_1.2-du.jar
at com.opencloud.rhino.management.deployment.Deployment.install(4276)
[WARN, rhino.management.resource, RMI TCP Connection(4)-192.168.84.173] -->
Resource adaptor entity creation failed: java.lang.IllegalStateException: Not in primary component
at com.opencloud.ob.Rhino.runtime.agt.release(4276)
...
Diagnostic steps
Rhino provides single system image for management. You do not need to deploy a DU on each node in a cluster. Installing a deployable unit on any node in Rhino cluster propagates that DU to all nodes in the cluster, so if the DU had already deployed via node 102, it can’t also be deployed via node 101.
In addition, if a new node is created and joins a running cluster, it will be automatically synchronised with the active cluster members (i.e DUs installed, service states, log levels, trace levels, alarms etc).
A Rhino cluster will only allow one management operation that modifes internal state to be executed at any one time, so you can’t, for example, install a DU on node 101 and a DU on node 102 at the same time. One of the install operations will block until the other has finished. You can run multiple read-only operations simultaneously, though.
Management of multiple Rhino instances
Symptoms
You are trying to use rhino-console
to talk to multiple rhino instances however it will not connect to the second instance.
Workaround or Resolution
Unfortunately it is not possible to store keys for multiple Rhino instances in the client’s keystore, they are stored using fixed aliases. With the current implementation, there are two ways to connect to multiple Rhino instances from a single management client:
-
Copy the
rhino-private.keystore
to all the Rhino home directories so that all instances have the same private key on the server. This may be adequate for test environments. -
Create a copy of
client.properties
that points to a different client keystore, and tweak the scripts to parameterise theclient.properties
Java system property. Example:
OPTIONS="$OPTIONS -Dclient.properties=file:$CLIENT_HOME/etc/${RMISSL_PROPERTIES:client.properties}"
If doing this you may also want to parameterise the keystore password to restrict access to authorised users.
Diagnostic steps
See Memory Database Full for how to diagnose and resolve problems with the size of the Rhino in-memory databases, including the management database.
BUILD FAILED when installing an OpenCloud product
Symptoms
Installation fails with an error like:
$:/opt/RhinoSDK/cgin-connectivity-trial-1.5.2.19 # ant -f deploy.xml
Buildfile: deploy.xml
management-init:
[echo] Open Cloud Rhino SLEE Management tasks defined
login:
BUILD FAILED
/opt/RhinoSDK/client/etc/common.xml:102: The following error occurred while executing this line:
/opt/RhinoSDK/client/etc/common.xml:74: No supported regular expression matcher found: java.lang.ClassNotFoundException: org.apache.tools.ant.util.regexp.Jdk14RegexpRegexp
Total time: 0 seconds
REM connection failure during management operations
Symptoms
Performing a management operation, e.g. activating an RA entity, fails with the following error:
Could not acquire exclusive access to Rhino server
Diagnostic steps
The message is sometimes seen when Rhino is under load and JMX operations are slow to return. Check the CPU load on the Rhino servers.
REM Exceptions of this type can also occur when stopping or starting the whole cluster.
When REM auto-refresh interval is set to low value (default is 30 seconds) there is high likelihood of a lock collision happening. With higher auto-refresh intervals the likelihood drops down. With auto-refresh interval set to "Off" the exception may not occur at all.
If the rem.interceptor.connection
log key is set to DEBUG
in REM’s
log4j.properties
, then the logs will show which operations could not acquire
the JMX connection lock.
Workaround or Resolution
If the CPU load on the Rhino server is high then follow the resolution advice in Operating environment issues.
If the auto-refresh interval is low then increase it until the problem stops.
For further diagnostic and resolution assistance contact Open Cloud or your solution provider, providing the REM logs.
Export error: Multiple Profile Snapshot for profiles residing in seperate memdb instances is unsupported
Symptoms
Trying to export Rhino configuration with rhino-export
fails with an error like:
com.opencloud.ui.snapshot.SnapshotClientException: Multiple Profile Snapshot for profiles residing in seperate memdb instances is unsupported
at com.opencloud.ob.client.be.a(80947:202)
at com.opencloud.ui.snapshot.SnapshotClient.performProfileTableSnapshot(80947:294)
at com.opencloud.rhino.management.exporter.Exporter.b(80947:382)
at com.opencloud.rhino.management.exporter.Exporter.a(80947:350)
at com.opencloud.rhino.management.exporter.Exporter.run(80947:291)
at com.opencloud.rhino.management.exporter.Exporter.main(80947:201)
Press any key to continue...
Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT
Symptoms
Any management operation done on the cluster fails with:
2014-06-12 11:40:39.156 WARN [rhino.management.trace] <RMI TCP Connection(79)-10.240.83.131> Error setting trace level of root tracer for SbbNotification[service=ServiceID[name=TST,vendor=XXX,version=1.2.0],sbb=SbbID[na
me=SDMSbb,vendor=XXX,version=1.2.0]]: com.opencloud.savanna2.framework.lock2.LockUnavailableException: Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT, current owners=[TransactionId:[101:15393
0147780781]]
at com.opencloud.ob.Rhino.aR.a(2.3-1.12-72630:662)
at com.opencloud.ob.Rhino.aR.acquireExclusive(2.3-1.12-72630:65)
at com.opencloud.rhino.management.trace.Trace.setTraceLevel(2.3-1.12-72630:256)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Diagnostic steps
The management lock is acquired and held for the duration of all management operations to prevent concurrent modification of Rhino state and on-disk data.
To find out more information about the transaction holding the lock use the gettransactioninfo
console command.
To find out what method is blocking release of the lock use jstack $(cat node-???/work/rhino.pid)
or kill -QUIT $(cat node-???/work/rhino.pid)
to dump the current thread state to the Rhino console log.
Workaround or resolution
Contact your solution provider with the Rhino logs showing the problem and a list of the management operations that were performed immediately prior to the one that timed out. If the management operation is permanently blocked, e.g. by an infinite loop in the raStopping() callback of an RA, the cluster will need to be restarted to interrupt the stuck operation. If it is not permanently blocked you must wait until the operation has finished.