Plan recovery approach
Recover the leader first when leader is malfunctioning
When recovering multiple nodes, check whether any of the nodes to be recovered are reported as being the leader
based on the output of the rvtconfig report-group-status
command.
If any of the nodes to be recovered are the current leader, recover the leader node first.
This helps to speed up the handover of group leadership, so that the recovery will complete faster.
Choose between csar heal over csar redeploy
In general, use the csar heal
operation where possible instead of csar redeploy
.
The csar heal
operation requires that the initconf process is active on the VM, and that the VM can reach both the CDS and MDM services, as reported by rvtconfig report-group-status
.
If any of those pre-requisites are not met for csar heal
, use csar redeploy
instead.
When report-group-status
reports that a single node cannot connect to CDS or MDM, it should be considered a VM specific fault. In that case, use csar redeploy
instead of csar heal
.
But a widespread failure of all the VMs in the group to connect to CDS or MDM suggest a need to investigate the health of the CDS and MDM services themselves, or the connectivity to them.
When recovering multiple VMs, you don’t have to consistently use either csar redeploy
or csar heal
commands for all nodes.
Choose the appropriate command for each VM according to the guidance on this page instead.
Recovering from malfunctions on multiple VMs
In the case where the VMs are being recovered proactively to mitigate against an anticipated fault (such as running low on disk space), but are otherwise healthy and providing service, recover each VM one at at time, as a series of single VM recovery operations.
For cases where multiple VMs are malfunctioning and need to be recovered as a group, consider the cases below.
When recovering all nodes in the group
Recovery requires at least one Rhino node to be functioning.
If the report-group-status
output indicates that at least one Rhino node is healthy, the remaining nodes can be recovered one at at time.
In the case where none of the Rhino nodes are functioning, and the group needs to be recovered as a whole, use the existing procedure for deleting and deploying the group as a whole.
This includes a csar delete
, a rvtconfig delete-node-type-version
, followed by a csar deploy
and a rvtconfig re-upload
.
See Backout procedure for the current platform within this guide for detailed steps related to backing out the group.
When recovering all nodes on one side of an upgrade
When recovering multiple nodes, VM recovery requires at least one Rhino node to be functioning in the Rhino cluster.
If the report-group-status
output indicates that at least one Rhino node is healthy, the remaining nodes can be recovered one at at time.
In the mid-upgrade or mid-rollback case where none of the Rhino nodes on one side of the upgrade are functioning,
and all nodes on that side of the upgrade need to be recovered as a whole,
use the following procedure, in which the intent is to recover all VMs on version A, and leave the VMs on version B unchanged:
-
SSH into each node on version A, and run
sudo systemctl poweroff
. -
Run the
rvtconfig delete-node-type-version
command, passing in version A as the version argument. -
Re-upload config for version A using the
rvtconfig upload-config
command. -
Run the
csar redeploy
command (see below) for each of the nodes which were on version A, to redeploy them back to version A.
Recovering one node
Healing one node
VMs should be healed one at a time, reassessing the group status using the rvtconfig report-group-status
command after each heal operation, as detailed below.
See the 'Healing a VM' section of the SIMPL VM Documentation for details on the csar heal
command.
The command should be run as follows:
csar heal --vm <VM name> --sdf <path to SDF>
![]() |
Make sure that you pass the SDF pertaining to the correct version, being the same version that the recovering VM is already on, especially during an upgrade. |
Redeploying one node
VMs should be redeployed one at a time, reassessing the group status using the rvtconfig report-group-status
command after each heal operation, as detailed below.
Exceptions to this rules are noted on this page.
See the 'Healing a VM' section of the SIMPL VM Documentation for details on the csar redeploy
command.
The command should be run as follows:
csar redeploy --vm <VM name> --sdf <path to SDF>
![]() |
Make sure that you pass the SDF pertaining to the correct version, being the same version that the recovering VM is already on, especially during an upgrade. |
Re-check status after recovering each node
To ensure a node has been successfully recovered, check the status of the VM in the report generated by rvtconfig report-group-status
.
![]() |
The csar heal command waits until heal is complete before indicating success, or times out in the awaiting_manual_intervention case (see below).
The csar redeploy command does not wait until recovery is complete before returning.
|
On accidental heal or redeploy to the wrong version
If the output of report-group-status
indicates an unintended recovery to the wrong version, follow the procedure in Troubleshooting accidental VM recovery to recover.
Recovery failures when recovering half or more nodes
A Rhino cluster expects at least half of the nodes to stay up during normal operation. An unplanned sudden loss of half or more of the nodes can cause the remaining half to go into a state called "waiting for primary component", in which the Rhino process restarts, and will not resume service until half or more of the nodes are present and available. This could be triggered for example by a hardware fault, or by redeploying too many nodes at the same time.
In any of the above cases, when the Rhino cluster enters this state, all available Rhino nodes will exhibit this state, observable by seeing this line in the Rhino section of the output of rvtconfig report-group-status:
[FAIL] Rhino is stuck waiting for the primary component to form or reform.
To get past this state and restore service, heal or redeploy the faulty Rhino nodes, based on the Rhino section of the report-group-status.
Note that csar heal
and initconf
can be decoupled in this situation. The heal operation may timeout if not enough Rhino nodes are available. For example, if 4 of 5 Rhino nodes were lost, and the faulty nodes are being healed, the first csar heal
operation will timeout while Rhino is waiting for primary component
, because 2 of 5 Rhinos is not sufficient to reform the primary component.
In this situation, progress to heal the next node until sufficient nodes are restored. Once sufficient nodes are restored, rhino primary state will be restored and initconf will continue to completion on the affected nodes even after the csar heal
has returned the timeout failure.
Recovering 3 or more nodes when all persistence instances are lost
This rare case only applies when needing to recover 3 or more nodes, when all the persistence instances (see below) needing recovering simultaneously, and no Rhino nodes stayed up during the whole recovery process. It does not apply when you are already using one of the two procedures listed above in this page for intentionally recovering all nodes in the cluster:
Rhino uses a management database that is stored both in memory, and on-disk on the persistence instances, each of which run a PostgreSQL process. While Rhino is resilient to multiple failures, and can even withstand losing all persistence instances in some cases, there still exist cases where the sudden loss of all persistence instances can result in data loss.
The situation to mitigate against is very rare, but possible, in which a series of simultaneous node failures of all the persistence instances is followed by another event such as a power loss, causing the majority of the remaining Rhino processes to fail or be restarted, before the persistence instances are restored. This can only happen when all 3 persistence instances were lost at the same time. There is a possibility in such a rare case, that the services would be automatically restarted, but without the desired user configuration.
The steps for this persistence loss MOP are below:
-
Following a recovery event where 3 or more nodes were recovered via
csar redeploy
: use thervtconfig report-group-status
command to detect which nodes were listed as persistence instances. If all the persistence instances were lost and have been (or need to be) recovered using csar redeploy, then continue with this persistence loss MOP. Otherwise, end this persistence loss MOP. -
Use the output of report-group-state to inspect the service lifetime of the Rhino processes on each node. If any of the Rhino nodes has an uptime that predates the initial loss of the persistence instances, then exit this persistence loss MOP.
-
The Rhino service lifetime is reported in the
Rhino
section for a node in the output ofreport-group-status
-
A sample reading:
[ OK ] systemd service is active (running), active for 3 hours, 44 minutes, 33 seconds
-
-
Detect whether the user configuration has been applied, as opposed to the default configuration. This step is specific to the deployment, being a recommended mix of:
-
Run any test call via the remaining nodes, which relies on your non-default configuration
-
Use service monitoring specific to your deployment, to inspect service health.
-
-
If the above indicate a complete loss of service, redeploy the whole VNF, using one of the recovery procedures below:
-
If mid way through an upgrade: Recover all nodes on one side of an upgrade
-
If all nodes are one the same version: When recovering all nodes in the group
-