Limitations of TSN recovery
The recovery procedures on this page (csar heal
and csar redeploy
) can only recover from a limited number of malfunctioning TSN nodes, depending on the size of the TSN cluster.
A malfunctioning TSN node is a node in which either or both of the disk-backed Cassandra or the RAM-disk Cassandra are non-operational.
TSN groups with 3 or 4 nodes, which have a replication factor of 3, can only tolerate a malfunction on a single TSN node. During a rolling upgrade, when one node is already down, no further failures can be tolerated.
TSN groups with 5 or more nodes, which have a replication factor of 5, can tolerate malfunctions on up to 2 TSN nodes. During a rolling upgrade, when one node is already down, at most one other malfunctioning VM can be tolerated.
In some situations, the command rvtconfig restore-cds
can be used to recover an entire TSN cluster that has failed and
cannot be recovered using either csar heal
or csar redeploy
. The requirements for this procedure to work are:
-
A recent CDS backup was taken using
rvtconfig backup-cds
-
The TSN version to restore is the same version in which the backup was taken.
-
The Cassandra version to restore is the same version in which the backup was taken.
If more than the above number of TSN nodes are malfuntioning
(that is, either the disk-based Cassandra or the RAM-disk Cassandra are not providing service),
and the services cannot be recovered manually or using rvtconfig restore-cds , the full deployment will need to be deleted and deployed again,
including all the RVT node types that depend on the TSN.
There is currently no other procedure in this guide for recovering the TSN group or deployment if more that the above number TSN nodes are lost.
|
See the Backout procedure within this guide for detailed steps on backing out the deployment.
Plan recovery approach
Recover the leader first when leader is malfunctioning
When recovering multiple nodes, check whether any of the nodes to be recovered are reported as being the leader
based on the output of the rvtconfig report-group-status
command.
If any of the nodes to be recovered are the current leader, recover the leader node first.
This helps to speed up the handover of group leadership, so that the recovery will complete faster.
Choose between csar heal over csar redeploy
In general, use the csar heal
operation where possible instead of csar redeploy
.
The csar heal
operation requires that the initconf process is active on the VM, and that the VM can reach both the CDS and MDM services, as reported by rvtconfig report-group-status
.
If any of those pre-requisites are not met for csar heal
, use csar redeploy
instead.
When report-group-status
reports that a single node cannot connect to CDS or MDM, it should be considered a VM specific fault. In that case, use csar redeploy
instead of csar heal
.
But a widespread failure of all the VMs in the group to connect to CDS or MDM suggest a need to investigate the health of the CDS and MDM services themselves, or the connectivity to them.
When recovering multiple VMs, you don’t have to consistently use either csar redeploy
or csar heal
commands for all nodes.
Choose the appropriate command for each VM according to the guidance on this page instead.
Recovering multiple TSN nodes
Recovering multiple TSN nodes has more considerations, so take care to recover the TSN nodes in the correct order, as per below. |
Care must be taken recovering TSN nodes, to avoid having too many nodes being down, according to the TSN limitations section above, where for example only 2 nodes can be down for TSN groups of 5 or more nodes.
In this context, a TSN node being down includes:
-
VM powered off
-
VM unreachable
-
The disk-based Cassandra being DN (down) or unavailable
-
The RAM-disk Cassandra being DN (down) or unavailable
-
The disk on the TSN being full, which will result in the service being DN (down) after a few minutes
If multiple TSNs are malfunctioning, follow this order of precedence for recovery:
-
TSN nodes with a faulty disk-based cassandra, in ascending order by node number, i.e. tsn-1 before tsn-2
-
TSN nodes with a faulty ramdisk-based cassandra, in ascending order by node number, i.e. tsn-1 before tsn-2
-
All remaining unhealthy TSNs, in ascending order of node number, i.e. tsn-1 before tsn-2
The Cassandra processes on the non-recovering nodes must be UN
(up and normal).
The Cassandra processes on nodes other than the recovering node cannot be in the DN
state when the replacement TSN node is started.
If this situation is detected, the recovering VMs will wait for the offending node until it is either restored to UP status, or the the node is removed from the cluster.
In the edge case where 2 TSNs needed to be redeployed at the same time, this situation will automatically resolve once both replacement TSN nodes come up.
Recovering all nodes from a total TSN cluster failure
In some situations it may be possible to recover a whole TSN cluster without having to delete the full deployment and deploy it again. This is a delicate operation that must only be tried as a last resort and only when the following conditions are met.
-
A failed TSN upgrade that cannot be recovered using
csar redeploy
/csar heal
. -
A CDS backup was taken before the upgrade above commenced.
-
The restore is to the same TSN/Cassandra version from which the backup was taken.
-
No other VM upgrades or config changes took place after the CDS backup was taken.
The command sequence required should be as follows.
-
Delete all TSN nodes in your deployment.
-
Deploy again the entire TSN cluster using
csar deploy
.csar deploy --sdf /home/admin/rvt-rollback-sdf/sdf-rvt.yaml --vnf tsn --sites <site name>
-
Disable Cassandra repairs / Scheduled tasks
-
Ensure
initconf
is paused in non-TSN nodes./rvtconfig set-desired-running-state --sdf /home/admin/uplevel-config/sdf-rvt.yaml --site-id <site ID> --state Stopped
-
Restore CDS from a backup previously taken with
rvtconfig backup-cds
./rvtconfig restore-cds --sdf /home/admin/uplevel-config/sdf-rvt.yaml --site-id <site ID> --snapshot-file /path/to/tsn_cassandra_backup.tar --ssh-key-secret-id <SSH key secret ID> -c <CDS Address> <CDS auth args>
-
Resume
initconf
in non-TSN nodes./rvtconfig set-desired-running-state --sdf /home/admin/uplevel-config/sdf-rvt.yaml --site-id <site ID> --state Started
-
Perform verification tests to ensure the deployment is functioning as expected.
Recovering one node
Healing one node
VMs should be healed one at a time, reassessing the group status using the rvtconfig report-group-status
command after each heal operation, as detailed below.
See the 'Healing a VM' section of the SIMPL VM Documentation for details on the csar heal
command.
The command should be run as follows:
csar heal --vm <VM name> --sdf <path to SDF>
Make sure that you pass the SDF pertaining to the correct version, being the same version that the recovering VM is already on, especially during an upgrade. |
Redeploying one node
VMs should be redeployed one at a time, reassessing the group status using the rvtconfig report-group-status
command after each heal operation, as detailed below.
Exceptions to this rules are noted on this page.
See the 'Healing a VM' section of the SIMPL VM Documentation for details on the csar redeploy
command.
The command should be run as follows:
csar redeploy --vm <VM name> --sdf <path to SDF>
Make sure that you pass the SDF pertaining to the correct version, being the same version that the recovering VM is already on, especially during an upgrade. |
Re-check status after recovering each node
To ensure a node has been successfully recovered, check the status of the VM in the report generated by rvtconfig report-group-status
.
The csar heal command waits until heal is complete before indicating success, or times out in the awaiting_manual_intervention case (see below).
The csar redeploy command does not wait until recovery is complete before returning.
|
On accidental heal or redeploy to the wrong version
If the output of report-group-status
indicates an unintended recovery to the wrong version, follow the procedure in Troubleshooting accidental VM recovery to recover.
Manual intervention after recovering multiple TSN nodes
In some cases after recovering 2 TSNs nodes simultaneously, manual intervention may be required, if they weren’t able to successfully heal. In this case, see the initconf.log
on the affected VMs for guidance.
This case is detectable based on the following:
-
The output of the
rvtconfig report-group-status
command which is to be run after each recovery operation includes a mention of the awaiting_manual_intervention status. -
This case also manifests as a timeout of the
csar heal
command, due to the VM being in the awaiting_manual_intervention state.