VM recovery overview
After the initial deployment of the VMs, some VMs might malfunction due to various reasons. For example, a service fault or a system failure might cause a VM to malfunction. Depending on different situations, Rhino VM automation allows you to recover malfunctioning VM nodes without affecting other nodes in the same VM group.
High level recovery options
The following table summarizes typical VM issues and the recovery operation you can use to resolve each issue.
VM issues | Recovery operation to resolve the issues |
---|---|
Transient VM issues. |
Reboot the affected VMs, in sequence, checking for VM convergence before moving on to the next node. |
A VM malfunctions, but the |
Use the During the healing process, the system performs decommission operations, such as notifying the MDM server of the VM status, before replacing the VM. |
A VM cannot be recovered with the |
Use the During the replacement process, the system doesn’t perform any decommission operations. Instead, it deletes the VM directly and then replaces it with a new one. |
All VMs in a group don’t work. |
Redeploy the VM group, by using the Backout procedure for the current platform. |
All VMs that have been deployed don’t work. |
Perform a full redeployment of the VMs, by using the Backout procedure for each group of VMs, then deploying again. |
Recovery operations in the table are ordered from quickest and least impactful to slowest and most invasive. To minimize system impact, always use a quicker and less impactful operation to recover a VM.
The csar heal
and csar recovery
operations are the main focus of this section.
Notes on scope of recovery
VM outages are unpredictable, and VM recovery requires a human engineer(s) in the loop to:
-
notice a fault
-
diagnose which VM(s) needs recovering
-
choose which operation to use
-
execute the right procedure.
These pages focus on how to diagnose which VM(s) needs recovery and how to perform that recovery. Initial fault detection and alerting is as a separate concern; nothing in this documentation about recovery replaces the need for service monitoring. |
The rvtconfig report-group-status
command can help you decide which VM to recover
and which operation to use.
VMs are replaced rather than healed in place
Both the heal and redeploy recovery operations replace the VM, rather than recovering it "in place". As such, any state on the VM that needs to be retained (such as logs) must be collected before recovery.
No configuration during recovery
Don’t apply configuration changes until the recovery operations are completed.
No upgrades during recovery
Don’t upgrade VMs until the recovery operations are completed.
This includes recovering to another version, which is not supported, with the exception of the "upgrade before upload-config" case below. A VM can only be recovered back to the version it was already running. A recovery operation cannot be used to skip over upgrade steps, for example. Before upgrading or rolling back a VM, allow any recovery operations (heal or redeploy) to complete successfully.
The reverse does not apply: VMs that malfunction part way through an upgrade or rollback can indeed be recovered using heal or redeploy. |
Recovering from mistaken upgrade before upload-config
There is one case in which it is permissible to heal a VM to a different version, when the mistaken steps have occurred:
-
The VMs were already deployed on an earlier downlevel version, and
-
An upgrade attempt was made through
csar update
before uploading the uplevel configuration, and -
The
csar update
command timed out due to lack of configuration, and -
A roll back is wanted.
In this case, you can use the csar heal
command to roll back the partially updated VM back to the downlevel version.
Planning for the procedure
Background knowledge
This procedure assumes that:
-
you have have access to the SIMPL VM that was used to deploy the VM(s)
-
you have detected a fault on one or more VM(s) in the group, which need replacing
Reserve maintenance period
Do these procedures in a maintenance period where possible, but you can do them outside of a maintenance period if the affected VMs are causing immediate or imminent loss of service.
VM recovery time varies by node type. As a general guide, it should take approximately 15 minutes.
Tools and access
You must have access to the SIMPL VM, and the SIMPL VM must have the right permissions for your VM platform.
This page references an external document: the SIMPL VM Documentation. Ensure you have a copy available before proceeding.