VM recovery overview

After the initial deployment of the VMs, some VMs might malfunction due to various reasons. For example, a service fault or a system failure might cause a VM to malfunction. Depending on different situations, Rhino VM automation allows you to recover malfunctioning VM nodes without affecting other nodes in the same VM group.

High level recovery options

The following table summarizes typical VM issues and the recovery operation you can use to resolve each issue.

VM issues Recovery operation to resolve the issues

Transient VM issues.

Reboot the affected VMs, in sequence, checking for VM convergence before moving on to the next node.

A VM malfunctions, but the initconf process still works, and the VM can communicate with the CDS and the MDM servers, and its disk is not full.

Use the csar heal command to heal the VM. See the recovery steps for more details.

During the healing process, the system performs decommission operations, such as notifying the MDM server of the VM status, before replacing the VM.

A VM cannot be recovered with the csar heal command or has been deleted.

Use the csar redeploy command to replace the VM. See the recovery steps for more details.

During the replacement process, the system doesn’t perform any decommission operations. Instead, it deletes the VM directly and then replaces it with a new one.

All VMs in a group don’t work.

Redeploy the VM group, by using the Backout procedure for the current platform.

All VMs that have been deployed don’t work.

Perform a full redeployment of the VMs, by using the Backout procedure for each group of VMs, then deploying again.

Recovery operations in the table are ordered from quickest and least impactful to slowest and most invasive. To minimize system impact, always use a quicker and less impactful operation to recover a VM.

The csar heal and csar recovery operations are the main focus of this section.

Notes on scope of recovery

VM outages are unpredictable, and VM recovery requires a human engineer(s) in the loop to:

  • notice a fault

  • diagnose which VM(s) needs recovering

  • choose which operation to use

  • execute the right procedure.

Note

These pages focus on how to diagnose which VM(s) needs recovery and how to perform that recovery. Initial fault detection and alerting is as a separate concern; nothing in this documentation about recovery replaces the need for service monitoring.

The rvtconfig report-group-status command can help you decide which VM to recover and which operation to use.

VMs are replaced rather than healed in place

Both the heal and redeploy recovery operations replace the VM, rather than recovering it "in place". As such, any state on the VM that needs to be retained (such as logs) must be collected before recovery.

No configuration during recovery

Don’t apply configuration changes until the recovery operations are completed.

No upgrades during recovery

Don’t upgrade VMs until the recovery operations are completed.

This includes recovering to another version, which is not supported, with the exception of the "upgrade before upload-config" case below. A VM can only be recovered back to the version it was already running. A recovery operation cannot be used to skip over upgrade steps, for example. Before upgrading or rolling back a VM, allow any recovery operations (heal or redeploy) to complete successfully.

Note The reverse does not apply: VMs that malfunction part way through an upgrade or rollback can indeed be recovered using heal or redeploy.

Recovering from mistaken upgrade before upload-config

There is one case in which it is permissible to heal a VM to a different version, when the mistaken steps have occurred:

  1. The VMs were already deployed on an earlier downlevel version, and

  2. An upgrade attempt was made through csar update before uploading the uplevel configuration, and

  3. The csar update command timed out due to lack of configuration, and

  4. A roll back is wanted.

In this case, you can use the csar heal command to roll back the partially updated VM back to the downlevel version.

Planning for the procedure

Background knowledge

This procedure assumes that:

  • you have have access to the SIMPL VM that was used to deploy the VM(s)

  • you have detected a fault on one or more VM(s) in the group, which need replacing

Reserve maintenance period

Do these procedures in a maintenance period where possible, but you can do them outside of a maintenance period if the affected VMs are causing immediate or imminent loss of service.

VM recovery time varies by node type. As a general guide, it should take approximately 15 minutes.

People

You must be a system operator to perform the MOP steps.

Tools and access

You must have access to the SIMPL VM, and the SIMPL VM must have the right permissions for your VM platform.

This page references an external document: the SIMPL VM Documentation. Ensure you have a copy available before proceeding.

Previous page Next page
VM Build Container Version 3.2