VMware officially shipped vSphere 6 in March after previewing the update back in October at VMworld. One long-requested feature was finally added: multi-processor support for fault tolerance (FT). Fault tolerance is a vSphere availability feature that creates a “shadow” secondary VM that is kept in sync with a primary VM every CPU cycle. Both primary and secondary VMs utilize the same virtual disks on a shared storage array. In the event the primary VM fails, the secondary VM immediately takes over. Note that this is a separate mechanism from vSphere High Availability, which automatically restarts a VM on a non-responsive ESXi host and is also different than Distributed Resource Scheduler (DRS), which distributes VMs across hosts based on load.
For many years, FT has been restricted to single vCPU VMs. This restriction significantly curtailed the use cases in which it could be employed. With the announcement of vSphere 6, the vCPU limit has been increased to four vCPU, which now makes the majority of an environment’s VMs eligible.
In previous versions, FT operated by using a replay mechanism known as vLockstep to effectively echo all CPU instructions from a primary FT-enabled VM onto a secondary shadow VM. FT logging occurred over the network with average roundtrip latency of less than 1ms over 1Gbit interfaces. Network limitations, specifically latency and throughput, were the primary reasons for the lack of multi-processor support before vSphere 6. This design constraint has been addressed by replacing vLockstep with a new “fast check-pointing” mechanism.
Fast check-pointing enables multi-processor support with several new functions and requirement changes. The first is the capability to slow down a primary FT-enabled VM in order to enable a secondary shadow VM to maintain consistency. The second change is a requirement for 10Gbit interfaces on the ESXi host.
The third and most significant change is the secondary shadow VM now has its own dedicated virtual disks. The primary and secondary shadow VMs no longer share VMDKs (Virtual Machine Disk) on a storage array. Maintaining two unique file systems permits more “slack” between the primary and secondary VMs as they are no longer co-dependent on the same VMDKs. This also enables the use of local disks for FT-enabled VMs for the first time. It also removes the thick provision requirement for virtual disks as well as enables the use of snapshots again. VMware “shared-local” storage solution vSAN remains unsupported with FT on vSphere 6.
The ability to employ fault tolerance on the majority of your VMs is an exciting prospect for any systems administrator who has had to support a critical application, but before you start planning an upgrade it’s important to recognize that fault tolerance is not a long term application uptime strategy. It is a tactic at best and a band aid at worst. When prioritizing your spend for IT optimization, both application owners and systems administrators should address all of the fundamental recommended practices to maximize availability and reliability of an application.
We’ll review these recommended practices further in part two of this blog topic.