In part one of this blog series, we reviewed VMware’s new multi-CPU fault tolerance capability in vSphere 6. This change makes fault tolerance available for nearly all VMs. It’s tempting to immediately begin enabling this feature for all of your critical VMs but it’s important to step back and make certain you are first addressing the fundamentals of a reliable data center.
Any plans for IT optimization in the data center should begin with assessing the numerous recommended practices for availability and reliability before you consider employing tools such as fault tolerance.
Many IT organizations have put these practices in place, but it is not uncommon to find that the environment has changed or that there has been a lapse in some way. The first step should be to create and maintain an application inventory for all of your key applications, including core infrastructure. If you are starting from scratch, first look at your most important business applications and mission critical infrastructure such as Active Directory and Exchange.
An application inventory should maintain the following data points:
- Application owner(s) who support end users
- Application owner(s) who manages vendor relationship
- All major versions in use
- Vendor support contract information, including contacts, account manager, etc.
- Licensing and support
- Application data attributes with flags for confidential data, including customer addresses, account numbers, healthcare data and PCI data
- Hosting locations and environment (example: production in primary Seattle data center, warm DR in Oregon data center)
- List of all servers running the application
- Application recovery point objective (RPO) prescribing data backup and replication interval
- Application recovery time objective (RTO) prescribing secondary production or DR strategy
Once an application inventory has been created, use the documented RPO and RTO data points to drive your backup and recovery strategy. Backups can be performed in numerous ways, the most traditional being enterprise backup software managing agents on each server backing up to tapes stored off site. Many organizations find backups to magnetic disk storage arrays or to a third party service to provide better value than tape. For virtualized environments, traditional agent based backup software is best replaced by agentless products such as Veeam, which better balance hypervisor load and offer item-level integrations with many key applications.
Regardless of the method you choose, be certain it provides for these attributes:
- Backups, regardless of media, must be stored offsite from the server being backed up
- A regular schedule of full and incremental backups must be maintained
- Applications that run dual production environments out of two geo-diverse data centers must still be periodically backed up to a volume that is not actively used by the application in the event corrupted data is replicated to both data centers
Within a vSphere environment, consider enabling High Availability and Distributed Resource Scheduler. High Availability (HA) will automatically reboot VMs that become non-responsive onto another ESXi host within a cluster. If you choose to enable HA, be sure to carefully analyze the impacts of the various host isolation configurations specific to the application workload.
Distributed Resource Scheduler (DRS) will automatically migrate VMs via vMotion from one ESXi host to another within a cluster to prevent application impacts due to an overloaded ESXi host. As with HA, carefully review and test configuration of DRS to ensure that it does not over-aggressively move VMs across your environment.
If you have a key application that is only operating in a single data center with offsite backups, consider building out a basic DR environment as a first step towards a full-fledged dual production “hot/hot” design. This is sometimes referred to as a “pilot light” DR strategy. In this design pattern, core critical components are hosted in a second location. Nearly always, this at a minimum includes actively replicating databases from production into the DR environment as well as maintaining virtual machine images pre-configured for the application but powered off at the DR location.
The next level of DR maturity would be to move to a “hot/warm” DR design where all application components are powered on but relatively nascent except for database replication. In the event of a failover to DR, the warm environment would be activated and begin receiving traffic. For warm DR environments built on managed hosting or on public cloud providers such as Amazon Web Services or Microsoft Azure, there are opportunities to reduce costs by scaling down the size and quantity of VMs when in “warm” mode. In the event of an activation of a warm DR environment, resources can be increased or additional resources can be added to scale to meet the demands of production.
The highest level of maturity would be a “hot/hot” design where production traffic is l0ad balanced across two production data centers in diverse regions. This design requires significant investment in application development and does not always accommodate legacy technologies. The costs of a “hot/hot” design can be significant and must be carefully weighed against a true analysis of RPO and RTO. Also keep in mind that applications with secondary production or DR environments should be tested on a regular schedule to validate recovery mechanisms, at minimum on an annual basis.
Now that we’ve addressed some core recommended practices, in the third and final entry for this blog series we’ll move onto more advanced strategies you can employ to make your applications more resilient.