The mark of IT excellence for an enterprise has historically been the delivery of services at a “five-nine” level of availability, or 99.999%. At a high level, this equates to delivering services with only 5 minutes of downtime in a given year. Those who do not operate their IT services at five-nine availability are typically aspiring toward that level of service. It may sound like a small change, but transitioning from three-nines (roughly 9 hours of downtime) to four-nines (roughly 53 minutes of downtime) can be a significant effort, and it’s even more difficult to make the transition from four-nines to the coveted five-nines.
In a less mature IT shop that has typically run at three-nines and is looking to improve, it’s common to “go with what you know” and pursue technologies that promise to increase availability. It’s very common to think that technologies such as clustering, load balancing, and standby systems are the easiest way to vault a platform from three-nines to four-nines. Though this may seem like the obvious solution, it is actually a bit misleading. Adding these types of high-availability technologies to a platform will increase the complexity of the system significantly. A less-mature IT shop that was delivering three-nines of availability is generally not well suited to manage even more complex systems, regardless of how much the systems promise to increase uptime. There’s certainly a risk that the additional complexity of these systems could cause availability to decrease if not managed properly.
The most effective means of transitioning from three-nines to four-nines is actually not a technological solution, but rather an approach that addresses the most common reasons for outages in IT environments: human error. Human error can come in a number of forms, but it can be something as simple as accidently knocking the power cord out of a critical device, or something more complex like silos of IT staff not being aware of all the systems that will be impacted by their proposed change. Managing this human error is typically handled via well practiced change controls and communication within the IT organization, and is typically part of the overall IT service management effort. If and when an organization has reached a comfort level with these change control and service management concepts, more complex technologies can be introduced into the environment to strive for the ideal 99.999% availability metric.
There continues to be an emphasis on transitioning IT services to the cloud, and an important point of consideration during the evaluation should be: “how will the cloud fit into my aspirations for five-nines of availability?” As I discussed in a previous post, though the cloud can be a great platform for delivering commodity IT services, it does not release IT from their responsibility and accountability for the overall IT platform.
A great way to start planning for the cloud and ensuring it meets your requirements is to thoroughly understand your chosen cloud platforms’ service level agreements (SLAs), supported systems management procedures, scheduled outage windows, and other operational considerations. More importantly, you should be considering how the cloud platforms can fit within your existing IT service management program. Each cloud providers’ management approach is a little different, but in general, they tend to be relatively simple and inflexible when compared to on-premise systems. The benefit of this rigidity is the relatively low cost which they can offer their services, but on the other hand, it may impede your ability to adopt the platform if it can’t fit into your service management framework. As with all things related to the cloud, it’s important to thoroughly understand the platform before committing to it, and it’s not always going to be a perfect fit. At the same time, many of the cloud platforms offer enough benefit (cost, scalability, stability, etc.) where it can be worth finding ways to adapt your existing IT service management methodology to make it work for the cloud, and account for any inherent inflexibility it may have.
At the end of the day, evaluating the cloud should be very similar to your approach for looking at complex on-premise clustering and HA solutions; it’s important to have your own IT house in-order before looking at the cloud. If your current IT environment is in shambles, rife with performance issues, and frequently experiencing outages, the cloud is not a fix for that. Slamming in a cloud solution on top of a mess will only increase complexity and diminish adoption. While I certainly don’t expect perfection, if things are relatively stable and your organization has a decent handle on change management, then yes – absolutely keep the cloud options on the table and keep striving for five-nines.