When it comes to the IT challenges of the new Healthcare.gov Healthcare Exchange, the hits keep on coming. Over the weekend, the healthcare.gov hosting vendor, Terremark (a subsidiary of Verizon), “experienced a failure in a networking component” and the attempted fix crashed the system. The failure is in the “Data Services Hub,” which connects the exchange to other federal systems to determine eligibility. This hub supports not only the Federal exchange, but many of the state exchanges that utilize it for checking eligibility for subsidies.
This particular outage is embarrassing for the administration because on Saturday, Kathleen Sebelius, Secretary of Health and Human Services posted a blog, What’s Working in the Marketplace: The Data Services Hub, describing it as one of the (few) success stories of the Healthcare.gov website. Twenty-four hours later, the service is unavailable, stopping nearly everyone from being able to shop or sign-up for a healthcare plan.
For the IT leadership responsible for the Healthcare Insurance Exchange deployment, this outage is even more embarrassing – especially as it enters its second day since the Data Services Hub is still not available. In a launch of a system this large, it is surprising that an outage in a networking component would take down the entire system – and inconceivable that redundant systems in a second facility are not available to support this kind of disaster. More than 24 hours into the outage, the fact that HHS has been unable to failover to a secondary site is concerning and likely one more indication of how rushed the launch of the healthcare insurance exchange was.
When it comes to high availability (keeping small outages from becoming large outages) and disaster recovery (failing over to a second site during a large outage), all major system deployments (and changes) should consider the following practices:
- Test Component Failure: Regardless of vendor claims, high availability is not a function of having two of every component in the system. It is a function of architecture, testing, and discipline. With redundancy comes complexity, and in my experience, the more redundancy there is, the more likely there will be an individual component failure. Highly available systems must test their redundant capabilities under load to confirm that everything is working as architected. Test plans must be developed (and ideally automated) so high availability functionality can be tested with new releases of software, operating system patches, and even component drivers. I have seen changes as simple as a network card driver result in an unknown incompatibility that impacted high availability.
- Changes must be automated and tested: Staging systems with identical redundant capabilities – including integration to business partners – must be deployed so updates and changes can be tested on an identical system. Preferably one that is also undergoing simulated load. All changes must then be scripted – because if a human is making a change, then the likelihood of failure is high.
- The emotion of disaster recovery must be removed: When it comes to IT disasters, it is more likely to be encountered due to a failed component or a change gone poorly than it is to occur due to a hurricane, tornado, earthquake or fire. When an IT-caused failure occurs, the emotions of the engineers working on the problem tend to be, “I’m close and I can get this fixed soon.” Unfortunately, that attitude can lead to increased outage times because no one wants to “give-up” and fail to the disaster recovery site – either because they lack confidence that the disaster recovery site works or because they recognize the effort it will take to fail-back to production from the disaster recovery site once the production issues are resolved. Timelines must be developed with the business such that the decision to fail to a disaster recovery site is published and well-known. Agreeing before a change or issue that “after the second hour of failure, primary focus shall be on failing over to the disaster recovery site” removes the emotion from the decision and helps ensure recovery of the system on a known schedule.
- Disaster Recovery must be tested: Replicated systems and data are worthless without process and automation. Too many organizations rely on tribal knowledge to fail over to a disaster recovery site – requiring an “all hands on deck” effort. This lack of automation and documentation leads to a general lack of confidence in the ability to use a disaster recovery site – and therefore a hesitation to actually use it when a disaster is encountered. It is also unrealistic to assume that you can have “all-hands on deck” even during IT-caused disasters – let alone during a natural disaster of some kind when your staff is more likely to be focused on their families and homes than they are on keeping systems available.
The failures of the Healthcare.gov website may be more visible than your systems, but they are no less critical to the success of your organization. As you deploy your business critical systems, high availability and disaster recovery are often top of mind. While vendors are happy to sell you a hardware and software foundation, systems are worthless without the process discipline to ensure they actually work.