On August 8, 2016, Delta Airlines encountered a day of hundreds of cancelled flights, thousands of delayed flights, and tens of thousands of frustrated customers due to “a power outage in Atlanta [which] impacted Delta computer systems and operations worldwide.” Customers re-counted that Delta staff reported that a routine scheduled switch to the backup generator caused a fire that destroyed both the backup and the primary generator and that system recovery could not begin until firefighters finished extinguishing the fire. During the outage, systems were unavailable causing flights to remain grounded and limited information flowing to Delta’s customers. Ultimately, Delta’s critical incident procedures kicked in and its employees managed as best as they could to update customers in a timely manner. Once the power came back on, Delta recovered systems, planes started taking off again, and the CEO released a video apologizing to its customers for their inconveniences.
In my 20 years working with clients discussing IT operational needs and budget priorities, disaster recovery investments often do not make the final cut. IT leaders understand why investments are required, but often struggle to make the business case to justify the investment. The defense I most frequently hear is that due to the low probability of a disaster, budget would be better spent elsewhere. While no post-mortem on the Delta event has been published, IT leaders can walk away with the following key lessons:
- Complete disaster events taking out a site permanently are rare, low probability events – but smaller outages (power loss, fiber cuts, flooding, etc.) can take out a single location. Unexpected issues and human mistakes which lead to full site system unavailability can and do
- As no site can offer full redundancy, IT leadership must work with the business leadership to identify what Tier 1 systems are required to maintain base level of operations – including accurate communications with key stakeholders.
- Investment justification must include not only the hard costs of an outage – but also the soft costs of brand and reputational loss. Trending on Twitter because you stranded thousands of travelers is not how any company wants to see its social media traffic increase.
Disasters – large and small – do occur and can have significant impact on business operations and the reputation on the business. While these types of disasters are insurable under Electronic Data Loss policies, expense reimbursement can never compensate for the reputational loss. IT leaders must present thoughtful plans and budget requests to manage risk to their business leaders so their CEO isn’t the next one filming an apology video.