Over the weekend a company many of us are familiar with landed on the front page for the wrong reasons again, this time likely due to insufficient internal IT policies and procedures. Last Friday, Dropbox’s service went down, affecting some users for up to 2 days. Dropbox announced in an excellent post-mortem write-up the outage was due to a bug in a routine server upgrade that was inadvertently loaded on its active server farm. Before Dropbox made the announcement the Internet rumor mill blamed the outage on a successful hack and users’ files had been compromised. These rumors were false, but the reputational damage was done.
A company that touts and relies on its ease of use and security of files needs to do a better job with Change Management. Two key components of Change Management, “Test the Change” and “Have a Back-Out Plan” did not go far enough in this instance. Dropbox indicated a failed server upgrade caused the outage, specifically, “During this process, the upgrade script checks to make sure there is no active data on the machine before installing the new OS. A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-replica pairs were impacted which resulted in the site going down.” This admission underscores the criticality of testing all changes prior to production implementation, especially those that involve automation. The testing should occur on systems that are not only identical to production, but have the same integration points that exist in production and have mock production data and attributes. A test at this level would have unearthed the bug that caused the problem.
In terms of “Have a Back-Out Plan”, Dropbox appears to have had one, but the time it took to recover was too long for a service used by millions of people. Again, Dropbox admitted as much in its post, “When running infrastructure at large scale, the standard practice of running multiple replicas provides redundancy. However, should those replicas fail, the only option is to restore from backup. The standard tool used to recover MySQL data from backups is slow when dealing with large data sets.” I wonder if this back-out procedure was tested to ensure the down time associated with the event was acceptable. Instead of inventing a solution to the problem after it occurred, could they have found a tool that speeds up recovery before an outage occurred? Dropbox appears to have improved its recovery time based on last weekend’s mishap, but it took a reputational and brand hit in doing so.
Does your Change Management policy and procedures include comprehensive testing and acceptable down time associated with back-out plans?