Spring cleaning may be an annual ritual for your home, but it is also a best practice in business. And in my area of business, IT operations, one of my favorite areas for “cleaning up” is making sure that our systems meet the availability and recovery expectations of our business. Our clients, and any enterprise for that matter, should consider the same approach.
Here are the five steps to take:
1. Start by reviewing service level agreements with your business partners around availability of key systems, as well as the expected recovery point and recovery time objectives in the event of a disaster. If these are well understood and reviewed periodically with your business partners, instead start by reviewing the expectations so you can be sure your teams are architecting and implementing with the business needs (and budget) in mind.
2. From an availability standpoint, use your next scheduled downtime window and test the failover and high-availability features supporting your most important business applications. Too often, systems are deployed in a highly available method and then when components fail, systems do not failover as expected. Issues as small as cabling mistakes or hardware driver updates can make a highly available system not fail-over as designed, so regular testing should be implemented for the most critical systems.
3. No one likes to think about disaster recovery—both because it forces us to think about significant events and because we often know we cannot recover as well as we would like. But while a large disaster may be unlikely, small ones happen regularly. And if you do not test, you do not know where you stand. Simulate a few small disasters, and if you haven’t executed a full disaster recovery test in the past year, then execute a full end-to-end test. Be especially thoughtful to check newly deployed or upgraded systems which may have not have experienced a full disaster recovery test yet. Compare your results against your stated recovery time and recovery point objectives. And, don’t forget to test communication methods to your IT team and the rest of your employees.
4. While you are thinking about recovery, it is also a good time to review your backup systems to ensure they are backing up all your critical data and working correctly. Select a few backup sets and test recovery to make sure everything is working correctly. We recently observed a client who experienced a Crypto-locker attack that required the recovery of 125,000 files making up almost 150 GB of data. Since backups had been recently tested, we knew we could recover the data in less than two hours.
5. During your spring cleaning, it’s a good time to evaluate whether you’re maintaining good IT hygiene. The key tenets of good hygiene are:
- Review your latest cloud invoices and see how they have been trending over the past few months. Consumption-based billing often goes up every month—sometimes unexpectedly, as test and development systems are not spun down when they are no longer necessary.
- Review the cloud portfolio and make sure the spending is aligned with the value being received. If you haven’t already, put a governance process in place to make sure cloud environments aren’t be forgotten and are being used cost effectively. Cloud technologies are great because you can instantly scale, but they can also instantly scale your spend without the budget or justification to do so.
- Review your hardware and software asset inventory, and develop plans to replace or retire any components no longer supported by the respective vendors. As many industry compliance regulations require the use of vendor-supported platforms, it is important that non-supported platforms are eliminated. In cases where you have legacy platforms that must be maintained, consider how you can lower your risk by segmenting the application and limiting access to it.