“At lower volumes, it would work fine. At higher volumes, it has problems.”
This quote is illustrative of many of the issues facing IT managers and teams as they work to scale their systems and platforms. Often, a system that works properly during the normal course of business is unable to handle a sudden influx of new users or potential customers. This particular quote was said by Todd Park, the Chief Technology Officer for the new healthcare.gov site that handles the new registrants in the healthcare exchanges as part of the Affordable Care Act.
You may not be in such an extreme situation as launching a major site for tens of millions of people under a major new law, but how would you handle the situation if you suddenly had a huge increase in traffic? While more customers are typically not only a good thing, but the ultimate goal of any business, could your systems handle a huge influx of potential users at your (virtual) door? In many cases, key IT systems or platforms often are close to capacity or have too many single points of failure to be able to scale quickly and seamlessly. As such, the first impression potential customers may have of your company is of long delays, broken functionality, or various errors that leave a negative impression or turn them away before they decide to become a paying customer.
Scaling is a difficult problem for organizations of all sizes and industries. A system that can currently handle the existing load may not have the necessary tools, technologies, and plans in place to allow these organizations to react to large growth. How can an organization minimize or mitigate these risks? There are a variety of ways that involve planning and implementation throughout the entire lifecycle of product development, not just as a temporary bandage when volume spikes.
Choose the right platform
There is no perfect platform for all applications. You may want to have complete control of the IT environment for performance or regulatory purposes and choose to buy your own servers. Or, you may want the ability to outsource the management of the infrastructure to a cloud provider to gain the flexibility to quickly expand, but that comes with the additional costs and risks of relying on an external vendor. You must plan for the option of quickly adding new systems to the mix to quickly service users, while avoiding building a massively complex platform that nearly always sits and costs time and money to maintain. Regardless of a hosted or cloud solution, or some mix of the two, you need to identify single points of failure and work to eliminate or mitigate the risk of one component bringing down the entire platform. Some options are adding additional capacity, building an “on demand” infrastructure that can scale up or down as needed, or ensure all data is properly backed up and can be easily restored in the event of a disaster.
Incorporate the necessary tools
The most important aspect of scaling seamlessly is to build a system that allows for scale from the beginning. How to handle growth should be one of the foremost design decisions when planning a new platform or upgrading an existing one. Measures such as redundant servers, load balancing and caching need to be thought of early on to build a system that is flexible enough to grow easily as your customer base does. Long running operations, such as complex reports, or non-critical tasks should be moved to a separate part of the infrastructure to ensure users have the priority on system resources. Ideally, a system should be able to scale up or down based on load without manual intervention. This should allow a system to continue to serve users even if something unexpected occurs, like a surge of new visitors based on a new product launch, major changes by a competitor, or a news story about your company.
Monitor and react proactively
The ability to track system usage and response time is also critical to identify hotspots or failures before a user notifies you. System administrators need to be able to quickly and easily understand the health of the systems to be able to mitigate potential issues before they cause a system-wide outage. Such monitoring and alerting is also useful as you can track usage and plan for staffing, future upgrades, and system improvements based on actual usage trends instead of guessing at your capacity needs.
Test and analyze capacity
Understanding system capacity is also an important indicator to plan for expected growth, as well as unexpected surges. Using a load testing tool is paramount as you don’t want to identify points of failure the first time real users are on the system. Such a tool also allows you to see how the system can handle multiples of growth from your current user base. This can help identify implementation issues in the application itself or structural issues in the infrastructure, allowing you the ability to fix them before they cause major issues. Having an environment where you can perform such tests without impacting users is also necessary to maintain business continuity while finding any new issues before they are deployed to your actual customers.
Scaling is not a quick fix, nor is there a one size fits all solution. Rather, it has to be part of the culture to plan, design, and address issues along the whole system lifecycle. These issues should be thought of as part of the initial business requirements to provide enough time to think through the various options and build the flexibility into the system. If scaling has never been a foremost consideration, you can begin to phase in aspects to eliminate certain risks throughout the platform. With proper design and testing, you can be secure in the knowledge that when a sudden burst of new customers appear, your systems will continue to perform.