Design for Failure

IMG_0190.jpg

Rule of thumb: Be a pessimist when designing architecture in any cloud - assume things will fail. In other words, always design, implement, and deploy for automated recovery from failure.

The Clouds that are being used have services and functions designed to alleviate or automatically respond to failure scenarios. If you don’t use them then you are paying for a function that you have not taken advantage of.

In 2018 when the South East US Region for AWS went down, many clients of AWS were affected - not Netflix.

Netflix had implemented and tested the cloud recovery functions that moved all of their content and customers to other regions seamlessly. This was planned for by Netflix.

In particular, assume that your hardware will fail; the OS will fail, the network will fail, even the region will fail.

  • Assume that outages occur

  • Assume that some disaster will strike your application or the technology it uses

  • Assume that you will be slammed with more than the expected number of requests per second some day

  • Assume that with time your application software will fail too

By being a pessimist, you end up thinking about recovery strategies during design time, which helps in designing a better overall system. By the way, this is also a good strategy for Security Development during design to protect against security breaches - assume they will happen and plan accordingly.

“I plan for failure and I plan for a security breach. I do this not by over-baking solutions but instead by developing the solution with an eye towards failure scenarios and security vulnerabilities at each level.

Then I spend a small amount of time evaluating the risk of identified events. Each has both an impact and a likelihood and a cost to mitigate it. Then I include each mitigation as an enterprise option within the solution that allows a cost to benefit review of each option.

In this way I make sure that the executives that own both the risk, and the cost to deliver, have a realistic way of judging when to pay for mitigations up front and when to live with the cost of the risk.”

- Taken from a presentation from #ACloud.Guru

This is basic architectural thinking and results in solutions that are secure and designed to withstand failures without being so expensive that the solution becomes cost ineffective.

We need to be more realistic and stop deciding that every failure and vulnerability is world-ending and that the sky is falling. This is why we as enterprise architects struggle with getting the executives and accountants to take the real risks seriously and to fund the mitigations when they are absolutely necessary.

In addition, we need the executives and accountants that own the delivery and cost to deliver to work closer with the enterprise architects to help identify where to spend for risk mitigation, and where to live with the risk, to get the best cost to benefit outcome for the business.