The antifragile organization
Authors: A. Tseitlin
Publication Year: 2013
Journal: ACM Transactions on Multimedia Computing, Communications and Applications
Embracing failure to improve resilience and maximize availability.Failure is inevitable.Disks fail. Software bugs lay dormant waiting for just the right conditions to bite. People make mistakes. Data centers are built on farms of unreliable commodity hardware. If you are running in a cloud environment, then many of these factors are outside of your control. To compound the problem, failure is not predictable and does not occur with uniform probability and frequency. The lack of a uniform frequency increases uncertainty and risk in the system. In the face of such inevitable and unpredictable failure, how can you build a reliable service that provides the high level of availability your users can depend on? A naive approach could attempt to prove the correctness of a system through rigorous analysis. It could model all different types of failures and deduce the proper workings of the system through a simulation or another theoretical framework that emulates or analyzes the real operating environment.
The main goal of this article is to inform the reader on methods to evaluate and increase robustness or “antifragility” in organizations. Antifragility is the exact opposite of fragility, and goes further than robustness and resilience. Where fragile systems crumble under pressure and robust systems stay the same, antifragile systems become stronger when faced with problems. The article poses two ways of increasing the resilience of a system: (1) Build the application with room for errors and (2) reduce uncertainty by regularly inducing failure. The focus lies on the second option. The goal is to induce failures on a regular basis, so that real failure have no impact on the system. Some options are mentioned but one is discussed thoroughly, the system used at Netflix. They use a group of autonomous agents called Monkeys, labeled together as the Simian Army. Chaos Monkey, which randomly terminates virtual instances in a production environment—instances that are serving live customer traffic. Chaos Gorilla that cause entire data centers to fail. Chaos Kong is not yet in operation but it is aimed at taking down an entire region; several data centers serving customers. Latency Monkey is aimed at slowing down communications, or even stopping them. This simulates downtime of a service. The rest of the army includes monkeys to take care of upkeep and other miscellaneous tasks not directly related to availability. Netflix aims at increasing the number of monkeys and uses a framework to decide how and when to incorporate them. This method of inducing failure in a system increases robustness, so that real world failures have no impact on the system whatsoever. This is however not the case in an antifragile system. The last paragraph of the article speaks of the “antifragile organization”, and mostly on what Netflix has done to achieve this goal. (1) Developers also operate their services. This gives them the opportunity to learn from failures that occur during operation, and respond quickly. (2) Each failure is an opportunity to learn. (3) Nobody is blamed for failures. Failures are nobody’s fault, everyone makes mistakes. The articles concludes that the more failures are handled by the people responsible for the system, the more adapt they become in handling them. The goal of inducing failures is to maximize availability, insulating users of a service from failure and delivering a consistent and available user experience. Reflection: This is an interesting article on resilience in systems. It focuses completely on the way Netflix handles this. However, nothing is said on the generalizability of these ideas. There is no information on whether this is an effective method. We can conclude that it probably is since they still use it, but no data is provided. Another point of criticism is that nothing is said about the architectural principles needed to learn from the failures. This article is related to the adaptive cycle because it talks about a method to prepare for crises, and proposes a method to build systems in such a way that crises are opportunities for improving the system and the organization.