Yesterday I talked about Mean time between failures(MTBF) and Mean time to repair(MTTR) and their importance in building robustness and recoverability into our systems.
With these two metrics we still expect that a failure will cause an outage of our system that should be repaired. But what if we can build a system that can absorb failures? Systems where a failure actually makes the system stronger…
Nassim Taleb calls such systems antifragile systems.
From Wikipedia:
Antifragility is a property of systems in which they increase in capability to thrive as a result of stressors, shocks, volatility, noise, mistakes, faults, attacks, or failures. The concept was developed by Nassim Nicholas Taleb in his book, Antifragile, and in technical papers As Taleb explains in his book, antifragility is fundamentally different from the concepts of resiliency (i.e. the ability to recover from failure) and robustness (that is, the ability to resist failure). The concept has been applied in risk analysis, physics, molecular biology, transportation planning, engineering, aerospace (NASA),and computer science
Designing antifragile software systems require not only a focus on the infrastructure and the application level but also on the outer world (the complete ‘system’). When building such a system we observe the system behavior, the impact of disturbances and how we can apply correcting actions to keep the system operational.
Remark: Testing and evolving such an antifragile system is a perfect candidate for chaos engineering.
Gregor Hophe in his Cloud Strategy book describes the path from fragile to antifragile systems using the following table:
Robust | Resilient | Antifragile | |
Model | Prevent failure | Recover from failure | Invite failure |
Motto | “Hope for the best” | “Prepare for the worst” | “Bring it on!” |
Attitude | Fear | Preparedness | Confidence |
Mechanism | Planning & verification | Redundancy & automation | Chaos Engineering |
Scope | Infrastructure | Middleware/Application | Whole System |
Remark: We cannot call these systems ‘not fragile’ because that would mean a robust system what is not what we mean here, therefore the usage of ‘antifragile’.
Read more: