Yesterday I talked about Mean time between failures(MTBF) and Mean time to repair(MTTR) and their importance in building robustness and recoverability into our systems.
With these two metrics we still expect that a failure will cause an outage of our system that should be repaired. But what if we can build a system that can absorb failures? Systems where a failure actually makes the system strongerā¦
Nassim Taleb calls such systems antifragile systems.
From Wikipedia:
Antifragility is a property of systems in which they increase in capability to thrive as a result of stressors, shocks, volatility, noise, mistakes, faults, attacks, or failures. The concept was developed by Nassim Nicholas Taleb in his book, Antifragile, and in technical papers As Taleb explains in his book, antifragility is fundamentally different from the concepts of resiliency (i.e. the ability to recover from failure) and robustness (that is, the ability to resist failure). The concept has been applied in risk analysis, physics, molecular biology, transportation planning, engineering, aerospace (NASA),and computer science
Designing antifragile software systems require not only a focus on the infrastructure and the application level but also on the outer world (the complete āsystemā). When building such a system we observe the system behavior, the impact of disturbances and how we can apply correcting actions to keep the system operational.
Remark: Testing and evolving such an antifragile system is a perfect candidate for chaos engineering.
Gregor Hophe in his Cloud Strategy book describes the path from fragile to antifragile systems using the following table:
Robust | Resilient | Antifragile | |
Model | Prevent failure | Recover from failure | Invite failure |
Motto | āHope for the bestā | āPrepare for the worstā | āBring it on!ā |
Attitude | Fear | Preparedness | Confidence |
Mechanism | Planning & verification | Redundancy & automation | Chaos Engineering |
Scope | Infrastructure | Middleware/Application | Whole System |
Remark: We cannot call these systems ānot fragileā because that would mean a robust system what is not what we mean here, therefore the usage of āantifragileā.
Read more: