Learning More from Accidents

When accidents happen, there’s a seductive call to look for a root cause – that is, a chain of events without which, the accident would not have happened.  In hindsight, root causes are apparently easy to identify; one works backwards from the accident, identifying causal threads until reaching the “root cause.”  It’s simple, and it’s generally wrong.

In complex systems, root causes rarely exist.  What exist are uncontrolled hazards, which are environmentally risky states, and triggers, which tip that risk into an unacceptable loss.  Consider a manual configuration change in which a typographical error triggers an accident.  It’s simple in hindsight to point at the human who made that change as being the root cause; and a set of proposed safeties might include “have a second set of eyes for every change” or “every person who makes a change has to watch for errors for one hour.”  Both of those seem valuable at first glance, but are ultimately expensive with little benefit; neither acts to control the underlying hazards.

A hazard analysis for a system like this might start with identifying unacceptable losses like: “fail to serve users” or “send the wrong content to a user.”  The hazards that might lead to a loss are more complex: “edge software doesn’t validate input is correct,” “inputs are not generated programmatically,” “inputs are not validated in a simulated environment,” and “correct server operations are not monitored for.”  Stepping back further, one might find organizational hazards, like “developers don’t have clear priorities to build safety controls,” or “changes are not built from documented specifications.”

Why does this matter?  In a sufficiently complex system, a single hazard is highly unlikely to result in a major accident.  Usually, a significant number of hazards are triggered on the way to an accident.  Selecting only one to “fix” will rarely stem the tide of accidents.  Fixes that target only one root cause are often heavy-weight bandages, which slow the system down.  A designed-safe system is a fast system – the true goal of safety is to enable high-performance operations (think about brakes as a system to enable speed)!

Enter STAMP. The Systems-Theoretic Accident Model and Processes, developed by Prof. Nancy Leveson at MIT, is a causal model of accidents in complex systems, which has two useful tools.  CAST (Causal Analysis based on STamp) is a post-incident analysis model, which is heavily used by Akamai’s Safety Engineering Team when conducting incident reviews.  STPA (Systems-Theoretic Process Analysis) is a hazard analysis model, useful in analyzing systems pre-accident.

STPA is a useful executive tool as well, even when reduced to its lightest incarnation.  When looking at any system – software or organizational – ask three questions:  What are the unacceptable losses?  What hazards make those more likely?  What control systems exist to mitigate those hazards?

If you struggle to answer those questions, one helpful mental model is the pre-mortem:  Spend two minutes, and ask yourself, “If, in a year, this ended up failing, what was the bad outcome (unacceptable loss)? How did it happen (hazard)?  What safeties should we have had (control systems)?”

Interested in reading more?  Prof. Leveson’s CAST tutorial slides can be found at http://sunnyday.mit.edu/workshop2019/CAST-Tutorial2019.pdf, and the Partnership for a Systems Approach to Safety and Security is at https://psas.scripts.mit.edu/