In this talk — "Advanced PostMortem Fu and Human Error 101" — from the O'Reilly Velocity Conference, John Allspaw describes (in terms of software systems) the modern view of risk and responsibility. Allspaw draws upon recent research in the life-critical industries, particularly aerospace and medicine.
What has been happening in the "life-critical" fields is a movement away from the idea of "operator error" with irresponsible or malevolent human actors as the eventual "root cause" of a catastrophic failure. Instead, modern researchers propose that there is no root cause when a system fails. Complex systems (such as Web sites) are by definition so complex that their behavior cannot be reliably predicted. Failure is therefore a normal, inevitable consequence of doing business.
In my humble opinion, the field of Software Quality Assurance has a lot to learn from Allspaw's work, particularly the way he has connected Web application architecture with Sidney Dekker's idea of "drift into failure" and with Erik Hollnagel's concept of "efficiency-to-thoroughness tradeoffs."
Part 1 of the talk is embedded above, the rest is available from Safari books
> I don't care how simple your Web site is, it is still a complex, > dynamic system.
> What happens during [a protracted response to an incident] can > infuriate, enthrall, fatigue, mystify or energize the emotions of an > entire company [and] can have lasting effects on your culture and > on your technology. This is what makes scar tissue. This is what > makes (in a very small sense) some sort of post-traumatic stress > disorder.
> Root cause analysis tends to make things look simple… But in > reality there was a lot going on. There were multiple things going > on, people's mindsets were in different places. We work in these > complex systems where there are interactions that can't be told by a > linear narrative.
> Safety is easy: just move the dominoes farther apart. That way they > won't hit each other if one falls over.
> We have to start talking about contributors not causes. There is > no root cause. There is no root cause of your failure… > Finding a root cause of a failure is like finding a root cause of a > success.
> Labeling [an incident] as "human error" is no good… What's > the remediation item? Be more careful? Be more vigilant? Don't do > that again? Where's the graph for carefulness? Human error isn't a > cause. It's an effect. It's an effect of how you've built your > organization.
> Failures are successes gone wrong… Why don't we fail all the > time? Having an answer to that is going to get you to a better place > where human error is concerned.
> We don't build systems that need protection from humans. If you > think that then you've probably watched too much Star Trek. Human > error is an inevitable by-product of strained complex > systems. Because humans vary in their capability, from Sullenberger > all the way to the guy who just caused the site outage.
> There is this idea that the amount of negligence is commensurate > with the severity of the outage… But there's no way that you > can take the severity of the outcome and map it to how good or bad > the action that helped contribute to it. But we still do because we > have this need for accountability.
> When the site goes down, nobody dies.
> Punishment as a deterrent is a losing proposition. Firing people, > docking their pay, benching them… only produce anxiety and > stress which all but guarantee that you are not going to get good > information about failures in the future. And if you don't get good > information about failure in the future, you're screwed.