Discover Top Posts Tagged with #code as craft

A community of developers helping each other master software craftsmanship.

Mentor the next generation of developers. Or learn from the best and master your craft. A New Year’s Resolution: Give or receive 1 hour of coding help in 2015

https://hackpledge.org/

(I pledged to mentor AND learn!)

#hackpledge #code as craft

In this talk — "Advanced PostMortem Fu and Human Error 101" — from the O'Reilly Velocity Conference, John Allspaw describes (in terms of software systems) the modern view of risk and responsibility. Allspaw draws upon recent research in the life-critical industries, particularly aerospace and medicine.

What has been happening in the "life-critical" fields is a movement away from the idea of "operator error" with irresponsible or malevolent human actors as the eventual "root cause" of a catastrophic failure. Instead, modern researchers propose that there is no root cause when a system fails. Complex systems (such as Web sites) are by definition so complex that their behavior cannot be reliably predicted. Failure is therefore a normal, inevitable consequence of doing business.

In my humble opinion, the field of Software Quality Assurance has a lot to learn from Allspaw's work, particularly the way he has connected Web application architecture with Sidney Dekker's idea of "drift into failure" and with Erik Hollnagel's concept of "efficiency-to-thoroughness tradeoffs."

Part 1 of the talk is embedded above, the rest is available from Safari books

Money quotes below :)

> I don't care how simple your Web site is, it is still a complex, > dynamic system.

> What happens during [a protracted response to an incident] can > infuriate, enthrall, fatigue, mystify or energize the emotions of an > entire company [and] can have lasting effects on your culture and > on your technology. This is what makes scar tissue. This is what > makes (in a very small sense) some sort of post-traumatic stress > disorder.

> Root cause analysis tends to make things look simple… But in > reality there was a lot going on. There were multiple things going > on, people's mindsets were in different places. We work in these > complex systems where there are interactions that can't be told by a > linear narrative.

> Safety is easy: just move the dominoes farther apart. That way they > won't hit each other if one falls over.

> We have to start talking about contributors not causes. There is > no root cause. There is no root cause of your failure… > Finding a root cause of a failure is like finding a root cause of a > success.

> Labeling [an incident] as "human error" is no good… What's > the remediation item? Be more careful? Be more vigilant? Don't do > that again? Where's the graph for carefulness? Human error isn't a > cause. It's an effect. It's an effect of how you've built your > organization.

> Failures are successes gone wrong… Why don't we fail all the > time? Having an answer to that is going to get you to a better place > where human error is concerned.

> We don't build systems that need protection from humans. If you > think that then you've probably watched too much Star Trek. Human > error is an inevitable by-product of strained complex > systems. Because humans vary in their capability, from Sullenberger > all the way to the guy who just caused the site outage.

> There is this idea that the amount of negligence is commensurate > with the severity of the outage… But there's no way that you > can take the severity of the outcome and map it to how good or bad > the action that helped contribute to it. But we still do because we > have this need for accountability.

> When the site goes down, nobody dies.

> Punishment as a deterrent is a losing proposition. Firing people, > docking their pay, benching them… only produce anxiety and > stress which all but guarantee that you are not going to get good > information about failures in the future. And if you don't get good > information about failure in the future, you're screwed.

> —John Allspaw

#devops #qa #devoqa #Etsy #Code As Craft #Agile #culture #risk #blame #failure #complexity #monitoring #complex systems #systems thinking