Discover Top Posts Tagged with #postmortems

haha gayboy with the hazama layout

people who also have hazama layouts should not throw stones

How to handle an outage

I got some questions on Twitter about npm's incident response process, so here I am blogging yet again. I looked up our internal docs on the topic, and was surprised to notice how terse they are.

Here's the "how to handle an outage" document that's in our operations repo:

How to handle an outage

Take a deep breath. I know we joke that things are on fire, but they're not literally on fire. People can't install javascript packages. We'll fix it. It'll be okay.

The person on PagerDuty should assume point & hande initial investigations.

The point person should make sure that the following two roles are filled: operations, which acts to resolve the outage, and communications, which shares information about the outage to the public & the rest of the company. Escalate operations to a subject matter expert if you aren't qualified to fix the problem yourself, and assume the communications role. If you are the expert, pick somebody else as communicator. (Don't try to do both during a serious outage. It's too stressful.)

In very serious outages or security incidents, there might be a third role, that of coordinator who decides what actions to take next. If you move the discussion from #ops to #incident-response, it's probably serious enough to warrant that third role. Also, if more than one person is acting in operations to fix things, choose a coordinator to avoid collisions or conflicting efforts.

If the outage is visible to users in any way, the communicator should open a statuspage.io incident immediately. Do this even if we don't know anything yet. An open incident gives Support a place to point people for more information.

The communicator should keep the incident updated as we learn more: use the "identified", "monitoring", and "resolved" statuses to let people outside know what to expect.

People who aren't actively involved in the incident should ask their questions of the communicator, not the operations person.

The operations person or people should keep the Slack channel updated with new information as they can.

Serious incidents usually warrant postmortem discussions to figure out how improve our response next time as well as to note what we did well.

That's it

It's just enough to guide the team to doing three things:

clarify who has initial responsibility

assign further roles so we can coordinate

write a status incident so our users know what's happening

In addition, when the outage is more than a trivial one-- e.g., AWS is down or a single point of failure has failed and cannot be restored easily--we make our response more formal. We move all communication to one place, a Slack channel set aside for incident response. In that channel, communications are more stylized. We acknowledge requests and responses, and I take pains to thank people for actions.

Why do I do this? To slow things down.

Yes, I literally slow incident response down. Usually people are stressed during incidents and they're doing things in a rushed way, because they feel the urgency of the problem. Pausing to breathe and think carefully results in better decision-making, and overall faster resolution to the incident. I also think being extra-polite in sticky situations helps us as humans cope with stress. Our users might be yelling at us, but we're not yelling at each other. Instead, we're a kind and thoughtful team, and that's being modeled by the person with the CTO title.

I also think acking communication is a good way to make sure nothing gets lost. SYN/SYN-ACK/ACK is a way of life, you know?

Postmortems follow outages

The less grim word for this is "retrospective", but I seem to be stuck on the autopsy language. This is an important step for any incident or project. Some guidelines we follow:

The postmortem must happen after things have concluded and everyone is calm again.

Focus must be on processes not people. This is what "blameless postmortem" means for us.

Sometimes people become aware during incident response that they've made mistakes or taken down production with some action. I go out of my way to make it clear that this is our collective fault. Our processes let us down by not catching the problem before it could go live. If it's got to be any one person's fault, it's mine for not putting a better safety net in place before it could happen.

"Process" is a scary word that feels heavyweight, but what it really means is "the way you normally do things". You always have process; the question is whether your process is intentionally designed or not. You can and should change your processes when you find better ways to do things. Your processes are there to help you, not to be something you serve mindlessly.

Process is a safety net. Process is what allows us fallible humans to work and make mistakes without regularly blowing things up. It's a safety latch on a gate, a pre-launch checklist, your belayer checking your harness before you start climbing the cliff.

Some examples of processes you might have:

Pull requests are reviewed by at least one other engineer before they're landed.

Deploys get tested in a staging environment before they go to production.

Deploys get tested on a canary with a portion of production traffic before they roll out fully.

Libraries must have 90% (or 100% or something else) code coverage from unit tests.

In our retrospectives, we look at our processes and where they helped us and where they let us down. Did we follow them? If not, why not? What can we change so we can avoid this category of problem next time? What can we change so we can recover even faster from an event that's out of our control?

It's most fascinating to me when we have processes that we ignore. There's always a good reason. Sometimes it's that the process is a pain to follow, or not obviously helpful.

My general maxim is that the easy thing to do should be the right thing. Yes, the sentence is ordered that way. In a pinch we'll always go with the easiest thing to do, so we better prepare hard so when that moment comes, the easy thing is a good thing and not a trap. This can take a lot of planning and hard work to set up, but the trick is that you're doing that hard work when it's not stressful, when you're at your leisure to do it right. Danger operations used to have a saying: "maximize net slack". This means you work hard right now so you can put your feet up on your desk & drink a fancy drink with an umbrella in it later. I believe in this strategy.

The CTO's retro of the incident

We kinda hashed up internal communications during this incident. We did something new in this response, something we couldn't do before: most of the engineers involved in the response got into a Zoom video chat to share screens and discuss the problem. This was great for everybody in the video meeting, but was a comms black hole for everybody not in the meeting. We needed to do a better job of disentangling the person doing comms from any responsibility for debugging, so comms becomes their only job during incidents. I'll probably write something up about that. (If indeed this blog post by itself isn't enough! It might be.)

The npm team did a fantastic job of coming together and working without drama on the problem. I was in a 4-hour meeting that could not be interrupted so my participation was limited to initial escalation and kicking off the incident response process. My participation wasn't necessary, which was perfect. The retro meeting was huge because of how many people contributed meaningfully to the recovery, even the very newest members of the team. It was a good response to a bad event, and I was thrilled to watch it happen.

Making myself unnecessary is a victory.

#incident response #ceej has a job #postmortems #npm registry #maximize net slack

Some Reflections Platformer Jumping

While this platformer may seem simple on the surface, its true essence lies in making the gameplay genuinely enjoyable. Incorporating a variety of platforms, dynamic environments, unique characters, and engaging mechanics enhances the player experience. The real challenge, however, is combining these elements seamlessly. Through developing this game, I’ve gained a deeper understanding of how to make gameplay both fun and immersive. Games like Mario have inspired me — they demonstrate how creativity and design can elevate even the simplest concepts into timeless experiences.

#super mario #game #postmortems #reflection

The neverending griefing discussion

I’ve been doing MMOs and online worlds a long time. And that means that I’ve written and said a lot of things on the Internet over the years, about designing them. One of the funny things about reactions to the various vision blogs for Stars Reach is the number of people who have popped up on various MMO forums whose entire impression of me and my design approaches is formed by their experience…

View On WordPress

#game design #griefing #postmortems #stars reach #starsreach #vw design

POSTMORTEMS is out

My new book Postmortems is now shipping. If you pre-ordered the Kindle version, it should be there for reading now.

You should be able to get it from pretty much any bookstore. I get a little bit extra if you order it through this link on Amazon:

Paperback

Kindle

It’s a hefty 700 pages. Early reviews are pretty flattering, and make me happy; I am looking forward to seeing what readers think! If…

View On WordPress

#game design #game history #postmortems #vw history

Lots of good insight into systems thinking in this short paper, but I think the best part is this:

There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. [...] Thus, no isolation of the 'root cause' of an accident is possible. The evaluations based on such reasoning as 'root cause' do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcomes.

#postmortems #systems #software engineering #chaos engineering

You: POST MORTEM

Me, an intellectual: THE HOTDAGA

#buzzfeed unsolved #bfu #postmortems #the hotdaga #the hot dog saga #shane madej #ryan bergara

Apple confirms new Mac Pro will be arriving in 2019

[ad_1]

Last year Apple held a small meeting for select media to basically confirm that it is working on a new Mac Pro. This was brought on by growing complaints with the Mac lineup, especially the Pro line, which constituted the MacBook Pro that had a disappointing launch and the Mac Pro, which, now five years old, is thoroughly outdated.

A year after that meeting, Apple once again held a small…

View On WordPress

#bodies #carried #donegal #postmortems

How to handle an outage

I got some questions on Twitter about npm's incident response process, so here I am blogging yet again. I looked up our internal docs on the topic, and was surprised to notice how terse they are.

Here's the "how to handle an outage" document that's in our operations repo:

How to handle an outage

Take a deep breath. I know we joke that things are on fire, but they're not literally on fire. People can't install javascript packages. We'll fix it. It'll be okay.

The person on PagerDuty should assume point & hande initial investigations.

The communicator should keep the incident updated as we learn more: use the "identified", "monitoring", and "resolved" statuses to let people outside know what to expect.

People who aren't actively involved in the incident should ask their questions of the communicator, not the operations person.

The operations person or people should keep the Slack channel updated with new information as they can.

Serious incidents usually warrant postmortem discussions to figure out how improve our response next time as well as to note what we did well.

That's it

It's just enough to guide the team to doing three things:

clarify who has initial responsibility

assign further roles so we can coordinate

write a status incident so our users know what's happening

Why do I do this? To slow things down.

I also think acking communication is a good way to make sure nothing gets lost. SYN/SYN-ACK/ACK is a way of life, you know?

Postmortems follow outages

The less grim word for this is "retrospective", but I seem to be stuck on the autopsy language. This is an important step for any incident or project. Some guidelines we follow:

The postmortem must happen after things have concluded and everyone is calm again.

Focus must be on processes not people. This is what "blameless postmortem" means for us.

Some examples of processes you might have:

Pull requests are reviewed by at least one other engineer before they're landed.

Deploys get tested in a staging environment before they go to production.

Deploys get tested on a canary with a portion of production traffic before they roll out fully.

Libraries must have 90% (or 100% or something else) code coverage from unit tests.

It's most fascinating to me when we have processes that we ignore. There's always a good reason. Sometimes it's that the process is a pain to follow, or not obviously helpful.

The CTO's retro of the incident

Making myself unnecessary is a victory.

#incident response #ceej has a job #postmortems #npm registry #maximize net slack

#postmortems

Trending Tags

Recently Viewed Tags

#postmortems