Skip to content

Links and resources from my talk about how to learn more from incidents!

Notifications You must be signed in to change notification settings

nickstenning/learningfromincidents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Learning from Incidents

This repository contains links to resources mentioned in a conference talk on how to learn more effectively from incidents. I most recently gave this talk at Agile & Automation Afternoons 2020, and have previously shared it at Lead Dev Berlin 2019 and SREcon 2019 EMEA, and internal conferences and events at Microsoft.

This talk was co-written and originally co-presented with Jessica DeVita. Most of the meaningful insights in the talk are hers, and all of the typos and errors are mine!

⚡💡 If you're looking for one thing to do after the talk, we recommend reading the Etsy Debriefing Guide, particularly pp. 21-23, which give examples of good questions to ask in post-incident reviews. 💡⚡

Contents

Slides

Here's a link to a PDF of slides for the most recent version of the talk.

References

The two links referenced in our talk were:

Recommendations

What follows are a series of recommendations for running better post-incident reviews and learning more from incidents.

You shouldn't attempt to adopt all of these practices at once for every incident! Start small with interesting incidents, and not necessarily your biggest incidents.

We recommend that you break up your post-incident learning process into the following stages:

  1. (Optionally) interview participants
  2. Run a facilitated post-incident review
  3. Run a separate meeting to plan repair items
  4. (Optionally) Publish written incident reports

1. Run a facilitated post-incident review

At most a few days after an incident, get as many people as possible who were involved in incident response into a room together to talk about what happened.

Have a neutral facilitator whose job it is to guide the discussion. They should not have been involved in incident response themselves, if at all possible.

Focus on reconstructing the timeline of the incident and understanding how actions and decisions of operators made sense to them at the time, even if we know in hindsight that they were mistakes.

Limit your post-incident reviews to 60-90 minutes. You will probably have to pick and choose what to talk about.

2. Use 1:1 interviews for complex incidents

For many incidents, making effective use of a 60-90 minute incident review meeting will be challenging unless the facilitator already has some idea of the incident timeline.

Use 1:1 interviews (often no longer than 10-15m each) with people involved in the incident response to ask about their experience of the incident.

Use your interview notes to look for interesting points in the timeline: points where hypotheses were formed or changed, when significant actions or decisions were taken, or where individual views on the situation diverged one from another.

3. Keep discussion of repair items separate

Including discussion of repair items in the main post-incident review meeting will make it difficult to keep focus on understanding what happened during the incident.

You will likely find that talking about repair items leads people to start discussing what didn't happen. This is fine in a meeting about possible repairs, but it doesn't help us learn from what did happen.

Have a separate meeting a day or two after the post-incident review, in which you discuss and agree upon repair items. This meeting can be shorter and include fewer people (typically those who have a say in prioritisation).

4. Publish written incident reports

Not everyone on your team will attend every post-incident review. Writing up reports (even on one or two pages) can provide a way for the rest of your team to share in what was learned.

It may make sense to prepare different documents for different audiences. Your immediate team may gain more from a detailed description of how the system surprised you. Your management chain may be more interested in understanding how repairs will reduce the customer impact in future. Don't be afraid to prepare to different documents for different audiences.

Further reading/viewing

If you're interested in learning more about this field, here is a curated selection of further references:

Practical guidance

Accessible research

Deeper research

Videos

Some people learn better from videos! Here is a collection of talks and educational videos which may be of interest:

Even more!

Here's a further introductory guide to resilience engineering for the software community put together by Lorin Hochstein and Jacob Scott.

If your thirst for knowledge is still not quenched, you will find an enormous quantity of research and writing on this topic assembled by Lorin in his resilience-engineering repository.

About

Links and resources from my talk about how to learn more from incidents!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published