This is the template for Postmortem.

Postmortem Template

Executive Summary of the Postmortem

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. Phasellus viverra nulla ut metus varius laoreet. Quisque rutrum. Aenean imperdiet. Etiam ultricies nisi vel augue. Curabitur ullamcorper ultricies nisi. Nam eget dui. Etiam rhoncus. Maecenas tempus, tellus eget condimentum rhoncus, sem quam semper libero, sit amet adipiscing sem neque sed ipsum. Nam quam nunc, blandit vel, luctus pulvinar, hendrerit id, lorem. Maecenas nec odio et ante tincidunt tempus. Donec vitae sapien ut libero venenatis faucibus. Nullam quis ante. Etiam sit amet orci eget eros faucibus tincidunt. Duis leo. Sed fringilla mauris sit amet nibh. Donec sodales sagittis magna. Sed consequat, leo eget bibendum sodales, augue velit cursus nunc,

Incident Timeline Summary

Event Description
Starting Date / Time of Issue The time the root problems began, not necessarily when alerted or detected.  How was the incident detected (e.g. customer complaint, our alarm, another team’s alarm, or manual monitoring)? Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et
Time To Detect (minutes) Time duration between Issue introduction to Issue Detection Time from first symptoms until issue first noticed or alerted Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et
Time to respond (minutes)Time from first alert or notification until acknowledged and triage begins Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et
Time to root cause identification (minutes) from Response Time Time duration between Issue Detection to Identifying the Root cause? Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et
Time to repair (minutes)Time between Identifying the Root cause to Issue mitigation Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et
Ending Date / Time of IssueThe time when remediations are completedTotal Time from initial response to recovery (excludes post incident action items) Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et

Roster

Role Team Individual
Incident Commander Team Name
Subject Matter Expert Team Name
Subject Matter Expert
Additional Responders

Detailed Timeline

Time Details
November 26, 2021: 11:40 AM PST Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et
November 26, 2021: 12:00 PM PST Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et
November 26, 2021: 12:34 PM PST Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et
November 26, 2021: 2:40 PM PST to 4:00 PM PST Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et

Customer Impact Questions to Answer

  1. What was the customer use case impacted?
  2. How has the issue manifested for the customer?
  3. What was the quantified impact from the incident?(Actual Transactions vs. Projected Transactions; Estimated TPV impact)
  4. Was there different impact across geographies?
  5. What was the potential customer impact blast radius? i.e. What are the potential issues that could happen due to this?

Root Cause Analysis

When a question asks “why”, use the “5 Why’s” method to show your thought process as you drive to deeper root causes and answers. Mouse over the “?” help icon to see an example.

  1. What was the primary reason for the failure?
  2. What were we trying to accomplish?
  3. Explain the call-pattern from the client interface points to the downstream systems until where the failure occurred?
  4. What is the end to end process the team uses?(code development, code review, testing, deployment, post production validation and monitoring)
  5. What are the safeguards in place today and why were they not effective? (safeguards for code development, code review, testing, deployment, post production validation and monitoring/alerting)
  6. What testing was done, and why was the issue not caught with testing? Why wasn’t this caught in One-box if applicable?
  7. How was the root cause diagnosed?
  8. How can the root cause analysis time be improved? How would you have cut the time in half?

Triggering Events

  1. What were the triggering factors and events that lead to the issue?
  2. Was there an existing Info Sec violation that would have prevented this event? If so, why was it not addressed?
  3. Has this or similar issue occurred in the past?
  4. (Optional) Did you have an existing backlog item that would have prevented or greatly reduced impact of this event? If yes, why was this not completed prior to the event?

Opportunities for Improvements

Detection

  • Detailed description of how the incident was detected. How was the incident detected (e.g. customer complaint, our alarm, another team’s alarm, or manual monitoring)?
  • How could the issue have been detected sooner and how could alerting be improved? As a thought exercise, how would you have cut the time in half? How could this issue been better observed?

Observability

  • Demonstrate, through visualizations, graphs and charts that detail key metrics, how the issue was observed and its effects.

Mitigation

  • Describe the actions taken to mitigate the incident. What are the steps you followed to reach the point where you knew how to mitigate the impact of this incident?
  • How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half?
  • Was there short term mitigation done?
  • Is there a long term mitigation?
  • Describe final repair

Preparation for future

  • How could this issue be prevented from occurring in the future?
  • What, where and how can we improve?
  • How can we better prepare for this or similar issues in the future?

Validation

  • Describe and demonstrate, through observability systems, and graphs and charts when possible, how and why we are assured the incident has been adequately recovered.

Lessons Learned

  • Which actions made a positive impact on incident remediation?
  • Which actions made a negative impact on incident remediation? Which actions had no impact?
  • What new lessons and insights did we learn about our systems as a whole? (people, processes and technology)

Additional Resources and Artifacts

  • Slack
  • Link to Incident Management Recordings (if needed)
  • Enumerate any additional resources, materials, or artifacts from the incident and associated URLs when possible

Corrective Actions

Prioritized CAR# Description Creation Date Target Date Owner State
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque nec June 1, 2021 June 2, 2021 Name Completed

This is the end of post