Incident Post-Mortem Template (Free)

What is an Incident Post-Mortem?

An incident post-mortem (also called an incident review or retrospective) is a structured analysis conducted after a service disruption. The goal isn't to assign blame — it's to understand what happened, why it happened, and what concrete steps will prevent it from happening again.

The best engineering teams treat every incident as a learning opportunity. Google, Cloudflare, and Atlassian all publish public post-mortems. Not because they enjoy admitting failures, but because they know transparency and systematic review are what separate teams that repeat incidents from teams that eliminate them.

When to write a post-mortem:

Any incident affecting users for more than 15 minutes
Any incident requiring manual intervention
Any "near miss" that almost caused a user-facing issue
Any time the same type of failure happens twice

First time dealing with downtime? Start with our site down action plan for immediate steps, then come back here for the review.

Why "Blameless" Isn't Just a Buzzword

If people fear punishment, they hide mistakes. Hidden mistakes become recurring outages. A blameless culture means:

Focus on systems, not people — "The deployment pipeline lacked a rollback mechanism" not "Dave pushed bad code"
Ask "what" not "who" — "What allowed this to reach production?" not "Who approved this?"
Assume good intent — Everyone was making the best decision they could with the information they had at the time
Treat it as learning — The post-mortem is an investment in reliability, not a trial

This doesn't mean no accountability. It means accountability lives in the action items, not in finger-pointing.

The Post-Mortem Template

Copy this template and fill it in for each incident. Below the template, we'll walk through how to fill each section effectively.

Incident Post-Mortem Report

1. Incident Summary

Incident Title	[Brief descriptive title]
Date	[YYYY-MM-DD]
Severity	[Critical / Major / Minor]
Duration	[Total time from detection to resolution]
Impact	[Number of users affected, revenue impact, SLA impact]
Lead	[Person who led the incident response]
Status	[Complete / Action Items Pending]

2. Executive Summary

[2-3 sentences: what happened, what was the impact, what's the key takeaway. Write this for someone who will only read this section.]

3. Timeline

Time (UTC)	Event
[HH:MM]	First signs of issue (monitoring alert / user report)
[HH:MM]	Issue acknowledged by [team/person]
[HH:MM]	Root cause identified
[HH:MM]	Fix deployed / mitigation applied
[HH:MM]	Service fully restored
[HH:MM]	All-clear communicated to stakeholders

4. Root Cause

[Detailed technical explanation of what caused the incident. Be specific. "Database overload" is not enough — explain WHY the database was overloaded.]

5. Contributing Factors

[Factor 1: e.g., "No alerting on database connection pool saturation"]
[Factor 2: e.g., "Deployment happened on Friday afternoon with reduced staff"]
[Factor 3: e.g., "Runbook for this scenario was outdated"]

6. What Went Well

[Thing 1: e.g., "Alert fired within 2 minutes of first errors"]
[Thing 2: e.g., "Status page was updated promptly, reducing support tickets"]
[Thing 3: e.g., "Team coordination on Slack was efficient"]

7. What Went Poorly

[Thing 1: e.g., "Took 45 minutes to identify root cause"]
[Thing 2: e.g., "No runbook existed for this failure mode"]
[Thing 3: e.g., "Customer communication was delayed by 30 minutes"]

8. Action Items

Action	Owner	Priority	Due Date	Status
[Specific action]	[Name]	[P0/P1/P2]	[Date]	[Open/Done]
[Specific action]	[Name]	[P0/P1/P2]	[Date]	[Open/Done]
[Specific action]	[Name]	[P0/P1/P2]	[Date]	[Open/Done]

9. Lessons Learned

[Key insights from this incident that the broader team should know. What would you tell your past self?]

How to Fill Each Section Effectively

The Timeline: Your Most Important Section

The timeline is the backbone of a good post-mortem. Without an accurate timeline, everything else is guesswork.

Tips for building the timeline:

Use UTC timestamps — eliminates timezone confusion, especially for distributed teams
Pull from monitoring data first — your monitoring tool's alert history is more accurate than anyone's memory
Include detection-to-acknowledgment time — this gap reveals alerting effectiveness
Note communication timestamps — when was the status page updated? When were stakeholders notified?

PerkyDash's incident timeline analysis automatically records every alert, status change, and response — giving you an accurate timeline without manual reconstruction.

Root Cause: Go Deep

"The server crashed" is a symptom, not a root cause. Use the "5 Whys" technique:

Why did the server crash? → Memory exhaustion
Why was memory exhausted? → A query returned 10x expected results
Why did the query return 10x results? → No pagination on the new endpoint
Why was there no pagination? → The endpoint skipped code review
Why did it skip code review? → It was marked as "minor change"

Root cause: No automated checks for pagination on new API endpoints. Now you have an actionable insight.

Action Items: The SMART Test

Bad action item: "Improve monitoring"

Good action item: "Add alerting for database connection pool > 80% saturation, with PagerDuty escalation, by March 1st. Owner: Sarah."

Every action item must be:

Specific — exactly what will change
Measurable — how will you know it's done
Assigned — one owner, not a team
Realistic — achievable with current resources
Time-bound — has a due date

Running the Post-Mortem Meeting

Don't just fill the template in isolation. The meeting is where insights emerge.

Before the Meeting

Fill in the Summary and Timeline sections in advance
Share the draft with attendees 24 hours before
Ask everyone to add their perspective to the timeline

During the Meeting (45-60 minutes)

5 min Read the summary together
15 min Walk through the timeline, fill gaps
10 min Discuss root cause and contributing factors
10 min What went well / what went poorly
10 min Define action items with owners and dates
5 min Summarize lessons learned

After the Meeting

Publish the post-mortem where the team can reference it
Track action items in your project management tool
Review action items in 2 weeks to ensure progress
Consider publishing a public version for customers

Need to communicate with users during an incident? Our guide on writing clear status updates covers the communication side.

Common Post-Mortem Mistakes

Writing the post-mortem weeks later

Do it within 48 hours. Memories fade fast. Your monitoring data doesn't, but the context around decisions does.

Stopping at "human error"

Humans make mistakes. The question is: why did the system allow that mistake to cause an outage? Always dig deeper.

Action items with no owner

"We should add better monitoring" assigned to nobody will never get done. One person, one date, one deliverable.

Skipping "What Went Well"

Reinforcing what worked is just as important as fixing what didn't. If your alerting caught the issue in 30 seconds, celebrate that.

Never reviewing action items

A post-mortem without follow-through is just a writing exercise. Schedule a review 2 weeks later.

Tools That Make Post-Mortems Easier

The hardest part of a post-mortem is reconstructing the timeline accurately. These tools help:

Monitoring + Timeline Data

Your monitoring tool should give you the raw data: when alerts fired, when the issue started, response times during the incident, and when recovery happened.

PerkyDash provides a complete incident timeline with every check result, alert, and status page update — so your post-mortem timeline writes itself.

Status Page History

If you updated a status page during the incident, that's timestamped communication data. It tells you when you first acknowledged the issue publicly and how communication evolved.

Don't have a status page yet? You can create an emergency status page in 60 seconds — no signup required. Having one ready before an incident makes the post-mortem communication section much easier to fill.

Communication Logs

Slack/Teams messages, emails, and support tickets from the incident. Export these while they're fresh.

Frequently Asked Questions

How soon after an incident should I write a post-mortem?

Within 24-48 hours. The technical data won't change, but the context — why decisions were made, what information was available when — fades quickly. Start the draft within a day and hold the meeting within a week.

Should we write post-mortems for minor incidents?

Yes, but adjust the depth. A 5-minute blip might just need the Summary, Timeline, and one action item. A 4-hour outage needs the full template. The habit of reviewing incidents matters more than the format.

What's the difference between a post-mortem and a root cause analysis?

Root cause analysis (RCA) focuses narrowly on the technical "why." A post-mortem is broader: it includes the timeline, communication effectiveness, what went well, and action items. Think of RCA as one section within the post-mortem.

Should post-mortems be public?

It depends. Public post-mortems build trust and show transparency. Companies like Cloudflare and GitLab regularly publish them. At minimum, share a summary with affected customers. Internally, always share the full post-mortem.

How do I handle repeat incidents?

If the same type of incident happens twice, reference the previous post-mortem. Check if action items were completed. If they were and it still happened, the fix was insufficient. If they weren't completed, that's a process problem to address.

Get the Data You Need for Better Post-Mortems

The hardest part of a post-mortem is the timeline. PerkyDash automatically records every alert, status change, and response time — so when an incident happens, your timeline is already written.

Start Monitoring Free Try Emergency Status Page

Incident Post-Mortem Template: Run Reviews That Actually Improve Things

What is an Incident Post-Mortem?

Why "Blameless" Isn't Just a Buzzword

The Post-Mortem Template

Incident Post-Mortem Report

1. Incident Summary

2. Executive Summary

3. Timeline

4. Root Cause

5. Contributing Factors

6. What Went Well

7. What Went Poorly

8. Action Items

9. Lessons Learned

How to Fill Each Section Effectively

The Timeline: Your Most Important Section

Root Cause: Go Deep

Action Items: The SMART Test

Running the Post-Mortem Meeting

Before the Meeting

During the Meeting (45-60 minutes)

After the Meeting

Common Post-Mortem Mistakes

Writing the post-mortem weeks later

Stopping at "human error"

Action items with no owner

Skipping "What Went Well"

Never reviewing action items

Tools That Make Post-Mortems Easier

Monitoring + Timeline Data

Status Page History

Communication Logs

Frequently Asked Questions

How soon after an incident should I write a post-mortem?

Should we write post-mortems for minor incidents?

What's the difference between a post-mortem and a root cause analysis?

Should post-mortems be public?

How do I handle repeat incidents?

Get the Data You Need for Better Post-Mortems

Related Guides

What to Do When Your Site Goes Down

Website Downtime Guide

Incident Communication

Monitoring Alerts Setup