How To Handle Major Incidents (6 Steps)
Lessons from managing large scale operational incidents.
If you have worked in software engineering for any time, you’ll know how often things go wrong.
When your pager goes off at 2 a.m., the last thing you want someone to ask is, “What should we do?”.
Prolonged downtime to software systems can damage customer trust and sometimes cost thousands or millions of dollars. Many things can cause downtime, such as hardware failures, networking issues, bad deployments, software bugs, or third-party system outages.
Whatever the reason, when major incidents occur, you need a process to handle them effectively.
What Is A Major Incident?
The definition of a major incident will differ between companies, but typically, it means “a problem that has a major business impact affecting many customers.”
Examples:
An AWS service outage affecting thousands of customers.
Your service is not operational (it’s hard down).
Multiple customers are unable to use your service.
PagerDuty defines incident severity and categorises a major incident (Sev1) as a “critical issue that warrants public notification and liaison with executive teams.”
Examples:
The system is in a critical state and is actively impacting many customers.
Functionality has been severely impaired for a long time, breaking SLA.
How Should You Handle Major Incidents?
The secret to handling major incidents is to prepare and have clear policies and procedures. Here are the key steps you should follow:
During The Event
1. Get The Right People In The Room
Bring everyone together on a video call. Make sure you have the right subject matter experts to mitigate problems. In most cases, this will be an on-call engineer, but you may also need to page in additional experts.
2. Assess The Impact
Understand the high-level impact. What are customers experiencing? Which customers are impacted? What systems are affected?
3. Manage Communication + Roles
Major incidents can be stressful events, and strong communication is key.
Someone needs to be in charge during a major incident. This could be a manager, senior engineer, or a dedicated person who handles major incidents. Who it is will depend on the size of your company and the options available to you.
But make sure that one person is in charge of managing the incident. Operational emergencies are one of the few times when a command-and-control leadership style is better.
Communicate with your leadership, keeping them updated on what’s happening. Communicate with customers. Communicate with the team, ensuring they have the full context of what has happened and what is going on.
The following questions can be used to create clarity:
What happened? (increased 5xx errors in us-east-1 caused alarms to fire)
What is the suspected cause? (load balancer is overloaded)
What do you observe? (customers seeing 5xx errors, traffic not reaching API)
What actions are being taken, and by who? (scaling load balancer)
4. Mitigate customer impact as your priority
During an operational incident, you have one primary goal: mitigating customer impact. A common trap is getting lost in understanding root causes, which ultimately prolongs major incidents and their impact on customers.
Don’t spend time exploring root causes or conducting extensive investigations. Instead, focus on action that will resolve the impact.
Make a deeper diagnosis after the event.
After The Event
5. Write A Blameless Postmortem
After the event, it’s best practice to write about what happened so that others can understand the incident in-depth and learn from it.
The most critical part of this process is to make it blameless. Don’t include individual names; don’t blame. The purpose of a postmortem is shared learning and continuous improvement to prevent future errors. People will make errors. Fix systems and processes to reduce the probability of human error in future.
The write-up should contain a detailed analysis of the human and technical aspects and be shared widely so everyone can learn together.
6. Fix The Root Cause
As part of the post-mortem, you should have identified some actions that can be taken to prevent an event from happening again. These could include improving testing, handling the incident better, or anything else you identify during your postmortem analysis.
Summary
Major incidents are inevitable in software teams. Having a clear and effective process in place when they occur is essential.
Six things you should do when major incidents occur:
Get the right people in the room
Assess impact
Communicate
Prioritise mitigating impact
Write a detailed postmortem analysis
Take action to prevent future incidents
Thanks for reading
I’d love to learn more about your challenges as an engineering leader.
If you enjoyed reading, please take 30 seconds to answer 2 questions. Thanks!
Get In Touch
I’d love to connect: