Meetings have a bad reputation, but there are a few that all software engineering teams should consider. An operations review meeting (ops review) would be at the top of my list for valuable meetings for software engineering teams.
A weekly operations review is a chance to drive operational excellence within your company. Many large tech companies use operational review meetings to maintain accountability, share insights, and continuously improve software systems.
Here are my suggestions based on running these meetings for several years.
A good operations review meeting answers three specific questions:
How did our products/services perform last week? (Review metrics and SLOs)
What went wrong that we should learn from? (Review major customer impacting incidents)
What actions will we take to improve our services based on what we’ve learned?
Amazon highlights the importance of an operations review meeting in its Well-Architected Framework.
Here’s a summary:
Operational review meetings are regular gatherings where teams from across the organization come prepared with an operational dashboard that showcases telemetry data, performance metrics, and other insights into operations for their products. The aim is to present to a broad audience to share and gain different perspectives on changes in the data, whether it is a spike, dip, or trend. This promotes a culture of transparency, preparedness, and continuous improvement throughout the organization.
Amazon implements this by holding weekly ops review meetings and using the spinning wheel as a random selection method for which team will present. The randomness of the selection ensures that each team comes prepared, as any team can be called upon to present. When presenting, teams must be capable of deep diving into the data, explaining root causes behind notable data changes, and articulating the steps taken or planned to rectify any anomalies. This pushes teams to maintain high-quality operational dashboards that reflect the real-time health and performance of their services.
What To Include In An Ops Review
Here’s a suggested agenda based on my experience running these meetings. Feel free to steal it or adjust it to your own needs. I’ve given some time hints, but these will depend on your company and the scale of your operations. Large organisations will need more time to review, and smaller companies will require less time.
If you’re lucky, you’ll have no major incidents and can spend more time in other areas. When there are major incidents, have a process to track follow-up actions. The goal is not to repeat the same major incident by changing your process or systems.
Sample Operations Review Agenda
PSAs
Follow Up
Incidents
Metrics
Dashboards
Let’s break these down.
1. PSAs 📢 (5 Mins)
Review any important public service announcements (PSAs) your engineering teams should know about.
The purpose of a PSA is to ensure everyone is aware of important information and actions they may need to take. For example, “There is a critical security vulnerability in the latest Ruby version, and all teams should plan to upgrade ASAP.”
2. Follow Up ✅ (5 Mins)
Any good operational review should focus on continual improvement. If you have action items or follow-ups from previous meetings, review them each week to maintain accountability.
Explicitly tracking important follow-up actions increases the likelihood they’ll get done.
3. Incidents 🔥 (30 Mins)
Most software engineering teams maintain availability through on-call rotations, where engineers are paged when things go wrong (major incidents). A major incident is any operational event that impacts customers (e.g., a service is unavailable or starts returning errors). Incident reviews should always be blameless and focus on improving processes.
Teams should come prepared with details about what happened, preferably in written format, with timelines of the events and a clear understanding of the root cause.
If you’re facilitating the meeting, ensure people know what incidents will be discussed beforehand so they can prepare.
From Amazon:
As part of responding to incidents or events, evaluate which metrics were helpful in addressing the issue and which metrics could have helped that are not currently being tracked. Use this method to improve the quality of metrics you collect so that you can prevent, or more quickly resolve future incidents.
Review the top 1-3 major incidents in your organisation from the past week.
What are the learnings?
Were core KPIs and SLOs met?
What actions should you take based on what you have learned?
Ideally, at the end of an incident review, everyone should come aware with a clear understanding of what happened, why it happened, and how to prevent it from happening again.
4. Metrics 📈 (10 Minutes)
Reviewing metrics regularly helps you spot system trends that would otherwise go unnoticed. With any metrics review, the goal is to use data to understand how you’re doing and inform actions your team can take to improve.
You can also use metrics review to spot gaps or alarms in your metrics.
From Amazon:
Continually review metrics that are being collected to verify that they properly identify, address, or prevent issues. Metrics can also become stale if you let them stay in an alarm state for an extended period of time.
The metrics you track will be individual to your service and team, but obvious metrics to review frequently are:
SLOs - did you meet your performance goals for operations? If not, why not?
SLAs - Did you meet your SLAs, such as responding to customer requests or on-call incidents on time?
DORA - Metrics such as change failure rate, deployment frequency, and time to restore give a good insight into your team's health.
On-Call - Are the number of pages your on-call engineers receive going up or down over time?
5. Dashboards 📊 (10 Minutes)
Let one team deep dive into their operational dashboards each week. A dashboard review aims to ensure that all teams have high-quality operational dashboards that give real-time insight into system health.
Allow questions to drive the conversation and look for interesting spikes, dips, or trends that could indicate problems.
Dashboard reviews can also help spot trends that might get missed in higher-level metrics reviews. For example, “Why is API latency slowly increasing over time?”
Summary
Building a culture of operational excellence is essential for any software engineering team in the modern world, where customers expect software services to be reliable and performant.
A well-run operational review can be one of the most useful meetings in your team's calendar.
Use a weekly operations review to maintain high operational standards and collaborate with your team to ensure continuous improvement.
Thanks for reading.
Get In Touch
I would love to hear from you! If you want to connect: