SEM 12: The Engineering Manager Guide To DORA Metrics
What are DORA metrics? Why should you care as a manager?
Metrics can be a valuable tool for better understanding your team and identifying areas for improvement.
If you’re an engineering team that isn’t tracking any operational metrics, DORA offers a good starting point for becoming more data-driven.
DORA metrics do not measure “productivity”. They help answer questions about how quickly, frequently, and reliably your team can deploy working software to customers.
What are DORA metrics?
DevOps Research and Assessment (DORA) metrics are five key performance indicators (KPIs) used to measure the performance and efficiency of software development and delivery processes.
The DORA metrics are deployment frequency, lead time for change, change failure rate, and mean time to recover. Recently, a fifth metric was added to the set: reliability.
Deployment frequency and lead time for changes measure velocity. “Change failure rate” and “time to recover” measure stability. Reliability measures how reliable your system is for customers.
DORA Metrics Help Answer Specific Questions
I believe DORA metrics are best defined as a way to answer 5 specific questions about the efficiency of software delivery.
What you do with the answers will depend on your team and your goals. For example, if you work in AWS building cloud services for customers, you’ll likely prioritise stability and reliability over velocity. You will care a lot about failure rates, recovery from failure, and reliability. If you work at an early-stage startup with few customers, you might prioritise velocity over stability.
The specific questions that DORA answers are listed below.
Velocity: Can you deliver software quickly?
How often does your team deploy code to production? (Deployment Frequency)
How long does it take your team to go from code committed to code successfully running in production? (Lead Time For Change)
Stability: Can you deliver software without breaking things?
How frequently do deployments introduce a failure that requires immediate intervention? (Change Failure Rate)
On average, how long does it take to recover from a failed deployment? (Mean Time To Recover)
Reliability: Can your customers trust you?
How reliable is your service for customers? (Reliability)
The Five DORA Metrics
Deployment Frequency (DF) 🚀
Example calculation: absolute number of deployments to production in a period (day/week/month)
Deployment frequency is a straightforward metric that answers how often your team release changes to production. Higher deployment frequency is typically associated with a better development process because you will release smaller changes, reducing deployment risk, and you can ship value to customers quickly.
While, in general, higher deployment frequency is better, it’s essential not to overgeneralise naively. Every team and project will have a different benchmark for what “high deployment frequency” should be. For example, I currently work at a company building a public cloud, and changes have a high cost in terms of review and process overhead. Therefore, releasing too frequently would cost more developer time dealing with the process and would also be potentially risky. For example, networking configuration changes require much stricter process control than pushing out low-risk UI changes on a web app.
Lead Time For Change (LT) 🚀
Lead Time For Change shows you how long it takes, on average, for code changes to make it from a committed state to running in production.
In my opinion, this is the most complex and expensive metric to calculate. You might want to use sampling or some other simplification to make this work. To calculate this accurately, you’ll need to know when a commit happened and when it was successfully deployed to production. Once you have this data, you can calculate the median.
Change Failure Rate (CFR) ❌
Example calculation: (failed deployments / total deployments) * 100
You can calculate this as a percentage. This answers the specific question: What percentage of changes to production result in degraded service or needed remediation (for example, rollback)
Mean Time To Recover (MTTR) 🔥
Example calculation: mean([incident_end_time - incident_start_time …])
Gartner estimates that unplanned downtime costs around $5,600 per minute. Therefore, it is essential to recover quickly from operational incidents.
To calculate this metric, for each of your incidents, record the start time and the end time (when the service is fully restored) to calculate the incident duration. Then, you can take an average.
Monitoring tools and incident management platforms like PagerDuty can provide timestamps and data on incident resolution times, or you may have auto-cut JIRA tickets to track operational outages that can be used to simplify data collection.
This is a relatively actionable metric. For example, you could introduce automated rollbacks when alarms fire to speed up recovery times, improve your alarming to detect issues faster, or introduce a 24/7 on-call rotation if you don’t already have one.
Reliability 👍
Example calculation: It depends
A reliable service consistently meets or exceeds its availability, performance, and quality goals. Reliability measures your ability to meet your reliability objectives, such as SLAs, performance targets, and error budgets.
There is no single reliability KPI, so you must decide what makes sense for your purposes.
If the service you own is an API, you might choose to look at a combination of error rates (5xx errors) to determine availability and latency to determine performance. We might say that a reliable API is one in which all requests succeed without error and are complete within our latency SLO. We could calculate a reliability score using these two indicators:
total_requests = the total number of API calls
5xx_errors = the number of API calls that resulted in a 5xx error
latency_errors = the number of API calls that took > SLO (e.g. 2000ms) to complete
reliability = (5xx_errors + latency_errors / total_requests)*100.
How (NOT) To Use DORA Metrics
Don’t use metrics to measure productivity. Use them to understand problems and find ways to help your team.
Paying attention to metrics like DORA over time has value. For example, investigating why your change failure rate increases over time might be required.
Weekly Review
Once you have DORA metrics, you and your team can regularly review them to investigate when things look off. Change Failure Rate (CFR) and Reliability are two metrics worth reviewing weekly because they directly impact your customer’s perception of your service.
For example, if your change failure rate is increasing weekly, you should take action to find out why. If your reliability score drops, investigate and potentially make changes.
Goal Setting
Metrics help you understand. With that understanding, you can make investments in improving your process. You can use DORA metrics in OKRs or SMART goals.
Objective: Improve the availability of our service for customers
Key Result: Reduce CFR from 5% to >1%
Key Result: Achieve > 99.99% Reliability
Takeaway
DORA metrics can be used to answer specific questions about your development process.
The questions are
How often does your team deploy code to production? (Deployment Frequency)
How long does it take to go from code committed to code successfully running in production? (Lead Time for Change)
How frequently do deployments introduce a failure that requires immediate intervention? (Change Failure Rate)
On average, how long does it take to recover from a failed deployment? (Mean Time to Recover)
How reliable is our service for our customers? (Reliability)
Deployment frequency and lead time for changes measure velocity. Change failure rate and time to recover measure stability. Reliability measures the extent to which your customers can rely on your service to do what it’s meant to do.
Metrics answer questions.
If you don’t like the answer, set a goal and find ways to improve.
Thanks for reading.
Get in touch
I’m always happy to meet and talk with other engineering managers. Let’s connect on LinkedIn.