SEM 18: Availability In Software Systems
How to measure the availability of your software systems
Availability is an essential KPI to measure if you operate software systems. It represents whether or not your system was available to its users and is represented as a percentage over time.
A service can be available (availability SLO is met), have degraded performance (service is available to use but not fully functional), or be unavailable (a user cannot use the system).
If you operate services for customers, availability will often be the most important KPI you can offer because customers will pay a high cost if your system is unavailable. Mission-critical services must be architected with high availability in mind.
Various factors can influence availability. These include software and hardware failures, deployments introducing regressions or errors, and software system dependencies that may fail. If you’re responsible for operating software services, availability is the metric that will keep you up at night.
High availability (HA) refers to a system’s ability to operate continuously – without downtime or failure. Highly available systems are designed to operate even during unexpected events.
Highly available systems rarely fail. Services like Google and Amazon are two examples of highly available software systems. They achieve high availability using architectural best practices, such as redundancy and fault isolation mechanisms, to ensure that the service can operate normally if one part of the system fails.
Impact Of Outages
Since customers rely on many software systems to run their businesses, an outage can prevent them from operating. According to IBM, the average cost of unplanned application outages was estimated at $400,000 per hour.
In 2021, Meta experienced a massive outage that took down services such as WhatsApp, Instagram, and Messenger. The outage lasted around six hours and was caused by a “boring error” - a simple networking misconfiguration. The outage cost Facebook an estimated $60 million in lost ad revenue.
Defining Availability
Availability is the % time a system is available and working for customers.
The availability of a system is typically expressed in terms of the number of “nines” we want to provide, such as 99.9%, 99.99%, or 99.999%. Achieving 99.999% availability, often called “five nines,” is the best-in-class standard. Five nines translates to a yearly downtime of 5.26 minutes.
Availability is always calculated over time, for example, as a percentage of availability this week, month, or quarter. This is important because your system might miss its availability target over a week but still meet it over a month. Be clear about the period when talking about availability.
Methods Of Calculating Availability
There are two common ways to calculate the availability of a system.
Time based availability = uptime / (uptime + downtime) * 100
Error based availability = (successful requests / total requests) * 100
Time-based availability is a traditional way of measuring system availability, and it made a lot of sense when software applications were simpler than they are today. It defines availability in a binary way: "Was the system up or down?”
Time-based availability is problematic and relatively inflexible for modern software systems in multiple regions worldwide (e.g. if you offer cloud services to customers).
Error-based availability is a more flexible and appropriate way to calculate availability for most modern systems. The definition of a successful request can be modified to suit the system you are working on. For example, an API-based system might define an error as one in which a 5xx code was returned, or latency was beyond an acceptable threshold, and a batch-processing system might define an error as one in which a job was not processed correctly.
Calculating Availability
To calculate availability, we use SLIs (Service Level Indicators) and Service Level Objectives (SLOs) to define precisely how to measure it. We use SLIs to measure whether a service is available, and SLOs are the goals we are trying to meet for those metrics.
Error Rate (SLI) = The proportion of HTTP requests that resulted in a successful response (not a 5xx error).
Goal (SLO): Error Rate < 0.05%.
For typical software systems, some examples of factors you might include when calculating availability could be:
Error Rate: The proportion of requests that resulted in a successful response.
Latency: The proportion of requests that were faster than some threshold.
Correctness: The proportion of records processed that resulted in the correct value coming out.
Data freshness: The proportion of the data updated more recently than some time threshold.
Putting it all together, the following example shows how Gitlab defined availability for their web tier
Within a 5-minute period:
At least 90% of requests have a latency within their “satisfactory” threshold
AND, less than 0.5% of requests return a 5XX error status response.
Availability Of Complex Systems
Many modern software systems consist of multiple services that work together to provide users with value. For example, your application might have an API server, a web UI, and a worker tier.
You can calculate availability for individual system parts (for example, the API only), but a user cares if the system works as expected.
You can calculate an average or weighted average to arrive at a single availability score for your system that reflects user experience. This score clearly shows how your system performs regarding user experience.
API: 99.98%
Web: 99.97%
Worker: 99.95%
System Average Availability: (API + Web + Worker / 3) = 99.97%
Summary
Availability is an essential KPI to measure if you operate software systems. It represents whether or not your system was available to its users and is represented as a percentage over time. It is calculated using metrics and service performance goals to define what being “available to customers precisely” means.
If you operate software systems, availability % should be one of the key KPIs you pay attention to weekly.
Thanks for reading.
I’d love to learn more about your interests and challenges as an engineering leader.
If you enjoyed reading, please take 30 seconds to answer 2 questions. Thanks!
Get In Touch
I would love to hear from you! If you enjoy my writing and want to connect:
If you enjoyed reading, please consider subscribing or sharing.