Balancing Speed And Safety In Software Engineering
How to achieve speed and safety in software engineering.
Events like the recent Crowdstrike outage highlight the risks of prioritising speed over safety, which can potentially lead to significant business impact and a loss of user trust in software organisations.
Most software engineering teams face the pressure to deliver quickly while ensuring system stability and reliability. As an engineering manager you’re responsible for delivering value and meeting deadlines, but also preventing downtime and customer impact.
How can teams balance the need for speed while maintaining safety in software development and operations?
Let’s explore some practical ideas.
The Importance Of Speed And Safety
For the purposes of this article, I’ll define speed at as the rate at which value is delivered to customers and safety as avoiding impact to customers.
When it comes to software, businesses want both speed and safety.
Faster delivery means more value to customers. It allows companies to adapt to market changes and accelerate feedback loops. But, moving too fast can introduce risks. Ensuring system stability, reducing downtime, and protecting security are essential to maintain customer trust and operational efficiency.
Downtime has a significant impact on businesses. Downtime has financial impact but it also impacts customer trust.
Meta lost roughly $100 million during a two-hour glitch that took down Facebook, Instagram and Messenger.
Parametrix said the global IT outage linked to Crowdstrike will likely cost the Fortune 500, excluding Microsoft, at least $5.4 billion in direct financial losses.
Service downtime and degradation costs $400B annually.
So, what practical solutions are there to enable speed while increasing safety?
The Tradeoff
Going too fast can introduce risk due to rushing out changes that haven’t been well planned. But equally going too slow introduces risk due to larger batch sizes. If you haven’t deployed in a month, it’s likely your releases will be large and riskier than if you were to do smaller changes more often.
Everything is a trade off.
Techniques To Increase Create Safety
1. Segment Changes Based On Risk
One of the most surprising bits of guidance I discovered when moving into the cloud infrastructure space was to roll out certain types of changes as slowly as possible (DNS, networking). It took me a while to understand this advice since it was counter to all standard industry advice (to deploy fast and often). The truth is that some domains and changes are riskier than others. For example, a cloud provider making changes to networking configuration could take down an entire cloud region, impacting hundreds of thousands of customers. Going fast in these cases is not the right approach.
It was a helpful reminder to think critically about the type of changes you’re making when considering how quickly to roll them out. It’s important not to blindly apply generic industry advice without considering your unique circumstances.
Changes to database schemas, updates to networking configuration, destructive infrastructure changes, and DNS changes are categories of change that may be “high risk.” These should be done carefully and cautiously and may benefit from more upfront planning. On the other hand, simple changes to a website are low risk, easy to rollback, and could be done more quickly.
Not all changes are equal in terms of risk; therefore, you should not apply a “one size fits all” policy to software releases. You can categorise different types of change and use different release processes for each.
For example, you might opt for continuous, fast deployment of application changes but more diligent and planned releases (possibly with approval gating) for infrastructure or database changes.
High-Risk Changes: Slower, more careful deployment might be necessary for critical infrastructure (e.g., database, DNS, or network changes). You may want approvals or planning for these changes.
Low-Risk Changes: Application changes that are easy to roll back can be deployed faster. You might choose to deploy these continuously without approvals or planning.
Your classifications will vary based on your industry or team.
2. Blue-Green Deployments
A blue-green deployment is a strategy where two identical environments, named "blue" and "green," are used to minimise downtime during updates. The blue environment is the live production system, while the green environment receives new code updates. After testing the green environment, traffic is gradually shifted from blue to green. If issues arise, traffic can be reverted to the blue version almost instantly, ensuring zero downtime.
Blue-green deployments make rollback easier, reduces risk, and allows for smoother production updates. This approach requires upfront work to build the infrastructure and tooling to support blue-green releases, but it can be very valuable over the longer term.
3. Canary Releases And Baking
Sometimes automated testing isn’t enough to catch problems that manifest more slowly (memory leaks, performance problems). A simple way to gain more confidence in a release is to deploy to a smaller subset of customers. You might also choose to wait 24 hours (baking) before deploying more widely.
Canary testing allows new code or features to be released to a subset of users. This allows you to verify if there are any issues before releasing to a wider audience. By limiting the audience of a new change, you limit the potential blast radius if something goes wrong.
You can use feature flags to make features available to a percentage of users.
4. Test Rollbacks Frequently
It’s easy to get complacent about your operations. If you’re not regularly testing your rollback procedures, there’s a high probability that your on-call engineers will struggle under pressure to do a safe rollback.
Practice rollbacks periodically and ensure everyone knows how to do them when needed. Ideally, rollbacks should be as simple as clicking a button. If that’s not the case, ensure that the steps to roll back a release safely are well documented.
When making a change ask yourself: what is the rollback plan if things go wrong?
5. Kill Switches And Operational Levers
Operational "levers" are a way to manage and adjust your system's performance, security, and operational efficiency in real time in a production environment. These levers enable you to fine-tune how your software and systems behave without redeploying the application or modifying its code directly.
Operational levers and kill switches can be useful to quickly react to problems in production without having to change configuration for every customer. For example if you have one customer application instance that is suffering performance problems you might choose to adjust operational levers and allocate more resources to that one instance rather than make a global update which may take longer and carries more risk.
Two common types of operational levers are:
Feature Levers (Feature Flags): These are boolean values that toggle specific features of an application on or off without changing the code. Feature toggles can be used for canary releases, A/B testing, or disabling features that are causing issues.
Configuration Levers: These allow you to change the configuration of applications and services without altering the code, such as adjusting timeout settings, cache sizes, or enabling/disabling logging levels.
When designing features, think about the possible operational levers you could add to make managing the feature in production easier.
How To Measure Speed And Safety
To think objectively about speed and safety we need a way to measure them.
There are standard metrics that you can use to measure both speed and safety in engineering teams: deployment frequency, lead time, change failure rate, and time to restore service.
Deployment Frequency (Speed): Measures how often new code is deployed. Balancing frequent releases with maintaining safety is a key challenge.
Change Lead Time (Speed): The time from when work on a change request begins to when it is put to production. Comparing the time of the first contribution for a task to the time of deployment is the most typical means of assessing lead time.
Change Failure Rate (Safety): The percentage of deployments that result in failures or require rollback. If changes frequently result in impact it could be a sign you need better testing practices in place.
Time to Restore Service (Safety): How quickly the team can restore service after an incident. This metric reflects how quickly a team can recover from failure.
Number of operational incidents (Safety): The number of customer impacting incidents in a period.
Always Communicate Risk
It’s not uncommon to feel pressure from the business to move quickly and to meet deadlines. If you feel as a manager that moving forward too quickly presents a risk, it’s important to clearly communicate that upwards.
Your stakeholders will always want you to go faster. It’s your job as a manager to pushback when necessary and communicate why.
Summary
Software teams want both speed and safety.
You can achieve speed and safety by:
Applying stricter controls to high risk changes with larger potential blast radius.
Using blue green deployments for incremental rollout and fast rollback
Use feature flags and baking to test new features on a subset of customers first
Test rollbacks and operational practices often
Make use of operational levers to react to problems faster in production
You can use standard metrics to measure your ability to deploy software fast and safely.
Thanks for reading.
Get In Touch
I’d love to connect on LinkedIn.
PS: Join an exclusive community of ambitious software engineers, engineering managers, and execs who want to grow and learn together. This community is currently free but will move to a paid model soon. Join today and lock in a free membership.