The following is a rough draft of notes I have put together while researching about SRE.
Edge cases. Not everything is linear. The impact of outages is not constant over time. The desired level or reliability for a service can change.
It’s Black Friday this week. Traffic will increase 4x and the adverse publicity in case of errors will be bigger. We have a 99.9% service but for this we want something close to 99.99%. Let’s over-provision resources, implement change freezes and use war rooms during this time.
Customers react different to outages. 4x one-hour outages, 1x four-hour outage and 0.5% constant errors can impact your error budget the same way, but customers will prefer one over another.
Most of the users are fine with the current latency but some advanced users are complaining about speed.
Users are not equal. If your users are robots that scrape content from your service they might care less about reliability than if they were human users.
“100% is the wrong reliability target for basically everything.”
Making reliable systems more reliable costs money. Sometimes a minor increase in reliability can be more expensive than the value it brings to the system.
Making the system more reliable than it needs to be (happiness test) will result on users depending on it. Making the system too reliable will reduce velocity of new features.
Things that usually trigger outages: pushing new configuration, new binaries.
Iteration of SLOs. You will probably need to adapt your original SLOs:
- This SLO is not a good fit anymore after 12 months.
- This SLO is not covering new features.
- This SLO is not covering the new company risk-reward profile.
Regular SLO Reviews. Every 3-12 months.
Error budgets. The inverse value of availability. Helps you keep track of how much headroom you have left before violating your SLOs.
99.9% success = 0.1% failure.
What is in the 0.1%: bad pushes by the product teams, planned maintenance, hardware failures, etc.
It works as a common incentive for developers and reliability engineers to balance innovation and reliabilty.
If the service is in SLOs, developers can take risks and push new features quickly. If it’s not, they need to be more conservative.
If the service is in SLOs, the SRE team can work on increasing the reliability.
Yes, we can increase reliability but it will cost 2x in cloud costs to have regional backups.
Yes, we can push faster but we will need better integration tests and automated canary analysis and rollback to keep the error budget burn within the SLO.
The SRE team must have and use authority to halt feature launches when there is no remaining error budget.
Effective SLOs need to:
- Have executive buy-in;
- Have consequences;
- Be measured with accuracy.
Error budget techniques
- Dynamic release cadence based on the remaining error budget;
- “Rainy day” fund to cover unexpected events
- Error budget-based alerts. ”The recent errors are greater than 3% of the monthly error budget.”
- Silver bullets.
On Silver bullets: a very senior stakeholder holds a small number of tokens. If developers want to release a new feature and the error budget is blown, they need to present a case to the SRE team and give them one token to enable the release. This is generally regarded as a failure and should trigger a post mortem or other retrospectives.
Excessive helpfulness is harmful. Of course you can always be flexible but if you are making too many exceptions, it’s a sign that something is not working.
Ways to increase reliability:
- Implement automated canary analysis;
- Improve monitoring;
- Build automation to reduce or eliminate toil;
- Roll out changes gradually so only a small group of users is impact by failures. The new feature will first be released to 0.1% of users, then 1%, then 10% and so on;
- Deploy the service in more than one region;
- Automated alerts that page a human (vs. relying on people to notice abnormalities on graphs);
- Developing a playbook or making it easier to parse and collate the server’s debug logs;
- Automate tasks such as draining a zone and redirecting traffic as you investigate;
- Engineer the service to run in a degraded mode during a failure. During this outage only read-only operations will be allowed, all writes will be forbidden._
Measuring the time between failures.
TTD: Time To Detect [a failure]. / TTR: Time To Resolution / TTF: Time to Failure.
TBF = ((TTD + TTR) * % of impact) / TTF
TBF: Time Between Failures. Can be improved by reducing the TTD (mechanisms to catch outages faster), reducing the TTR (automated alerts that pages a human).
Improving TBF means the reliability of the system has increased.
More ways to improve reliability:
- Report worst customers or regions (or another category) to find cases where the error budget is not evenly distributed. Let’s focus the effort on this region;
- Standardize infrastructure. “But it works on my machine” problem.
- Safe release and rollback systems. Bad pushes will always happen, reverting them should be simple and safe;
- Postmortems to highlight why the SLO was violated, what was the bug and which actions were created to fix them;
- Canarying changes and releasing to production gradually. If we catch a failure, only a few percentage of the service will have been impacted.