The following is a rough draft of notes I have put together while researching about SRE.
SRE vs. DevOps. Both have the same purpose and do not compete. “If DevOps is the philosophy, SRE is a way to accomplish that philosophy”.
In the past. Big barrier between developers and operators (people responsible for reliability). Developers were focused on agility (pushing features) and operators on stability (keeping things running and moving slow).
DevOps as 5 different goals:
- Reduce organization silos: breaking barriers between teams to increase collaboration and throughput
- Accept failure as normal: the more humans you introduce into a system, the more imperfection it will bring
- Implement gradual change: smaller changes are easier to be reviewed and to be rolled back (in case a bug is introduced)
- Leverage tooling & automation
- Measure everything: measuring is the only way to know if new strategies are successful or not
Applying SRE to each goal:
- Share of ownership of production environment with developers. Unified tooling to make sure everyone has the same view and approach when working with production
- Blameless post mortems. Error budget - how much the system is allowed to fail.
- Canary things to reduce the cost of failure and the impact
- Eliminate manual work. Automate things.
- Measure the amount of toil and the reliability of the systems
Customer Reliability Engineering (CREs)
Problem: when a service behave according to SLOs and error budgets but breaks customer expectations.
Problem: provider doesn’t communicate how the system is designed and customer doesn’t tell what they are expecting. The frustration can spread across millions of users.
Solution: communicate clearly to your customers how your service is intended to behave (expose your SLOs to them).
Role of a Google Cloud CRE: reach out to customers who are building their services on top of GCP and help them build their own SLOs. Help making failure an acceptable thing and agree on an acceptable level of reliability.
3 principles of CRE
- Reliability is the most important feature of a system. If the system is unusable then the customers cannot take advantage of the service you provide. The system should meet the expectation of it users;
- It’s users who decide reliability, not monitoring. If your users perceive your service to be unreliable, then it is not meeting their expectations, no matter what your logs and metrics say;
- Pursuit of ever increasing reliability. 99.9% = following best practices for reliability. 99.99% = dedicated operations team. 99.999% = can require sacrificing aspects of the system like flexibility and release velocity. Each additional 9 will make the system 10x more reliable but can cost the business 10x more.
28-day error budget
- 99.9% = 40 minutes of outage. Can be enough to alert someone and the person fix the root cause;
- 99.99% = 4 minutes. Not enough time for a human to intervene. System must be able to detect and self-heal outages;
- 99.999% = 24 seconds. Very complex and costly to achieve.
Problem: if reliability is a feature, how do you prioritize it over other features?
Solution: Service Level Objectives (SLOs) provides a common language and understanding around reliability using concrete data.
“It’s possible to build a super reliable system that has no features and never changes, but it’s hard to make any money by doing that.”
Problem: system reliability usually impacts on developer velocity. How do you balance the risk to reliability vs. building new features to that system?
If you are burning most of your error budget trying to ship new features fast, you need to lift your foot off the accelerator.
“What is the right level of reliability for the system you support?”. Not discussing or answering this question results in firefighting, repetitive maintenance, pager fatigue, etc.
For SLOs to work, all parts of the business must agree that they are an accurate measure of UX + they can drive decisions.
If you go out of your SLOs, there must be well-documented consequences. Engineer effort must be redirected into making reliability improvements.
Service Level Operations (SLOs). Tool to help strike the balance between releasing new features and reliability. It helps teams communicate expectation through data.
Three principles to stabilish SLOs:
- What do you want to promise and to whom;
- What metrics to measure;
- How much reliability is good enough.
Service Level Agreements (SLAs). Agreements that you make with your customer about reliability of your system.
There must be consequences if you violate them (otherwise there is no point on making them). Give the customer partial refunds or credits if you violate the SLAs.
Issues must be caught before they breach your SLAs so that you can have time to fix them (otherwise 💸). These thresholds are your SLOs. SLOs must be stronger than SLAs because they impact users.
- SLAs: Promise with monetary consequences;
- SLOs: Internal promise and agreement on expectations.
SLA: All HTTP requests will respond within 300ms. / SLO: All HTTP requests will respond within 200ms.
Happiness test. Rule of thumb to help setting SLO targets. If your service is performing exactly at its target SLOs, your average user would be happy.
Users perceive a service to be unreliable when it fails to meet their expectations. I clicked in a movie on Netflix and it gave me a different title.
Challenge: how do you measure and quantify happiness of customers?
Problem: a single user can be getting all the error budget, therefore being sad.
How do you consider Netflix to be “working” or “good enough”?
- Time it takes to play a title after you select it.
- No interruptions or issues with playback.
For 1., a possible metric could be the request latency or time for the request to return a response or play back.
For 2., a possible metric could be the ratio of errors over successes over total number of requests or amount of data transmitted per second or throughput.
These metrics are called Service Level Indicators (SLIs).
SLIs are quantitative measurement of user experience.
Challenge: there are trade-offs when measuring. I found the perfect data for my SLI but writing code to get that data is too complex to implement.
Missing the target SLO. Target SLO is 99% of requests will be served within 300ms in the last 4 weeks. SLI is measured and you see that only 95% of requests were served within 300ms in the past 4 weeks.