Notes on Site Reliability Engineering. Leave a 🌟 if you found this useful!
This is an ongoing list of notes on SRE I've learned over the past couple of years.
I pay careful attention to metrics and the math behind them 👨‍🔬
How Complex Systems Fail (Chicago)
Linux Command(s) (First-Principles)
USENIX Short Topics on SysAdmin (Interview Prep)
IIT Madras - Systems Engineering Course
SLO (Service Level Objective) - A quantitative measurement of time or quantity of actions that must take place to enter SLA (repercussions). Internal thresholds set to alert the SLA violation. Quantitatively stronger than SLA. Services can have multiple SLO’s.
Example: HTTP (SLO) 200ms. If a request takes longer than 200ms you will enter SLA (usually financial repercussions). An SRE engineer needs to be able to anticipate (ideally) or remedy (more common) a failed SLO.
SLA (Service Level Agreement) - Essentially the consequences of a failed SLO. Usually comes in the form of direct or indirect monetary compensation.
Example: GCP breaks their HTTP SLO. GCP reimburses the company with $100 in cloud credits.
The Happiness Test - The minimum threshold to ensure that customers are happy.
Example: Netflix - Playback latency (HTTP). Packet loss in the middle of a video.
SLI (Service Level Indicators) - The metrics you define to quantitatively measure your system performance.
Example: Error Rate (Network Health) - (success / total requests) * 100
Example: Error Rate (Network Health) - (success / throughput) * 100
Measuring Reliability (Edge Case) - Not every organization and/or system is linear. There are cases when you will need exponentially better service to a customer versus your standard service you normally offer.
Example: Black Friday - It is expected that Company Y will have an N% increase in their website (read: client) and thus will require X% increase in the “triangle of success”.
Reliability - % of time that the system functions properly for the user. Availability - % of time that the system is up and running. Scalability - # of users that the system can serve reliably.
Never Want 100% - The marginal cost to make an already reliable system more reliable often times exceeds the value of delivering this to the customers.
Marginal cost in this case is how much it would cost (engineer time, compute cost, etc.) to make a proposed change.
Value to customers in this case could be thought of the probability that new customers use the service due to proposed change and/or the probability of risk that you will lose a customer.
Measure your SLO achieved and be above the target.
âť“: What do the users need and how does the system currently perform?
Measure how SLI is performing against the target.
âť“: Will increasing the service availability result in positive externalities or negative externalities to the business function?
Note: If you make your service more reliable than an individuals ISP, your customer is going to blame the ISP, not you.
Error Budget = 1 - SLO
Allowed Downtime = SLO * 28 (days) * 24 (hours/day) * 60 (minutes/hour)
⚠️ The single largest source of outages is change to a system. New features = lower service availability.
Note: non-linear correlation between the relationship of new features and lowered service.
Example: To improve reliability of a new feature incorporated into a system you could find that it will cost 10x the previous amount to ensure that the new system is reliable.
Advanced Error Budget Topics:
“Dynamic release cadence”
“Rainy day fund”
“Budget based alerts”
“Silver Bullets”
⚠️ ️️️Silver Bullets are treated as a failure and would require a post-mordem. ⚠️
How to make devs happy?
How to reduce scale of failure amongst users?
TTD - Time to detect an issue in a system.
TTR - Time to resolve the issue in the system.
TTF - Time elapsed between failures.
Error Impact (TBF) = (TTD + TTR) * impact (%) / TTF
How to improve reliability?
How to improve TTD?
How to improve TTR?
Examples: develop a playbook, increased data parsing and log analysis. Take a failed zone offline and redirect traffic to an available zone while the affected zone is getting repaired.
How to improve impact % ?
How to reduce TTF?
Example: re-routing traffic from a failed region over to a region that is healthy.
Periodically report the worst customers, worst region, uneven error budget distribution. Focus extra hard on those regions.
Standardize infrastructure.
Consult SWE on system design.
Rollback speed.
Phased rollouts.
How to measure the happiness?
We define a SLI and measure how it changes over time.
We want an SLI that has a linear relationship (predictable) with the happiness of the users.
Predictability is very important because you will be making engineering changes based on the data.
Relationship between latency and user happiness is an “S” curve (non-linear).
Example: Website is slow to load or respond to other embedded features. User leaves site. Count up the speed and the quantity of users that left the site in this window of time as a ratio of users that didn’t. You will have a quantified metric of how unhappy the event made users.
Standard (computer) operational metrics: Load average, CPU util, memory usage, bandwidth.
CPU bound = slow service = unhappy user
SLI = good events / valid events
SLI is a measurement of user experience (quantitative)
Services internal state metrics: thread pull fullness, request queue length, request queue outages
SLI Range: 0%-100%
Benefits: Consistent format
SLI aggregated over a long time period is needed to make a decision on the validity of the metric. Want high signal, low noise.
Processing server side request logs
Service requests.
Application servers have a conflict of interest since they are the ones who create the response data.
Client telemetry is data straight at the source.
Measuring at the client.
SLI: Request/response will tell us availability, latency, and quality of service.
Data Processing: Coverage, correctness, freshness, throughput.
Storage: Measure the durability of the storage layer.
HTTP(S): Parameters include host name, requested path to set the scope to a set of tasks or response handlers.
Problem with HTTP Status Code(s):
Data Processing: Selection of inputs to set the scope to some data subset.
Request/Response SLIs: the ratio of successful requests received.
Request/Response Latency: % of requests that are served faster than some threshold.
Which of the requests that are served are valid for the SLI?
How can you tell if the request was served with degrading quality?
Example: Availability of a VM - proportion of minutes that it was booted and availably via SSH.
Note:
How accurate is the correlation between latency and user experience?
Probably pretty high. A system can be optimized for this if we find a great coorelation. Example - prefetching, caching. This would increase the SLO. Since “S” shaped relationship you want 75-90% of the requests to fall into this area.
Example: When it’s ok to only sample ~75% of the requests?
Latency
Latency Reporting - only report long running applications on their their success/failure.
Example:
Threshold (T) = 30 minutes
Reported (R) = 120 minutes
⚠️ 90 minutes of “unknown” failing. You can only make decisions off of data that you measure! ⚠️
Example:
Main topics: Freshness, correctness, coverage, and throughput.
Freshness:
Output freshness decay as a function of time and user input data. Utility is what mainly diminishes.
Users expect that those outputs are up to date and are not aware of the degradation over time. Data pipelines have to be constantly rebuilt and checked to ensure they meet the freshness threshold.
Freshness SLI: Ratio of valid data updated frequency beyond threshold X.
(t=N)
. Freshness the time since this completion.Streaming (continuous processing):
Example: 1/5 streaming shards slow (latency is not within threshold) therefore 20% of data is stale.
Example: Requests read data unevenly. The ratio of unread/total requests = N% that are stale.
Correctness: Users will independently verify that your data is correct. Need an SLI for measuring this.
Correctness SLI: Ratio of valid data producing the correct output.
Example:
Coverage SLI: The ratio of valid data that has been successfully processed.
Example:
Input records = 2,147,483,647
Records that thrower status symbol “OK” = 2,147,452,310
Coverage = IR/OK = 99.9985%
Throughput SLI:
Throughput = events / time
Throughput thresholds (GB/s):
⚠️ Huge drops in throughput are almost guaranteed to cause angry customers. ⚠️
⚠️ Only need 1-3 SLIs for each part of the user journey. ⚠️
Why?
Example:
Example: App Store. You have 4 main user “journeys”.
These user actions are nothing more than web requests so we have latency and availability SLI for them.
Problem: Variance in request rates can cause a SLI to get “lost” in the noise.
Solution: Assigning a weight to each component SLI based on traffic/importance reduces risk.
Journey | Good | Fast | Threshold |
---|---|---|---|
home | 9994 | 9866 | 10000 |
search | 9989 | 9729 | 10000 |
category | 9997 | 9913 | 10000 |
open page | 10000 | 9849 | 10000 |
sum (ms) | 39980 | 39356 | 40000 |
browse | 99.95 | 98.39 |
Problem:
Solution:
Assumptions:
Note:
In distributed systems consistent writes have different latency than reads.
Google has found that users are ok with slower write speeds (such as hitting the “submit” button). BUT, users expect FAST read speeds.
Example thresholds:
Slow target: 50-75% requests faster than annoying. Long tail: 1s awful 90% beat this.
Third-party: Users are generally understanding that third party (payment processor, identify auth) services will be slower. Ok to set reasonable thresholds.
Google’s SLI Philosophy:
Problem:
Notes:
Aspirational Targets:
Achievable Targets:
Figure out SLI types & architect high level spec.
Describe in great detail the events being measured. Where/How the SLI will be measured?
Walk through user journey (trace through architecture) and identify coverage gaps. Document points of failure extensively. High risk failure points indicate a re-work needed.
Set SLO targets. Set measurement windows to gather performance data.
You have a video game that has 50MM 30-day-trial DAU playing.
Average 1-10MM users online at any give time.
1 new world each month added that causes traffic and revenue spikes.
Largest revenue stream = real world ($$) -> game currency ($)
2nd largest revenue stream = PvP battles, mini-games, resource production.
Largest player expense = settlement upgrades, defensive weapons for battles, setting up recruitment for other players to join them, etc.
Mobile client & web UI applications.
Mobile Client
HTTP Requests: JSON-RPC messages over REST HTTP.
Socket: Open web-socket to receive game updates.
Web
Other
Use a sequence diagram to plan out the client <-> server interactions done by the infrastructure.
Client: Requests the profile URL over HTTPS.
Web Server:
Send HTML in response.
Service the user's profile data from the user profile data store.
Render the leaderboard for the players location using latitude/longitude data.
Build the HTML response for player and send back to client.
CDN: Content delivery network will automatically serve the CSS/JavaScript via HTML response.
Load Balancer: Accepts the incoming request and forwards it to a pool of web servers.
Steps:
How do you want to measure the performance of the service against users expectations.
The user interaction is a request-response interaction so we want to measure availability.
Measuring Availability:
Filtering:
If we want profile page requests from HTTP we want to scrape the following: /profile/user/*
Further Refining:
GET
requests for /profile/user/*
that have a Response Code
of 200
or 300
or 400
.Final SLI Location / Metric:
GET
requests for /profile/user
or /profile/user/avatar
that have 200
, 300
or 400
response codes measured at the load balancer
.Latency SLI:
Issues:
Reflection Questions:
Do the SLIs capture the entire user journey and failures?
What are the edge cases and exceptions?
Do the SLIs capture all journey permutations?
Resolution:
Inspects the HTTP response body.
Validates load balancer routes.
Tests CDN serving data.
System Analysis:
- Ignore the front end SLIs since we know these have visibility issues.
- Ignore the client side (CDN) since it's only used for ads and analytics.
- Focus on the core backend infrastructure.
Focus:
- Focus on the load in the load balancer and server pools.
Identify elevated latency and 500
errors.
Focus:
- Focus on "bad code" being push into production.
This bad code would have to get past (i) the OK
response header and (ii) the prober checking the response body
.
Focus:
- Focus on a middle ground solution. You don't have to send them a failure code if you can't get the leaderboard to send.
- You can serve the user a partial response in the event of a failure to lookup.
Focus:
- The prober (sitting at the front end) won't catch the case where the wrong user profile is served.
- This is a huge issue and you'd want to measure this.
Focus:
- Find all possible risks.
- Estimate their cost to the error budget.
- If total cost > error budget (SLO targets) need to follow Pareto Principle and solve the most vital ones first.
Everyone in your organization working on the service, product managers, developers, SREs and executives needs to know the following:
- Where the line is (SLO/SLI)
- What happens if it's crossed (SLA)
- Exceptions to standard measuring procedures.
- Set owners for each SLO.
There should be historical data on previous SLOs and documentation to show why each SLO was changed.
Example: Don't count 503
as errors because load balancer handles them.
The status (development, staging, paging) of the SLO should always be tracked.
Capture all of your metadata in one place in the version controlled configuration file to have a "single source of truth".
Important Metadata Features:
- Measurement Window: Defines the time period of each data "chunk measured" before restart.
- Graph Duration: The time period that a SRE would see on their dashboard GUI.
- Target: The availability target that your want to maintain.
- Owner: The person that owns the product or service. (Product Manager for example)
- Contacts: Top down contacts on the product/service. (Tech Lead -> SRE)
- Status: The status of the service.
- Rationale: What results from the SLI being triggered. (The negative externalities)
- References: Links to any relevant internal notes around this service.
- Changelog: A record of any changes made to the service's SLI/SLO.
7 Characteristics:
- Result in engineering efforts to improve reliability if error budget is spent.
- Quantitatively describe WHEN the error budget will kick in.
- Quantitatively describe HOW the error budget will kick in. Specifically how the Dev/SRE teams will respond.
- Set in place consequences if the SLO consistently fails over a long time horizon.
Example: devs that sacrifice reliability for features should be let go.
- Policy is a consistently applied set of rules.
- Document who disagreements get escalated to.
- Policy should be agreed upon and signed off by all parties.
Steps:
- Increase consequences w.r.t. increased levels of error budget burn.
- Developers and SREs must work together to push out new features to the user while mainting the reliability. This is impossible to do all the time so it's a constant balance.
- SREs want to help the SWEs develop more features safely. Want to have alligened incentives. This is the only way that SRE will work in an organization.
- Examples consequence: Pulling back the reins on feature releases. This will be a consequence to the devs for breaking the codebase which annoyed the users.
Threshold 1:
- Automated alerts are setup to notify the SRE of an SLO that is at risk.
Threshold 2:
- SREs decide they need collaborative efforts to defend the SLO. Devs will come in to assist.
Threshold 3:
- 30-day error budget has been spent without a root cause figured out. All feature releases will be blocked as a consequence and dev will need to interlock with SRE to fix the issue.
- The changes are bundled up into a weekly hotfix patch.
- DO NOT cut a new release from the devlopment branch.
Threshold 4:
- 90-day error budget has been spent without a root cause figured out. SRE will escalate to executive leadership to get more engineering time dedicated to this issue at hand.
Scenario:
- Availability SLO: 99.9% over past 30 days.
- Bug causes 99.85% availability (0.15% of the time serves errors)
Resolution:
- See visual example **[here](https://imgur.com/a/QWzdLxz)**