A set of Site Reliability Engineering notes & challenges
The goal of this project is to introduce you to basic SRE topics. It was designed to give an overview covered by SRE during a whole application life-cycle.
We will divide this challenge into multiple stages to more closely explain what is happening in each cycle.
Start by thinking about how the whole infrastructure can be deployed, maintained, monitored, discarded, extended, or automated. The point of this step is to have a clearer picture of how to proceed, to rule out all the improbable scenarios and major blockers. In this step, try to answer some basic architectural and support questions.
It will give you an idea of what technologies you as an SRE should continue with. Remember, you are defining the way how the whole process is going to work. Hence, you need to decide on the tools you are going to be working with.
Get a nice overview of how the solution works. Usually, here you get a lot of graphs, or you need to create them. Graphs help you get a grasp of what is happening. Some services talk to each other, some are independent, some require more memory, some require a lot of computing power. Here you need information - well-defined and concise information. Without understanding what is happening, you cannot decide what you want to happen.
Here, you need to think in terms of how the whole solution is going to be updated, deleted, recreated, or moved. Most likely, you will have to implement some kind of automated mechanism on how to combat these issues. Notoriously, this is the moment when Continuous Integration (CI), Continuous Deployment (CD), Anything-as-a-Code (XaaS) comes into the picture. They are not your enemy!
One of the big challenges of automation is knowing how to secure your things. When you are developing, you want to isolate the parts that require authorization from the public, and once it has been created, glue it inside of the bigger part without allowing anyone to know how the process had been achieved. In this scenario, you have to integrate the Secret controlling mechanisms to strip away all the required private data from the service logic.
If you want the solution to be bullet-proof, you have to take into account on how everything is going to behave once It is up and running. These are referred to as edge-cases, and a proper solution has to cover them. Your infrastructure is built to serve US West Coast under 100ms, but what do you do for connections from India? It is Black Friday, and suddenly you have 100x higher traffic than usual. Typically, this has to do something with auto-scaling, load balancing, or networking. Your task is now to think of any impossible case scenario in which your infrastructure will fail -- and extend to support it.
A question you will often ask throughout the whole process. Everything is now operational, you have come this far. But, one of the services is down (perhaps, a like button on Facebook is not counting likes properly), and you don't know what to do. Here, you need to set up the Monitoring, Logging, and Observability services. They help you troubleshoot and see what is happening in realtime. They spew out a lot of unnecessary things, so correctly configuring them will help you manage everything more easily.
Whenever you are building something, you need to know how well it behaves. This part is focused on answering how your infrastructure is doing, health and functional -wise. It is commonly referred to as Service Level Agreement (SLA) which works based on Service Level Indicators (SLI). They depict legal requirements and arrangements between a user and the service provider.
Service | Model | Networking Type | Port | Paths | Response time | Dependencies | Extras |
---|---|---|---|---|---|---|---|
data | container | Internal | 9876 | /api/* | < 200ms | ||
info | container | Internal | 5555 | /* | < 50ms | data | |
load balancer | platform supported | External | 80, 443 | / | info | ||
monitoring | optional |
Service | Rate of change (weekly) | Versioning (optional) | Details | Rollout strategy (optional) | Type |
---|---|---|---|---|---|
infrastructure | 5 | Yes | A/B Deployment | IaaC | |
load balancer | 1-2 | No | Should be part of infrastructure code, but supported separately. | Big Bang Deployment | IaaC |
data | > 50 | Yes | Storage must not be discarded. Keep logs. Enhance security rules. Reconfigure path rules before swapping with the old version. | Rolling Deployment | SaaC |
info | > 50 | Yes | Must support high availability. | Rolling Deployment | SaaC |
monitoring | 1-2 | Yes | Ensure firewall rules, authentication, and authorization. Support only internal networks. | Rolling Deployment | SaaC |
We would also like to have a way of knowing when the deployments fail or succeed. Find a way to notify the users working on the project about the deployment statuses.
This can contain everything regarding a service (e.g. resource requirements, availability,...).
An example approach to solving such a challenge could be summarized briefly in the following steps:
More challenges will be added... - Author
Every implementation will be scored based on the criteria below.
Visit Awesome Site Reliability Engineering to find major information about most of the SRE related topics.