A curated list of Site Reliability and Production Engineering resources.
Compilation of public failure/horror stories related to Kubernetes
A collection of postmortem templates
A curated list of Site Reliability and Production Engineering Tools
A comprehensive list of Game Design related learning materials, examples...
A role-playing game for incident management training
Calculate how much downtime should be permitted in your Service Level Ag...
coredumpy saves your crash site for post-mortem debugging
An Incident Management Process / Post Mortem Template