:monkey: :fire: Datadog Failure Injection System for Kubernetes
Oldest Kubernetes version supported: 1.16
:warning: Kubernetes version 1.20.x is not supported! This Kubernetes issue prevents the controller from running properly on Kubernetes 1.20.0-1.20.4. Earlier versions of Kubernetes as well as 1.20.5 and later are still supported.
:bomb: Disclaimer :bomb:
The Chaos Controller allows you to disrupt your Kubernetes infrastructure through various means including but not limited to: bringing down resources you have provisioned and preventing critical data from being transmitted between resources. The use of Chaos Controller on your production system is done at your own discretion and risk.
The Chaos Controller is a Kubernetes controller with which you can inject various systemic failures, at scale, and without caring about the implementation details of your Kubernetes infrastructure. It was created with a specific mindset answering Datadog's internal needs:
:bulb: Read the latest release quick installation guide and the configuration guide to know how to deploy the controller.
Disruptions are built as short-living resources which should be manually created and removed once your experiments are done. They should not be part of any application deployment. The Disruption
resource is immutable. Once applied, you can't edit it. If you need to change the disruption definition, you need to delete the existing resource and to re-create it.
Getting started is as simple as creating a Kubernetes resource:
apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
name: node-failure
namespace: chaos-demo # it must be in the same namespace as targeted resources
spec:
selector: # a label selector used to target resources
app: demo-curl
count: 1 # the number of resources to target, can be a percentage
duration: 1h # the amount of time before your disruption automatically terminates itself, for safety
nodeFailure: # trigger a kernel panic on the target node
shutdown: false # do not force the node to be kept down
To disrupt your cluster, run kubectl apply -f <disruption_file>.yaml
. You can clean up the disruption with kubectl delete -f <disruption_file>.yaml
. For your safety, we recommend you get started with the dry-run
mode enabled.
:open_book: The features guide details all the features of the Chaos Controller.
:open_book: The examples guide contains a list of various disruption files that you can use.
Check out Chaosli if you want some help understanding/creating disruption configurations.
New feature in
8.0.0
The Chaos Controller has expanded its capabilities by introducing disruption scheduling, enhancing your ability to automate and test system resilience consistently. Instead of manual creation and deletion, use DisruptionCron
to regularly disrupt long-lived Kubernetes resources like Deployments
and StatefulSets
.
apiVersion: chaos.datadoghq.com/v1beta1
kind: DisruptionCron
metadata:
name: node-failure
namespace: chaos-demo
spec:
schedule: "*/15 * * * *" # every 15 minutes
targetResource:
kind: deployment
name: demo-curl
disruptionTemplate:
count: 1
duration: 1h
nodeFailure:
shutdown: false
To schedule disruption in your cluster, run kubectl apply -f <disruption_cron_file>.yaml
. To stop, run kubectl delete -f <disruption_cron_file>.yaml
.
:mag_right: Check out DisruptionCron guide for more detailed information on how to schedule disruptions.
Chaos Engineering is necessarily different from system to system. We encourage you to try out this tool, and extend it for your own use cases. If you want to run the source code locally to make and test implementation changes, visit the Contributing Doc. By the way, we welcome Pull Requests.