Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
Cruise Control is a product that helps run Apache Kafka clusters at large scale. Due to the popularity of Apache Kafka, many companies have bigger and bigger Kafka clusters. At LinkedIn, we have ~7K+ Kafka brokers, which means broker deaths are an almost daily occurrence and balancing the workload of Kafka also becomes a big overhead.
Kafka Cruise Control is designed to address this operation scalability issue.
Kafka Cruise Control provides the following features out of the box:
Resource utilization tracking for brokers, topics, and partitions.
Query the current Kafka cluster state to see the online and offline partitions, in-sync and out-of-sync replicas,
replicas under min.insync.replicas
, online and offline logDirs, and distribution of replicas in the cluster.
Multi-goal rebalance proposal generation for:
Anomaly detection, alerting, and self-healing for the Kafka cluster, including:
kafka_0_11_and_1_0
branch)kafka_0_11_and_1_0
branch)Admin operations, including:
kafka_0_11_and_1_0
branch)main
(previously migrate_to_kafka_2_5
) branch of Cruise Control is compatible with Apache Kafka 2.5+
(i.e. Releases with 2.5.*
),
2.6
(i.e. Releases with 2.5.11+
), 2.7
(i.e. Releases with 2.5.36+
),
2.8
(i.e. Releases with 2.5.66+
), 3.0
(i.e. Releases with 2.5.85+
),
and 3.1
(i.e. Releases with 2.5.85+
).migrate_to_kafka_2_4
branch of Cruise Control is compatible with Apache Kafka 2.4
(i.e. Releases with 2.4.*
).kafka_2_0_to_2_3
branch (deprecated) of Cruise Control is compatible with Apache Kafka 2.0
, 2.1
, 2.2
, and 2.3
(i.e. Releases with 2.0.*
).kafka_0_11_and_1_0
branch (deprecated) of Cruise Control is compatible with Apache Kafka 0.11.0.0
, 1.0
, and 1.1
(i.e. Releases with 0.1.*
).message.format.version
0.10.0
and above is needed.kafka_2_0_to_2_3
and kafka_0_11_and_1_0
branches compile with Scala 2.11
.migrate_to_kafka_2_4
compiles with Scala 2.12
.migrate_to_kafka_2_5
compile with Scala 2.13
.2.0
, 2.1
, 2.2
, and 2.3
requires KAFKA-8875 hotfix.git clone
git clone https://github.com/linkedin/cruise-control.git && cd cruise-control/
https://github.com/linkedin/cruise-control/releases
to pick a release -- e.g. 0.1.10
wget https://github.com/linkedin/cruise-control/archive/0.1.10.tar.gz && tar zxvf 0.1.10.tar.gz && cd cruise-control-0.1.10/
git init && git add . && git commit -m "Init local repo." && git tag -a 0.1.10 -m "Init local version."
CruiseControlMetricsReporter
is used for metrics collection (i.e. the default for Cruise
Control). The metrics reporter periodically samples the Kafka raw metrics on the broker and sends them to a Kafka topic.
./gradlew jar
(Note: This project requires Java 11)./cruise-control-metrics-reporter/build/libs/cruise-control-metrics-reporter-A.B.C.jar
(Where A.B.C
is
the version of the Cruise Control) to your Kafka server dependency jar folder. For Apache Kafka, the folder would
be core/build/dependant-libs-SCALA_VERSION/
(for a Kafka source checkout) or libs/
(for a Kafka release download).metric.reporters
to
com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter
. For Apache Kafka, server
properties are located at ./config/server.properties
.SSL
is enabled, ensure that the relevant client configurations are properly set for all brokers in
./config/server.properties
. Note that CruiseControlMetricsReporter
takes all configurations for vanilla
KafkaProducer
with a prefix of cruise.control.metrics.reporter.
-- e.g.
cruise.control.metrics.reporter.ssl.truststore.password
.compact
, make sure that the topic to which Cruise Control metrics
reporter should send messages is created with the delete
cleanup policy -- the default metrics reporter topic
is __CruiseControlMetrics
.config/cruisecontrol.properties
of Cruise Control:
bootstrap.servers
and zookeeper.connect
to the Kafka cluster to be monitored.capacity.config.file
to the path of your capacity file.
config/capacityJBOD.json
), but it may not reflect the actual capacity of the brokersmetric.sampler.class
to your implementation (the default sampler class is CruiseControlMetricsReporterSampler
)sample.store.class
to your implementation if you have one (the default SampleStore
is KafkaSampleStore
)./gradlew jar copyDependantLibs
./kafka-cruise-control-start.sh [-jars PATH_TO_YOUR_JAR_1,PATH_TO_YOUR_JAR_2] config/cruisecontrol.properties [port]
JAR files correspond to your applications and port
enables customizing the Cruise Control port number (default: 9090
).
56666
), export JMX_PORT=56666
before
running kafka-cruise-control-start.sh
http://localhost:9090/kafkacruisecontrol/state
(or
http://localhost:\[port\]/kafkacruisecontrol/state
if you specified the port when starting Cruise Control).Note:
Cruise Control provides a REST API for users to interact with. See the wiki page for more details.
Cruise Control relies on the recent load information of replicas to optimize the cluster.
Cruise Control periodically collects resource utilization samples at both broker- and partition-level to infer the traffic pattern of each partition. Based on the traffic characteristics and distribution of all the partitions, it derives the load impact of each partition over the brokers. Cruise Control then builds a workload model to simulate the workload of the Kafka cluster. The goal optimizer explores different ways to generate cluster workload optimization proposals based on the user-specified list of goals.
Cruise Control also monitors the liveness of all the brokers in the cluster. To avoid the loss of redundancy, Cruise Control automatically moves replicas from failed brokers to alive ones.
For more details about how Cruise Control achieves that, see these slides.
To read more about the configurations. Check the configurations wiki page.
Published at Jfrog Artifactory. See available releases.
More about pluggable components can be found in the pluggable components wiki page.
The metric sampler enables users to deploy Cruise Control to various environments and work with the existing metric systems.
Cruise Control provides a metrics reporter that can be configured in your Apache Kafka server. Metrics reporter generates performance metrics to a Kafka metrics topic that can be consumed by Cruise Control.
The Sample Store enables storage of collected metric samples and training samples in an external storage.
Metric sampling uses derived data from the raw metrics, and the accuracy of the derived data depends on the metadata of the cluster at that point. Hence, when we look at the old metrics, if we do not know the metadata at the point the metric was collected, the derived data would not be accurate. Sample Store helps solving this problem by storing the derived data directly to an external storage for later loading.
The default Sample Store implementation produces metric samples back to Kafka.
The goals in Cruise Control are pluggable with different priorities. The default goals in order of decreasing priority are:
RackAwareGoal
. Contrary to RackAwareGoal
, as long as replicas of each partition
can achieve a perfectly even distribution across the racks, this goal lets placement of multiple replicas of a partition into a single rack.kafka_0_11_and_1_0
branch) Ensures that Disk space usage of each disk is below a given threshold.kafka_0_11_and_1_0
branch) Attempts to keep the Disk space usage variance among disks within a certain range relative to the average broker Disk utilization.The anomaly notifier allows users to be notified when an anomaly is detected. Anomalies include:
kafka_0_11_and_1_0
branch)kafka_0_11_and_1_0
branch)kafka_0_11_and_1_0
branch)kafka_0_11_and_1_0
branch)kafka_0_11_and_1_0
branch)In addition to anomaly notifications, users can enable actions to be taken in response to an anomaly by turning self-healing on for the relevant anomaly detectors. Multiple anomaly detectors work in harmony using distinct mitigation mechanisms. Their actions broadly fall into the following categories: