The Apache Kafka C/C++ library
librdkafka v1.9.0 is a feature release:
rd_kafka_offsets_store()
(et.al) will now return an error for any
partition that is not currently assigned (through rd_kafka_*assign()
).
This prevents a race condition where an application would store offsets
after the assigned partitions had been revoked (which resets the stored
offset), that could cause these old stored offsets to be committed later
when the same partitions were assigned to this consumer again - effectively
overwriting any committed offsets by any consumers that were assigned the
same partitions previously. This would typically result in the offsets
rewinding and messages to be reprocessed.
As an extra effort to avoid this situation the stored offset is now
also reset when partitions are assigned (through rd_kafka_*assign()
).
Applications that explicitly call ..offset*_store()
will now need
to handle the case where RD_KAFKA_RESP_ERR__STATE
is returned
in the per-partition .err
field - meaning the partition is no longer
assigned to this consumer and the offset could not be stored for commit.socket.connection.setup.timeout.ms
(default 30s).
The maximum time allowed for broker connection setups (TCP connection as
well as SSL and SASL handshakes) is now limited to this value.
This fixes the issue with stalled broker connections in the case of network
or load balancer problems.
The Java clients has an exponential backoff to this timeout which is
limited by socket.connection.setup.timeout.max.ms
- this was not
implemented in librdkafka due to differences in connection handling and
ERR__ALL_BROKERS_DOWN
error reporting. Having a lower initial connection
setup timeout and then increase the timeout for the next attempt would
yield possibly false-positive ERR__ALL_BROKERS_DOWN
too early.rd_kafka_poll()
(et.al.) at least once to trigger the
refresh callback before being able to connect to brokers.
With the new rd_kafka_conf_enable_sasl_queue()
configuration API and
rd_kafka_sasl_background_callbacks_enable()
the refresh callbacks
can now be triggered automatically on the librdkafka background thread.rd_kafka_queue_get_background()
now creates the background thread
if not already created.rd_kafka_consumer_close_queue()
and rd_kafka_consumer_closed()
.
This allow applications and language bindings to implement asynchronous
consumer close.test.mock.broker.rtt
to simulate RTT/latency for mock brokers.commit_transaction()
, list_groups()
, etc.no OPENSSL_Applink()
written to the console if ssl.keystore.location
was configured.
This regression was introduced in v1.8.0 due to use of vcpkgs and how
keystore file was read. #3554.max.poll.interval.ms
consumer
timer expiring even though the application was polling according to profile.
Fixed by @WhiteWind (#3815).rd_kafka_clusterid()
would previously fail with timeout if
called on cluster with no visible topics (#3620).
The clusterid is now returned as soon as metadata has been retrieved.rd_kafka_list_groups()
if there are no available brokers
to connect to (#3705).timeout_ms
) in various APIs, such as rd_kafka_poll()
,
was limited to roughly 36 hours before wrapping. (#3034)rd_kafka_metadata()
or consumer group rebalancing
encountered a non-retriable error it would not be propagated to the caller and thus
cause a stall or timeout, this has now been fixed. (@aiquestion, #3625)DeleteGroups()
and DeleteConsumerGroupOffsets()
:
if the given coordinator connection was not up by the time these calls were
initiated and the first connection attempt failed then no further connection
attempts were performed, ulimately leading to the calls timing out.
This is now fixed by keep retrying to connect to the group coordinator
until the connection is successful or the call times out.
Additionally, the coordinator will be now re-queried once per second until
the coordinator comes up or the call times out, to detect change in
coordinators.rd_kafka_mock_broker_set_down()
would previously
accept and then disconnect new connections, it now refuses new connections.rd_kafka_offsets_store()
(et.al) will now return an error for any
partition that is not currently assigned (through rd_kafka_*assign()
).
See Upgrade considerations above for more information.rd_kafka_*assign()
will now reset/clear the stored offset.
See Upgrade considerations above for more information.seek()
followed by pause()
would overwrite the seeked offset when
later calling resume()
. This is now fixed. (#3471).
Note: Avoid storing offsets (offsets_store()
) after calling
seek()
as this may later interfere with resuming a paused partition,
instead store offsets prior to calling seek.ERR_MSG_SIZE_TOO_LARGE
consumer error would previously be raised
if the consumer received a maximum sized FetchResponse only containing
(transaction) aborted messages with no control messages. The fetching did
not stop, but some applications would terminate upon receiving this error.
No error is now raised in this case. (#2993)
Thanks to @jacobmikesell for providing an application to reproduce the
issue.assert: rkbuf->rkbuf_rkb
) when parsing
malformed JoinGroupResponse consumer group metadata state.cant handle op type
) when using consume_batch_queue()
(et.al)
and an OAUTHBEARER refresh callback was set.
The callback is now triggered by the consume call. (#3263)partition.assignment.strategy
ordering when multiple strategies are configured.
If there is more than one eligible strategy, preference is determined by the
configured order of strategies. The partitions are assigned to group members according
to the strategy order preference now. (#3818)PRODUCER_FENCED
error was returned from the
broker (added in Apache Kafka 2.8), which could cause the producer to
seemingly hang.
This error code is now correctly handled by raising a fatal error.send_offsets_to_transactions()
was called, and the first connection
attempt failed then no further connection attempts were performed, ulimately
leading to send_offsets_to_transactions()
timing out, and possibly
also the transaction timing out on the transaction coordinator.
This is now fixed by keep retrying to connect to the group coordinator
until the connection is successful or the call times out.
Additionally, the coordinator will be now re-queried once per second until
the coordinator comes up or the call times out, to detect change in
coordinators.message.timeout.ms
is greather than
an explicitly configured linger.ms
was incorrect and instead of
erroring out early the lingering time was automatically adjusted to the
message timeout, ignoring the configured linger.ms
.
This has now been fixed so that an error is returned when instantiating the
producer. Thanks to @larry-cdn77 for analysis and test-cases. (#3709)Release asset checksums:
a2d124cfb2937ec5efc8f85123dbcfeba177fb778762da506bfc5a9665ed9e57
59b6088b69ca6cf278c3f9de5cd6b7f3fd604212cd1c59870bc531c54147e889
librdkafka v1.6.2 is a maintenance release with the following backported fixes:
OUT_OF_ORDER_SEQUENCE
error from the broker, which triggered an
Epoch bump on the producer resulting in an InitProducerIdRequest being sent
to the transaction coordinator in the middle of a transaction.
This request would start a new transaction on the coordinator, but the
producer would still think (erroneously) it was in the current transaction.
Any messages produced in the current transaction prior to this event would
be silently lost when the application committed the transaction, leading
to message loss.
To avoid message loss a fatal error is now raised.
This fix is specific to v1.6.x. librdkafka v1.8.x implements a recoverable
error state instead. #3575.Release asset checksums:
1d389a98bda374483a7b08ff5ff39708f5a923e5add88b80b71b078cb2d0c92e
b9be26c632265a7db2fdd5ab439f2583d14be08ab44dc2e33138323af60c39db
librdkafka v1.8.2 is a maintenance release.
ssl.ca.pem
to add CA certificate by PEM string. (#2380)librdkafka.redist
1.8.0 package had two flaws:
ssl.ca.location
on OSX, the property
would automatically revert back to probe
(default value).
This regression was introduced in v1.8.0. (#3566)OUT_OF_ORDER_SEQUENCE
error from the broker, which triggered an
Epoch bump on the producer resulting in an InitProducerIdRequest being sent
to the transaction coordinator in the middle of a transaction.
This request would start a new transaction on the coordinator, but the
producer would still think (erroneously) it was in current transaction.
Any messages produced in the current transaction prior to this event would
be silently lost when the application committed the transaction, leading
to message loss.
This has been fixed by setting the Abortable transaction error state
in the producer. #3575.Release asset checksums:
8b03d8b650f102f3a6a6cff6eedc29b9e2f68df9ba7e3c0f3fb00838cce794b8
6a747d293a7a4613bd2897e28e8791476fbe1ae7361f2530a876e0fd483482a6
Note: there was no v1.8.1 librdkafka release
librdkafka v1.8.0 is a security release:
librdkafka.redist
NuGet package. The updated zlib version fixes CVEs:
CVE-2016-9840, CVE-2016-9841, CVE-2016-9842, CVE-2016-9843
See https://github.com/edenhill/librdkafka/issues/2934 for more information.librdkafka.redist
NuGet package:
OpenSSL 1.1.1l, zlib 1.2.11, zstd 1.5.0../configure --install-deps
.
These builds are used by the librdkafka builds bundled with
confluent-kafka-go, confluent-kafka-python and confluent-kafka-dotnet.flush()
now overrides the linger.ms
setting for the duration
of the flush()
call, effectively triggering immediate transmission of
queued messages. (#3489)ERR__ALL_BROKERS_DOWN
is no longer emitted when the coordinator
connection goes down, only when all standard named brokers have been tried.
This fixes the issue with ERR__ALL_BROKERS_DOWN
being triggered on
consumer_close()
. It is also now only emitted if the connection was fully
up (past handshake), and not just connected.rd_kafka_query_watermark_offsets()
, rd_kafka_offsets_for_times()
,
consumer_lag
metric, and auto.offset.reset
now honour
isolation.level
and will return the Last Stable Offset (LSO)
when isolation.level
is set to read_committed
(default), rather than
the uncommitted high-watermark when it is set to read_uncommitted
. (#3423)sasl.kerberos.min.time.before.relogin
is set to 0 - which disables ticket refreshes (by @mpekalski, #3431).txidle
and rxidle
in the statistics object was emitted as 18446744073709551615 when no idle was known. -1 is now emitted instead. (#3519)ERR_REQUEST_TIMED_OUT
,
ERR_COORDINATOR_NOT_AVAILABLE
, and ERR_NOT_COORDINATOR
(#3398).
Offset commits will be retried twice.auto.offset.reset
could previously be triggered by temporary errors,
such as disconnects and timeouts (after the two retries are exhausted).
This is now fixed so that the auto offset reset policy is only triggered
for permanent errors.auto.offset.reset
is now logged to help the
application owner identify the reason of the reset.session.timeout.ms
, the
consumer will remain in the group as long as it receives heartbeat responses
from the broker.DeleteRecords()
could crash if one of the underlying requests
(for a given partition leader) failed at the transport level (e.g., timeout).
(#3476).Release asset checksums:
4b173f759ea5fdbc849fdad00d3a836b973f76cbd3aa8333290f0398fd07a1c4
93b12f554fa1c8393ce49ab52812a5f63e264d9af6a50fd6e6c318c481838b7f
librdkafka v1.7.0 is feature release:
ssl.engine.location
) by @adinigam and @ajbarb.connections.max.idle.ms
to automatically close idle broker
connections.
This feature is disabled by default unless bootstrap.servers
contains
the string azure
in which case the default is set to <4 minutes to improve
connection reliability and circumvent limitations with the Azure load
balancers (see #3109 for more information).oauthbearer_token_refresh_cb()
was missing a Handle *
argument that has now been added. This is a breaking change but the original
function signature is considered a bug.
This change only affects C++ OAuth developers.session.timeout.ms
default was changed from 10 to 45 seconds to make consumer groups more
robust and less sensitive to temporary network and cluster issues.consumer_lag
is now using the committed_offset
,
while the new consumer_lag_stored
is using stored_offset
(offset to be committed).
This is more correct than the previous consumer_lag
which was using
either committed_offset
or app_offset
(last message passed
to application).ERR__UNKNOWN_PARTITION
for existing partitions shortly after the
client instance was created.TLS_client_method()
(on OpenSSL >= 1.1.0) instead of the deprecated and outdated
SSLv23_client_method()
.group.id
or transactional.id
was configured (#3305).rd_kafka_new()
which made this call blocking if the refresh command
was taking long. The refresh is now performed by the background rdkafka
main thread.brokers[].throttle
statistics object.default_topic_conf
.consume_batch..()
call the already
accumulated messages for revoked partitions were not purged, which would
pass messages to the application for partitions that were no longer owned
by the consumer. Fixed by @jliunyu. #3340.Abort txn ctrl msg bad order at offset 7501: expected before or at 7702: messages in aborted transactions may be delivered to the application
would be seen.
This is a rare occurrence where a transactional producer would register with
the partition but not produce any messages before aborting the transaction.topic.metadata.refresh.interval.ms
: if this property was set too low
it would cause cached metadata to be unusable and new metadata to be fetched,
which could delay the time it took for a rebalance to settle.
It now correctly uses metadata.max.age.ms
instead.rktp_started
) on consumer termination
(introduced in v1.6.0).cooperative-sticky
assignor did
not actively Leave the group on unsubscribe(). This delayed the
rebalance for the remaining group members by up to session.timeout.ms
.flush()
was not respected when delivery reports
were scheduled as events (such as for confluent-kafka-go) rather than
callbacks.purge()
which could cause newly
created partition objects, or partitions that were changing leaders, to
not have their message queues purged. This could cause
abort_transaction()
to time out. This issue is now fixed.linger.ms
, due to rate-limiting of internal
queue wakeups. This is now fixed by not rate-limiting queue wakeups but
instead limiting them to one wakeup per queue reader poll. #2912.txn_requires_abort()
error.produce()
and commit_transaction()
and before any partitions had been registered with the coordinator, the
messages would time out but the commit would succeed because nothing
had been sent to the coordinator. This is now fixed.commit_transaction()
was
checking the current transaction state an invalid state transaction could
occur which in turn would trigger a assertion crash.
This issue showed up as "Invalid txn state transition: .." crashes, and is
now fixed by properly synchronizing both checking and transition of state.librdkafka v1.6.1 is a maintenance release.
auto.offset.reset=error
now has error-code
set to ERR__AUTO_OFFSET_RESET
to allow an application to differentiate
between auto offset resets and other consumer errors.send_offsets_to_transaction()
coordinator
requests, such as TxnOffsetCommitRequest, could in rare cases be sent
multiple times which could cause a crash.ssl.ca.location=probe
is now enabled by default on Mac OSX since the
librdkafka-bundled OpenSSL might not have the same default CA search paths
as the system or brew installed OpenSSL. Probing scans all known locations.send_offsets_to_transaction()
was called.send_offsets_to_transaction()
calls would leak memory if the
underlying request was attempted to be sent after the transaction had
failed.ERR_UNSTABLE_OFFSET_COMMIT
. #3265librdkafka v1.6.0 is feature release:
sticky.partitioning.linger.ms
) - achieves higher throughput and lower latency through sticky selection of random partition (by @abbycriswell).DeleteRecords()
, DeleteGroups()
and DeleteConsumerGroupOffsets()
(by @gridaphobe)sticky.partitioning.linger.ms
) is
enabled by default (10 milliseconds) which affects the distribution of
randomly partitioned messages, where previously these messages would be
evenly distributed over the available partitions they are now partitioned
to a single partition for the duration of the sticky time
(10 milliseconds by default) before a new random sticky partition
is selected.rd_kafka_commit_transaction()
and
rd_kafka_abort_transaction()
with the timeout_ms
parameter
set to -1
, which will use the remaining transaction timeout.DeleteRecords()
(by @gridaphobe).DeleteGroups()
(by @gridaphobe).DeleteConsumerGroupOffsets()
.CreateTopics()
.ssl.ca.certificate.stores
to specify a list of
Windows Certificate Stores to read CA certificates from, e.g.,
CA,Root
. Root
remains the default store.rand_r()
on supporting platforms which decreases lock
contention (@azat).assignor
debug context for troubleshooting consumer partition
assignments../configure --disable-lz4-ext
) to v1.9.3
which has vast performance improvements.rd_kafka_conf_get_default_topic_conf()
to retrieve the
default topic configuration object from a global configuration object.conf
debugging context to debug
- shows set configuration
properties on client and topic instantiation. Sensitive properties
are redacted.rd_kafka_queue_yield()
to cancel a blocking queue call.rd_kafka_seek_partitions()
to seek multiple partitions to
per-partition specific offsets.oauthbearer_set_token()
function would call free()
on
a new
-created pointer, possibly leading to crashes or heap corruption (#3194)RD_KAFKA_RESP_ERR__STATE
when API calls were attempted after the transaction had failed, we now
try to return the error that caused the transaction to fail in the first
place, such as RD_KAFKA_RESP_ERR__FENCED
when the producer has
been fenced, or RD_KAFKA_RESP_ERR__TIMED_OUT
when the transaction
has timed out.rd_kafka_send_offsets_to_transaction()
would fail the current
transaction into an abortable state when CONCURRENT_TRANSACTIONS
was
returned by the broker (which is a transient error) and the 3 retries
were exhausted.rd_kafka_topic_new()
with a topic config object with
message.timeout.ms
set could sometimes adjust the global linger.ms
property (if not explicitly configured) which was not desired, this is now
fixed and the auto adjustment is only done based on the
default_topic_conf
at producer creation.rd_kafka_flush()
could previously return RD_KAFKA_RESP_ERR__TIMED_OUT
just as the timeout was reached if the messages had been flushed but
there were now no more messages. This has been fixed.Release asset checksums:
af6f301a1c35abb8ad2bb0bab0e8919957be26c03a9a10f833c8f97d6c405aa8
3130cbd391ef683dc9acf9f83fe82ff93b8730a1a34d0518e93c250929be9f6b
librdkafka v1.5.3 is a maintenance release.
close()
could hang in certain
cgrp states (@gridaphobe, #3127).Message::errstr()
(#3140).roundrobin
partition assignment strategy could get stuck in an
endless loop or generate uneven assignments in case the group members
had asymmetric subscriptions (e.g., c1 subscribes to t1,t2 while c2
subscribes to t2,t3). (#3159)Release asset checksums:
3f24271232a42f2d5ac8aab3ab1a5ddbf305f9a1ae223c840d17c221d12fe4c1
2105ca01fef5beca10c9f010bc50342b15d5ce6b73b2489b012e6d09a008b7bf
librdkafka v1.5.2 is a maintenance release.
retries
has
been increased from 2 to infinity, effectively limiting Produce retries to
only message.timeout.ms
.
As the reasons for the automatic internal retries vary (various broker error
codes as well as transport layer issues), it doesn't make much sense to limit
the number of retries for retriable errors, but instead only limit the
retries based on the allowed time to produce a message.request.timeout.ms
has been increased from 5 to 30 seconds to match
the Apache Kafka Java producer default.
This change yields increased robustness for broker-side congestion.CONFIGURATION.md
(through rd_kafka_conf_properties_show())
)
now include all properties and values, regardless if they were included in
the build, and setting a disabled property or value through
rd_kafka_conf_set()
now returns RD_KAFKA_CONF_INVALID
and provides
a more useful error string saying why the property can't be set.inflateGetHeader()
with
unitialized memory pointers that could lead to the GZIP header of a fetched
message batch to be copied to arbitrary memory.
This function call has now been completely removed since the result was
not used.
Reported by Ilja van Sprundel.rd_kafka_topic_opaque()
(used by the C++ API) would cause object
refcounting issues when used on light-weight (error-only) topic objects
such as consumer errors (#2693).socket_cb
would
return error.roundrobin
partition.assignment.strategy
could crash (assert)
for certain combinations of members and partitions.
This is a regression in v1.5.0. (#3024)KafkaConsumer
destructor did not destroy the underlying
C rd_kafka_t
instance, causing a leak if close()
was not used.Message->errstr()
.ERR_TOPIC_AUTHORIZATION_FAILED
return value from produce*()
(#2215)ERR_KAFKA_STORAGE_ERROR
is now correctly treated as a retriable
produce error (#3026).Note: there was no v1.5.1 librdkafka release
Release asset checksums:
de70ebdb74c7ef8c913e9a555e6985bcd4b96eb0c8904572f3c578808e0992e1
ca3db90d04ef81ca791e55e9eed67e004b547b7adedf11df6c24ac377d4840c6
The v1.5.0 release brings usability improvements, enhancements and fixes to librdkafka.
batch.size
producer configuration property (#638)topic.metadata.propagation.max.ms
to allow newly manually created
topics to be propagated throughout the cluster before reporting them
as non-existent. This fixes race issues where CreateTopics() is
quickly followed by produce().rd_kafka_event_debug_contexts()
to get the debug contexts for
a debug log line (by @wolfchimneyrock)../configure --enable-XYZ
now requires the XYZ check to pass,
and --disable-XYZ
disables the feature altogether (@benesch)rd_kafka_produceva()
which takes an array of produce arguments
for situations where the existing rd_kafka_producev()
va-arg approach
can't be used.rd_kafka_message_broker_id()
to see the broker that a message
was produced or fetched from, or an error was associated with.RD_KAFKA_RESP_ERR_UNKNOWN_TOPIC_OR_PART
and
RD_KAFKA_RESP_ERR_TOPIC_AUTHORIZATION_FAILED
to the application through
the standard consumer error (the err field in the message object).allow.auto.create.topics=true
may be used to re-enable the old deprecated
functionality.queued.max.messages.kbytes
has been decreased from 1GB to 64MB to avoid excessive network usage for low
and medium throughput consumer applications. High throughput consumer
applications may need to manually set this property to a higher value.ssl.ca.location=probe
is configured,
librdkafka will probe known CA certificate paths and automatically use the
first one found. This should alleviate the need to configure
ssl.ca.location
when the statically linked OpenSSL's OPENSSLDIR differs
from the system's CA certificate path.api.version.request=false
and broker.version.fallback=...
so there
should be no functional change.linger.ms
, has been changed
from 0.5ms to 5ms to improve batch sizes and throughput while reducing
the per-message protocol overhead.
Applications that require lower produce latency than 5ms will need to
manually set linger.ms
to a lower value../configure --LDFLAGS='a=b, c=d'
with arguments containing = are now
supported (by @sky92zwq)../configure
arguments now take precedence over cached configure
variables
from previous invocation.topic.metadata.refresh.interval.ms
. (#2955)Release asset checksums:
76a1e83d643405dd1c0e3e62c7872b74e3a96c52be910233e8ec02d501fa33c8
f7fee59fdbf1286ec23ef0b35b2dfb41031c8727c90ced6435b8cf576f23a656