High Performance Rate Limiting MicroService and Library
GetRateLimits
would reset the ResetTime
and not the Remaining
counter. This would cause counters to eventually deplete and never fully reset. The solution involved fixing two issues:
Duration
value was never properly propagated in global behavior. This was added to the global broadcast logic.
UpdatePeerGlobals
during a global broadcast but neglected to propagate Duration
.Duration
to zero and trigger a reset of the ResetTime
. This code path does not reset the Remaining
counter because it's meant for cases where an existing rate limit had been extended or abbreviated in duration.CacheItem
struct and logic in algorithms.go would ignore it, causing it to short circuit around the logic that checks Duration
. Once the data type was corrected, the Duration
bug was revealed.ResetTime
generated by the owning and non-owning peers did not always match exactly.
ResetTime
in multiple places based on clock.Now()
.ResetTime
doesn't change due to the above bug.GetRateLimits()
will set a requestTime
and pass it around so that any date/time computation to set ResetTime
will always use the same base value instead of clock.Now()
.QueueUpdate()
used by peers to propagate updates to rate limits that it owns.
Remaining
counter. So, if the same key were updated multiple times it may get added in non-chronological order. The last update wins, potentially passing a stale Remaining
count, thereby dropping hits already applied.QueueUpdates()
. Then, when the timer calls to propagate the update, get the current ratelimit state of each queued update just before sending to the peers.GetRateLimits
on a non-owner peer with global behavior.
UNDER_LIMIT
no matter how many hits were applied.GetRateLimits
with zero hits to not trigger any global updates because nothing changed.GetRateLimits
.gubernator_over_limit_counter
on both non-owner and owner peers. Only count on owner peer.gubernator_getratelimit_counter
. When a non-owner uses Global behavior to process a request, do not increment the counter. After it global sends to the owner, the owner will increment the counter. This counter shall be the accurate count of rate limits checked.gubernator_broadcast_counter
. Use gubernator_broadcast_duration_count
instead.TestHealthCheck
that causes the next test to fail because the Gubernator services were restarted and aren't always ready in time to accept requests.global.go
for global behavior by @miparnisari in #213SetupDaemonConfig
no longer needs a file by @miparnisari in #214DRAIN_OVER_LIMIT
.TestGlobalRateLimits
that was not checking for exact behavior.
GlobalManager
fields so that tests would not read from global metric variables that were impacted by other tests.GlobalSyncWait
from 500ms to 100ms
runAsyncHits()
and runBroadcasts()
. When a ratelimit is hit, it could take up to 2x this setting before it's replicated to each peer.