Gubernator Versions Save

High Performance Rate Limiting MicroService and Library

v2.4.0

1 month ago

What's Changed

  • MegaFix global behavior bugs by @Baliedge in #225
    • Every call to GetRateLimits would reset the ResetTime and not the Remaining counter. This would cause counters to eventually deplete and never fully reset. The solution involved fixing two issues:
      • The Duration value was never properly propagated in global behavior. This was added to the global broadcast logic.
        • The changes in PR https://github.com/mailgun/gubernator/pull/219 fixes propagation issues in UpdatePeerGlobals during a global broadcast but neglected to propagate Duration.
        • As a result, logic in algorithms.go would detect a change in Duration to zero and trigger a reset of the ResetTime. This code path does not reset the Remaining counter because it's meant for cases where an existing rate limit had been extended or abbreviated in duration.
        • I had wondered why this was never a problem before that PR. That's because that PR fixed a global broadcast bug that was setting the wrong data type in a CacheItem struct and logic in algorithms.go would ignore it, causing it to short circuit around the logic that checks Duration. Once the data type was corrected, the Duration bug was revealed.
      • The ResetTime generated by the owning and non-owning peers did not always match exactly.
        • Value would vary slightly depending on network lag and system time synchronization because peers were generating ResetTime in multiple places based on clock.Now().
        • This isn't a showstopper normally, but it does prevent writing a unit test to ensure ResetTime doesn't change due to the above bug.
        • GetRateLimits() will set a requestTime and pass it around so that any date/time computation to set ResetTime will always use the same base value instead of clock.Now().
    • Fix race condition in QueueUpdate() used by peers to propagate updates to rate limits that it owns.
      • Updates include ratelimit state, such as the Remaining counter. So, if the same key were updated multiple times it may get added in non-chronological order. The last update wins, potentially passing a stale Remaining count, thereby dropping hits already applied.
      • The fix is to pass only ratelimit key info to QueueUpdates(). Then, when the timer calls to propagate the update, get the current ratelimit state of each queued update just before sending to the peers.
    • Fix inconsistency with over limit status when calling GetRateLimits on a non-owner peer with global behavior.
      • The logic would always return a response with status UNDER_LIMIT no matter how many hits were applied.
      • This differs when the same request reaches the owner peer, which will return the appropriate status.
      • The fix adds a check if hits > remaining and set status accordingly.
    • Optimize calls to GetRateLimits with zero hits to not trigger any global updates because nothing changed.
    • Add rigorous functional tests around global behavior to verify full peer-to-peer propagation after a call to GetRateLimits.
    • Fix doublecounting of metric gubernator_over_limit_counter on both non-owner and owner peers. Only count on owner peer.
    • Fix metric doublecounting of gubernator_getratelimit_counter. When a non-owner uses Global behavior to process a request, do not increment the counter. After it global sends to the owner, the owner will increment the counter. This counter shall be the accurate count of rate limits checked.
    • Remove redundant metric gubernator_broadcast_counter. Use gubernator_broadcast_duration_count instead.
    • Fix intermittent test error related to TestHealthCheck that causes the next test to fail because the Gubernator services were restarted and aren't always ready in time to accept requests.
  • Fix mutex deadlocks in PeerClient by @miparnisari in #223
  • Fix goroutine leaks by @miparnisari in #221
  • Add test for global rate limiting with load balancing by @Baliedge and @philipgough in #224
  • Update protobufs and Makefile by @miparnisari in #211
    • Update versions and run buf mod update and make proto
    • Fix version of gateway
    • Generate reverse proxy for peers v1
  • Change global behavior by @thrawn01 in #219
    • To change how GLOBAL behavior operates. Previously, the owner of the rate limit would broadcast the computed result of the rate limit to other peers, and the computed result from the owner is returned to clients who submit hits to a peer. However, after some great feed back on https://github.com/mailgun/gubernator/pull/218 and https://github.com/mailgun/gubernator/pull/216 It has become clear that we should instead allow the local peers to compute the result of the request based on the hits broadcast by the owning peer.
    • In the new behavior a peer will compute the result of the rate limit request and immediately return that computed result. This means that the non owning peer will compute the result with the current Remaining value it currently has in it's cache. To put it another way, the peer cache will no longer hold the computed result from the owner.
    • In order to facilitate this change, I've added many more tests around global functionality which should help ensure we don't break behavior going forward.
  • Add docs in global.go for global behavior by @miparnisari in #213
  • SetupDaemonConfig no longer needs a file by @miparnisari in #214
  • Update GitHub Action dep versions by @miparnisari in #201

v2.3.2

3 months ago

What's Changed

  • Bump grpcio from 1.53.0 to 1.53.2 in /python by @dependabot in #220

v2.3.1

3 months ago

What's Changed

  • Fix improperly set version. Now set to v2.3.1.

v2.3.0

3 months ago

What's Changed

  • New "Drain Over Limit" behavior by @Baliedge in #209
    • Add new behavior DRAIN_OVER_LIMIT.
    • Tech debt

v2.2.1

6 months ago

What's Changed.

  • Update version strings properly.
  • Just forgot to update version strings in v2.2.0.

v2.2.0

6 months ago

What's Changed

  • Fanout for async updates in global by @miparnisari in #198
    • Global updates occur concurrently instead of sequentially.
    • Configure concurrency with Behaviors.GlobalPeerRequestsConcurrency, default is 100.

v2.1.4

7 months ago

What's Changed

  • Fix startup error due to OTel dependency upgrade by @Baliedge in #196
    • "error in resource.Merge on resources index 0: cannot merge resource due to conflicting Schema URL"

v2.1.3

7 months ago

What's Changed

v2.1.2

7 months ago

What's Changed

  • Security update golang.org/x/net and Tidy code by @Baliedge in #192
    • Improve TestGlobalRateLimits that was not checking for exact behavior.
      • Required moving metric variables into GlobalManager fields so that tests would not read from global metric variables that were impacted by other tests.
    • Tidy up code.
    • Update default for GlobalSyncWait from 500ms to 100ms
      • This applies to both runAsyncHits() and runBroadcasts(). When a ratelimit is hit, it could take up to 2x this setting before it's replicated to each peer.

v2.1.1

7 months ago

What's Changed

  • Fix repo version tags.