Sturdyc Save

A highly concurrent cache packed with functionality for building sturdy systems.

Project README

sturdyc: a caching library for building sturdy systems

Go Reference License: MIT Go Report Card Test codecov

Sturdyc is a highly concurrent cache that supports non-blocking reads and has a configurable number of shards that makes it possible to achieve parallel writes. The xxhash algorithm is used for efficient key distribution. Evictions are performed per shard based on recency at O(N) time complexity using quickselect.

It has all the functionality you would expect from a caching library, along with additional features designed to help you build robust applications, such as:

The package has great support for batchable data sources as it takes the response apart, and then caches each record individually based on the permutations of the options with which it was fetched. It can also significantly reduce the traffic to the underlying data sources by creating a buffer for each unique option set, and then delaying the refreshes until it has gathered enough IDs

Records can be configured to refresh either based on time or at a certain rate of requests. All refreshes occur in the background, ensuring that users never have to wait for a record to be updated, resulting in very low latency applications.

There are examples for all of these configurations further down this file!

Installing

go get github.com/creativecreature/sturdyc

At a glance

The package exports the following functions:

To utilize these functions, you will first have to set up a client to manage your configuration:

func main() {
	// Maximum number of entries in the cache.
	capacity := 10000
	// Number of shards to use.
	numShards := 10
	// Time-to-live for cache entries.
	ttl := 2 * time.Hour
	// Percentage of entries to evict when the cache is full. Setting this
	// to 0 will make set a no-op when the cache has reached its capacity.
	// Expired records are evicted continiously by a background job.
	evictionPercentage := 10

	// Create a cache client with the specified configuration.
	cacheClient := sturdyc.New(capacity, numShards, ttl, evictionPercentage)

	// We can then use the client to store and retrieve values.
	sturdyc.Set(cacheClient, "key1", "value")
	if val, ok := sturdyc.Get[string](cacheClient, "key1"); ok {
		log.Println(val) // Prints "value"
	}

	sturdyc.Set(cacheClient, "key2", 1)
	if val, ok := sturdyc.Get[int](cacheClient, "key2"); ok {
		log.Println(val) // Prints 1
	}
}

Next, we'll look at some of the more advanced features.

Stampede protection

Cache stampedes (also known as thundering herd) occur when many requests for a particular piece of data (which has just expired or been evicted from the cache) come in at once. This can cause all requests to fetch the data concurrently, subsequently causing a significant load on the underlying data source.

To prevent this, we can enable stampede protection:

func main() {
	// Set a minimum and maximum refresh delay for the sturdyc. This is
	// used to spread out the refreshes for our entries evenly over time.
	minRefreshDelay := time.Millisecond * 10
	maxRefreshDelay := time.Millisecond * 30
	// The base used for exponential backoff when retrying a refresh.
	retryBaseDelay := time.Millisecond * 10
	// NOTE: Ignore this for now, it will be shown in the next example.
	storeMisses := false

	// Create a cache client with the specified configuration.
	cacheClient := sturdyc.New(capacity, numShards, ttl, evictionPercentage,
		sturdyc.WithStampedeProtection(minRefreshDelay, maxRefreshDelay, retryBaseDelay, storeMisses),
	)
}

With this configuration, the cache will prevent records from expiring by enqueueing refreshes when a key is requested again at a random interval between 10 and 30 milliseconds. Performing the refreshes upon cache key retrieval, rather than at a fixed interval, allows unused keys to expire.

To demonstrate this, we can create a simple API client:

type API struct {
	cacheClient *sturdyc.Client
}

func NewAPI(c *sturdyc.Client) *API {
	return &API{c}
}

func (a *API) Get(ctx context.Context, key string) (string, error) {
	// This could be a call to a rate limited service, a database query, etc.
	fetchFn := func(_ context.Context) (string, error) {
		log.Printf("Fetching value for key: %s\n", key)
		return "value", nil
	}
	return sturdyc.GetFetch(ctx, a.cacheClient, key, fetchFn)
}

and use it in our main function:

func main() {
	// ...

	// Create a cache client with the specified configuration.
	cacheClient := sturdyc.New(capacity, numShards, ttl, evictionPercentage,
		sturdyc.WithStampedeProtection(minRefreshDelay, maxRefreshDelay, retryBaseDelay, storeMisses),
	)

	// Create a new API instance with the cache client.
	api := NewAPI(cacheClient)

	// We are going to retrieve the values every 10 milliseconds, however the
	// logs will reveal that actual refreshes fluctuate randomly within a 10-30
	// millisecond range. Even if this loop is executed across multiple
	// goroutines, the API call frequency will maintain this variability,
	// ensuring we avoid overloading the API with requests.
	for i := 0; i < 100; i++ {
		val, _ := api.Get(context.Background(), "key")
		log.Printf("Value: %s\n", val)
		time.Sleep(minRefreshDelay)
	}
}

Running this program, we're going to see that the value gets refreshed once every 2-3 retrievals:

go run .
2024/04/07 09:05:29 Fetching value for key: key
2024/04/07 09:05:29 Value: value
2024/04/07 09:05:29 Value: value
2024/04/07 09:05:29 Fetching value for key: key
2024/04/07 09:05:29 Value: value
2024/04/07 09:05:29 Value: value
2024/04/07 09:05:29 Value: value
2024/04/07 09:05:29 Fetching value for key: key
...

The entire example is available here.

Non-existent records

Another factor to consider is non-existent keys. It could be an ID that has been added manually to a CMS with a typo that leads to no data being returned from the upstream source. This can significantly increase our systems latency, as we're never able to get a cache hit and serve from memory.

However, it could also be caused by a slow ingestion process. Perhaps it takes some time for a new entity to propagate through a distributed system.

The cache allows us to store these IDs as missing records, while refreshing them like any other entity. To illustrate, we can extend the previous example to enable this functionality:

func main() {
	// ...

	// Tell the cache to store missing records.
	storeMisses := true

	// Create a cache client with the specified configuration.
	cacheClient := sturdyc.New(capacity, numShards, ttl, evictionPercentage,
		sturdyc.WithStampedeProtection(minRefreshDelay, maxRefreshDelay, retryBaseDelay, storeMisses),
	)

	// Create a new API instance with the cache client.
	api := NewAPI(cacheClient)

	// ...
	for i := 0; i < 100; i++ {
		val, err := api.Get(context.Background(), "key")
		if errors.Is(err, sturdyc.ErrMissingRecord) {
			log.Println("Record does not exist.")
		}
		if err == nil {
			log.Printf("Value: %s\n", val)
		}
		time.Sleep(minRefreshDelay)
	}
}

Next, we'll modify the API client to return the ErrStoreMissingRecord error for the first 3 calls. This error instructs the cache to store it as a missing record:

type API struct {
	count       int
	cacheClient *sturdyc.Client
}

func NewAPI(c *sturdyc.Client) *API {
	return &API{
		count:       0,
		cacheClient: c,
	}
}

func (a *API) Get(ctx context.Context, key string) (string, error) {
	fetchFn := func(_ context.Context) (string, error) {
		log.Printf("Fetching value for key: %s\n", key)
		a.count++
		if a.count < 3 {
			return "", sturdyc.ErrStoreMissingRecord
		}
		return "value", nil
	}
	return sturdyc.GetFetch(ctx, a.cacheClient, key, fetchFn)
}

and then call it:

func main() {
	// ...

	for i := 0; i < 100; i++ {
		val, err := api.Get(context.Background(), "key")
		if errors.Is(err, sturdyc.ErrMissingRecord) {
			log.Println("Record does not exist.")
		}
		if err == nil {
			log.Printf("Value: %s\n", val)
		}
		time.Sleep(minRefreshDelay)
	}
}

Running this program, we'll see that the record is missing during the first 3 refreshes and then transitions into having a value:

2024/04/07 09:42:49 Fetching value for key: key
2024/04/07 09:42:49 Record does not exist.
2024/04/07 09:42:49 Record does not exist.
2024/04/07 09:42:49 Record does not exist.
2024/04/07 09:42:49 Fetching value for key: key
2024/04/07 09:42:49 Record does not exist.
2024/04/07 09:42:49 Record does not exist.
2024/04/07 09:42:49 Record does not exist.
2024/04/07 09:42:49 Fetching value for key: key
2024/04/07 09:42:49 Value: value
2024/04/07 09:42:49 Value: value
2024/04/07 09:42:49 Fetching value for key: key
...

The entire example is available here.

Batch endpoints

One challenge with caching batchable endpoints is that you have to find a way to reduce the number of keys. To illustrate, let's say that we have 10 000 records, and an endpoint for fetching them that allows for batches of 20.

Now, let's calculate the number of combinations if we were to create the cache keys from the query params:

$$ C(n, k) = \binom{n}{k} = \frac{n!}{k!(n-k)!} $$

For $n = 10,000$ and $k = 20$, this becomes:

$$ C(10,000, 20) = \binom{10,000}{20} = \frac{10,000!}{20!(10,000-20)!} $$

This results in an approximate value of:

$$ \approx 4.032 \times 10^{61} $$

and this is if we're sending perfect batches of 20. If we were to do 1 to 20 IDs (not just exactly 20 each time) the total number of combinations would be the sum of combinations for each k from 1 to 20.

sturdyc caches each record individually, which effectively prevents factorial increases in cache keys.

To see how this works, we can look at a small example application. This time, we'll start with the API client:

type API struct {
	cacheClient *sturdyc.Client
}

func NewAPI(c *sturdyc.Client) *API {
	return &API{c}
}

func (a *API) GetBatch(ctx context.Context, ids []string) (map[string]string, error) {
	// We are going to pass the cache a key function that prefixes each id.
	// This makes it possible to save the same id for different data sources.
	cacheKeyFn := a.cacheClient.BatchKeyFn("some-prefix")

	// The fetchFn is only going to retrieve the IDs that are not in the cache.
	fetchFn := func(_ context.Context, cacheMisses []string) (map[string]string, error) {
		log.Printf("Cache miss. Fetching ids: %s\n", strings.Join(cacheMisses, ", "))
		// Batch functions should return a map where the key is the id of the record.
		// If you have storage of missing records enabled, any ID that isn't present
		// in this map is going to be stored as a cache miss.
		response := make(map[string]string, len(cacheMisses))
		for _, id := range cacheMisses {
			response[id] = "value"
		}
		return response, nil
	}

	return sturdyc.GetFetchBatch(ctx, a.cacheClient, ids, cacheKeyFn, fetchFn)
}

and we're going to use the same cache configuration as the previous example, so I've omitted it for brevity:

func main() {
	// ...

	// Create a new API instance with the cache client.
	api := NewAPI(cacheClient)

	// Seed the cache with ids 1-10.
	log.Println("Seeding ids 1-10")
	ids := []string{"1", "2", "3", "4", "5", "6", "7", "8", "9", "10"}
	api.GetBatch(context.Background(), ids)
	log.Println("Seed completed")

	// Each record has been cached individually. To illustrate this, we can keep
	// fetching a random number of records from the original batch, plus a new ID.
	// Looking at the logs, we'll should be able to see that the cache only
	// fetches the id that wasn't present in the original batch.
	for i := 1; i <= 100; i++ {
		// Get N ids from the original batch.
		recordsToFetch := rand.IntN(10) + 1
		batch := make([]string, recordsToFetch)
		copy(batch, ids[:recordsToFetch])
		// Add a random ID between 1 and 100 to the batch.
		batch = append(batch, strconv.Itoa(rand.IntN(1000)+10))
		values, _ := api.GetBatch(context.Background(), batch)
		// Print the records we retrieved from the cache.
		log.Println(values)
	}
}

Running this code, we can see that we only end up fetching the randomized ID, while continuously getting cache hits for IDs 1-10, regardless of what the batch looks like:

...
2024/04/07 11:09:58 Seed completed
2024/04/07 11:09:58 Cache miss. Fetching ids: 173
2024/04/07 11:09:58 map[1:value 173:value 2:value 3:value 4:value]
2024/04/07 11:09:58 Cache miss. Fetching ids: 12
2024/04/07 11:09:58 map[1:value 12:value 2:value 3:value 4:value]
2024/04/07 11:09:58 Cache miss. Fetching ids: 730
2024/04/07 11:09:58 map[1:value 2:value 3:value 4:value 730:value]
2024/04/07 11:09:58 Cache miss. Fetching ids: 520
2024/04/07 11:09:58 map[1:value 2:value 3:value 4:value 5:value 520:value 6:value 7:value 8:value]
...

The entire example is available here.

Cache key permutations

If you're attempting to cache data from an upstream system, the ID alone may be insufficient to uniquely identify the record in your cache. The endpoint you're calling might accept a variety of options that transform the data in different ways. Therefore, it's important to consider this and store records for each unique set of options.

To showcase this, we can create a simple API client that interacts with a service for retrieving order statuses:

type OrderOptions struct {
	CarrierName        string
	LatestDeliveryTime string
}

type OrderAPI struct {
	cacheClient *sturdyc.Client
}

func NewOrderAPI(client *sturdyc.Client) *OrderAPI {
	return &OrderAPI{
		cacheClient: client,
	}
}

func (a *OrderAPI) OrderStatus(ctx context.Context, ids []string, opts OrderOptions) (map[string]string, error) {
	// We use the  PermutedBatchKeyFn when an ID isn't enough to uniquely identify a
	// record. The cache is going to store each id once per set of options. In a more
	// realistic scenario, the opts would be query params or arguments to a DB query.
	cacheKeyFn := a.cacheClient.PermutatedBatchKeyFn("key", opts)

	// We'll create a fetchFn with a closure that captures the options. For this
	// simple example, it logs and returns the status for each order, but you could
	// just as easily have called an external API.
	fetchFn := func(_ context.Context, cacheMisses []string) (map[string]string, error) {
		log.Printf("Fetching: %v, carrier: %s, delivery time: %s\n", cacheMisses, opts.CarrierName, opts.LatestDeliveryTime)
		response := map[string]string{}
		for _, id := range cacheMisses {
			response[id] = fmt.Sprintf("Available for %s", opts.CarrierName)
		}
		return response, nil
	}
	return sturdyc.GetFetchBatch(ctx, a.cacheClient, ids, cacheKeyFn, fetchFn)
}

The main difference from the previous example is that we're using the PermutatedBatchKeyFn function. Internally, the cache uses reflection to extract the names and values of every exported field, and uses them to build the cache keys.

To demonstrate this, we can write another small program:

func main() {
	// ...

	// Create a new cache client with the specified configuration.
	cacheClient := sturdyc.New(capacity, numShards, ttl, evictionPercentage,
		sturdyc.WithStampedeProtection(minRefreshDelay, maxRefreshDelay, retryBaseDelay, storeMisses),
	)

	// We will fetch these IDs using various option sets.
	ids := []string{"id1", "id2", "id3"}
	optionSetOne := OrderOptions{CarrierName: "FEDEX", LatestDeliveryTime: "2024-04-06"}
	optionSetTwo := OrderOptions{CarrierName: "DHL", LatestDeliveryTime: "2024-04-07"}
	optionSetThree := OrderOptions{CarrierName: "UPS", LatestDeliveryTime: "2024-04-08"}

	orderClient := NewOrderAPI(cacheClient)
	ctx := context.Background()

	// Next, we'll seed our cache by fetching the entire list of IDs for all options sets.
	log.Println("Filling the cache with all IDs for all option sets")
	orderClient.OrderStatus(ctx, ids, optionSetOne)
	orderClient.OrderStatus(ctx, ids, optionSetTwo)
	orderClient.OrderStatus(ctx, ids, optionSetThree)
	log.Println("Cache filled")
}

At this point, the cache has stored each record individually for each option set:

  • FEDEX-2024-04-06-id1
  • DHL-2024-04-07-id1
  • UPS-2024-04-08-id1
  • etc..

Next, we'll add a sleep to make sure that all of the records are due for a refresh, and then request the ids individually for each set of options:

func main() {
	// ...

	// Sleep to make sure that all records are due for a refresh.
	time.Sleep(maxRefreshDelay + 1)

	// Fetch each id for each option set.
	for i := 0; i < len(ids); i++ {
		orderClient.OrderStatus(ctx, []string{ids[i]}, optionSetOne)
		orderClient.OrderStatus(ctx, []string{ids[i]}, optionSetTwo)
		orderClient.OrderStatus(ctx, []string{ids[i]}, optionSetThree)
	}

	// Sleep for a second to allow the refresh logs to print.
	time.Sleep(time.Second)
}

Running this program, we can see that the records are refreshed once per unique id+option combination:

go run .
2024/04/07 13:33:56 Filling the cache with all IDs for all option sets
2024/04/07 13:33:56 Fetching: [id1 id2 id3], carrier: FEDEX, delivery time: 2024-04-06
2024/04/07 13:33:56 Fetching: [id1 id2 id3], carrier: DHL, delivery time: 2024-04-07
2024/04/07 13:33:56 Fetching: [id1 id2 id3], carrier: UPS, delivery time: 2024-04-08
2024/04/07 13:33:56 Cache filled
2024/04/07 13:33:58 Fetching: [id3], carrier: UPS, delivery time: 2024-04-08
2024/04/07 13:33:58 Fetching: [id1], carrier: FEDEX, delivery time: 2024-04-06
2024/04/07 13:33:58 Fetching: [id1], carrier: UPS, delivery time: 2024-04-08
2024/04/07 13:33:58 Fetching: [id2], carrier: UPS, delivery time: 2024-04-08
2024/04/07 13:33:58 Fetching: [id2], carrier: FEDEX, delivery time: 2024-04-06
2024/04/07 13:33:58 Fetching: [id3], carrier: FEDEX, delivery time: 2024-04-06
2024/04/07 13:33:58 Fetching: [id2], carrier: DHL, delivery time: 2024-04-07
2024/04/07 13:33:58 Fetching: [id3], carrier: DHL, delivery time: 2024-04-07
2024/04/07 13:33:58 Fetching: [id1], carrier: DHL, delivery time: 2024-04-07

The entire example is available here.

Refresh buffering

As seen in the example above, we're not really utilizing the fact that the endpoint is batchable when we're performing the refreshes.

To make this more efficient, we can enable the refresh buffering functionality. Internally, the cache is going to create a buffer for each permutation of our options. It is then going to collect ids until it reaches a certain size, or exceeds a time threshold.

The only change we have to make to the example above is to enable this functionality:

func main() {
	// ...

	// With refresh buffering enabled, the cache will buffer refreshes
	// until the batch size is reached or the buffer timeout is hit.
	batchSize := 3
	batchBufferTimeout := time.Second * 30

	// Create a new cache client with the specified configuration.
	cacheClient := sturdyc.New(capacity, numShards, ttl, evictionPercentage,
		sturdyc.WithStampedeProtection(minRefreshDelay, maxRefreshDelay, retryBaseDelay, storeMisses),
		sturdyc.WithRefreshBuffering(batchSize, batchBufferTimeout),
	)

	// ...
}

and now we can see that the cache performs the refreshes in batches per permutation of our query params:

go run .
2024/04/07 13:45:42 Filling the cache with all IDs for all option sets
2024/04/07 13:45:42 Fetching: [id1 id2 id3], carrier: FEDEX, delivery time: 2024-04-06
2024/04/07 13:45:42 Fetching: [id1 id2 id3], carrier: DHL, delivery time: 2024-04-07
2024/04/07 13:45:42 Fetching: [id1 id2 id3], carrier: UPS, delivery time: 2024-04-08
2024/04/07 13:45:42 Cache filled
2024/04/07 13:45:44 Fetching: [id1 id2 id3], carrier: FEDEX, delivery time: 2024-04-06
2024/04/07 13:45:44 Fetching: [id1 id3 id2], carrier: DHL, delivery time: 2024-04-07
2024/04/07 13:45:44 Fetching: [id1 id2 id3], carrier: UPS, delivery time: 2024-04-08

The entire example is available here.

Passthrough

Time-based refreshes work really well for most use cases. However, there are scenarios where you might want to allow a certain amount of traffic to hit the underlying data source. For example, you might achieve a 99.99% cache hit rate, and even though you refresh the data every 1-2 seconds, it still only amounts to a handful of requests. This could cause the other system to scale down too much

To solve this problem, sturdyc provides you with sturdyc.Passthrough and sturdyc.PassthroughBatch. These functions are functionally equivalent to sturdyc.GetFetch and sturdyc.GetFetchBatch, except they allow a certain percentage of requests through rather than allowing a request through every x milliseconds/seconds/minutes/hours.

The passthroughs will still be performed in the background, which means that your application will maintain low latency response times. Moreover, if the underlying system goes down, sturdyc.Passthrough and sturdyc.PassthroughBatch will still be able to serve stale data until the record's TTL expires.

capacity := 5
numShards := 2
ttl := time.Minute
evictionPercentage := 10
c := sturdyc.New(capacity, numShards, ttl, evictionPercentage,
    // Allow 50% of the requests to pass-through. Default is 100%.
    sturdyc.WithPassthroughPercentage(50),
    // Buffer the batchable pass-throughs. Default is false.
    sturdyc.WithPassthroughBuffering(),
)

res, err := sturdyc.Passthrough(ctx, c, "id", fetchFn)
batchRes, batchErr := sturdyc.PassthroughBatch(ctx, c, idBatch, c.BatchKeyFn("item"), batchFetchFn)

Custom metrics

The cache can be configured to report custom metrics for:

  • Size of the cache
  • Cache hits
  • Cache misses
  • Evictions
  • Forced evictions
  • The number of entries evicted
  • Shard distribution
  • The size of the refresh buckets

All you have to do is implement the MetricsRecorder interface:

type MetricsRecorder interface {
	CacheHit()
	CacheMiss()
	Eviction()
	ForcedEviction()
	EntriesEvicted(int)
	ShardIndex(int)
	CacheBatchRefreshSize(size int)
	ObserveCacheSize(callback func() int)
}

and pass it as an option when you create the client:

cache := sturdyc.New(
    cacheSize,
    shardSize,
    cacheTTL,
    evictWhenFullPercentage,
    sturdyc.WithMetrics(metricsRecorder),
)

Below are a few images where these metrics have been visualized in Grafana:

Screenshot 2024-05-04 at 12 36 43 Screenshot 2024-05-04 at 12 37 39 Screenshot 2024-05-04 at 12 38 04 Screenshot 2024-05-04 at 12 38 20
Open Source Agenda is not affiliated with "Sturdyc" Project. README Source: creativecreature/sturdyc
Stars
51
Open Issues
0
Last Commit
2 weeks ago
License
MIT

Open Source Agenda Badge

Open Source Agenda Rating