Provides a reliable cache contingency backup plan in case a GraphCMS/GraphQL endpoint is failing.
The main goal of this service is to provide a reliable cache contingency backup plan in case a GraphQL endpoint is failing. This service most important priority is the service reliability, not the data consistency, which may not always be up-to-date.
This Cache service is meant to run on your own AWS account, and be managed by yourself. It is powered by the Serverless Framework. It uses an optional 3rd party tool for monitoring the service: Epsagon (free plan is most likely enough).
It is a whole service meant to be used by developers/teams who rely on GraphCMS, and are looking for a safe and reliable way to provide services to their own customers.
You keep complete ownership of this service, as it runs under your plan, and you are free to make any change to fit your business.
P.S: Please share the awesome things you build with the community!
Using this service instead of directly hitting a GraphQL endpoint provides the following benefits:
Watch this 10mn video to understand and see it in action!
Clone the repo, then configure your local install:
nvm use
or nvm install
(optional, just make sure to use the same node version as specified in /.nvmrc
)
yarn install
/.env.development
and /.env.test
files (only the GraphCMS credentials are really necessary, if you're just playing around)
yarn start
# Starts at localhost:8085/status
and /read-cache
yarn emulate:local
play around with fake queries sent to http://localhost:8085
and go to /read-cache
to see changesBefore deploying on AWS:
serverless-domain-manager
pluginredis.url
for the project you meant to deployOn AWS (staging):
.env.staging
file first
yarn deploy:demo
(you may want to either disable or configure the serverless-domain-manager
plugin)yarn emulate:client:demo
to send queries to your staging endpoint and manually test the behavior thereOn AWS (prod):
.env.production
file first
yarn deploy:demo:production
yarn emulate:client:demo:production
to send queries to your production endpoint and manually test the behavior thereIf you've decided to clone/fork this project, please do the following:
slack-codebuild
is used to send build notification to a slack channel (it's MIT too)
SLACK_WEBHOOK_URL
: Use your own or remove it, or your build notification will appear on our slack channel (please don't do that)
CC_TEST_REPORTER_ID
: Use your own or remove it, or your build results will be mixed with our ownIt really depends on the implementation of you app here.
If you're using react with Apollo for instance, it's just a matter of changing the endpoint to target your cache (/cache-query
endpoint) rather than your GCMS endpoint, and not use any credentials (the cache doesn't need any).
It should be simple and straightforward, as it's just a matter of fetching your cache /cache-query
endpoint instead of hitting your GraphCMS endpoint directly.
Testing with a non-production application is strongly recommended to begin with. Also, use a
QUERY
GraphCMS token, you don't need to use a token that can write, read is enough and therefore more secure.
Follow this when you need to deploy a new customer
package.json
demo
) by the new customer's nameserverless.yml
custom.envs
section with appropriate values (basically duplicate another customer (staging + prod) and change values)secrets-staging.yml
and secrets-production.yml
This is only useful if you've kept the
serverless-domain-manager
plugin and thus want to deploy your service using a custom domain
You're gonna need to configure AWS Route53 and AWS Certificate Manager to create your custom domain first.
custom.env.$customer.domain.name
sls create_domain
are managed here using yarn create:$customer
and yarn create:$customer:production
If you use custom domains and if they aren't ready then it'll fail (check your API Gateway).
nvm use
yarn deploy:$customer
- Deploy the newly created customer in stagingyarn deploy:$customer:production
- Deploy the newly created customer in productionThis Cache uses a mix of GraphQL query and headers as index (redis key), and GCMS API responses as values (redis value).
"Always reliable, eventually synchronized"
This Cache service will always return the value from the redis cache. It will never check if a newer value exists on the GCMS's side.
Therefore, it may not be in-sync with the actual values held by GCMS.
Due to this behaviour, this Cache service would never send fresher data on its own. That's why there is are different "cache invalidation" strategies.
Those strategies are optional and you are not required to use any of them. You may use none, one, or several, as you decide. We implemented the Strategy 1 first, and then switched to the Strategy 2 which is less complex and more reliable in our use-case.
This strategy is very useful if you have lots of reads and very few writes.
It is very inefficient if you write a lot in GraphCMS (like automated massive writes). It doesn't play nice if you write a lot in GraphCMS (like automated massive writes in batches, such as massive data import).
On GCMS's side, a WebHook is meant to trigger a cache invalidation every time a change is made in the data held by GCMS.
WebHooks can be configured from there: https://app.graphcms.com/YOURS/staging/webhooks Each stage has its own WebHooks.
The WebHook should be configured to hit the cache invalidation endpoint (/refresh-cache
), which will run a query for all existing keys in the redis cache.
Note that the cache will only be invalidated if the refresh query to GCMS API actually worked. So, if GCMS API is down during the cache refresh, the cache won't be changed. (there is no retry strategy)
This is an important detail, as the cache should always contain reliable data.
Reminder: The cache uses a Redis storage, with the query (as string) used as key, and the query results (as json) used as value.
In short, every time any data is changed in GCMS, the whole cache is refreshed.
N.B: Special protection has been put in motion to avoid concurrent access of the /refresh-cache
endpoint.
Only one concurrent call is authorized, it is gracefully handled by the reservedConcurrency
option in serverless.yml.
Known limitations:
This strategy hasn't been designed the best way it could have been, and suffer from some rare race conditions.
It may happen, in the case of a massive write (such as an automated import tool that performs lots of writes really fast (like 100-200 writes in 30-50 seconds))
that the /refresh-cache
endpoint will be called several times (despite the concurrency lock), because the import script takes so long, and multiple calls to /refresh-cache
are executed.
The bad thing is that the last call that fetches data from GraphCMS API and store them in the cache isn't necessarily executed at last, and it may happen that the data stored in the cache isn't the most recent version.
The proper way to tackle this issue would be to use a queue
, with a debounce
strategy.
Basically wait until there are no more received request and then perform the cache refresh (instead of immediately performing the cache refresh).
Unfortunately, we ran out of time and didn't tackle this issue yet. (instead, we implemented Strategy 2, which is simpler) We're also not really familiar with queue services (SQS, SNS, EventBridge, ...) and don't know which one would be the best for the job.
Contributor help needed!: That would be a very appreciated contribution! We'd definitely love a PR for this :)
If there are many queries stored in redis (hundreds), they may not all resolve themselves in the 30sec limit imposed by API GW. In such case, they'd likely start to fail randomly depending on GCMS API response time, and it'd become very difficult to ensure the integrity of all requests. It'd also (in the current state) be very hard to fix.
One possible way to tackle this issue would be to spawn calls (async, parallel) to another lambda, who's role would be to refresh one query only We only have a handful of queries in our cache, so we're not affected by this limitation yet and aren't planning on working on it anytime soon.
This strategy is very useful if you have lots of reads and very few writes.
It is very inefficient if you write a lot in GraphCMS (like automated massive writes).
Much simpler and fixes several downsides suffered by Strategy 1, such as:
Known limitations:
Because there is no automated refill of the cache, it will be filled when a client performs an action that generate a query. If that query is rarely executed, it may happen that it's executed during an outage, and the query would therefore fail, potentially crashing your app.
If the cache reset happens during a GCMS outage, then your app will crash anyway. We don't check that GCMS is up and running before performing the cache reset. (but that'd be awesome, once they provide a way to do that!)
Contributor help needed!: If you know a way to detect GraphCMS status and therefore avoid a cache reset during an outage, we're very interested. To our knowledge, they don't have any automated too we could rely on to detect this before wiping all the data from the cache, but that'd definitely be an awesome addition!
This is more a workaround than a real feature, but because all the data sent in the request body
are used as redis key, to index a query's results, you can take advantage of that.
In GraphQL, all queries (and mutations) accept an operationName
:
For instance, the following GraphQL query:
query {
__schema {
mutationType {
kind
}
}
}
Will yield the following request body
:
{
"operationName": null,
"variables": {},
"query": "{ __schema { mutationType { kind } }}"
}
Here, the operationName
is null
.
But if you specify it (query myQueryName {}
) then it will reflect in the operationName
,
and this field is also used to index the query in redis.
So, if you wanted to automatically invalidate your cache every hour, you could just make the operationName
dynamic,
such as query myQueryName_01_01_2019_11am {}
.
This way, since the value would change every hours, a different GraphQL query would be sent every hour,
and the key used by redis would therefore be different every hours, leading to a cache refresh because the newer query would actually be executed on GraphCMS API before being cached.
This is a nice workaround that allows you to define very precisely a different strategy, which works very differently and could basically be used to ensure the cached data is refreshed periodically. On the other hand, it wouldn't protect against outages, because it wouldn't handle a fallback strategy. (if graphcms is down when a new query is executed for the first time, then it'd fail)
But that's still nice to know, and perfectly fit a "simple cache strategy" use-case.
Disclaimer: We'll likely not have the time to add another strategy if we don't need it ourselves. But, feel free to open an issue and let's discuss it, we'll gladly advise you regarding the implementation details and discuss the specs together.
Using a protected endpoint /read-cache
, you can visualise all queries (redis indexes) that are stored in the cache.
For each query, there is a version
and updatedAt
fields that helps you understand when was the cached value refreshed for the last time (and how many times since it was initially added).
Structure example:
{
"createdAt": 1564566367896,
"updatedAt": 1564566603538,
"version": 2,
"body": {
"operationName": null,
"variables": {},
"query": "{ organisations { name __typename }}"
}
}
Good to know:
- The
body
is the object representation of thegql
version of the query. (basically, what's sent over the network) It contains aquery
, which is the string representation of the query.- The
body.query
is sanitized and doesn't fully represent the key stored on redis (trim of\n
, truncated (50 chars), etc.), for the sake of readability.- There is no way to see the data from this endpoint (as it could be sensitive), only the keys are shown. (it's also password protected in case of, see
BASIC_AUTH_PASSWORD
)
This service must be resilient and reliable. It relies on Redis when the GraphCMS endpoint is down.
But, what happens if Redis fails instead of GraphCMS?
In such scenario, the outcome depends on the Cache API endpoint used:
/cache-query
: A Redis error when searching for a previous query result is gracefully handled and redis is bypassed, a request is therefore sent to the GraphCMS endpoint and results are returned to the client.
This makes the service very reliable, as clients will still receive proper results, even if Redis is down.
In the catastrophic case where both GraphCMS and Redis are down at the same time, a 500 response will be returned to the client./refresh-cache
: This endpoint cannot work without a working redis connection, and will therefore return a 500 response./reset-cache
: This endpoint cannot work without a working redis connection, and will therefore return a 500 response./read-cache
: This endpoint cannot work without a working redis connection, and will therefore return a 500 response.The most important endpoint is
/cache-query
, as it's what's used by the clients that attempt to fetch data from GraphCMS. Therefore, it's the most resilient, and will return proper results even if GraphCMS is down (only if the query was executed previously and the query result was properly cached), or if Redis is down (by re-playing the query through GraphCMS). But, it can't handle both being down simultaneously.
We use a logger
instance of Winston which is configured to silent logs on production environments that aren't of level error
or higher.
Logs on AWS (CloudWatch) can be accessed by running:
NODE_ENV=production yarn logs:cache-query
NODE_ENV=production yarn logs:read-cache
NODE_ENV=production yarn logs:refresh-cache
NODE_ENV=production yarn logs:status
If no NODE_ENV
is defined, staging
environment is used by default.
Epsagon is a tool that helps troubleshoot what happens on AWS. It allows to see what happens on the backend, by analysing I/O network calls and generates graphs, that are very helpful to pinpoint a problem's source. (See blog)
Traces are configured within the project, the only required information is the EPSAGON_APP_TOKEN
environment variable.
Traces are the most interesting feature of Epsagon, and what you may eventually pay for. They allow you to visually understand what happens on the backend, and get meaningful information such as delays, return codes, logs, etc.
Also, Epsagon can be used as a monitoring service, through it's setError
function. (it's manually disabled in test
environment through the DISABLE_EPSAGON
env variable)
Errors catch through setError
are handled by Epsagon as Exception
and can be redirected to a slack channel using their alerts service.
They are very active on their slack and offer engineering-level support.
Epsagon comes with a Free plan that enjoys 100.000 traces/month, which is more than enough for our use-case. See their pricing page for more information.
Epsagon will automatically be disabled if you don't provide a EPSAGON_APP_TOKEN
environment variable.
Epsagon is disabled in
test
environment, see jest-preload.js.
basic-auth
Authorizer. (issue on their side with AWS API GW)
API GW > Lambda > API GW
infinite loop on /refresh-cache
when used. It has therefore been disabled for that particular endpoint, see "FIXME".
If you are interested in the issue, watch this and this, basically generated 10k calls in 1h, cost $3.POST
/cache-query
body
. (the same way it's natively handled by GCMS API)
POST
/refresh-cache
Protected by an authorization Header
GraphCMS-WebhookToken
that must contain the same token as the one defined in your REFRESH_CACHE_TOKEN environment variable
POST
/reset-cache
Protected by an authorization Header
GraphCMS-WebhookToken
that must contain the same token as the one defined in your REFRESH_CACHE_TOKEN environment variable
GET
/read-cache
Protected by Basic Auth, see
BASIC_AUTH_USERNAME
andBASIC_AUTH_PASSWORD
env variables.
GET
/status
This service also support the deployment and management of multiple redis caches - one per customer (AKA "an instance").
Basically, it allow to spawn multiple Cache services, with each its own Redis connection and own GraphCMS/GraphQL connection. You could also re-use credentials and token to re-use the same redis connection for several instances, although it's not what we recommend, it's up to you.
Therefore, each instance is completely separated from others, with its own redis cache, its own Lambda and own API Gateway. It not more expensive either (assuming you're using a free RedisLabs plan and thus ignoring Redis's costs), since the AWS infrastructure is on-demand it'd cost the same whether all the load is on one lambda, or separated on multiple lambdas See Limitations.
It would still be possible to use just one redis instance with different databases (one db per customer, but the same connection for all). It really depends on your Redis service. Though, separation by clusters is not handled by our Cache system. (feel free to open a issue and propose a PR!)
In case you forked the project and you'd like to keeping it up to date with this boilerplate, here are a few built-in scripts to help you out:
yarn sync:fork
will git pull --rebase
the boilerplate master
branch into your ownyarn sync:fork:merge
will git pull
the boilerplate master
branch into your ownThis is meant to be used manually, if you ever want to upgrade without trouble.
N.B: Using the rebase mode will force you to force push afterwards (use it if you know what you're doing). Using merge mode will create a merge commit (ugly, but simpler). We use the rebase mode for our own private fork.
You can run interactive tests using Jest with yarn test
script.
CodeBuild is configured to run CI tests using yarn test:coverage
script.
test:coverage
script is executed with --detectOpenHandles --forceExit
options, because the tests aren't closing all redis connections and jest hangs and don't send the coverage report if we don't force it with --forceExit
.
We were't able to figure out the source of this, as it is very hard to see when connections are open/closed during tests. (Note to self: May be related with beforeEach/afterEach
that aren't executed on children describe > test
)
This step is useful only if you've forked/cloned the project and want to configure CI using AWS CodeBuild.
Using the AWS Console > CodeBuild:
Watch the video tutorial
Disclaimer: We forgot to enable "Privileged" mode in the video, for the
Environment > Image > Additional configuration
and had to go toEnvironment > Override image
to fix it.
We created our Redis instances on Redis Labs.
As we run on a Free plan, there are a few limitations to consider:
Due to those limitations, we strongly recommend to run this service with one instance per customer (multi-tenants) This way, you will avoid edge cases such as:
CustomerA triggering too many connections, which would take down CustomerD.
Adding a CustomerZ, which caches a bit more data that goes over the 30MB limit, hence impacting all your customers.
Trigger a cache refresh will refresh all queries, without knowledge of "to whom" belongs the query/data, which may likely not be what you want.
Using a dedicated redis instance per customer fixes that too.
One important thing not to miss when creating the Subscription, is to select the right availability zone (AZ), which depends on where you're located.
We selected ue-west-1
, which is Ireland, because it's the closer from us.
You won't be able to select a different AZ in free plan, so choose carefully. The database can only be created in the same region as the one selected for the subscription.
Once a subscription is created, you can create a database (our redis instance).
A redis instance can be configured with those values:
noeviction
: returns error if memory limit has been reached when trying to insert more dataallkeys-lru
: evicts the least recently used keys out of all keysallkeys-lfu
: evicts the least frequently used keys out of all keysallkeys-random
: randomly evicts keys out of all keysvolatile-lru
: evicts the least recently used keys out of keys with an "expire" field setvolatile-lfu
: evicts the least frequently used keys out of keys with an "expire" field setvolatile-ttl
: evicts the shortest time-to-live and least recently used keys out of keys with an "expire" field setvolatile-random
: randomly evicts keys with an "expire" field setThe recommended choice is
allkeys-lfu
, so that the impact of re-fetching data is minimised as much as possible.
/refresh-cache
endpoint has a timeout of 30 seconds.
There is no built-in way to handle workload longer than 30s yet.
This can be an issue if there are too many GraphCMS queries in the redis cache (which will trigger a TimeOut error),
as they may not all be updated when trying to invalidate the redis cache.
If a timeout happens, you can know which keys have been updated by looking at /read-cache
updatedAt
data,
but there is no built-in way to automatically handle that limitation (yet).
Also, it's very likely that even if you run /refresh-cache
multiple times, since the redis keys are gonna be refreshed in the same order,
it therefore should fail for the same keys across multiple attempts. (but it also depends on how fast GCMS API replies to each API calls, and that's not predictable at all)/refresh-cache
or /read-cache
are called, the redis.keys
method is used, which is blocking and not recommended for production applications.
A better implementation should be made there, probably following this.
It is not such a concern though, since those endpoints should rarely be called, and it won't be an issue if the redis store doesn't contain lots of keys anyway.Updates are consolidated in our CHANGELOG file.
It's meant to be a developer-friendly way to know what benefits you'll get from updating your clone/fork, and provides a update history.
We use Semantic Versioning for this project: https://semver.org/. (vMAJOR.MINOR.PATCH
: v1.0.1
)
Note: You should write the CHANGELOG.md doc before releasing the version. This way, it'll be included in the same commit as the built files and version update
Then, release a new version:
yarn run release
This command will prompt you for the version to update to, create a git tag, build the files and commit/push everything automatically.
Don't forget we are using SemVer, please follow our SemVer rules.
Pro hint: use beta
tag if you're in a work-in-progress (or unsure) to avoid releasing WIP versions that looks legit
Code style is enforced by .editorconfig
and files within the .idea/
folder.
We also use EsLint, and extend AirBnb code style.
WebStorm IDE is the preferred IDE for this project, as it is already configured with debug configurations, code style rules.
Only common configuration files (meant to be shared) have been tracked on git. (see .gitignore
)
This project is being maintained by:
Unly is a socially responsible company, fighting inequality and facilitating access to higher education. Unly is committed to making education more inclusive, through responsible funding for students. We provide technological solutions to help students find the necessary funding for their studies.
We proudly participate in many TechForGood initiatives. To support and learn more about our actions to make education accessible, visit :
Tech tips and tricks from our CTO on our Medium page!
#TECHFORGOOD #EDUCATIONFORALL