Create backups of BigQuery datasets/tables
BBQ (read: barbecue) is a python app that runs on Google App Engine (GAE) and creates daily backups of BigQuery tables.
To install BBQ in GCP, follow installation steps from Setup.md doc.
Below button opens Setup.md in Google Cloud Shell, where you could instantly follow described steps.
Google BigQuery is fast, highly scalable, cost-effective and fully-managed enterprise data warehouse for analytics at any scale. BigQuery automatically replicates data and keeps a 7-day history of changes.
BBQ development started in 2016. BigQuery supported only snapshot decorators, which had some limitations when tables and datasets were recreated. Currently BigQuery time travel feature allows you to access the data from any point within the last seven days. If this feature is sufficient for you, then you don't need BBQ at all.
Current main use case of BBQ is to create BigQuery backups for longer than 7 days.
BBQ consists of:
BBQ allows to perform 3 operations:
BBQ is using Datastore as the main database to store all metadata about backups.
BBQ initially creates backups for all source tables, to which it has access. When source table is modified, BBQ will create a backup within 36 hours. Backups are created using copy-job in the same location/region as source data.
BBQ can hold multiple versions of the same source table. Every partition in partitioned table is treated as separate table (i.e. BBQ copies only modified partitions). If source table has expiration time set, the backup table will not preserve this property (so that backup never expires).
There are few ways in which you may restore data:
BBQ supports restoring data only into original source project or default restoration project provided in config.yaml. Both of them requires to assign BBQ Service Account appropriate IAM role that allows to manage Big Query data. It is also possible to set write and create disposition that specifies whether BBQ should append or replace source data.
As appending or replacing production data is super risky it is highly recommended to do it carefully, giving BBQ write access to source project should be temporary!
There's 10,000 copy jobs per project per day limit, which you may hit during the restoration. This limit can be increased by Google Support.
Note that partitions are represented as individual tables in backup storage. If you try to restore dataset with 10 tables with 500 partitions each, it will require 5000 copy jobs which is half of your daily quota.
Every day retention process scans all backups to find and delete backups matching specific criteria for given source table/partition:
Backup process is scheduled periodically for all specified projects (check config.yaml to specify which projects to backup and config/cron.yaml to configure schedule time). Note that cron uses UTC.
However, you may also invoke backup process manually from cron jobs.
It's worth to underline that:
BBQ provides option for scheduling on-demand backup for single non-partitioned table or single partition of partitioned table.
Note that on-demand flow will ignore checking prerequisites before scheduling copy job, as opposed to normal flow.
On-demand table backup is available from <your-project-id>.appspot.com site in 'Advanced' section.
In order to find backup Y for table X:
To check the content for given backup Y in Big Query:
It is possible to export Datastore kinds and query them in Big Query, this method is recommended for more frequent usage.
#StandardSQL
WITH last_tables AS (
SELECT *
FROM `Y.datastore_export.Table_*`
WHERE _TABLE_SUFFIX IN (
SELECT MAX(_TABLE_SUFFIX) FROM `Y.datastore_export.Table_*`
)
), last_backups AS (
SELECT *, CAST(SPLIT(__key__.path, ',')[OFFSET(1)] AS INT64) AS PARENT_ID
FROM `Y.datastore_export.Backup_*`
WHERE _TABLE_SUFFIX IN (
SELECT MAX(_TABLE_SUFFIX) FROM `Y.datastore_export.Backup_*`
)
)
SELECT * FROM last_backups WHERE PARENT_ID IN (
SELECT __key__.id FROM last_tables
WHERE project_id = X.project_id AND dataset_id = X.dataset_id AND table_id = X.table_id
)
There are several options to restore data, available from <your-project-id>.appspot.com
Restore process is asynchronous. To check status of process, follow links returned in response:
restorationWarningsOnlyStatusEndpoint - it shows general progress only and only failed copy jobs (if any)
restorationStatusEndpoint - it shows all information about progress and each copy job.
We advise to use restorationWarningsOnlyStatusEndpoint for bigger datasets as the response time is quicker and response size smaller.
Subscribe or post to [email protected] to follow along and ask questions about the BBQ.
Feel free to submit feature requests and bug reports under Issues.
If you would like to contribute to our development efforts, please review our contributing guidelines and submit a pull request.