Ocadotechnology Bbq Save Abandoned

Create backups of BigQuery datasets/tables

Project README

DEPRECATED

Backup Big Query (BBQ) has been deprecated and is no longer supported.

If you require extended backup retention, we encourage you to opt for Big Query Snapshots as an alternative to using BQ Copy jobs.

Backup Big Query (BBQ)

BBQ (read: barbecue) is a python app that runs on Google App Engine (GAE) and creates daily backups of BigQuery tables.

Setup
Motivation
Features
High level architecture
Usage
Contributing

Setup

To install BBQ in GCP, follow installation steps from Setup.md doc.

Below button opens Setup.md in Google Cloud Shell, where you could instantly follow described steps.

Motivation

Google BigQuery is fast, highly scalable, cost-effective and fully-managed enterprise data warehouse for analytics at any scale. BigQuery automatically replicates data and keeps a 7-day history of changes.

BBQ development started in 2016. BigQuery supported only snapshot decorators, which had some limitations when tables and datasets were recreated. Currently BigQuery time travel feature allows you to access the data from any point within the last seven days. If this feature is sufficient for you, then you don't need BBQ at all.

Current main use case of BBQ is to create BigQuery backups for longer than 7 days.

Our motivation for building BBQ was to:

protect crucial data against application bug, user error or malicious attack,
be able to restore to multiple versions of our data,
restore data at scale (i.e. thousands of tables at the same time) which can be part of Disaster Recovery plan.

Features

Main BBQ features include:

highly scalable daily backup of BigQuery tables (hundreds of thousands backups),
- both single and partitioned tables are supported,
simple, access-based whitelisting strategy. BBQ will backup tables it has access to via service account,
retention - automatic deletion of old backups based on age and/or number of versions,
restore - BBQ can restore:
- whole datasets,
- selected tables/partitions/versions.

BBQ will not backup:

external data sources,
Views (you can use GCP Census for that),
Dataset/table labels as they are not copied by BigQuery copy job (again, you can use GCP Census for that)
Empty partitioned tables without any partitions,
Clustered partitioned table,
Tables in very rare use cases.

Known caveats

Modification of table metadata (including table description) qualifies table to be backed up at the next cycle. It can be a problem for partitioned tables, where such change updates last modified time in every partition. Then BBQ will backup all partitions again, even though there was no actually change in partition data,
There's 10,000 copy jobs per project per day limit, which you may hit on the first day. This limit can be increased by Google Support,
Data in table streaming buffer will be backed up on the next run, once the buffer is flushed. BBQ uses copy-job for creating backups and "Records in the streaming buffer are not considered when a copy or extract job runs" (check Life of a BigQuery streaming insert for more details),
When table name is longer than 400 characters, then in rare cases BBQ may backup tables more than once. Such backup duplicates are automatically removed by retention process.

High level architecture

Architecture diagram

BBQ consists of:

multiple source projects - data of those projects will be backed up,
BBQ project - main project where GAE runs and backups are stored,
bbq metadata storage project - project used to separate BBQ backups that should be immutable and invisible for users without any particular reason from metadata (e.g. datastore exports, SLI)
(optionally) default restoration project - BBQ will restore data into that project by default if provided in config.yaml.

BBQ allows to perform 3 operations:

backups - create backup tables of source tables,
restore - copies selected backup data into restore project,
retention - prunes backups based on selected rules.

BBQ is using Datastore as the main database to store all metadata about backups.

Backup process

BBQ initially creates backups for all source tables, to which it has access. When source table is modified, BBQ will create a backup within 36 hours. Backups are created using copy-job in the same location/region as source data.

Example of single table backup

Single table backup process

BBQ can hold multiple versions of the same source table. Every partition in partitioned table is treated as separate table (i.e. BBQ copies only modified partitions). If source table has expiration time set, the backup table will not preserve this property (so that backup never expires).

Example of partitioned table backup

Partitioned tabke backup process

Restore process

There are few ways in which you may restore data:

restoring whole dataset. Most recent backups of tables in this dataset will be restored,
restoring table by providing specific backup version. List of backups can be provided as input.

BBQ supports restoring data only into original source project or default restoration project provided in config.yaml. Both of them requires to assign BBQ Service Account appropriate IAM role that allows to manage Big Query data. It is also possible to set write and create disposition that specifies whether BBQ should append or replace source data.

As appending or replacing production data is super risky it is highly recommended to do it carefully, giving BBQ write access to source project should be temporary!

Example of restoring selected partitions from partitioned table

Restore process

Copy job limit

There's 10,000 copy jobs per project per day limit, which you may hit during the restoration. This limit can be increased by Google Support.

Note that partitions are represented as individual tables in backup storage. If you try to restore dataset with 10 tables with 500 partitions each, it will require 5000 copy jobs which is half of your daily quota.

Retention process

Every day retention process scans all backups to find and delete backups matching specific criteria for given source table/partition:

if there are multiple backups per day, the most recent one is retained. Multiple backups per day are created in rare cases (e.g. when task queue task is executed more than one time),
for backups younger than 7 months, the 10 most recent ones are retained,
for backups older than 7 months, single most recent backup is retained,
if source table is deleted, then the last backup is deleted after 7 months after deletion.

Example of 10 versions retention for backups younger than 7 months

Retention process

Example of 7 months old backup deletion and source deletion grace period

Retention process

Usage

How to run backups?

Backup process is scheduled periodically for all specified projects (check config.yaml to specify which projects to backup and config/cron.yaml to configure schedule time). Note that cron uses UTC.

However, you may also invoke backup process manually from cron jobs.

It's worth to underline that:

Backups for partitions are scheduled randomly within the range of time specified in config.yaml,
It is possible to check the progress via Task Queues.

On-demand table backup

BBQ provides option for scheduling on-demand backup for single non-partitioned table or single partition of partitioned table.

Note that on-demand flow will ignore checking prerequisites before scheduling copy job, as opposed to normal flow.

On-demand table backup is available from <your-project-id>.appspot.com site in 'Advanced' section.

How to find backup for given table?

Option 1

In order to find backup Y for table X:

In Cloud Console visit Datastore,
Find Key literal for table X:
- Select Table kind,
- Filter entities equal to X.project_id, X.dataset_id, X.table_id or X.partition_id,
- Find table X from results and copy Key literal,
Query backup Y:
- Select Backup kind,
- Filter entities by Key that has ancestor X.Key literal.

To check the content for given backup Y in Big Query:

Open Big Query in BBQ storage project,
Filter tables by Y.dataset_id or Y.table_id in search bar,
Select table and check Schema, Details or Preview tab.

Option 2

It is possible to export Datastore kinds and query them in Big Query, this method is recommended for more frequent usage.

To enable export, check Cloud Datastore export section.
Export is scheduled periodically; however, to have latest data you should invoke them manually from cron jobs.

To find backup Y for table X open Big Query in BBQ storage project Z - replace X, Y, Z in query below and execute:

#StandardSQL
WITH last_tables AS (
  SELECT *
  FROM `Y.datastore_export.Table_*`
  WHERE _TABLE_SUFFIX IN (
    SELECT MAX(_TABLE_SUFFIX) FROM `Y.datastore_export.Table_*`
  )
), last_backups AS (
  SELECT *, CAST(SPLIT(__key__.path, ',')[OFFSET(1)] AS INT64) AS PARENT_ID
  FROM `Y.datastore_export.Backup_*`
  WHERE _TABLE_SUFFIX IN (
    SELECT MAX(_TABLE_SUFFIX) FROM `Y.datastore_export.Backup_*`
  )
)
SELECT * FROM last_backups WHERE PARENT_ID IN (
  SELECT __key__.id FROM last_tables
  WHERE project_id = X.project_id AND dataset_id = X.dataset_id AND table_id = X.table_id
)

How to restore data from backups?

There are several options to restore data, available from <your-project-id>.appspot.com

Restore whole dataset
Restore single table
Restore tables from custom list of backups

Checking status of restoration process

Restore process is asynchronous. To check status of process, follow links returned in response:

restorationWarningsOnlyStatusEndpoint - it shows general progress only and only failed copy jobs (if any)
restorationStatusEndpoint - it shows all information about progress and each copy job.

We advise to use restorationWarningsOnlyStatusEndpoint for bigger datasets as the response time is quicker and response size smaller.

Community

Subscribe or post to [email protected] to follow along and ask questions about the BBQ.

Contributing

Feel free to submit feature requests and bug reports under Issues.

If you would like to contribute to our development efforts, please review our contributing guidelines and submit a pull request.

Open Source Agenda is not affiliated with "Ocadotechnology Bbq" Project. README Source: ocadotechnology/bbq

Stars

Open Issues

Last Commit

8 months ago

License

Apache-2.0

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/ocadotechnology-bbq"><img src="https://www.opensourceagenda.com/projects/ocadotechnology-bbq/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022

Ocadotechnology Bbq Save Abandoned

DEPRECATED

Backup Big Query (BBQ) has been deprecated and is no longer supported.

We recommend utilizing the Time Travel and Fail Safe features in Google Cloud Platform's Big Query instead.

If you require extended backup retention, we encourage you to opt for Big Query Snapshots as an alternative to using BQ Copy jobs.

Backup Big Query (BBQ)

Table of contents

Setup

Motivation

Our motivation for building BBQ was to:

Features

Main BBQ features include:

BBQ will not backup:

Known caveats

High level architecture

Backup process

Example of single table backup

Example of partitioned table backup

Restore process

Example of restoring selected partitions from partitioned table

Copy job limit

Retention process

Example of 10 versions retention for backups younger than 7 months

Example of 7 months old backup deletion and source deletion grace period

Usage

How to run backups?

On-demand table backup

How to find backup for given table?

Option 1

Option 2

How to restore data from backups?

Checking status of restoration process

Community

Contributing

Open Source Agenda Badge

From the blog

How to Choose Which Programming Language to Learn First?

From the blog

How to Choose Which Programming Language to Learn First?