Decentralized, trustless backup tool
Scatter your data before loosing it
Backup tool that treats its stores as throwaway, untrustworthy commodity
Decentralization: avoid trusting any one third-party with all your data
Block-level de-duplication
RAID-like error correction
Redundancy: N-copies duplication
Stream-based: less is more
And:
...pick some or all of the above, apply in any order.
Indeed, scat decomposes backing up and restoring into basic stream processors ("procs") arranged like filters in a pipeline. They're chained together, piping the output of proc x to the input of proc x+1. As such, though created for backing up data, its core doesn't actually know anything about backups, but provides the necessary procs.
Such modularity enables unlimited flexibility: stream data from anywhere (local/remote file, arbitrary command, etc.), process it in any way (encrypt, compress, filter through arbitrary command, etc.), to anywhere: write/read/upload/download is just another proc at the end/beginning of a chain.
+---------------------------------+
| chain proc |
| |
+---------+ | +--------+ +--------+ |
| chunk 0 +----->| | proc 0 | | proc 1 | |
| (seed) | | +--+-----+ +--------+ |
+---------+ | | ^ |
| | +-------+ | |
| +--->|+-------+ -----+ |
| +|+-------+ |
| +| chunk | |
| +-------+ |
+---------------------------------+
...where seed
may be a tar stream and procs 0..n would be split, checksum, parity, gzip, scp, etc. part of a chain that is itself a proc also.
Full-length 4K demo video: on YouTube
scat
in your $PATH
Stream processing, like performing a backup from a tar stream, is done via a proc chain formulated as a proc string.
The following examples showcase proc strings for typical use cases. They're good starting points to start playing with. Copy them in shell scripts and play around with them, backing up and restoring test files until fully understanding the mechanics at play and reaching desired behaviours. It's important to get comfortable both ways to both back up often and not fear potential moments restoring gets necessary.
See Proc string for syntax documentation and the full list of available procs.
Example of backing up dir foo/
in a RAID 5 fashion to 2 Google Drive accounts and 1 VPS (compress, encrypt, 2 data shards, 1 parity shard, upload >= 2 exclusive copies - using 8 threads, 4 concurrent transfers)
foo/
Command:
$ tar c foo | scat -stats "split | backlog 8 {
checksum
| index foo_index
| gzip
| parity 2 1
| checksum
| cmd gpg --batch -e -r 00828C1D
| group 3
| concur 4 stripe(1 2
mydrive=rclone(drive:tmp)=7gib
mydrive2=rclone(drive2:tmp)=14gib
myvps=scp(bankmon tmp)
)
}"
The combination of parity
, group
and stripe
creates a RAID 5:
parity(2 1)
: split into 2
data shards and 1
parity shardgroup(3)
: aggregate all 3
shards for stripingstripe(1 2 ...)
: interleave those across given stores, making 1
copy of each, ensuring at least 2
of 3 are on distinct stores from the others so we can afford to lose any one of them and still be able to recompute original dataOrder matters. Notably:
index
gpg -e
is not idempotent, to avoid re-writing/uploading identical chunksstripe
Note:
Both
backlog
andconcur
are being used above. The former limits the number of concurrent instances of a chain proc ({}
) to 8, while the latter limits the number of concurrent transfers bystripe
to 4. They may appear redundant, why not one or the other for both? They actually take different types of arguments and have distinct purposes. Seebacklog
andconcur
.
rclone(drive:tmp)
andscp(bankmon tmp)
have a different arguments layout. The former takes a "remote" argument (passed as-is to rclone), while the latter's arguments are "[user@]host" (passed as-is to ssh) and remote directory. Seerclone
andscp
.
Reverse chain:
foo
Command:
$ scat -stats "uindex | backlog 8 {
backlog 4 multireader(
drive=rclone(drive:tmp)
drive2=rclone(drive2:tmp)
bankmon=scp(bankmon tmp)
)
| cmd gpg --batch -d
| uchecksum
| group 3
| uparity 2 1
| ugzip
| join -
}" < foo_index | tar x
The above only demonstrate a subset of what's possible with scat. There exist more procs and they may be assembled in different manners to tailor to one's particular needs. See Proc string.
$ scat [options] <proc>
Options:
-stats
print stats: rates, quotas, etc.-version
show version-help
show usageArgs:
<proc>
proc string: see Proc string
Being stream-based implies not knowing in advance the total size of the data to process. Thus, no progress percentage can be reported. However, when transferring files or directories, size can be known by the caller and passed to pv.
Note: When piping from pv, do not pass the
-stats
option to scat. Both commands would step on each other's toes writing to stderr and moving the terminal cursor.
File backup:
$ pv my_file | scat "..."
Directory backup (approximate progress, not taking into account tar headers):
# Using GNU du:
$ tar c my_dir | pv -s $(du -sb ~/tmp/100m | cut -f1) | scat "..."
# Under macOS, install GNU coreutils
$ brew install coreutils
$ # idem above, replace du with gdu
# ...or using stock Darwin du, even more approximate:
$ tar c my_dir | pv -s $(du -sk my_dir | cut -f1)k | scat "..."
Making snapshots boils down to versioning the index file in a git repository:
$ git init
$ git add foo_index
$ git commit -m "backup of foo"
Restoring a snapshot consists in checking out a particular commit and restoring using the old index file:
$ git checkout <commit-ish>
$ # ...use foo_index: see restore example
You could have a single repository for all your backups and commit index files after each backup, as well as the backup and restore scripts used to write and read these particular indexes. This allows for modifying proc strings from one backup to the next, while reusing identical chunks if any, and still be able to restore old snapshots created with potentially different proc strings, without having to remember what they were at the time.
scat is born out of frustration from existing backup solutions.
As of writing the initial version, I had one or more of the following gripes with available solutions:
I wanted to be able to:
without:
I believe scat achieves these objectives 🙂