🚀 Backup Google Takeout archives (YouTube channel and Google Photos) at 1GB/s+ to Azure Storage periodically with minimal human toil and financial cost
Liftoff from Google Takeout into Azure Storage, repeatedly, very fast, like 1GB/s+ or 10 minutes total per takeout fast
Gargantuan Takeout Rocket (GTR) is a toolkit of guides and software to help you take out your data from Google Takeout and put it somewhere else safe easily, periodically, and fast to make it easy to do the right thing of backing up your Google account and related services such as your YouTube account or Google Photos periodically.
GTR is not a fully automated solution as that is impossible with Google Takeout's anti-automation measures, but GTR is an assistive solution. GTR takes a less than an hour to setup and less than 10 minutes every 2 months (or whatever interval you want) to use. The cost to backup 1TB on Azure every month is $1 dollar a month as long as you store each backup archive for 6 months at a minimum. You don't need a fast internet connection on your client to use this tool as all data transfer from Google to the backup destination is handled remotely by many servers in data centers. There are no bandwidth charges for the backup process, however restoration in case of an emergency is fairly expensive. All resources used are serverless and are almost practically highly scalable including to zero.
The only backup destination currently available in GTR is Microsoft Azure Blob Storage due to Azure's unique API which allows commanding Azure Blob Storage to download from a remote URL. A Cloudflare Workers proxy is used to work around a URL escaping bug and a parallelism limitation in the Azure Blob Storage API. Speeds of up to 1GB/s or more from Google Takeout to Azure Blob Storage's Archive Tier can be seen with this setup.
A browser extension is provided to intercept downloads from Google Takeout and command Azure to download the file. Behind the scenes, the extension immediately stops and prevents the local download, discovers the temporary (valid 15 minutes) direct URL to download the Google Takeout Archive, analyzes the size of the source file remotely to generate a download plan consisting of file chunks of 1000MB, specially encodes the URL so Azure is able to download from Google via the Cloudflare Workers proxy, executes the download plan by shotgunning all the download commands in parallel to Azure through the Cloudflare Worker proxy to transload the file from Google as quickly as possible, and commits all the 1000MB chunks into one seamless file on Azure. The download for each file completes in 30 to 60 seconds, well before the direct URL expires in 15 minutes and with rather high limits on how many parallel downloads of this archive or other archives in the same takeout can be happening at once.
A public instance of the Cloudflare Workers proxy is provided for convenience but users can setup and run their own Cloudflare Workers proxy if desired and target their own proxy in the extension instead of the public one for privacy reasons. For most users who are looking to run their own Cloudflare Workers proxy instead of using the public Cloudflare Workers proxy, the free tier of Cloudflare Workers should suffice.
The original author of GTR's Google account is about 1.25TB in size (80% Youtube Videos, 20% other, Google Photos ~200GB). Pre-GTR, the backup procedure would have taken at least 3 hours even with a VPS Setup facilitating the transfer from Google Takeout as even large instances on the cloud with large disks, much memory, and many CPUs would eventually choke with too many files being downloaded in parallel. The highest speed seen was about 300MB/s. It was also exhaustively high-touch and toilsome, requiring many clicks, reauthorizations, and setup of the workspace. By delegating the task of downloading to Azure with assists from CloudFlare Workers and the browser extension that makes up GTR, the original author is able to transfer the 1.25TB of 50GB Google Takeout files to Azure Storage in 10 minutes at anytime with little to no setup.
GTR is right for you if:
This guide is a continual work in progress. PRs are very much welcome!
If you need some help or questions or whatever, feel free to hit me up over Twitter or make an issue.
Let me know if the guide works for you as well!
This is something that you'll only have to do once.
You can adjust the numbers and redundancies as needed or desired.
See GTR Proxy readme for details on setting one up yourself. You may want to setup your own GTR Proxy for privacy reasons. The Cloudflare Worker implementation is serverless and there are no fees or usage accrued while it is idle. There are also no charges for incoming and outgoing bandwidth as long as both azure and google's servers reside in the same continent and for most people, their usage of their own GTR Proxy should fall under Cloudflare's free tier.
If you decided to use the public GTR Proxy, please see the privacy policy on it.
Install the extension in a Chromium-derived browser such as Google Chrome, Edge, Opera, Brave, and etc. At the moment, the extension is not published in the web store and it might never be. Look at the purpose of this repository and guess why from this diagram below:
I have no intention of risking my Google account to publish the extension. I assure you it's not malware but I can't say a Google robot might think differently. I'm not eager to be testing the worst case scenario; I'm just interested in preparing for it.
The extension has a rocket icon. 🚀. If you don't see it, click on the puzzle icon and click the rocket icon.
The extension UI can be seen by clicking on the rocket icon. This may or may not be the current UI but it should be something like this
If you've setup your own Cloudflare Workers proxy, set the GTR Proxy Base URL
to yours. The default URL in the field is the public instance.
On your planner application of choice, remind yourself every 2 months (or whatever interval you want) to perform a backup using this. I have Todoist setup to remind me every 2 months.
You may also want to configure Google Takeout to run automatically every two months to backup your whole account.
Generate SAS Token and URL
and copy the Blob SAS URL
.
Chrome:
Edge:
Don't panic.
Restoration and download is fairly expensive. This is the tradeoff for the speed and durability. It's worth it for me, for what it is worth.
Let's consider a 1TB restore:
Costs:
For 1TB, this will cost about $108. Small price for salvation.
and there's many more. oh there's just so many. too many.
Sometimes a hit, sometimes not. Just depends on how the community is feeling.
just search twitter for "google takeout". you'll find users complaining about sizes and archive amounts quite a lot.
A future version of GTR may include S3 and S3-compatible APIs as a destination. There may be a possiblity to teach Cloudflare Workers to facilitate this in a highly parallel manner like was done for Azure. Unfortunately, S3 does not have a similar "download from a remote server" API. However, we might be able to teach Cloudflare Workers to use itself to transload. This might not be compatible with Cloudflare's "unload workers from memory" optimization though. Would this still work?
I'm also extremely curious about storing the "hot" data in Cloudflare R2. Without ingress or egress fees, one could transload and stage Takeout archives there temporaily and download it for a local backup and have it be compatible/resumeable with their download manager of choice. R2 is missing stuff like lifecycle rules which are pretty important in preventing run-away costs from being used as a staging area.
With the recent news about Cloudflare, some users may also wish to use a non-Cloudflare alternative. I don't know of a good alternative with the same free "price point", geographical reach, computing power, network outlay, scalability, and permissive use.
In the meantime:
https://sjwheel.net/cloud/computing/2019/08/01/aws_backup.html
https://benjamincongdon.me/blog/2021/05/03/Backing-up-my-Google-Takeout-data/
https://tyler.io/my-familys-photo-and-video-library-backup-strategy-in-2020/
The general idea of these is to use a single VPS instance to handle the coordination and traffic. Congdon's solution clocked in at about 65MB/s.
I used Azure's "Standard_L8s_v2" for my instance and that topped out at about 300MB/s when writing to the temporary local NVMe storage before uploading from that to Azure Storage. The CPU was pegged pretty hard during my transfer so this kind of makes me think how much CPU time I'm using to do many GB/s of transfer. Probably a lot. And I'm not really paying for the CPU to do TLS as the cloud vendors are paying. Great!
VPS setups may want to use aria2c along with an aria2c browser extension to streamline the transloading process without too much terminal work. This was fast for me, but I wanted something much faster and VPS-less.
Haven't tried, not sure. Might be something to try. YMMV, stuff may break.
Note that the GTR Proxy by default is limited to Google Takeout domains. You would need to fork the proxy and add domains to its whitelist.
In general, the high parallelism and concurrency that GTR relies on is a product of Google Takeout ultimately serving takeout archives with signed URLs to Google Cloud Storage, their S3-like object storage offering. Google Cloud Storage is very robust, very available, and very scalable. If you try the interceptor with something else, the intercepted URL needs to have no limit on parallelism and concurrency and not use cookies to validate access.
Services to try:
Let me know if you try something and it works. Don't bother trying it on traditional server hosted Linux ISO mirrors though. They tend to limit concurrency and aren't object storage based.
I got inspired watching SpaceX launch rockets with a pile of Merlin engines. Starship is definitely a BFR! The fact it launched with "off the shelf" rockets combined in parallel to launch such huge amounts was definitely inspirational somewhat to the architecture. Hence, GTR.