🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
WIP pre-release for the upcoming ArchiveBox v0.8.0
release.
[!WARNING] This is an unfinished alpha pre-release. We're promoting it a little earlier than usual because it contains a ✨ big Django upgrade ✨ that affects many areas of the codebase, and we want brave early adopters to help us test it! If that sounds like you, make sure to back up your archive first, then give it a try and let us you if you find any bugs by opening a new issue!
Try this release early using docker
or pip
:
# with docker (pre-built)
docker pull archivebox/archivebox:dev
# with docker (built from source)
docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev
# with pip (built from source)
pip install 'git+https://github.com/pirate/ArchiveBox@dev'
To use the new noVNC
container to view & control the ArchiveBox browser remotely, grab the updated docker-compose.yml
and follow these instructions.
_EXTRA_ARGS
options (thanks @benmuth!)generic_jsonl
parser (thanks @jimwins!)feedparser
for RSS parsing (thanks @jimwins!)Snapshot
detail page header expanded/collapsed state./data/archive
/
, /data
, and /data/archive
in Docker and warn if running low on disk space/browsers
chown on Docker armv7
entrypoint failingis_staff
and is_superuser
flags during LDAP first authyt-dlp
and singlefile
versionsRESOLUTION
being ignored when using Chrome headless in Docker:80
or :443
port is present in the original URL/var/spool/cron/crontabs
permissions when mounting it via Dockersci-dl
scientific paper downloader being worked on by @benmuthchown
'ing ./data/archive
dir if it's a network mount that prevents ownership changes by @gnattu in https://github.com/ArchiveBox/ArchiveBox/pull/1312
COOKIES_FILE
to fetch page titles by @benmuth in https://github.com/ArchiveBox/ArchiveBox/pull/1364
_EXTRA_ARGS
for various extractors by @benmuth in https://github.com/ArchiveBox/ArchiveBox/pull/1360
Full Changelog: https://github.com/ArchiveBox/ArchiveBox/compare/v0.7.2...v0.8.0-rc
Get this release via pip
, docker
, brew
, or dpkg
(apt
& brew
releases are delayed).
# Get it with Pip on any OS (`amd64`, `arm64`, `arm/v7`)
pip install --upgrade 'archivebox==0.7.2'`
# Get it with Docker on any OS (`amd64`, `arm64`, `arm/v7`)
docker pull archivebox/archivebox:0.7.2
# Get it with brew on macOS (`amd64`, `arm64`)
brew tap archivebox/archivebox
brew install archivebox
pip install --upgrade 'archivebox==0.7.2'`
# Get it with apt on Ubuntu/Debian based systems (`any`)
wget 'https://github.com/ArchiveBox/debian-archivebox/raw/main/archivebox-0.7.1.deb'
apt install ./archivebox-0.7.1.deb
# OR
dpkg -i ./archivebox-0.7.1.deb
# then run pip install after
pip install --upgrade 'archivebox==0.7.2'`
Note: this is not packaged using "proper" debian techniques like 0.6.2 was, instead it's just a wrapper for executing pip install archivebox
w/ a few extras. This is because ArchiveBox relies on some binary and dynamic dependencies (node, chrome, playwright, ffmpeg, yt-dlp, etc.) which aren't allowed in Debian packages.
(Launchpad apt
ppa
& brew
updates coming eventually, packaging all the vendored binaries that archivebox depends on has gotten harder lately)
# Then run this to upgrade an existing collection data dir to 0.7.2
cd ~/path/to/data/dir
archivebox init
--tag=tag1,tag2,tag3
support to archivebox schedule
commandPGID=0
root-group ownership of data dir (but PUID=0 is still not allowed)/
or /data
volume mounts don't have any space availablearchivebox add
instead of at the end (so that imports that are paused or interrupted still get tagged correctly)CHROME_USER_AGENT
format string not getting interpolated properly+editable
string and also add BUILD_TIME/browsers/*
does not exist warning on startupThis is a pre-release. I'm getting ready to finally bundle all the changes from the last year and a half into a minor version bump.
Sorry for the delay everyone!
I'll update this with proper release notes once I'm preparing to roll the debian, homebrew, and pip packages.
For now you can use the pre-release version via the Docker archivebox/archivebox:dev
tag.
Get this release via pip
, docker
, brew
, or dpkg
(apt
ppa
update delayed).
# Get it with Pip on any OS (`amd64`, `arm64`, `arm/v7`)
pip install --upgrade 'archivebox==0.7.1'`
# Get it with Docker on any OS (`amd64`, `arm64`, `arm/v7`)
docker pull archivebox/archivebox:0.7.1
# Get it with brew on macOS (`amd64`, `arm64`)
brew tap archivebox/archivebox
brew install archivebox
# Get it with apt on Ubuntu/Debian based systems (`any`)
wget 'https://github.com/ArchiveBox/debian-archivebox/raw/main/archivebox-0.7.1.deb'
apt install ./archivebox-0.7.1.deb
# OR
dpkg -i ./archivebox-0.7.1.deb
Note: this is not packaged using "proper" debian techniques like 0.6.2 was, instead it's just a wrapper for executing pip install archivebox
w/ a few extras. This is because ArchiveBox relies on some binary and dynamic dependencies (node, chrome, playwright, ffmpeg, yt-dlp, etc.) which aren't allowed in Debian packages.
(Launchpad apt
ppa
update coming eventually, packaging for apt
has gotten harder lately)
# Then run this to upgrade an existing collection data dir to 0.7.1
cd ~/path/to/data/dir
archivebox init
Lots of bugfixes, speedups, and small convenience features.
None
by @overhacked in https://github.com/ArchiveBox/ArchiveBox/pull/822
Full Changelog: https://github.com/ArchiveBox/ArchiveBox/compare/v0.6.2...v0.7.1
Re-snapshot
buttoninit --quick
and server --quick-init
options to quickly update the db version without doing a full re-init (for users with large archive collections this will make version upgrades a lot faster / less painful)archivebox setup
command and archivebox init --setup
flag to aid in automatically installing dependencies and creating a superuser during initial setupSNAPSHOTS_PER_PAGE=40
and MEDIA_MAX_SIZE=750m
config options#hash
e.g. /archive/<timestamp>/index.html#git
/archive/https://example.com/some/url
-> redirects to -> /archive/<timestamp>/index.html
(also works without scheme /archive/example.com
)archivebox add --tag=tag1,tag2,tag3 ...
CUSTOM_TEMPLATES_DIR
data/logs/errors.log
file (so users no longer have to run in --debug mode to see error details)archivebox schedule
logging and changed logfile location to ./logs/schedule.log
docker-compose.yml
djdt_flamegraph
for developers to profile UI performance--overwrite
flag support to archivebox schedule
, archived urls get added similarly to add --overwrite
SEARCH_BACKEND_TIMEOUT
config option to tune amount of time search backend can take before it gives upCache-Control
headers to dramatically speed up load times by caching favicons, screenshots, etc. in browsers/upstreamsAssertionError
error when archivebox would to attempt archive with CHROME_BINARY=None
when Chrome was not found on host system
apt
/ deb
distribution on Launchpad PPAThank you contributors who helped with the 181 commits in this release!
@cdvv7788, @jdcaballerov, @thedanbob, @aggroskater, @mAAdhaTTah, @mario-campos, @mikaelf
.util
to fix windows import clash 72e2c7bCOOKIES_FILE
config param breaking in wget ef7711fshould_save_extractor
methods to accept overwrite
parameter 5420903archivebox oneshot
command for downloading a single site without starting a whole collectionLast stable version for the v0.4 branch, contains numerous last fixes an improvements to v0.4 before the leap to v0.5.