ArchiveBox Versions Save

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

v0.8.0-rc

1 month ago

WIP pre-release for the upcoming ArchiveBox v0.8.0 release.

[!WARNING] This is an unfinished alpha pre-release. We're promoting it a little earlier than usual because it contains a ✨ big Django upgrade ✨ that affects many areas of the codebase, and we want brave early adopters to help us test it! If that sounds like you, make sure to back up your archive first, then give it a try and let us you if you find any bugs by opening a new issue!

Try this release early using docker or pip:

# with docker (pre-built)
docker pull archivebox/archivebox:dev
# with docker (built from source)
docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev
# with pip (built from source)
pip install 'git+https://github.com/pirate/ArchiveBox@dev'

To use the new noVNC container to view & control the ArchiveBox browser remotely, grab the updated docker-compose.yml and follow these instructions.

Highlights

  • upgrade to Django 4.2 (thanks @jimwins!)
  • add new _EXTRA_ARGS options (thanks @benmuth!)
  • add new generic_jsonl parser (thanks @jimwins!)
  • switch to feedparser for RSS parsing (thanks @jimwins!)
  • remember Snapshot detail page header expanded/collapsed state
  • allow more restrictive NFS permission coercion on ./data/archive
  • check /, /data, and /data/archive in Docker and warn if running low on disk space
  • fix /browsers chown on Docker armv7 entrypoint failing
  • disable chrome automatic self-updating when running headless
  • Add ability to populate is_staff and is_superuser flags during LDAP first auth
  • add gitea and other domains to default GIT_DOMAINS list to run git archiving on
  • bump yt-dlp and singlefile versions
  • fix RESOLUTION being ignored when using Chrome headless in Docker
  • fix sorting by Size / Files in the Admin Snapshots list page UI
  • fix spinner icon showing on some Snapshots instead of favicon when only a few extractors are enabled
  • fix yt-dlp sometimes failing to archive media due to filenames being too long or containing special characters
  • fix wget extractor not finding output when :80 or :443 port is present in the original URL
  • fix /var/spool/cron/crontabs permissions when mounting it via Docker
  • COMING SOON: new sci-dl scientific paper downloader being worked on by @benmuth

What's Changed

New Contributors

Full Changelog: https://github.com/ArchiveBox/ArchiveBox/compare/v0.7.2...v0.8.0-rc

v0.7.2

4 months ago
Web version screenshot

Get this release via pip, docker, brew, or dpkg (apt & brew releases are delayed).

# Get it with Pip on any OS (`amd64`, `arm64`, `arm/v7`)
pip install --upgrade 'archivebox==0.7.2'`
# Get it with Docker on any OS (`amd64`, `arm64`, `arm/v7`)
docker pull archivebox/archivebox:0.7.2
# Get it with brew on macOS (`amd64`, `arm64`)
brew tap archivebox/archivebox
brew install archivebox
pip install --upgrade 'archivebox==0.7.2'`
# Get it with apt on Ubuntu/Debian based systems (`any`)
wget 'https://github.com/ArchiveBox/debian-archivebox/raw/main/archivebox-0.7.1.deb'
apt install ./archivebox-0.7.1.deb
# OR
dpkg -i ./archivebox-0.7.1.deb

# then run pip install after
pip install --upgrade 'archivebox==0.7.2'`

Note: this is not packaged using "proper" debian techniques like 0.6.2 was, instead it's just a wrapper for executing pip install archivebox w/ a few extras. This is because ArchiveBox relies on some binary and dynamic dependencies (node, chrome, playwright, ffmpeg, yt-dlp, etc.) which aren't allowed in Debian packages.
(Launchpad apt ppa & brew updates coming eventually, packaging all the vendored binaries that archivebox depends on has gotten harder lately)


CLI version screenshot
# Then run this to upgrade an existing collection data dir to 0.7.2
cd ~/path/to/data/dir
archivebox init

What's Changed

  • add --tag=tag1,tag2,tag3 support to archivebox schedule command
  • allow PGID=0 root-group ownership of data dir (but PUID=0 is still not allowed)
  • improve error messages, hints, and logging about permissions issues in Docker
  • notify users when new ArchiveBox version is available on Github (thanks @benmuth!)
  • bump dependency versions (yt-dlp, chrome, readability, node, python)
  • warn when Docker / or /data volume mounts don't have any space available
  • limit to compatible python version to >= 3.8 and <= 3.11

Bug Fixes

  • fix action buttons in Snapshot admin page not showing up correctly
  • tag links immediately in first stage of archivebox add instead of at the end (so that imports that are paused or interrupted still get tagged correctly)
  • fix config variables in CHROME_USER_AGENT format string not getting interpolated properly
  • switch readability to prefer Chrome DOM dumps for article text instead of singlefile (because singlefile output is often huge and crashes readability/times out)
  • make Docker image smaller by removing unneeded docs files
  • better current version detection and remove annoying +editable string and also add BUILD_TIME
  • fix /browsers/* does not exist warning on startup

v0.7.0

1 year ago

This is a pre-release. I'm getting ready to finally bundle all the changes from the last year and a half into a minor version bump.

Sorry for the delay everyone!

I'll update this with proper release notes once I'm preparing to roll the debian, homebrew, and pip packages.

For now you can use the pre-release version via the Docker archivebox/archivebox:dev tag.

v0.7.1

1 year ago

Get this release via pip, docker, brew, or dpkg (apt ppa update delayed).

# Get it with Pip on any OS (`amd64`, `arm64`, `arm/v7`)
pip install --upgrade 'archivebox==0.7.1'`
# Get it with Docker on any OS (`amd64`, `arm64`, `arm/v7`)
docker pull archivebox/archivebox:0.7.1
# Get it with brew on macOS (`amd64`, `arm64`)
brew tap archivebox/archivebox
brew install archivebox
# Get it with apt on Ubuntu/Debian based systems (`any`)
wget 'https://github.com/ArchiveBox/debian-archivebox/raw/main/archivebox-0.7.1.deb'
apt install ./archivebox-0.7.1.deb
# OR
dpkg -i ./archivebox-0.7.1.deb

Note: this is not packaged using "proper" debian techniques like 0.6.2 was, instead it's just a wrapper for executing pip install archivebox w/ a few extras. This is because ArchiveBox relies on some binary and dynamic dependencies (node, chrome, playwright, ffmpeg, yt-dlp, etc.) which aren't allowed in Debian packages.
(Launchpad apt ppa update coming eventually, packaging for apt has gotten harder lately)


# Then run this to upgrade an existing collection data dir to 0.7.1
cd ~/path/to/data/dir
archivebox init

What's Changed

Lots of bugfixes, speedups, and small convenience features.

New Contributors

Expand to see the list...

Full Changelog: https://github.com/ArchiveBox/ArchiveBox/compare/v0.6.2...v0.7.1

v0.6.2

3 years ago

New features

  • new ArchiveResult log in the admin web UI, with full editing ability of individual extractor outputs + list of outputs under each Snapshot admin entry
  • ability to save multiple snapshots of the same URL over time using new Re-snapshot button
  • add init --quick and server --quick-init options to quickly update the db version without doing a full re-init (for users with large archive collections this will make version upgrades a lot faster / less painful)
  • add new archivebox setup command and archivebox init --setup flag to aid in automatically installing dependencies and creating a superuser during initial setup
  • new SNAPSHOTS_PER_PAGE=40 and MEDIA_MAX_SIZE=750m config options
  • allow hotlinking directly to specific extractor output on the snapshot detail page using URL #hash e.g. /archive/<timestamp>/index.html#git
  • add ability to view snapshot matching a given URLs by visiting /archive/https://example.com/some/url -> redirects to -> /archive/<timestamp>/index.html (also works without scheme /archive/example.com)
  • #660 add ability to tag URLs while adding them via the web UI and via the CLI using archivebox add --tag=tag1,tag2,tag3 ...
  • #659 add back ability to override visual styling with custom HTML and CSS using new config option CUSTOM_TEMPLATES_DIR
  • ability to add and remove multiple tags at once from the snapshot admin using autocompleting dropdown

Enhancements

  • lots of performance improvements! (in testing with 100k entries, the main index was brought down from 10-14 second load times to ~110ms once cache warms up)
  • full text search now works on the public snapshot list
  • dates and times are now localized to your browser's timezone instead of showing in UTC
  • integrity and correctness improvements to readability, mercury, warc, and other extractors
  • video subtitles and description are now added to the full-text search index as well (including youtube's autogenerated transcripts in all languages)
  • log all errors with full tracebacks to new data/logs/errors.log file (so users no longer have to run in --debug mode to see error details)
  • better archivebox schedule logging and changed logfile location to ./logs/schedule.log
  • better docker-compose setup experience with sonic config example in docker-compose.yml
  • add Django Debug Toolbar + djdt_flamegraph for developers to profile UI performance
  • add --overwrite flag support to archivebox schedule, archived urls get added similarly to add --overwrite
  • #644 remove boostrap and jquery remove network requests to CDNs by inlining them instead
  • #647 allow filtering by ArchiveResult status in the Snapshot admin UI to select only links that have been archived or not archived
  • #550 kill all orphan child processes after each extractor finishes to prevent dangling chromium/node subprocesses and memory leaks
  • 3276434 add new SEARCH_BACKEND_TIMEOUT config option to tune amount of time search backend can take before it gives up
  • more diagnostic info added to the Snapshot admin view including most recent status code, content type, detected server, etc
  • make the order of the table columns, layout, and spacing the same on the public view and private view (also remove DataTable, we're not using it)
  • better snapshot grid page (faster load times, nicer CSS for tags and cards, more actions supported and metadata shown)
  • added Cache-Control headers to dramatically speed up load times by caching favicons, screenshots, etc. in browsers/upstreams
  • new project releases page https://releases.archivebox.io and demo url https://demo.archivebox.io

Bugfixes

  • #673 fix searching by URL substring in Snapshot admin list
  • #658 fix Snapshot admin action buttons not working in Safari and some other browsers
  • #678 fix AssertionError error when archivebox would to attempt archive with CHROME_BINARY=None when Chrome was not found on host system
  • #654 fix some issues with sonic attempting to index massive text blobs or binary blobs on some pages and hanging
  • #674 fix UTF-8 encoding encoding problems with file reading/writing on Windows (supporting a Python pkg on Windows is unreasonably painful ya'll)
  • #433 fix deleted items sometimes reappearing on next import/update
  • #473 fix issue preventing use of archivebox python API inside raw REPL (not using archivebox shell)
  • fix stdin/stdout/stderr handling for some edge cases in Docker/Docker-Compose

image image

v0.5.6

3 years ago
  • add ARMv7 and ARMv8 CPU support for apt / deb distribution on Launchpad PPA
  • fix nodesource apt repo not supported on i386 b90afc8
  • fix handling of skipped ArchiveResult entries with null output 0aea5ed
  • catch exception on import of old index.json into ArchiveResult 171bbeb
  • move debsign to release not build 66fb5b2
  • skip tests during debian build a32eac3
  • fix emptystrings in cmd_version causing exception a49884a
  • automate deb dist better and bump version 0e6ac39
  • fix assertion 6705354
  • change wording of db not found error 683a087

v0.5.4

3 years ago

Thank you contributors who helped with the 181 commits in this release!
@cdvv7788, @jdcaballerov, @thedanbob, @aggroskater, @mAAdhaTTah, @mario-campos, @mikaelf

  • fix migration failing due to null cmd_versions in older archives a3008c8
  • Publish, minor, & major version to DockerHub and add set up CodeQL codeql-analysis.yml c5b7d9f, bbb6cc8
  • fix DATABASE_NAME posixpath, and dependencies dict bug 02bdb3b, 5c7842f
  • use relative imports for .util to fix windows import clash 72e2c7b
  • fix COOKIES_FILE config param breaking in wget ef7711f
  • Refactor should_save_extractor methods to accept overwrite parameter 5420903
  • Fix issue #617 by using mark_safe in combination with format_html … 1989275
  • make permission chowning on docker start less fancy, respect PUID/PGID #635
  • add createsuperuser flag to server command 39ec77e
  • fix files icons styling and use the db exclusively for rendering them, instead of filesystem f004058, 7d8fe66, 5c54bcc, 534ead2
  • limit youtubedl download size to 750m and stop splitting out audio files 3227f54
  • also search url, timestamp, tags on public index 8a4edb4
  • fix trailing slash problems and wget not detecting download path 9764a8e
  • add response status code to headers.json c089501
  • fix singlefile path used for sonic 24e2493
  • cleanup template layout in filesystem, new snapshot detail page UI Screen Shot 2021-01-30 at 9 53 22 p

v0.5.3

3 years ago
  • ArchiveResult moved to SQLite3 DB for performance @cdvv7788
  • lots of assorted bugfixes and improvements courtesy of @cdvv7788 and @jdcaballerov
  • new full-text search support with ripgrep and sonic courtesy of @jdcaballerov
  • new archivebox oneshot command for downloading a single site without starting a whole collection
  • new Pocket API importer courtesy of @mAAdhaTTah
  • new Wallabag importer courtesy of @ehainry
  • new extractor options on Add page courtesy of @BlipRanger
  • new apt/deb/homebrew/pip packaging setup into separate repos under new Github Org https://github.com/ArchiveBox
  • new official PPA and Docker Hub accounts https://hub.docker.com/r/archivebox/archivebox (with automatic armv7 builds courtesy of @chrismeller)
  • new Snapshot grid view courtesy of @jdcaballerov image

v0.4.24

3 years ago

Last stable version for the v0.4 branch, contains numerous last fixes an improvements to v0.4 before the leap to v0.5.

v0.4.21

3 years ago