🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
--init
and --overwrite
flags on add
A minor bugfix release for the Readability archive method to avoid timing out killing the whole archiving process.
docker run -v $PWD:/data archivebox schedule --foreground --every=day --depth=1 'https://getpocket.com/users/USERNAME/feed/all'
# docker-compose.yml
version: '3.7'
services:
archivebox:
image: nikisweeting/archivebox:latest
command: schedule --foreground --every=day --depth=1 'https://getpocket.com/users/USERNAME/feed/all'
environment:
- USE_COLOR=True
- SHOW_PROGRESS=False
volumes:
- ./data:/data
Add support for the Readability article text extractor, it runs on the SingleFile, Wget, and DOM dump output by default, but if none of those are available it will download the article from scratch to do text extraction. This release also officially adds Docker support for ARM architectures, including the Raspberry Pi. The image size was also shrunk from 1.5GB to 452MB by making sure unnecessary build tools are uninstalled after the package build process.
This is a minor bugfix release with some Dockerfile improvements to qualify for the official docker image library.
We add a major new archive method in this release: SingleFile. On bare metal it requires installing Node and Chrome/Chromium, but it works out-of-the-box in the Docker version.
This finally allows ArchiveBox to pass all of the acid tests except one, and the archive for Github and many other sites are nicer than Wget was able to do on its own.
🌅 v0.4 is officially released. This is a long-awaited 3rd-pass review over every corner of the archivebox UX. It adresses many of the fundamental shortcomings around index consistency by using a new SQLite database, with automatic migrations provided by django. It also smooths many of the rough edges, adds a new admin Web UI, a rich new CLI, closes 40+ github tickets, and is the first official release available on PyPI.
pip install archivebox
docker run -v $PWD:/data nikisweeting/archivebox
Enjoy!
🎉 Big thanks to everyone who helped! Especially the Monadical team @cdvv7788 @apkallum @afreydev and also @drpfenderson who helped us track down the last few index importing bugs! 🎉
The docs still have some work left to finish updating, but the CLI help text is all up-to-date (when in doubt, just pass --help
).
Let us know if you find any rough edges here: https://github.com/pirate/ArchiveBox/issues/new/choose
pip install archivebox
cd path/to/your/archive/folder
archivebox init # this doubles as the migrate command, it will safely upgrade existing index files automatically
archviebox add 'https://example.com'
archviebox add 'https://getpocket.com/users/USERNAME/feed/all' --depth=1
archivebox status
archivebox server
archivebox help
Or if you prefer docker, the CLI works exactly the same archivebox [subcommand] [...args]
:
docker run -v $PWD:/data nikisweeting/archivebox init
docker run -v $PWD:/data nikisweeting/archivebox add 'https://example.com'
docker run -v $PWD:/data -p 8000 nikisweeting/archivebox server
version: '3.7'
services:
archivebox:
image: nikisweeting/archivebox:latest
command: server 0.0.0.0:8000
stdin_open: true
tty: true
ports:
- 8000:8000
environment:
- USE_COLOR=True
volumes:
- ./data:/data
A bunch of big changes:
pip install archivebox
is now availablearchivebox/cli/archivebox.py
archivebox
(see below)For more info, see: https://github.com/pirate/ArchiveBox/wiki/Roadmap
Install Methods:
pip/pipenv install archivebox [--dev]
docker run nikisweeting/archivebox
/ docker-compose up
apt/brew/pkg/yum/nix/etc install archivebox
(maybe later)Command Line Interface:
archivebox
archivebox version
archivebox help
archivebox init
archivebox status
archivebox add
archivebox remove
archivebox update
archivebox list
archivebox schedule
archivebox config
archivebox server
archivebox shell
archivebox manage
archivebox oneshot
archivebox export
archivebox proxy
Web UI:
/
Main index/add
Page to add new links to the archive (but needs improvement)/archive/<timestamp>/
Snapshot details page/archive/<timestamp>/<url>
live wget archive of page/archive/<timestamp>/<extractor>
get a specific extractor output for a given snapshot/archive/<url>
shortcut to view most recent snapshot of given url/archive/<url_hash>
shortcut to view most recent snapshot of given url/admin
Admin interface to view and edit archive data/old.html
Backwards-compatible static HTML index for the previous versionPython API:
from archivebox.main import add, remove, info, config, etc...
from archivebox.core.models import Snapshot, User, etc...
from archivebox.extractors import media, wget, screenshot, etc...
from archivebox.index import json, sql, html, etc...
from archivebox.parsers import pinboard_rss, pocket_html, generic_json, etc...
(Red ❌ features are still unfinished and will be released in later versions)
urllib.parse
instead of hand-written lambdas