ArchiveBox Versions Save

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

v0.4.17

3 years ago
  • Fix bugs with parsing long URLs as paths
  • html-encoded URLs
  • new generic HTML parser
  • new --init and --overwrite flags on add
  • improve stdout and hints
  • fix Pull title button
  • other small bugfixes

v0.4.16

3 years ago

A minor bugfix release for the Readability archive method to avoid timing out killing the whole archiving process.

v0.4.15

3 years ago
  • fix a bug where invalid URLs where attempted to be parsed an imported, causing the whole archive process to crash
  • add support for scheduled archiving in docker
docker run -v $PWD:/data archivebox schedule --foreground --every=day --depth=1 'https://getpocket.com/users/USERNAME/feed/all'
# docker-compose.yml

version: '3.7'

services:
  archivebox:
    image: nikisweeting/archivebox:latest
    command: schedule --foreground --every=day --depth=1 'https://getpocket.com/users/USERNAME/feed/all'
    environment:
      - USE_COLOR=True
      - SHOW_PROGRESS=False
    volumes:
      - ./data:/data

v0.4.14

3 years ago

Add support for the Readability article text extractor, it runs on the SingleFile, Wget, and DOM dump output by default, but if none of those are available it will download the article from scratch to do text extraction. This release also officially adds Docker support for ARM architectures, including the Raspberry Pi. The image size was also shrunk from 1.5GB to 452MB by making sure unnecessary build tools are uninstalled after the package build process.

image

v0.4.13

3 years ago

v0.4.12

3 years ago

This is a minor bugfix release with some Dockerfile improvements to qualify for the official docker image library.

v0.4.11

3 years ago

We add a major new archive method in this release: SingleFile. On bare metal it requires installing Node and Chrome/Chromium, but it works out-of-the-box in the Docker version.

This finally allows ArchiveBox to pass all of the acid tests except one, and the archive for Github and many other sites are nicer than Wget was able to do on its own.

v0.4.9

3 years ago

image

🌅 v0.4 is officially released. This is a long-awaited 3rd-pass review over every corner of the archivebox UX. It adresses many of the fundamental shortcomings around index consistency by using a new SQLite database, with automatic migrations provided by django. It also smooths many of the rough edges, adds a new admin Web UI, a rich new CLI, closes 40+ github tickets, and is the first official release available on PyPI.

Enjoy!

🎉 Big thanks to everyone who helped! Especially the Monadical team @cdvv7788 @apkallum @afreydev and also @drpfenderson who helped us track down the last few index importing bugs! 🎉

The docs still have some work left to finish updating, but the CLI help text is all up-to-date (when in doubt, just pass --help).
Let us know if you find any rough edges here: https://github.com/pirate/ArchiveBox/issues/new/choose

pip install archivebox

cd path/to/your/archive/folder

archivebox init  # this doubles as the migrate command, it will safely upgrade existing index files automatically
archviebox add 'https://example.com'
archviebox add 'https://getpocket.com/users/USERNAME/feed/all' --depth=1
archivebox status
archivebox server
archivebox help

Or if you prefer docker, the CLI works exactly the same archivebox [subcommand] [...args]:

docker run -v $PWD:/data nikisweeting/archivebox init
docker run -v $PWD:/data nikisweeting/archivebox add 'https://example.com'
docker run -v $PWD:/data -p 8000 nikisweeting/archivebox server
version: '3.7'

services:
    archivebox:
        image: nikisweeting/archivebox:latest
        command: server 0.0.0.0:8000
        stdin_open: true
        tty: true
        ports:
            - 8000:8000
        environment:
            - USE_COLOR=True
        volumes:
            - ./data:/data

Screenshots

Screen Shot 2020-07-28 at 6 19 48 AM

New Features

A bunch of big changes:

  • pip install archivebox is now available
  • full transition to Django Sqlite DB with migrations (making upgrades between versions much safer now)
  • maintains an intuitive and helpful CLI that's backwards-compatible with all previous archivebox data versions
  • uses argparse instead of hand-written CLI system: see archivebox/cli/archivebox.py
  • new subcommands-based CLI for archivebox (see below)
  • new Web UI with pagination, better search, filtering, permissions, and more
  • 30+ assorted bugfixes, new features, and tickets closed

For more info, see: https://github.com/pirate/ArchiveBox/wiki/Roadmap

Released in this version:

Install Methods:

Command Line Interface:

Web UI:

  • / Main index
  • /add Page to add new links to the archive (but needs improvement)
  • /archive/<timestamp>/ Snapshot details page
  • /archive/<timestamp>/<url> live wget archive of page
  • /archive/<timestamp>/<extractor> get a specific extractor output for a given snapshot
  • /archive/<url> shortcut to view most recent snapshot of given url
  • /archive/<url_hash> shortcut to view most recent snapshot of given url
  • /admin Admin interface to view and edit archive data
  • /old.html Backwards-compatible static HTML index for the previous version

Python API:

(Red ❌ features are still unfinished and will be released in later versions)

v0.2.4

5 years ago
  • better archive corruption guards (check structure invariants on every parse & save)
  • remove title prefetching in favor of new FETCH_TITLE archive method
  • slightly improved CLI output for parsing and remote url downloading
  • re-save index after archiving completes to update titles and urls
  • remove redundant derivable data from link json schema
  • markdown link parsing support
  • faster link parsing and better symbol handling using a new compiled URL_REGEX

v0.2.3

5 years ago
  • fixed issues with parsing titles including trailing tags
  • fixed issues with titles defaulting to URLs instead of attempting to fetch
  • fixed issue where bookmark timestamps from RSS would be ignored and current ts used instead
  • fixed issue where ONLY_NEW would overwrite existing links in archive with only new ones
  • fixed lots of issues with URL parsing by using urllib.parse instead of hand-written lambdas
  • ignore robots.txt when using wget (ssshhh don't tell anyone 😁)
  • fix RSS parser bailing out when there's whitespace around XML tags
  • fix issue with browser history export trying to run ls on wrong directory