Heritrix3 Versions Save

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

3.4.0-20220727

1 year ago

This is the 2022-07-27 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

3.4.0-20210923

2 years ago

This is the 2021-09-23 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

3.4.0-20210803

2 years ago

This is the 2021-08-03 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

This release includes:

  • Upgrades http-client to version 4.5, including improved cookie handling and expiration.
  • A new browser-based extraction module, ExtractorChrome.
  • JDK16 compatibility improvements.
  • Many more smaller fixes and improvements (see changelog).

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

3.4.0-20210617

2 years ago

This is the 2021-06-17 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

IMPORTANT This release was accidentally built with Java 15 and due to changes in the run-time libraries it is not compatible with Java 8 (Java 9 or later should work fine).

This release improves sitemap extraction, and fixes a bug that can interfere with checkpoint creation.

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

3.4.0-20210527

2 years ago

This is the 2021-05-27 release. Despite being an interim release, it includes a number of important fixes for bugs in Heritix 3.4 and should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

Notably, this release includes new modules for finding and using sitemaps. See: Support for extracting URLs in sitemaps #262

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

3.4.0-20200518

3 years ago

This is the 2020-05-18 release. Despite being an interim release, it includes a number of important fixes for bugs in Heritix 3.4 and should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

This release features new modules to support archiving over SFTP, but stored as a reponse record rather than the resource record that has been more widely used in the past. The next release will resolve this as per this pull request

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

3.4.0-20200304

4 years ago

This is the fourth dated, periodic release. It includes a number of significant changes, most importantly updating of the Berkeley Database from a very old version 4.1.6 to version 7.5.11. This resolves a long-standing bug when recovering from checkpoints multiple times, but also means that the Heritrix state files from previous versions are not compatible with this version. In other words:

Any crawl state folders from previous versions of Heritrix are not compatible with this version! You can only use this new release with new crawls!

Some basic release notes are available here. You can find more detailed information in the changelog

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

3.4.0-20190418

5 years ago

This is the third dated, periodic release. Despite being an interim release, it includes a number of important fixes for bugs in Heritix 3.4 and should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

3.4.0-20190207

5 years ago

This is the second dated, periodic release. Despite being an interim release, it includes a number of important fixes for bugs in Heritix 3.2 and should be considered the latest stable version of Heritrix. Some basic release notes are available here

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

3.4.0-20190205

5 years ago

This release is the first aiming to establish stable baseline using dated, periodic releases. The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here