Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
This is the 2022-07-27 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
This is the 2021-09-23 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
This is the 2021-08-03 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
This release includes:
ExtractorChrome
.The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
This is the 2021-06-17 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
IMPORTANT This release was accidentally built with Java 15 and due to changes in the run-time libraries it is not compatible with Java 8 (Java 9 or later should work fine).
This release improves sitemap extraction, and fixes a bug that can interfere with checkpoint creation.
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
This is the 2021-05-27 release. Despite being an interim release, it includes a number of important fixes for bugs in Heritix 3.4 and should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
Notably, this release includes new modules for finding and using sitemaps. See: Support for extracting URLs in sitemaps #262
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
This is the 2020-05-18 release. Despite being an interim release, it includes a number of important fixes for bugs in Heritix 3.4 and should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
This release features new modules to support archiving over SFTP, but stored as a reponse
record rather than the resource
record that has been more widely used in the past. The next release will resolve this as per this pull request
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
This is the fourth dated, periodic release. It includes a number of significant changes, most importantly updating of the Berkeley Database from a very old version 4.1.6 to version 7.5.11. This resolves a long-standing bug when recovering from checkpoints multiple times, but also means that the Heritrix state files from previous versions are not compatible with this version. In other words:
Any crawl state folders from previous versions of Heritrix are not compatible with this version! You can only use this new release with new crawls!
Some basic release notes are available here. You can find more detailed information in the changelog
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
This is the third dated, periodic release. Despite being an interim release, it includes a number of important fixes for bugs in Heritix 3.4 and should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
This is the second dated, periodic release. Despite being an interim release, it includes a number of important fixes for bugs in Heritix 3.2 and should be considered the latest stable version of Heritrix. Some basic release notes are available here
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here