CommonCrawlDocumentDownload Versions Save

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

1.0.0.10

1 year ago

1.0.0.9

1 year ago
  • Switch to Gradle 7.6 and to the new maven-publish plugin
  • Update third-party-libraries
  • Update to more recent CC-MAIN
  • Parse newer fields
  • Adjust logging configuration

Full Changelog: https://github.com/centic9/CommonCrawlDocumentDownload/compare/1.0.0.8...1.0.0.9

1.0.0.8

1 year ago

Intermediate release while switching to Gradle 7.6, not uploaded to Maven Central.

Full Changelog: https://github.com/centic9/CommonCrawlDocumentDownload/compare/1.0.0.7...1.0.0.8

1.0.0.7

2 years ago
  • Add Extension .pot for powerpoint
  • Switch to CC-MAIN-2019-39
  • Update third-party libraries

Full Changelog: https://github.com/centic9/CommonCrawlDocumentDownload/compare/1.0.0.6...1.0.0.7

1.0.0.6

5 years ago
  • Update 3rd party libraries
  • Use common-crawl 2018-43 by default
  • Write accumulated mimetypes to a separate text-file after each index-file
  • Add some support for detecting duplicate files and moving them out of the list to not re-process the same file over and over by the post-processing steps
  • Some small adjustments for behavior changes in Java 11

1.0.0.5

6 years ago
  • Update 3rd party libraries
  • Download some more mime-types out of the box
  • Use longer socket-timeout
  • Switch to the new S3 public dataset URL
  • Handle new item "mime-detected" in JSON
  • Some refactoring