Aut Versions Save

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

aut-1.2.0

1 year ago

Documentation

Release Notes

Full Changelog

Closed issues:

  • Include last modified date for a resource #546

Merged pull requests:

aut-1.1.1

1 year ago

Documentation

Release Notes

Full Changelog

Fixed bugs:

  • DomainGraph should use YYYYMMDD not YYYYMMDDHHMMSS #544

Merged pull requests:

aut-1.1.0

1 year ago

Documentation

Release Notes

Full Changelog

Fixed bugs:

  • org.apache.tika.mime.MimeTypeException: Invalid media type name: application/rss+xml lang=utf-8 #542

Closed issues:

  • Add ARCH text files derivatives #540

Merged pull requests:

aut-1.0.0

1 year ago

Documentation

Release Notes

Full Changelog

Implemented enhancements:

  • Remove http headers, and html on webpages() #538
  • Add domain column to webpages() #534
  • Replace Java ARC/WARC record processing library #494
  • Method to perform finer-grained selection of ARCs and WARCs #247
  • Unnecessary buffer copying #18

Fixed bugs:

  • Discard date RDD filter only takes a single string, not a list of strings. #532
  • Extract gzip data from transfer-encoded WARC #493
  • ARC reader string vs int error on record length #492

Closed issues:

  • java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.Set$Set1 Set(liberal.ca) #529
  • Improve CommandLineApp.scala test coverage #262
  • Improve ExtractBoilerpipeText.scala test coverage #261
  • Improve ArchiveRecord.scala test coverage #260
  • Unit testing for RecordLoader #182
  • Improve ArchiveRecordWritable.java test coverage #76
  • Improve WarcRecordUtils.java test coverage #74
  • Improve ArcRecordUtils.java test coverage #73
  • Improve ExtractDate.scala test coverage #64
  • Remove org.apache.commons.httpclient #23

Merged pull requests:

aut-0.91.0

2 years ago

Documentation

Release Notes

Full Changelog

Implemented enhancements:

  • Include timestamp in crawl date #525

Merged pull requests:

  • Change crawl_date format to YYYYMMDDHHMMSS, update hasDate filter. #526 (ruebot)

aut-0.90.4

2 years ago

Documentation

Release Notes

Full Changelog

Implemented enhancements:

  • Replace scala-uri library from ExtractDomain and just parse public_suffix_list.dat #521

Fixed bugs:

  • Scaladocs haven't been created since 0.90.0 release #522

Merged pull requests:

aut-0.90.3

2 years ago

Documentation

Release Notes

Full Changelog

Fixed bugs:

  • ExtractDomains returns non-Apex Domains #519

Merged pull requests:

aut-0.90.2

3 years ago

Documentation

Release Notes

Full Changelog

Fixed bugs:

  • ARC file name appearing in url list #516
  • WARC-Target-URI in Wget warc files is not parsed properly #514

Merged pull requests:

  • Filter or filedesc and dns records from arcs. #517 (ruebot)
  • Handle wget WARC-Target-URI formatting. #515 (ruebot)

aut-0.90.1

3 years ago

Documentation

Release Notes

Full Changelog

Fixed bugs:

  • crawl_date is not included on binary information jobs when documentation says it is #512

Merged pull requests:

  • Add missing crawl_date column to binary information jobs. #513 (ruebot)
  • Update jsoup to 1.13.1 #511 (ruebot)

aut-0.90.0

3 years ago

Documentation

Release Notes

Full Changelog

Fixed bugs:

  • Python implementation of .all() has .keepValidPages() incorrectly applied to it #502
  • Extract hyperlinks from wayback machine #501
  • Release 0.80.0 JAR produces error; built 0.80.1 fatjar built on repo works #495

Closed issues:

  • Migrate CI infrastructure from TravisCI to GitHub Action #506
  • Split tf into it's own repo #498
  • Change master branch to main branch #490
  • GitHub action - Run isort and black on Python code #488
  • Add scalafmt GitHub action #486
  • Add Google Java Formatter as a GitHub action #484
  • Packages build is often broken - should we support it? #483
  • Implement SaveToDisk in Python #478
  • Java 11 support #356

Merged pull requests:

  • ars-cloud compatibility with aut and Java 11 #510 (ruebot)
  • Update to Spark 3.0.1 #508 (ruebot)
  • Replace TravisCI with GitHub Actions. #507 (ruebot)
  • Bump junit from 4.12 to 4.13.1 #505 (dependabot[bot])
  • Fix relative links extraction #504 (yxzhu16)
  • Remove .keepValidPages() on .all() Python implmentation. #503 (ruebot)
  • Updates read.me to include citation section #500 (SamFritz)
  • Remove tf project; resolves #498. #499 (ruebot)
  • Add Python formatter GitHub Action. #489 (ruebot)
  • Add scalafmt GitHub action and apply it to scala code. #487 (ruebot)
  • Add Google Java Formatter as an action, and apply it. #485 (ruebot)
  • Add Python implementation of SaveBytes. #482 (ruebot)
  • Bump xercesImpl from 2.11.0 to 2.12.0 #481 (dependabot[bot])
  • [Skip Travis] Trim README down given aut.docs.archivesunleashed.org #480 (ruebot)
  • Spark 3.0.0 + Java 11 support. #375 (ruebot)