Wget Lua Versions Save

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

v1.21.3-at.20231213.03

6 months ago

Wget-AT 20231213.03 (Wget 1.21.3-at.20231213.03) Release Notes

This release adds the recording of more information on the build process of the used Wget-AT.

Wget-AT can be configured with several options, and build on and for different system. Information about this will now be written to the WARC record of WARC-Type value warcinfo using fields starting with wget-build-*.

New warcinfo headers

The new headers in the warcinfo record are:

Example

The new wget-build-* headers in the warcinfo record are for example

wget-build-version: 1.21.3-at.20231213.03
wget-build-system-host: x86_64-pc-linux-gnu
wget-build-system-build: x86_64-pc-linux-gnu
wget-build-system-target: x86_64-pc-linux-gnu
wget-build-compilation-string: gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/usr/local/etc/wgetrc" -DLOCALEDIR="/usr/local/share/locale" -I. -I../lib -I../lib -I/usr/include/luajit-2.1 -I/usr/local/include -DHAVE_LIBSSL -I/usr/local/include -DNDEBUG -g -O2
wget-build-link-string: gcc -I/usr/local/include -DHAVE_LIBSSL -I/usr/local/include -DNDEBUG -g -O2 -L/usr/local/lib -lcares -lpcre2-8 -lidn2 -lssl -lcrypto -L/usr/local/lib -lzstd -lz -lpsl -lm -ldl -lluajit-5.1 ../lib/libgnu.a 
wget-build-features: +cares +digest -gpgme +https +ipv6 +iri +large-file -metalink -nls +ntlm +opie +psl +ssl/openssl

Minor update

A minor update is that the [email protected] email address, the repository URL https://github.com/ArchiveTeam/wget-lua, and the IRC channel #archiveteam-dev on hackint IRC are now noted in the output of commands --version and --help.

v1.21.3-at.20231213.01

6 months ago

Wget-AT 20231213.01 (Wget 1.21.3-at.20231213.01) Release Notes

This release adds the recording of minimal SSL/TLS information in the WARC of the connection used to send and receive data. Next to this, the release allows Wget-AT to keep track of used protocols.

WARC-Cipher-Suite header

Recording of any details of a used SSL/TLS connection did not happen before this release. The only information about this was stored in the URI which would either start with https (indicating a secure connection with the SSL or TLS protocol) or http. Websites may (and it is confirmed that some will) return different HTTP responses depending on the details of the secure connections, which creates an urgency of storing such information in WARCs.

There is no information in the WARC format (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) on storing secure connection information. A discussion is taking place in the issue at https://github.com/iipc/warc-specifications/issues/86 for adding support. This issue originally suggested the use of a WARC-TLS-Cipher-Suite WARC header, but also notes WARC-Cipher-Suite as an option.

WARC records

The cipher suite WARC header is written on records with WARC-Type value request and response when request and response data for a HTTPS URL is recorded. It will be written on a revisit record as well when this is written due to deduplicating a response record.

A resource record may also have this header written to it, for example when a FTPS connection is used. Currently this is not being done as the FTP WARC record writing functionality is Wget-AT is not optimal, and a major overhaul of this is being worked on, at which point this header will be written for data transferred over an FTPS connection.

This header is not currently written on a metadata record in Wget-AT, while supported on that record type according to https://github.com/iipc/warc-specifications/issues/86.

Allowed header values

The value of this field is the IANA defined cipher suite name (https://www.iana.org/assignments/tls-parameters/tls-parameters.txt) for TLS, or for SSL a name as defined in RFC 6101 for SSLv3 or "The SSL Protocol" (https://www.ietf.org/archive/id/draft-hickman-netscape-ssl-00.txt) for SSLv2 (SSL version 0.2).

The recorded value should be the cipher suite that is in use for the connection. It should not be the cipher suite presented in the client hello step of the handshake, or any other value before a cipher suite is agreed on and application data is being transferred.

Using header name WARC-Cipher-Suite over WARC-TLS-Cipher-Suite

Having TLS in the WARC header restrict one to store only TLS cipher suites. The obsolete SSLv3 protocol uses SSL cipher suites that have a name starting with SSL_* defined. If a TLS connection is used, the cipher suite name starts with TLS_*. Leaving TLS out of the WARC header allows both SSL and TLS cipher suites to be stored, while it is still clear from the first three bytes of the header value if a SSL or TLS record was used.

RFC 5246 notes that "cipher suite values { 0x00, 0x1C } and { 0x00, 0x1D } are reserved to avoid collision with Fortezza-based cipher suites in SSL 3." These two cipher suite values are defined in RFC 6101 as respectively SSL_FORTEZZA_KEA_WITH_NULL_SHA and SSL_FORTEZZA_KEA_WITH_FORTEZZA_CBC_SHA. While most SSL_* cipher suites have been assigned a similar TLS_* name for further use in the TLS protocol (for example SSL_DHE_DSS_WITH_DES_CBC_SHA was named TLS_DHE_DSS_WITH_DES_CBC_SHA), some have not and are only assigned by their SSL_* defined names.

Value { 0x00,0x1E } was defined with name TLS_KRB5_WITH_DES_CBC_SHA in RFC 2712, while it holds an entirely different name and definition in RFC 6101 with SSL_FORTEZZA_KEA_WITH_RC4_128_SHA. Some cipher suites like SSL_CK_RC4_128_WITH_MD5 (used in SSL version 0.2, see "The SSL Protocol") do not have a TLS_* defined name at all. These are examples that show how not all SSL_* cipher suites can simply be represented by their TLS_* cipher suites, and why cipher suites should be written with their SSL_* defined name when the SSL protocol is used. While the SSL protocol is obsolete, it is technically possible to be used, and should be accounted for in the WARC headers.

Next to the above, a "cipher suite" is well defined and widely accepted as being either a TLS or SSL cipher suite, making WARC-Cipher-Suite a more minimal representation of the type of data to store than WARC-TLS-Cipher-Suite. Any future set of cipher suites that are neither SSL nor TLS cipher suites can also be written under the WARC-Cipher-Suite header.

WARC-Protocol header

Next to recording the used cipher suite, the SSL/TLS version should be recorded, for the same reason as given in the previous section. The WARC format currently does not define a way to store this version, but the issue at https://github.com/iipc/warc-specifications/issues/42 discusses this. The proposed definition in this issue is implemented in this release.

The WARC-Protocol header is allowed to be written on all records the WARC-Cipher-Suite is allowed on.

Allowed header values

The allowed header values are as defined in the issue at https://github.com/iipc/warc-specifications/issues/42 of which a subset is used in Wget-AT as of this release:

  • http/0.9
  • http/1.0
  • http/1.1
  • ssl/2
  • ssl/3
  • tls/1.0
  • tls/1.1
  • tls/1.2
  • tls/1.3

The value ftp is not currently in use for the same reasons the WARC-Cipher-Suite header is not yet written on WARC records for FTPS URLs as explained in the previous section.

As with WARC-Cipher-Suite, the value should be that of the connection that is used to actually transfer data over, not anything used during negotiations.

Only one of the http/* values is always written on the request and response records of a HTTP(S) URL, while one of the ssl/* or tls/* values is written only on a record of a HTTPS URL. An example of the written WARC-Protocol headers on a record with a HTTP/1.1 payload with data transferred over a TLSv1.3 connection is

WARC-Protocol: http/1.1
WARC-Protocol: tls/1.3

Minor features

One minor feature was added in this release:

  • The Dockerfile now uses Debian bookworm instead of Debian bullseye.

Bug fixes

Three bugs have been fixed in this release:

  • A bug is fixed that prevented the use of option --warc-cdx.
  • The manual of Wget writes that specifying a protocol of SSLv2, SSLv3, TLSv1, TLSv1_1, TLSv1_2, or TLSv1_3 to option --secure-protocol, forces the use of this protocol. In practice this was not the case, the protocol would be set as minimum version. If --secure-protocol=TLSv1_1 was given, one of TLSv1_1, TLSv1_2, or TLSv1_3 would be used after negotiation. This is now fixed to follow the manual.
  • If a URL would be transformed from a HTTP to HTTPS URL due to HSTS, the HTTP version of the URL would still be written in the WARC headers, while the HTTPS URL was used for data transfer. This is now fixed.

v1.21.3-at.20220528.01

2 years ago

v1.21.3-at.20220503.02

2 years ago

v1.21.3-at.20220503.01

2 years ago

v1.20.3-at.20211001.01

2 years ago

v1.20.3-at.20200401.01

3 years ago

Wget-AT 20200401.01 (Wget 1.20.3-at.20200401.01) Release Notes

This is the first official release of Wget-AT as continuation of Wget-Lua. Wget-AT is a new direction with Wget-Lua to add more modern features for web archiving, in addition to the already implemented Lua scripting.

This release adds support for Zstandard with dictionary compression, implements URL-agnostic deduplication and moves to version 1.1 of the WARC format.

WARC/1.1

Version 1.1 of the WARC format (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) implements a number of different fields and changes a number of erroneous recommendations in version 1.0 of the format.

The notable changes to version 1.1 WARCs created with 1.20.3-at.20200401.01 compared to 1.0 WARCs created with previous versions are the addition of

  • the WARC-Refers-To-Target-URI header and
  • the WARC-Refers-To-Date header

for WARC revisit records. The version noted in the WARC records is now WARC/1.1 instead of WARC/1.0.

Zstandard with dictionary

Normally, according to the standard for WARC/1.1, WARC records are compressed using Zlib, creating .warc.gz files. Every record is compressed individually. If many webpages are stored in a WARC files that have overlap, this overlap would cause an equal relative overlap between compressed records. With the use of dictionaries in which these overlapping parts can be referenced, the overlapping parts can be largely compressed away, causing a much smaller overhead in size for records compressed with Zstandard with a dictionary.

Implementation

The implementation of Zstandard with dictionary compression has been created in cooperation with Internet Archive to allow playback of Zstandard compressed WARCs through the Wayback Machine. WARCs created with Zstandard compression have extention .warc.zst, similar to .warc.gz when Zlib compression is used.

Zstandard can both be used with and without dictionary. Without dictionary it is shown that Zstandard performs better than many other compression algorithms, like Zlib normally used for WARC record compression. The additional use of dictionaries for compression allows records to be compressed to smaller sizes and allows for overlapping data between records to be compressed away with the right trained dictionaries.

Zstandard allows for skippable frames, which allow for any user data to be added between frames in an additional frame. This frame is normally skipped by software handling Zstandard compressed files. The skippable frame (see https://facebook.github.io/zstd/zstd_manual.html for details) consists of, in listed order,

  • the skippable frame ID with values between 0x184D2A50 and 0x184D2A5F, in little endian format,
  • the frame size in 4 bytes, in little endian format, and
  • the content of the frame.

A used dictionary can be stored in the skippable frame with frame ID 0x184D2A5D as very first frame of the WARC file. By default the Zstandard dictionary is compressed with Zstandard before added as content of the skippable frame, unless option --warc-zstd-dict-no-compression is given to prevent compression of the dictionary before storing it. To prevent the dictionary from being included at the start of the resulting WARC file, option --warc-zstd-dict-no-include should be used.

--warc-compression-use-zstd

Use Zstandard instead of Zlib compression for compressing WARC records. To use a Zstandard dictionary as well, use option --warc-zstd-dict=FILENAME.

--warc-zstd-dict=FILENAME

The Zstandard dictionary to use for compression. Option --warc-compression-use-zstd needs to be used in order to use this option.

The dictionary is by default compressed with Zstandard and included in at the beginning of the WARC file, unless respectively options --warc-zstd-dict-no-compression or --warc-zstd-dict-no-include are used.

--warc-zstd-dict-no-include

Prevent the used Zstandard dictionary from being included in a skippable frame at the start of the WARC file. Option --warc-zstd-dict=FILENAME needs to be used in order to use this option.

It can be useful to not include the dictionary if many seperate WARCs are created using the same dictionary. Storing the dictionary in every WARC creates overhead in size. Instead, it may be useful to store the Zstandard dictionary separately.

--warc-zstd-dict-no-compression

Prevent the compression of the used Zstandard dictionary with Zstandard before writing it to the skippable frame. Option --warc-zstd-dict=FILENAME needs to be used in order to use this option.

Zstandard dictionaries themselves are not compressed, and compression can often yield tens of percents of reduction in the size of the skippable frame with compressed dictionary over that with uncompressed dictionary. Not compressing the dictionary might improve performance, as no decompression needs to take place in order to use the dictionary.

Deduplication

With deduplication on WARC records, a response record can be converted to a revisit record if it is found to be a duplicate from another record. In accordance with version 1.1 of the WARC format, the headers

  • WARC-Refers-To, referring to WARC-Record-ID of the original record,
  • WARC-Refers-To-Target-URI, referring to WARC-Target-URI of the original record,
  • WARC-Refers-To-Date, referring to WARC-Date of the original record,
  • WARC-Profile, with value http://netpreserve.org/warc/1.1/revisit/identical-payload-digest, and
  • WARC-Truncated, with value length,

are added and header WARC-Type is assigned value revisit. WARC-Block-Digest is set to the digest of the truncated data and WARC-Payload-Digest is the digest of the original payload.

With this release URL-agnostic deduplication is supported for WARC records in a single Wget session with the --warc-dedup-url-agnostic option. URL-gnostic deduplication is used by default for WARC writing, unless disabled with --warc-dedup-disable.

--warc-dedup-url-agnostic

Allow URL-agnostic deduplication of WARC records in the same Wget session.

A response record is converted into a revisit records with URL-agnostic deduplication when only the WARC-Payload-Digest matches that of a previously written record. Other WARC headers, like WARC-Target-URI, do not have to be equal in order for a revisit record to be written.

--warc-dedup-min-size=NUMBER

The minimum number of bytes a payload should be large before it is deduplicated. The default value is 100.

When a response record is converted to a revisit record, a number of fields are added. The value of --warc-dedup-min-size is used to determine when it is 'worth it' to write a revisit record instead of the original, given the increase or decrease in size, performance, and other factors.

--warc-dedup-disable

Disables the URL-gnostic deduplication. This deduplication is turned on by default.

URL-gnostic deduplication converts a response record into a revisit record when another record was previously written with equal values for the WARC-Payload-Digest and WARC-Target-URI WARC headers.

v1.20.3-lua

4 years ago

Wget-Lua with updated Wget version 1.20.3.