:spider: The pipeline for the OSCAR corpus
annotation
to quality_warnings
by @Uinelj in https://github.com/oscar-project/ungoliant/pull/85
Full Changelog: https://github.com/oscar-project/ungoliant/compare/v1.2.3...v2.0.0
Full Changelog: https://github.com/oscar-corpus/ungoliant/compare/v1.2.1...v1.2.3
Full Changelog: https://github.com/oscar-corpus/ungoliant/compare/v1.1.1...v1.2.1
This is the second release of Ungoliant, a project that provides tools to generate corpora from CommonCrawl. Ungoliant also includes already established pipeline(s), in particular to generate [OSCAR][oscar]-like corpora.
Ungoliant also replaces goclassy
.
Get the release from the Releases tab or via cargo: cargo install ungoliant
.
Ungoliant v1.1.0 features a new pipeline that produces document oriented corpora instead of previous, line oriented corpora.
The changes include:
This is the first release of Ungoliant, a project that provides tools to generate corpora from CommonCrawl. Ungoliant also includes already established pipeline(s), in particular to generate OSCAR-like corpora.
Ungoliant also replaces goclassy
.
Get the release from the Releases tab or via cargo: cargo install ungoliant
.
These changes are feature evolutions from goclassy
ungoliant
command-line interface.
cmake
installed if you plan on compiling Ungoliant yourself.