Trafilatura Versions Save

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

1 year ago

Impact on extraction and output format:

better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
XML: preserve list type as attribute (#229)
XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
faster text cleaning and shorter code (#237 with @deedy5, #245)
metadata: add language when detector is activated (#224)
metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
TXT: change markdown formatting of headers by @LaundroMat (#257)

Smaller changes in convenience functions:

Updates:

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.3.0...v1.4.0

1 year ago

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.2...v1.3.0

2 years ago

2 years ago

What's Changed

--precision and --recall arguments added to the CLI
better text cleaning: paywalls and comments
improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
further bugs fixed: #189, #192 (with @felipehertzer), #200
efficiency: faster module loading and improved RAM footprint

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.0...v1.2.1

2 years ago

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.1.0...v1.2.0

2 years ago

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.0.0...v1.1.0

2 years ago

compress HTML backup files & seamlessly open .gz files
support JSON web feeds
graphical user interface integrated into main package
faster downloads: reviewed backoff, compressed data
optional modules: downloads with pycurl, language identification with py3langid
bugs fixed (#111, #125, #132, #136, #140)
minor optimizations and fixes by @vbarbaresi in #124 & #130
fixed array with single or multiples entries on json extractor by @felipehertzer in #143
code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
drop support for Python 3.5

Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.3...v1.0.0

2 years ago

Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.2...v0.9.3

2 years ago

2 years ago