Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Impact on extraction and output format:
Smaller changes in convenience functions:
Updates:
Full Changelog: https://github.com/adbar/trafilatura/compare/v1.3.0...v1.4.0
html2txt()
function added (#221)process_record()
functionFull Changelog: https://github.com/adbar/trafilatura/compare/v1.2.2...v1.3.0
Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.1...v1.2.2
--precision
and --recall
arguments added to the CLIFull Changelog: https://github.com/adbar/trafilatura/compare/v1.2.0...v1.2.1
Full Changelog: https://github.com/adbar/trafilatura/compare/v1.1.0...v1.2.0
Full Changelog: https://github.com/adbar/trafilatura/compare/v1.0.0...v1.1.0
pycurl
, language identification with py3langid
Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.3...v1.0.0
Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.2...v0.9.3
explore