Trafilatura Versions Save

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

v1.4.0

1 year ago

Impact on extraction and output format:

  • better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
  • XML: preserve list type as attribute (#229)
  • XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
  • faster text cleaning and shorter code (#237 with @deedy5, #245)
  • metadata: add language when detector is activated (#224)
  • metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
  • TXT: change markdown formatting of headers by @LaundroMat (#257)

Smaller changes in convenience functions:

  • add function to clear caches (#219)
  • CLI: change exit code if download fails (#223)
  • settings: use "\n" for multiple user agents by @k-sareen (#241)

Updates:

  • docs updated (and #244 by @dsgibbons)
  • package dependencies updated

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.3.0...v1.4.0

v1.3.0

1 year ago
  • fast and robust html2txt() function added (#221)
  • more robust parsing (#228)
  • fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
  • extraction about 10-20% faster, slightly better recall
  • partial fixes for memory leaks (#216)
  • docs extended and updated (#217, #225)
  • prepared deprecation of old process_record() function
  • more stable processing with updated dependencies

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.2...v1.3.0

v1.2.2

2 years ago
  • more efficient rules for extraction
  • metadata: further attributes used (with @felipehertzer)
  • better baseline extraction
  • issues fixed: #202, #204, #205
  • evaluation updated

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.1...v1.2.2

v1.2.1

2 years ago

What's Changed

  • --precision and --recall arguments added to the CLI
  • better text cleaning: paywalls and comments
  • improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
  • further bugs fixed: #189, #192 (with @felipehertzer), #200
  • efficiency: faster module loading and improved RAM footprint

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.0...v1.2.1

v1.2.0

2 years ago
  • efficiency: replaced module readability-lxml by trimmed fork
  • bugs fixed: (#179, #180, #183, #184)
  • improved baseline extraction
  • cleaner metadata (with @felipehertzer)

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.1.0...v1.2.0

v1.1.0

2 years ago
  • encodings: better detection, output NFC-normalized Unicode
  • maintenance and performance: more efficient code
  • bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)
  • prepare compatibility with upcoming Python 3.11
  • changed default settings
  • extended documentation

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.0.0...v1.1.0

v1.0.0

2 years ago
  • compress HTML backup files & seamlessly open .gz files
  • support JSON web feeds
  • graphical user interface integrated into main package
  • faster downloads: reviewed backoff, compressed data
  • optional modules: downloads with pycurl, language identification with py3langid
  • bugs fixed (#111, #125, #132, #136, #140)
  • minor optimizations and fixes by @vbarbaresi in #124 & #130
  • fixed array with single or multiples entries on json extractor by @felipehertzer in #143
  • code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
  • drop support for Python 3.5

Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.3...v1.0.0

v0.9.3

2 years ago
  • better, faster encoding detection: replaced chardet with charset_normalizer
  • faster execution: updated justext to 3.0
  • better extraction of sub-elements in tables (#78, #90)
  • more robust web feed parsing
  • further defined precision- and recall-oriented settings
  • license extraction in footers (#118)

Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.2...v0.9.3

v0.9.2

2 years ago
  • first precision- and recall-oriented presets defined
  • improvements in authorship extraction (thanks @felipehertzer)
  • requesting TXT output with formatting now results in Markdown format
  • bugs fixed: notably extraction robustness and consistency (#109, #111, #113)
  • setting for cookies in request headers (thanks @muellermartin)
  • better date extraction thanks to htmldate update

v0.9.1

2 years ago
  • improved author extraction (thanks @felipehertzer!)
  • bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ...
  • docs updated and extended
  • CLI: option names normalized (heed deprecation warnings), new option explore