Trafilatura Versions Save

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

v1.8.1

1 month ago

Maintenance:

  • Pin LXML to prevent broken dependency (#535)

Extraction:

  • Improve extraction accuracy for major news outlets (#530)
  • Fix formatting by correcting order of element generation and space handling with @dlwh (#528)
  • Fix: prevent tail insertion before children in nested elements by @knit-bee (#536)

v1.8.0

1 month ago

Extraction:

  • Better precision by @felipehertzer (#509, #520)
  • Code formatting in TXT/Markdown output added (#498)
  • Improved CSV output (#496)
  • LXML: compile XPath expressions (#504)
  • Overall speedup about +5%

Downloads and Navigation:

  • More robust scans with is_live_page() (#501)
  • Better sitemap start and safeguards (#503, #506)
  • Fix for headers in response object (#513)

Maintenance:

  • License changed to Apache 2.0
  • Response class: convenience functions added (#497)
  • lxml.html.Cleaner removed (#491)
  • CLI fixes: parallel cores and processing (#524)

v1.7.0

3 months ago

Extraction:

  • improved html2txt() function (#483)

Downloads:

  • add advanced fetch_response() function → pending deprecation for fetch_url(decode=False)

Maintenance:

  • support for LXML v5+ (#484 by @knit-bee, #485)
  • update htmldate

v1.6.4

4 months ago

Maintenance:

  • MacOS: fix setup, update htmldate and add tests (#460)
  • drop invalid XML element attributes with @vbarbaresi in #462
  • remove cyclic imports (#458)

Navigation:

  • introduce MAX_REDIRECTS config setting and fix urllib3 redirect handling by @vbarbaresi in #461
  • improve feed detection (#457)

Documentation:

  • enhancements to documentation and testing with @Maddesea in #456

v1.6.3

5 months ago

Extraction:

  • preserve space in certain elements with @idoshamun (#429)
  • optional list of xPaths to prune by @HeLehm (#414)

Metadata:

  • more precise date extraction (see htmldate)
  • new htmldate extensive search parameter in config (#434)
  • changes in URLs: normalization, trackers removed (see courlan)

Navigation:

  • reviewed code for feeds (#443)
  • new config option: external URLs for feeds/sitemaps (#441)

Documentation:

  • update, add page on text embeddings with @tonyyanga (#428, #435, #447)
  • fix quickstart by @sashkab (#419)

v1.6.2

8 months ago

Extraction:

  • more lenient HTML parsing (#370)
  • improved code block support with @idoshamun (#372, #401)
  • convertion of relative links to absolute by @feltcat (#377)
  • remove use of signal from core functions (#384)

Metadata:

  • JSON-LD fix for sitenames by @felipehertzer (#383)

Command-line interface:

  • more robust batch processing (#381)
  • added --probe option to CLI to check for extractable content (#378, #392)

Maintenance:

  • simplified code (#408)
  • support for Python 3.12
  • pinned LXML version for MacOS (#393)
  • updated dependencies and parameters (notably htmldate and courlan)
  • code cleaning by @marksmayo (#406)

v1.6.1

10 months ago

Extraction:

  • minor fixes: tables in figures (#301), headings (#354) and lists (#318)

Metadata:

  • simplify and fully test JSON parsing code, with @felipehertzer (#352, #368)
  • authors, JSON and unicode fixes by @felipehertzer in #365
  • fix for authors without additionalName by @awwitecki in #363

Navigation:

  • reviewed link processing in feeds and sitemaps (#340, #350)
  • more robust spider (#359)
  • updated underlying courlan package (#360)

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.6.0...v1.6.1

v1.6.0

11 months ago

Extraction:

  • new content hashes and default file names (#314)
  • fix deprecation warning with @sdondley in #321
  • fix for metadata image by @andremacola in #328
  • fix potential unicode issue in third-party extraction with @Korben00 in #331
  • review logging levels (#347)

Command-line interface:

  • more efficient sitemap processing (#326)
  • more efficient downloads (#338)
  • fix for single URL processing (#324) and URL blacklisting (#339)

Navigation

  • additional safety check on domain similarity for feeds and sitemaps
  • new function is_live test() using HTTP HEAD request (#327)
  • code parts supported by new courlan version

Maintenance

  • allow urllib3 version 2.0+
  • minor code simplification and fixes

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.5.0...v1.6.0

v1.5.0

1 year ago

Extraction:

  • fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #310), and @edkrueger (#303)
  • pagetype and image urls added to metadata by @andremacola (#282, #310)
  • add as_dict method to Document class with @edkrueger in #306
  • XML output fix with @knit-bee in #315
  • various smaller fixes: lists (#309), XPaths, metadata hardening

Navigation:

  • transfer URL management to courlan.UrlStore (#232, #312)
  • fixes for spider module

Maintenance:

  • simplify code and extend tests
  • underlying packages htmldate and courlan, update setup and docs

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.4.1...v1.5.0

v1.4.1

1 year ago

Extraction:

  • extraction bugs fixed (#263, #266), more robust HTML doctype parsing
  • XML output improvements by @knit-bee (#273, #274)
  • adjust thresholds for link density in paragraphs

Metadata:

  • improved title and sitename detection (#284)
  • faster author, categories, domain name, and tags extraction
  • fixes to author emoji regexes by @felipehertzer (#269)

Command-line interface:

  • review argument consistency and add deprecation warnings (#261)

Setup:

  • make download timeout configurable (#263)
  • updated dependencies, use of faust-cchardet for Python 3.11

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.4.0...v1.4.1