Trafilatura Versions Save

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

v1.8.1

1 month ago

Maintenance:

Pin LXML to prevent broken dependency (#535)

Extraction:

Improve extraction accuracy for major news outlets (#530)
Fix formatting by correcting order of element generation and space handling with @dlwh (#528)
Fix: prevent tail insertion before children in nested elements by @knit-bee (#536)

v1.8.0

1 month ago

Extraction:

Better precision by @felipehertzer (#509, #520)
Code formatting in TXT/Markdown output added (#498)
Improved CSV output (#496)
LXML: compile XPath expressions (#504)
Overall speedup about +5%

Downloads and Navigation:

More robust scans with is_live_page() (#501)
Better sitemap start and safeguards (#503, #506)
Fix for headers in response object (#513)

Maintenance:

License changed to Apache 2.0
Response class: convenience functions added (#497)
lxml.html.Cleaner removed (#491)
CLI fixes: parallel cores and processing (#524)

v1.7.0

3 months ago

Extraction:

improved html2txt() function (#483)

Downloads:

add advanced fetch_response() function → pending deprecation for fetch_url(decode=False)

Maintenance:

support for LXML v5+ (#484 by @knit-bee, #485)
update htmldate

v1.6.4

4 months ago

Maintenance:

MacOS: fix setup, update htmldate and add tests (#460)
drop invalid XML element attributes with @vbarbaresi in #462
remove cyclic imports (#458)

Navigation:

introduce MAX_REDIRECTS config setting and fix urllib3 redirect handling by @vbarbaresi in #461
improve feed detection (#457)

Documentation:

enhancements to documentation and testing with @Maddesea in #456

v1.6.3

5 months ago

Extraction:

preserve space in certain elements with @idoshamun (#429)
optional list of xPaths to prune by @HeLehm (#414)

Metadata:

more precise date extraction (see htmldate)
new htmldate extensive search parameter in config (#434)
changes in URLs: normalization, trackers removed (see courlan)

Navigation:

reviewed code for feeds (#443)
new config option: external URLs for feeds/sitemaps (#441)

Documentation:

update, add page on text embeddings with @tonyyanga (#428, #435, #447)
fix quickstart by @sashkab (#419)

v1.6.2

8 months ago

Extraction:

more lenient HTML parsing (#370)
improved code block support with @idoshamun (#372, #401)
convertion of relative links to absolute by @feltcat (#377)
remove use of signal from core functions (#384)

Metadata:

JSON-LD fix for sitenames by @felipehertzer (#383)

Command-line interface:

more robust batch processing (#381)
added --probe option to CLI to check for extractable content (#378, #392)

Maintenance:

simplified code (#408)
support for Python 3.12
pinned LXML version for MacOS (#393)
updated dependencies and parameters (notably htmldate and courlan)
code cleaning by @marksmayo (#406)

v1.6.1

10 months ago

Extraction:

minor fixes: tables in figures (#301), headings (#354) and lists (#318)

Metadata:

simplify and fully test JSON parsing code, with @felipehertzer (#352, #368)
authors, JSON and unicode fixes by @felipehertzer in #365
fix for authors without additionalName by @awwitecki in #363

Navigation:

reviewed link processing in feeds and sitemaps (#340, #350)
more robust spider (#359)
updated underlying courlan package (#360)

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.6.0...v1.6.1

v1.6.0

11 months ago

Extraction:

new content hashes and default file names (#314)
fix deprecation warning with @sdondley in #321
fix for metadata image by @andremacola in #328
fix potential unicode issue in third-party extraction with @Korben00 in #331
review logging levels (#347)

Command-line interface:

more efficient sitemap processing (#326)
more efficient downloads (#338)
fix for single URL processing (#324) and URL blacklisting (#339)

Navigation

additional safety check on domain similarity for feeds and sitemaps
new function is_live test() using HTTP HEAD request (#327)
code parts supported by new courlan version

Maintenance

allow urllib3 version 2.0+
minor code simplification and fixes

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.5.0...v1.6.0

v1.5.0

1 year ago

Extraction:

fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #310), and @edkrueger (#303)
pagetype and image urls added to metadata by @andremacola (#282, #310)
add as_dict method to Document class with @edkrueger in #306
XML output fix with @knit-bee in #315
various smaller fixes: lists (#309), XPaths, metadata hardening

Navigation:

transfer URL management to courlan.UrlStore (#232, #312)
fixes for spider module

Maintenance:

simplify code and extend tests
underlying packages htmldate and courlan, update setup and docs

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.4.1...v1.5.0

v1.4.1

1 year ago

Extraction:

extraction bugs fixed (#263, #266), more robust HTML doctype parsing
XML output improvements by @knit-bee (#273, #274)
adjust thresholds for link density in paragraphs

Metadata:

improved title and sitename detection (#284)
faster author, categories, domain name, and tags extraction
fixes to author emoji regexes by @felipehertzer (#269)

Command-line interface:

review argument consistency and add deprecation warnings (#261)

Setup:

make download timeout configurable (#263)
updated dependencies, use of faust-cchardet for Python 3.11

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.4.0...v1.4.1