Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Maintenance:
Extraction:
Extraction:
Downloads and Navigation:
is_live_page()
(#501)Maintenance:
Response
class: convenience functions added (#497)lxml.html.Cleaner
removed (#491)Extraction:
html2txt()
function (#483)Downloads:
fetch_response()
function
→ pending deprecation for fetch_url(decode=False)
Maintenance:
Maintenance:
Navigation:
MAX_REDIRECTS
config setting and fix urllib3 redirect handling by @vbarbaresi in #461Documentation:
Extraction:
Metadata:
htmldate
extensive search parameter in config (#434)Navigation:
Documentation:
Extraction:
Metadata:
Command-line interface:
--probe
option to CLI to check for extractable content (#378, #392)Maintenance:
htmldate
and courlan
)Extraction:
Metadata:
additionalName
by @awwitecki in #363Navigation:
Full Changelog: https://github.com/adbar/trafilatura/compare/v1.6.0...v1.6.1
Extraction:
Command-line interface:
Navigation
is_live test()
using HTTP HEAD request (#327)Maintenance
urllib3
version 2.0+Full Changelog: https://github.com/adbar/trafilatura/compare/v1.5.0...v1.6.0
Extraction:
Navigation:
Maintenance:
Full Changelog: https://github.com/adbar/trafilatura/compare/v1.4.1...v1.5.0
Extraction:
Metadata:
Command-line interface:
Setup:
Full Changelog: https://github.com/adbar/trafilatura/compare/v1.4.0...v1.4.1