Awesome Web Scraper Save

A collection of awesome web scaper, crawler.

Project README

Awesome Web Scraper

A collection of awesome web scaper, crawler.

Java

Apache Nutch - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
websphinx - Website-Specific Processors for HTML INformation eXtraction.
Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
crawler4j - open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.

C/C++

HTTrack - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.

C#

ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.

Erlang

ebot - Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.

Python

scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
gdom - gdom, DOM Traversing and Scraping using GraphQL.
trafilatura - Library and command-line tool to extract metadata, main text, and comments.
extractnet - machine learning based content & metadata extraction framework for Python
Scrapegraph-ai - An open source library for making scraping with the use of the AI

PHP

Goutte - Goutte, a simple PHP Web Scraper.
DiDOM - Simple and fast HTML parser.
simple_html_dom - Just a Simple HTML DOM library fork.
PHPCrawl - PHPCrawl is a framework for crawling/spidering websites written in PHP.
Crawler - A library for Rapid Web Crawler and Scraper Development.

Nodejs

puppeteer - Headless Chrome Node API https://pptr.dev.
Phantomjs - Scriptable Headless WebKit.
node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery.
node-simplecrawler - Flexible event driven crawler for node.
spider - Programmable spidering of web sites with node.js and jQuery.
slimerjs - A PhantomJS-like tool running Gecko.
casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
zombie - Insanely fast, full-stack, headless browser testing using node.js.
nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
jsdom - A JavaScript implementation of the WHATWG DOM and HTML standards, for use with node.js
xray - The next web scraper. See through the <html> noise.
lightcrawler - Crawl a website and run it through Google lighthouse.

Ruby

wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

Go

gocrawl - Polite, slim and concurrent web crawler.
fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

Rust

scraper - HTML parsing and querying with CSS selectors.
reqwest - An ergonomic, batteries-included HTTP Client for Rust.

License

MIT

Contributing

Please, read the Contribution Guidelines before submitting your suggestion.

Feel free to open an issue or create a pull request with your additions.

Open Source Agenda is not affiliated with "Awesome Web Scraper" Project. README Source: duyet/awesome-web-scraper

Stars

236

Open Issues

Last Commit

1 month ago

Repository

duyet/awesome-web-scraper

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/awesome-web-scraper"><img src="https://www.opensourceagenda.com/projects/awesome-web-scraper/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022