JSoup - Scrapes, parses, manipulates and cleans HTML.
websphinx - Website-Specific Processors for HTML information extraction.
Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
spider-flow - A visual spider framework, it's so good that you don't need to write any code to crawl the website.
Norconex Web Crawler - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications.
C#
ccrawler - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content.
SimpleCrawler - Simple spider base on mutithreading, regluar expression.
DotnetSpider - This is a cross platfrom, ligth spider develop by C#.
Abot - C# web crawler built for speed and flexibility.
Hawk - Advanced Crawler and ETL tool written in C#/WPF.
SkyScraper - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
Infinity Crawler - A simple but powerful web crawler library in C#.