Ache Versions Save

ACHE is a web crawler for domain-specific search.

0.15.0

1 year ago

We are pleased to announce version 0.15.0 of ACHE Crawler!

This version includes several dependency updates and fixes a robots.txt serialization bug that only happens when the robots.txt feature is enabled. This fix may cause data backward incompatibility of previous crawls that use robots.txt. We also plan to upgrade Elasticsearch support in the next version, so this version may be the last version to support legacy Elasticsearch versions (e.g., <6.x).

Following is a detailed log of the changes since the last version:

Bump okhttp from 3.14.0 to 4.9.3
Bump jackson-* libraries from 2.13.1 to 2.13.3
Bump logback-classic from 1.2.9 to 1.2.11
Bump slf4j-api from 1.7.32 to 1.7.36
Bump RoaringBitmap from 0.9.23 to 0.9.27
Bump metrics-* libraries from 4.2.7 to 4.2.17
Bump aws-java-sdk-s3 from 1.12.131 to 1.12.225
Remove aws-java-sdk-s3 dependency from main project
Add support for Elasticsearch 7.x and 8.x indexing (#282)
Bump jetty-server from 9.4.44.v20210927 to 9.4.48.v20220622
Bump kryo-serializers from 0.42 to 0.43
Bump RoaringBitmap from 0.9.27 to 0.9.39
Bump tika-parsers from 1.18 to 1.28.4
Bump gradle-node-plugin to version 3.5.1 and node.js to 18.14.2
Migrate tests from jUnit 4 to 5
Migrate test assertions from Hamcrest to AssertJ
Bump org.apache.httpcomponents:httpclient from 4.5.13 to 4.5.14
Bump ch.qos.logback:logback-classic from 1.2.+ to 1.4.5
Fix robots.txt serialization bug
Bump jackson-* libraries from 2.13.3 to 2.14.2
Bump org.apache.commons:commons-lang3 from 3.4 to 3.12.0
Bump org.apache.commons:commons-compress from 1.21 to 1.22
Bump org.apache.kafka:kafka-clients from 3.2.0 to 3.4.0
Bump com.squareup.okhttp3:okhttp from 4.9.3 to 4.10.0

0.14.0

2 years ago

We are pleased to announce version 0.14.0 of ACHE Crawler!

Following is a detailed log of the changes since the last version:

Remove support for CDR 3.1 format in Kafka target repository
Move tools and memex packages to the ache-tools sub-project
Moved forked crawler-commons classes to a separate sub-project
Remove tika dependency from ache and crawler-commons sub-project
Synchronize crawler-commons/http-fetcher with the upstream library
Setup gradle build using GitHub Actions
Build docker image with multi-arch support (amd64, arm64)
Upgrade build to Gradle 7.3.3
Upgrade gradle-node-plugin to version 3.0.1
Upgrade ache-dashboard npm dependencies
Pin slf4j-api version to 1.7.32
Bump airline from 0.8 to 0.9
Bump aws-java-sdk-s3 from 1.12.129 to 1.12.131
Bump crawler-commons from 1.1 to 1.2
Bump com.github.kt3k.coveralls from 2.10.2 to 2.12.0
Bump commons-codec from 1.10 to 1.15
Bump commons-compress from 1.12 to 1.21
Bump commons-lang3 from 3.4 to 3.12.0
Bump commons-validator from 1.6 to 1.7
Bump guava from 20.0 to 23.0
Bump jetty-server from 9.3.6.v20151106 to 9.4.44.v20210927
Bump kryo from 4.0.0 to 4.0.2
Bump kafka-clients from 0.11.0.1 to 3.0.0
Bump logback-classic from 1.1.+ to 1.2.9
Bump mockito-core from 1.10.+ to 4.2.0
Bump npm from 6.14.10 to 8.3.0
Bump rocksdbjni from 6.2.2 to 6.25.3
Bump RoaringBitmap from 0.7.8 to 0.9.23
Bump smile-core from 1.5.0 to 1.5.3
Bump lucene-analyzers-common from 7.3.1 to 8.10.1
Bump webarchive-commons from 1.1.8 to 1.1.9
Bump jsoup from 1.10.3 to 1.14.3
Bump junit from 4.12 to 4.13.2
Bump jackson-* libraries from 2.8.5 to 2.13.1
Bump metrics-* libraries from 3.1.3 to 4.2.7
Replace SparkJava framework (unmaintained) by Javalin 4.2.0
Add timeout configurations for the TOR fetcher
Update and improve the documentation
Change documentation theme to sphinx_material
Add support to HTTP BASIC auth for Elasticsearch data format

0.13.0

3 years ago

We are pleased to announce version 0.13.0 of ACHE Crawler!

Following is a detailed log of the changes since the last version:

Upgrade gradle-node-plugin to version 2.2.4
Upgrade gradle wrapper to version 6.6.1
Upgrade crawler-commons to version 1.1
Reorganized gradle module directory structure
Rename root package to achecrawler
Use multi-stage build to reduce Docker image size
Refactor Elasticsearch repository and make it wait until the server ready
Upgrade npm dependencies

0.12.0

4 years ago

We are pleased to announce version 0.12.0 of ACHE Crawler!

Following is a detailed log of the changes since the last version:

Upgrade crawler-commons dependency to version 0.9
Removed Elasticsearch transport-client-based repository
Removed Elasticsearch 1.4.4 binaries dependency
Added DumpDataFromElasticsearch tool for dumping documents from Elasticsearch repositories
Added configuration for minimum relevance in link selectors
Added configuration for selecting whether should re-crawl sitemaps and robots.txt links
Added documentaion about relevance_threshold parameters to the target page classifiers documentation page
Added support for crawling via HTTP proxy in okhttp3 fetcher (by @maqzi)
Added tracking of more HTTP error messages (301, 302, 3xx, 402) (by @maqzi)
Upgrade crawler-commons library to version 1.0
Upgrade commons-validator library to version 1.6
Upgrade okhttp3 library to version 3.14.0
Fix issue #177: Links from recent TLDs are considered invalid
Upgrade RocksDB dependency (rocksdbjni) to version 6.2.2
Added error code details to RocksDB exception logs
Upgrade gradle-node-plugin to version 1.3.1
Upgrade npm version to 6.10.2
Upgrade ache-dashboard npm dependencies
Upgrade gradle wrapper to version 5.6.1
Update Dockerfile to use openjdk:11-jdk (Java 11)
Added content_type field to RegexTargetClassifier
Change default link classifier to LinkClassifierBreadthSearch
Update io.airlift:airline dependency to version 0.8
Update gradle build script to use new plugins DSL
Update coverals gradle plugin to version 2.9.0
Update searchkit to version ^2.4.0

0.11.0

5 years ago

We are pleased to announce version 0.11.0 of ACHE Crawler! Besides several technical improvements, we are really glad to announce the very first ACHE release under the Apache License 2 (APLv2).

Following is a detailed log of the major changes since the last version:

Removed dependency on Weka and reimplemented all machine-learning code using SMILE.
Added option to skip cross-validation on ache buildModel command
Added option to configure max number of features on ache buildModel command
Changed license from GNU GPL to Apache 2.0
Added tool (ache run ReplayCrawl) to replay old crawls using a new configuration file
Added near-duplicate page detection using min-hashing and LSH
Support ELASTIC format in Kafka data format (issue #155)
Upgrade react-scripts to get rid of vulnerable transitive dependency (hoek:4.2.0)
Upgrade npm version to 5.8.0 on gradle build script
Changed smile target page classifier to use Platt's scaling only when the parameter 'relevance_threshold' is provided in the pageclassifier.yml file.
Added Ansible scripts for automatic deployment
Added RocksDB-based target repository (RocksDBTargetRepository)
Fixed bug in ache-dashboard that prevented reloading search page on the browser page refresh (issue #163)
Support Elasticsearch 6.x (issue #158)

0.10.0

6 years ago

We are pleased to announce version 0.10.0 of ACHE Crawler! This release contains very important changes, which include support for running multiple crawlers in a single server (multi-tenancy), and the start of our migration to Apache License 2 (APLv2).

Following is a detailed log of the major changes since last version:

Upgraded gradle-node plugin to version 1.2.0
Removed BerkeleyDB dependency (issue #143)
Allow for running multiple crawlers in a single server (issue #103)
REST API endpoints modified to support multiple crawlers (issue #103)
Web interface modified to support multiple crawlers (issue #103)
Display more metrics in crawler monitoring page
Upgrade RocksDB (org.rocksdb:rocksdbjni) to version 5.8.7 (issue #142)
Upgraded build script plugin "gradle-node" to version 1.2.0
Upgraded javascript dependencies from crawler web-interface:
- react to version 16.2.0
- react-vis to version 1.7.9
- searchkit to version 2.3.0
- npm to version 5.6.0
Allow cookies be modified dynamically via REST API endpoint (issue #114)
Added crawlerId field to JSON output of target repositories to track provenance of crawled pages

0.9.0

6 years ago

We are pleased to announce version 0.9.0 of ACHE Focused Crawler! We also recently reached the milestone of 100+ starts on GitHub, 55+ forks, and 1000+ commits in the current git repository. We would like to thanks all users for the feedback we have received in the past year.

This is a large release and it brings many improvements to the documentation and several new features. Following is a detailed log of major changes since last version:

Fixed multiple bugs and handling of exceptions
Several improvements made to ACHE documentation
Allow use of multiple data formats simultaneously (issue #92)
Added new data storage format using the standard WARC format (issue #64)
Added new data storage format using Apache Kafka (issue #123)
Re-crawling of sitemaps.xml files using fixed time intervals (issue #73)
Allow configuration of cookies in ache.yml (issue #81)
Allow configuration of full User-Agent string
Fixed memory issues that would cause OutOfMemoryError (issue #63)
Support for robots exclusion protocol a.k.a. robots.txt (issue #46)
Added new HTTP fetcher implementation using okhttp3 library with support to multiple SSL cipher suites
Non-HTML pages are no longer parsed as HTML
Training of new link classifiers (Online Learning) in a background thread (issue #76)
Added REST API endpoint to stop crawler
Added REST API endpoint to add new seeds to the crawl
Added documentation for the REST API
Persist run-time crawl metrics across crawler restarts (issue #101)
Added support to per-domain wildcard link filters (issue #121)
Add more detailed metrics for HTTP response codes (issue #120)
Changed referrer policies in the search dashboard for better security
Added various configuration options for timeouts in both fetcher implementations (issue #122)
Added support for Basic HTTP authentication in the web interface (issue #129)
Added REST API endpoints to supporting monitoring using Prometheus.io (issue #128)
Add page relevance metrics for better monitoring (issue #119)
Add parameters for elasticsearch index and type names through the /startCrawl REST API (issue #107)
Support for serving web interface from non-root path (issue #137)
Added button to stop crawler in web user interface (issue #139)
Upgraded searchkit library to 2.2.0 which supports Elasticsearch 5.x
Upgrade crawler-commons library to version 0.8

Notice: that there were breaking changes in some data formats:

Repositories for relevant and irrelevant pages are now stored in the same folder (or same Elasticsearch index) and page entries include new properties to identify pages as relevant or irrelevant according to the target page classifier output. Double check the data formats documentation page and make sure you make appropriate changes if needed.

0.8.0

7 years ago

We are pleased to announce version 0.8.0 of ACHE Focused Crawler.

This release includes a more complete and reorganized documentation (available at http://ache.readthedocs.io/en/latest/) and a new REST API for real-time crawler monitoring.

Following is the detailed log of major changes since last version.

Added frontier load time metrics (issue #59)
Update some library versions on build.gradle
Update gradle wrapper to version 3.2.1
Added Dockerfile
Added connection timeouts to BingSearchAzureAPI
Changed seed finder to use SimpleHttpFetcher
Added option to configure a custom user agent string
Added option of not starting console reporter in MetricsManager
Change set_version script to work on MacOS
Updated test dependency (Jetty) to version 9.3.6
Rewrite all CLI programs using only airline library
Shutdown crawler and log errors on any error (any Throwable)
Simple WekaTargetClassifier refactoring
Added argument --seedsPath to specify the directory to store the seed file in SeedFinder command
Replaced the deprecated installApp by installDist gradle command in conda.recipe
Fixed type of links extracted from sitemaps
REST API for real-time metrics monitoring (issue #67)
Remove dependency on linkclassifier.features file from LinkClassifierBreadthSearch (issue #65)
Create an initial version of web-based crawler dashboard for visualization of system metrics (issue #68)
Avoid creating empty files when not necessary in FilesTargetRepository
Added Memex CDRv3 support
Added Elasticsearch indexer to AcheToCdrFileExporter and rename it to AcheToCdrExporter
Capture exceptions and retry on failures during ElasticSearch bulk indexing
Refactoring of TargetClassifierFactory
Added command annotation to MigrateToFilesTargetRepository tool
Added a simple in-memory duplicate detection tool
Added a new regex-based target classifier that matches multiple fields (issue #69)
Created an initial version of documentation using the documentation generation system Sphinx and published documentation online at http://ache.readthedocs.io/ (issue #66)
Added additional system descriptions and a scaffold for missing documentation (issue #66)
Added badge with link to documentation in README.md (issue #66)
Added an index to page-classifiers documentation page
Improved documentation on page classifiers
Added a tool to run a classifier over a file content
Adjusted regex matcher to use DOTALL mode (issue #69)
Rename test file correctly
Write a CSV with queries, classification result, and URLs (issue #71)
Moved SeedFinder documentation from wiki to Sphinx documentation

0.7.0

7 years ago

There were more than 100 commits since the last release 0.6.0 in July 8. Following are some of the improvements.

ACHE is now simpler to use and to configure:

Added more specific configuration samples for focused crawling and in-depth website crawling
Stopwords are now an optional parameter, and a embedded stopword list is used by default
Added utility tools for working with CDR (Common Data Repository) files
Added utility to print frontier links along with relevance scores
Added configuration for HTTP connection pool size

ACHE is faster: we fixed synchronization and parallelism issues that led to improvements of crawler efficiency of 980% (a simple benchmark available at https://github.com/ViDA-NYU/ache/issues/56).

ACHE is more resilient due fix of bugs related to:

Extraction of malformed URLs during HTML parsing
Failures due to handling of URLs with IPv4 addresses
Failure to train the linking classifier for certain configuration values
Corruption of binary data improperly stored in strings

URL normalization added for links extracted from web pages, so less duplicate content will be fetched

Cleaned log messages and added logging of structured data in CSV files regarding:

Download requests
Links selected to be downloaded

Added detailed software metrics that allows better monitoring and detection of problems. Added metrics include shows counts, 1, 5 and 15-minute rates, mean, median, and 75%, 95%, 98% and 99% percentiles for

URL fetch time
Download page processing time
Current download queue size
Current processing and pending downloads in queue

ACHE has an improved data management:

Added new page repository that stores multiple pages in rolling compressed files
Added a new alternative database backend based on Facebook's RocksDB key-value store that improves efficiency and JVM memory management.

Some stability problems were solved, such as:

Limiting size of downloader thread-pool queue sizes
Properly close repository files during crawler shutdown
Avoid start crawler shutdown multiple times

Other minor improvement such as:

Migrated code base to Java 8
More refactoring, code cleaning, and tests (coverage 44%)

0.6.0

7 years ago

We are pleased to announce version 0.6.0 of ACHE Focused Crawler. Here we list the major changes since last version.

New features, improvements and bug fixes:

Implementation of SeedFinder algorithm, which leverages search engine's APIs to automatically create a large and diverse seed URL set to start to bootstrap the crawler.
Added flexible way to different handlers for different types of links, which will allow to have different extractors for each content type such as HTML, media files, XML sitemaps, etc.
Support for sitemap.xml protocol, which allows the crawler automatically discover all links along with some metadata specified by webmasters.
More bug fixes and code refactoring.
More unit tests and integration tests (coverage raised to 42%)