Ache Versions Save

ACHE is a web crawler for domain-specific search.

0.5.0

8 years ago

We are pleased to announce version 0.5.0 of ACHE Focused Crawler. Here we list the major changes since last version.

New features, improvements and bug fixes:

New simplified configuration based on a single YML file (ache.yml)
Fixed "backlink crawling" using Mozcape API to get backlinks
Complete rewrite of Crawler Manager module with some threading bug fixes and new thread managing model
Allow HTTP Fetcher cancel download of undesired mime-types (valid mime-types configuration)
Added ability to crawl .onion links from TOR network using HTTP proxies such as Privoxy
Added more unit tests for several components (test coverage raised to 31% of codebase)
More code cleaning and refactorings

0.4.0

8 years ago

We are pleased to announce version 0.4.0 of ACHE Crawler. Here we list the major changes since last version.

New features, improvements and bug fixes:

Improved and updated ACHE documentation
Added configuration to disable English language detection
Configured a service to measure test code coverage (https://coveralls.io/github/ViDA-NYU/ache)
Added more unit tests for several components (test coverage raised to 24% of codebase)
Refactor RegexBasedDetector into a new type of Page Classifier that uses regular expressions
Refactor Link Storage to abtract a new component called LinkSelector
Extract headers from HTTP responses
Add support for relative redirect URLs
Add support for redirected urls, mime type, reorganize code
Fixed a number of small issues and minor bugs
Removed legacy code and more code formatting
Fixed of some memory leaks and memory usage waste
Removed LinkMonitor and ability to print frontier pages that caused memory leaks
Added better caching policy with limited memory usage in Frontier
Added link selector with politeness restrictions (access URLs from the same domain only after minimum time interval)
Added link selector that maximizes number of websites downloaded
Added link selector to allow only crawl web pages within a max depth from the seed URLs
Changed default JVM garbage collector used in ACHE
Added command line option to train a Random Forest page classifier
Refactoring of page repositories to reuse code and allow improvements
Added configuration to hash file name when using FILESYTEM data formats
Added new JSON data format
Store fetch time of downloaded pages in JSON data format
Store HTTP request headers in JSON data format
Added deflate compression for pages repositories
Improved command line help messages
Updated Gradle wrapper version to 2.8
Updated Weka version to 3.6.13
Fixed other minor bugs
Removed lots of unused code and code cleaning

0.3.1

8 years ago

We are pleased to announce version 0.3.1 of ACHE Crawler. This is a minor release with some changes:

Added config files to final package distribution
Added version to command line interface
Some code refactorings

0.3.0

8 years ago

We are pleased to announce version 0.3.0 of ACHE Crawler. Here we list the major changes since version 0.2.0 (note that some changes break compatibility with previous releases).

New features:

New command-line interface using named parameters
Integration with ElasticSearch with configurable index names
Added new way to configure different types of classifiers using YAML files (this will allow new types of classifiers be added later as well as "meta classifiers" which can combine any type of classifier, using votting or machine learning ensembles for example)
Implemented a new type of page classifier based on simple URL regular expressions
Added filtering for extracted links using "white" and "black" lists of regular expressions
Added tool for compression of CBOR files in GZIP using CCA format
Added tool for off-line indexing data into ElasticSearch from crawler files in disk

Improvements:

Improved documentation in GitHub
Started writing automated unit tests for new features
Configuration of a continuous integration pipeline using TravisCI (compiles and runs the tests for each new commit in the repository)
Embedded language detection into crawler package to ease configuration for end user (before, the user needed to download external language profiles files and specify them in command line)
Converted bash scripts to build SVM model to a single command written in cross-platform Java code
Don't automatically remove data from existing crawl, just resume previous crawls.

Bug fixes:

Escaping HTML entities from extracted links (this was causing wrong links to be extracted and the crawler waste resources trying to download unexisting pages)
Checking for empty strings in frontier and seed file
Fixed computation of CCA key
Insert URLs from the seed file only when they are not already inserted
Added shutdown hook to close LinkStorage database properly
Removed URL fragment (#) from extracted links (this was causing duplicated URLs to be downloaded)

Refactorings:

Refactored tens of classes in the crawler

0.2.0

9 years ago

Release v0.2.0