Ache Versions Save

ACHE is a web crawler for domain-specific search.

0.5.0

8 years ago

We are pleased to announce version 0.5.0 of ACHE Focused Crawler. Here we list the major changes since last version.

New features, improvements and bug fixes:

  • New simplified configuration based on a single YML file (ache.yml)
  • Fixed "backlink crawling" using Mozcape API to get backlinks
  • Complete rewrite of Crawler Manager module with some threading bug fixes and new thread managing model
  • Allow HTTP Fetcher cancel download of undesired mime-types (valid mime-types configuration)
  • Added ability to crawl .onion links from TOR network using HTTP proxies such as Privoxy
  • Added more unit tests for several components (test coverage raised to 31% of codebase)
  • More code cleaning and refactorings

0.4.0

8 years ago

We are pleased to announce version 0.4.0 of ACHE Crawler. Here we list the major changes since last version.

New features, improvements and bug fixes:

  • Improved and updated ACHE documentation
  • Added configuration to disable English language detection
  • Configured a service to measure test code coverage (https://coveralls.io/github/ViDA-NYU/ache)
  • Added more unit tests for several components (test coverage raised to 24% of codebase)
  • Refactor RegexBasedDetector into a new type of Page Classifier that uses regular expressions
  • Refactor Link Storage to abtract a new component called LinkSelector
  • Extract headers from HTTP responses
  • Add support for relative redirect URLs
  • Add support for redirected urls, mime type, reorganize code
  • Fixed a number of small issues and minor bugs
  • Removed legacy code and more code formatting
  • Fixed of some memory leaks and memory usage waste
  • Removed LinkMonitor and ability to print frontier pages that caused memory leaks
  • Added better caching policy with limited memory usage in Frontier
  • Added link selector with politeness restrictions (access URLs from the same domain only after minimum time interval)
  • Added link selector that maximizes number of websites downloaded
  • Added link selector to allow only crawl web pages within a max depth from the seed URLs
  • Changed default JVM garbage collector used in ACHE
  • Added command line option to train a Random Forest page classifier
  • Refactoring of page repositories to reuse code and allow improvements
  • Added configuration to hash file name when using FILESYTEM data formats
  • Added new JSON data format
  • Store fetch time of downloaded pages in JSON data format
  • Store HTTP request headers in JSON data format
  • Added deflate compression for pages repositories
  • Improved command line help messages
  • Updated Gradle wrapper version to 2.8
  • Updated Weka version to 3.6.13
  • Fixed other minor bugs
  • Removed lots of unused code and code cleaning

0.3.1

8 years ago

We are pleased to announce version 0.3.1 of ACHE Crawler. This is a minor release with some changes:

  • Added config files to final package distribution
  • Added version to command line interface
  • Some code refactorings

0.3.0

8 years ago

We are pleased to announce version 0.3.0 of ACHE Crawler. Here we list the major changes since version 0.2.0 (note that some changes break compatibility with previous releases).

New features:

  • New command-line interface using named parameters
  • Integration with ElasticSearch with configurable index names
  • Added new way to configure different types of classifiers using YAML files (this will allow new types of classifiers be added later as well as "meta classifiers" which can combine any type of classifier, using votting or machine learning ensembles for example)
  • Implemented a new type of page classifier based on simple URL regular expressions
  • Added filtering for extracted links using "white" and "black" lists of regular expressions
  • Added tool for compression of CBOR files in GZIP using CCA format
  • Added tool for off-line indexing data into ElasticSearch from crawler files in disk

Improvements:

  • Improved documentation in GitHub
  • Started writing automated unit tests for new features
  • Configuration of a continuous integration pipeline using TravisCI (compiles and runs the tests for each new commit in the repository)
  • Embedded language detection into crawler package to ease configuration for end user (before, the user needed to download external language profiles files and specify them in command line)
  • Converted bash scripts to build SVM model to a single command written in cross-platform Java code
  • Don't automatically remove data from existing crawl, just resume previous crawls.

Bug fixes:

  • Escaping HTML entities from extracted links (this was causing wrong links to be extracted and the crawler waste resources trying to download unexisting pages)
  • Checking for empty strings in frontier and seed file
  • Fixed computation of CCA key
  • Insert URLs from the seed file only when they are not already inserted
  • Added shutdown hook to close LinkStorage database properly
  • Removed URL fragment (#) from extracted links (this was causing duplicated URLs to be downloaded)

Refactorings:

  • Refactored tens of classes in the crawler

0.2.0

9 years ago

Release v0.2.0