Crawly Versions Save

Crawly, a high-level web crawling & scraping framework for Elixir.

0.15.0

1 year ago

What's Changed

  1. Created a simple management Web UI. Try it on localhost:4001
  2. Added the possibility of creating spiders with the help of the YML format. Read more here: https://github.com/elixir-crawly/crawly/blob/master/documentation/spiders_in_yml.md
  3. Added the possibility to run Crawly (and your scraping projects) without Elixir. Read more here: https://github.com/elixir-crawly/crawly/blob/master/documentation/standalone_crawly.md
  4. Added generators for Crawly spiders and configuration files to reduce boilerplate
  5. Improved UniqueRequest middleware so that it can store hashes instead of complete URLs (special thanks to @serpent213)
  6. Added SameDomainFilter middleware, my favorite, which will probably deprecate the need to rely on the base_url in the future. Again thanks to @serpent213!

0.13.0

3 years ago
  1. Bugfix for start_urls size (now it's possible to have very large start URLs)
  2. Split business logs from other logs. Per spider logging
  3. Send logs to CrawlyUI (optional)
  4. Allow to override more spider options:
    • closespider_itemcount
    • closespider_timeout
    • concurrent_requests_per_domain (number of started workers)
  5. Change on_spider_log_callback (now it also gets the crawl_id)
  6. Parse pipelines

0.12.0

3 years ago

0.11.0

3 years ago

0.10.0

3 years ago

The release includes the following improvements:

  1. WriteToFile pipeline now adds timestamps to filenames
  2. WriteToFile pipeline will now create a folder if missing
  3. SendToUI item pipeline will send data to experimental CrawlyUI management dashboard
  4. Other smaller features

0.9.0

4 years ago

This release contains the following features:

Automatic cookies management (allows scraping websites under login form or a form with ZIP code) Spider custom settings (allows overriding settings like concurrency on the spider level) Injected on_spider_closed_callback (allows notifying other parts of the system on the crawl end) Fixes and improvements of the documentation

0.8.0

4 years ago

This release contains the following features:

  1. Retries support
  2. Pluggable user agents
  3. Browser rendering support