Madeindjs Spider Versions Save

The fastest web crawler and indexer

v1.26.4

2 months ago

Whats Changed

  1. control: feature flag in order to control crawls manually like pause, stop, resume, and shutdown.
  2. decentralized: feature flag to scale the workload across nodes/workers using flatbuffers.
  3. serde: feature flag in order to de/serialize links on the fly.
  4. perf(regex): compile regex blacklist on crawl start

Decentralizing 🤖

The spider_worker crate splits the concerns of some of the heavy workload on gathering the resources. Get started by running cargo install spider_worker and start the process spider_worker. Enable the decentralized flag on the spider crate and optionally set the env variable SPIDER_WORKER to your instance. Put the worker on a https load balancer that can scale elastically to run extreme workloads: ⚖️.

Full Changelog: https://github.com/spider-rs/spider/compare/v1.26.0...v1.26.4

v1.22.1

3 months ago

Whats Changed

2x - 5x performance increase 🏎️

  1. perf(async): add current handle runtime set

Full Changelog: https://github.com/spider-rs/spider/compare/v1.21.3...v1.22.1

v1.21.3

3 months ago

Major 🚀

Planet scale crawling activated. Massive concurrency handling for extreme workloads and website sizes (Firefox parsing concurrent imp). Drastic memory reduction across crawls included.

Can crawl 1 million urls on 64gb linux in under a min with 1g network.

Spawn 4 crawls and target large websites ( maybe some .gov pages ).

What's Changed

  • perf(parser): add fast async tendril concurrency setup by @j-mendez in https://github.com/spider-rs/spider/pull/100
  • chore(crawl): fix concurrency selector gathering files that end with asset names that are valid html files
  • chore(crawl): improve coverage on finding links perf(memory): add resource lists and heap html handling when running stream links --

Full Changelog: https://github.com/spider-rs/spider/compare/v1.19.41...v1.21.3

v1.19.41

3 months ago

What's Changed

  • perf(crawl): add join handle task management by @j-mendez in https://github.com/spider-rs/spider/pull/99
  • chore(crawl): fix task shutdown on termination
  • perf(crawl): reduce memory allocation across crawls
  • chore(crawl): fix gathering timeout duration from robots config

-- Major speed and memory improvements for on large crawls

Full Changelog: https://github.com/spider-rs/spider/compare/v1.19.26...v1.19.41

v1.19.26

3 months ago

Whats Changed

  1. perf(links): add fast pre serialized url anchor link extracting and reduced memory usage
  2. perf(links): fix case sensitivity handling
  3. perf(crawl): reduce memory usage on link gathering
  4. chore(crawl): remove Website.reset method and improve crawl handling resource usage ( reset not needed now )
  5. chore(crawl): add heap usage of links visited
  6. perf(crawl): massive scans capability to utilize more cpu
  7. feat(timeout): add optional configuration.request_timeout duration
  8. build(tokio): remove unused net feature
  9. chore(docs): add missing scrape section
  10. perf(crawl): add compact_str reduce mem x2
  11. perf(scraper): add ahash imp default scraper fork

Full Changelog: https://github.com/spider-rs/spider/compare/v1.18.15...v1.19.26

v1.18.15

4 months ago

Whats Changed

Major

  1. fix stream throttling/delay
  2. perf(selectors): add top level selector building
  3. fix case insensitive link capturing
  4. add inline trap detection
  5. subdomain and tld crawl performance increase

Minor

  1. remove extra string compare conversions before hand
  2. fix unwrap_or with default evaluations

vv1.7.5

4 months ago

v1.16.0

8 months ago

What's Changed

Performance

  • performance boost from 30-500+% increase depending on website size.
  • drastic memory reduction in usage for large crawls

Full Changelog: https://github.com/spider-rs/spider/compare/v1.15.0...v1.16.0

v1.16.4

8 months ago

What's Changed

Performance

  • performance boost from 30-500+% increase depending on website size.
  • drastic memory reduction in usage for large crawls

Full Changelog: https://github.com/spider-rs/spider/compare/v1.15.0...v1.16.4

v1.17.0

8 months ago

What's Changed

  • feat(controls): add pause, resume, and shutdown crawler

Pause/Resume active crawls

use spider::website::Website;
use spider::utils::{pause, resume};

#[tokio::main]
#[ignore]
async fn main() {
    let url = "https://choosealicense.com/";
    let mut website: Website = Website::new(&url);

    tokio::spawn(async move {
        pause(url).await;
        sleep(Duration::from_millis(5000)).await;
        resume(url).await;
    });

    website.crawl().await;
}

Shutdown crawls

use spider::website::Website;
 use spider::utils::{shutdown};

#[tokio::main]
#[ignore]
async fn main() {
    let url = "https://choosealicense.com/";
    let mut website: Website = Website::new(&url);

    tokio::spawn(async move {
        // really long crawl force shutdown
        sleep(Duration::from_secs(30)).await;
        shutdown(url).await;
    });

    website.crawl().await;
}

Examples show the following pausing, resuming, and shut down crawlers.