The fastest web crawler and indexer
control
: feature flag in order to control crawls manually like pause, stop, resume, and shutdown.decentralized
: feature flag to scale the workload across nodes/workers using flatbuffers
.serde
: feature flag in order to de/serialize links on the fly.The spider_worker
crate splits the concerns of some of the heavy workload on gathering the resources. Get started by running cargo install spider_worker
and start the process spider_worker
.
Enable the decentralized
flag on the spider
crate and optionally set the env variable SPIDER_WORKER
to your instance.
Put the worker on a https load balancer that can scale elastically to run extreme workloads: ⚖️.
Full Changelog: https://github.com/spider-rs/spider/compare/v1.26.0...v1.26.4
2x - 5x performance increase 🏎️
Full Changelog: https://github.com/spider-rs/spider/compare/v1.21.3...v1.22.1
Planet scale crawling activated. Massive concurrency handling for extreme workloads and website sizes (Firefox parsing concurrent imp). Drastic memory reduction across crawls included.
Can crawl 1 million urls on 64gb linux in under a min with 1g network.
Spawn 4 crawls and target large websites ( maybe some .gov pages ).
Full Changelog: https://github.com/spider-rs/spider/compare/v1.19.41...v1.21.3
-- Major speed and memory improvements for on large crawls
Full Changelog: https://github.com/spider-rs/spider/compare/v1.19.26...v1.19.41
Website.reset
method and improve crawl handling resource usage ( reset
not needed now )configuration.request_timeout
durationnet
featurecompact_str
reduce mem x2Full Changelog: https://github.com/spider-rs/spider/compare/v1.18.15...v1.19.26
Major
Minor
Full Changelog: https://github.com/spider-rs/spider/compare/v1.17.0...vv1.7.5
Full Changelog: https://github.com/spider-rs/spider/compare/v1.15.0...v1.16.0
Full Changelog: https://github.com/spider-rs/spider/compare/v1.15.0...v1.16.4
use spider::website::Website;
use spider::utils::{pause, resume};
#[tokio::main]
#[ignore]
async fn main() {
let url = "https://choosealicense.com/";
let mut website: Website = Website::new(&url);
tokio::spawn(async move {
pause(url).await;
sleep(Duration::from_millis(5000)).await;
resume(url).await;
});
website.crawl().await;
}
use spider::website::Website;
use spider::utils::{shutdown};
#[tokio::main]
#[ignore]
async fn main() {
let url = "https://choosealicense.com/";
let mut website: Website = Website::new(&url);
tokio::spawn(async move {
// really long crawl force shutdown
sleep(Duration::from_secs(30)).await;
shutdown(url).await;
});
website.crawl().await;
}
Examples show the following pausing, resuming, and shut down crawlers.