The fastest web crawler written in Rust. Maintained by @a11ywatch.
Example below gathering links from different domains.
use spider::tokio;
use spider::website::Website;
use std::io::Error;
use std::time::Instant;
#[tokio::main]
async fn main() -> Result<(), Error>{
let mut website = Website::new("https://rsseau.fr")
.with_external_domains(Some(Vec::from(["http://loto.rsseau.fr/"].map(|d| d.to_string())).into_iter()))
.build()?;
let start = Instant::now();
website.crawl().await;
let duration = start.elapsed();
let links = website.get_links();
for link in links {
println!("- {:?}", link.as_ref());
}
println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
links.len()
);
Ok(())
}
Thank you @roniemartinez and @sebs for the help!
Full Changelog: https://github.com/spider-rs/spider/compare/v1.41.1...v1.42.3
The sitemap
feature flag was added to include pages found in the results. Currently the links found from the pages are not crawled.
If you want to set a custom sitemap location use the configuration website.configuration.sitemap_url
.
website.configuration.sitemap_url = Some(Box::new("sitemap.xml".into()));
The builder method to adjust the location will be available in the next version, it was accidentally left out.
Full Changelog: https://github.com/spider-rs/spider/compare/v1.40.6...v1.41.1
If you need crawls to be sequential use configuration.delay
or use website.with_delay(1)
, set any value greater than 0.
Use the feature flag chrome
for headless and chrome_headed
for headful crawling.
Chrome installations are detected automatically on the OS. The current implementation uses chromiumaxide and handles html as raw strings so downloading media will not be ideal since the bytes may be invalid. The chrome feature does not work with the decentralized flag at the moment.
Video below shows 200 plus pages being handled within a couple seconds, headless runs drastically faster. Try to only use headed for debugging.
https://github.com/spider-rs/spider/assets/8095978/df341e32-df09-4ff6-9f77-468088b87b73
Full Changelog: https://github.com/spider-rs/spider/compare/v1.37.7...v1.40.6
time
.Full Changelog: https://github.com/spider-rs/spider/compare/v1.7.22...v1.8.0
You can now get the bytes from Page
to store as a valid resource.
Thank you @Byter09 for the help!
Full Changelog: https://github.com/spider-rs/spider/compare/v1.36.5...v1.37.7
With the sync
feature enabled [enabled by default].
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://choosealicense.com");
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
println!("{:?}", res.get_url());
}
});
website.crawl().await;
}
If you need the events to finish first, you can make a spawn call to website.crawl().await
first before rx2.recv().await
;
Full Changelog: https://github.com/spider-rs/spider/compare/v1.34.2...v1.36.3
let mut website = Website::new("https://choosealicense.com");
website
.with_respect_robots_txt(true)
.with_subdomains(true)
.with_tld(false)
.with_delay(0)
.with_request_timeout(None)
.with_http2_prior_knowledge(false)
.with_user_agent(Some("myapp/version".into()))
.with_on_link_find_callback(Some(|s| {
println!("link target: {}", s.inner());
s
}))
.with_headers(None)
.with_blacklist_url(Some(Vec::from(["https://choosealicense.com/licenses/".into()])))
.with_proxies(None);
Thank you @roniemartinez for the help!
let mut website = Website::new("https://choosealicense.com");
website
.with_respect_robots_txt(true)
.with_subdomains(true)
.with_tld(false)
.with_delay(0)
.with_request_timeout(None)
.with_http2_prior_knowledge(false)
.with_user_agent(Some("myapp/version".into()))
.with_on_link_find_callback(Some(|s| {
println!("link target: {}", s.inner());
s
}))
.with_headers(None)
.with_blacklist_url(Some(Vec::from(["https://choosealicense.com/licenses/".into()])))
.with_proxies(None);
Thank you @roniemartinez for the help!
download
to store html locally at any target destinationexample:
With a local with server up and the spider CLI.
cargo install spider_cli
spider --domain http://localhost:3000 download
Full Changelog: https://github.com/spider-rs/spider/compare/v1.31.7...v1.32.8
Full Changelog: https://github.com/spider-rs/spider/compare/v1.31.5...v1.31.7