The fastest web crawler written in Rust. Maintained by @a11ywatch.
chrome_screenshot
feature flagFull Changelog: https://github.com/spider-rs/spider/compare/v1.50.2...v1.50.20
You can now run a cron job at anytime to sync data from the crawls. Use the cron with subscribe
to handle data curation with ease.
[dependencies]
spider = { version = "1.50.0", features = ["sync", "cron"] }
extern crate spider;
use spider::website::{Website, run_cron};
use spider::tokio;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://choosealicense.com");
// set the cron to run or use the builder pattern `website.with_cron`.
website.cron_str = "1/5 * * * * *".into();
let mut rx2 = website.subscribe(16).unwrap();
let join_handle = tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
println!("{:?}", res.get_url());
}
});
// take ownership of the website. You can also use website.run_cron, except you need to perform abort manually on handles created.
let runner = run_cron(website).await;
println!("Starting the Runner for 10 seconds");
tokio::time::sleep(tokio::time::Duration::from_secs(10)).await;
let _ = tokio::join!(runner.stop(), join_handle);
}
Full Changelog: https://github.com/spider-rs/spider/compare/v1.49.10...v1.50.5
You can set a cookie String directly with website.cookie_str
that is added for each request. Using the cookie feature also enables storing cookies that are received.
Full Changelog: https://github.com/spider-rs/spider/compare/v1.49.10...v1.49.12
Thank you @marlonbaeten for help!
Full Changelog: https://github.com/spider-rs/spider/compare/v1.48.0...v1.49.10
Full Changelog: https://github.com/spider-rs/spider/compare/v1.46.5...v1.48.0
Full Changelog: https://github.com/spider-rs/spider/compare/v1.46.4...v1.46.5
Crawling all domains found on website now possible with *
in external_domains.
Example:
let mut website = Website::new("https://choosealicense.com");
website
.with_external_domains(Some(Vec::from(["*"].map(|d| d.to_string())).into_iter()));
Use the crawl budget
and blacklist
features to help prevent infinite crawls:
website
.with_blacklist_url(Some(Vec::from(["^/blog/".into()])))
.with_budget(Some(spider::hashbrown::HashMap::from([("*", 300), ("/licenses", 10)])));
Thank you @sebs for the help!
Full Changelog: https://github.com/spider-rs/spider/compare/v1.45.10...v1.46.0
You can now use crawl budgeting with the CLI and grouping domains.
Example:
spider --domain https://choosealicense.com --budget "*,1" crawl -o
# ["https://choosealicense.com"]
Example of crawl grouping domains with CLI:
spider --domain https://choosealicense.com -E https://loto.rsseau.fr/ crawl -o
Full Changelog: https://github.com/spider-rs/spider/compare/v1.45.8...v1.45.9
Crawl budget limits to prevent paths from exceeding page limit. It is possible to set depth for the budget /a/b/c
.
Use the feature flag budget
to enable.
page.url().await
extra CDP callExample:
use spider::tokio;
use spider::website::Website;
use std::io::Error;
#[tokio::main]
async fn main() -> Result<(), Error> {
let mut website = Website::new("https://rsseau.fr")
.with_budget(Some(spider::hashbrown::HashMap::from([
("*", 15),
("en", 11),
("fr", 3),
])))
.build()?;
website.crawl().await;
for link in website.get_links() {
println!("- {:?}", link.as_ref());
}
println!("Total pages: {}", links.len());
Ok(())
}
// - "https://rsseau.fr/en/tag/google"
// - "https://rsseau.fr/en/blog/debug-nodejs-with-vscode"
// - "https://rsseau.fr/en/books"
// - "https://rsseau.fr/en/blog"
// - "https://rsseau.fr/en/tag/zip"
// - "https://rsseau.fr/books"
// - "https://rsseau.fr/en/resume"
// - "https://rsseau.fr/en/"
// - "https://rsseau.fr/en/tag/wpa"
// - "https://rsseau.fr/en/blog/express-typescript"
// - "https://rsseau.fr/en/blog/zip-active-storage"
// - "https://rsseau.fr"
// - "https://rsseau.fr/blog"
// - "https://rsseau.fr/en"
// - "https://rsseau.fr/fr"
// Total pages: 15
Full Changelog: https://github.com/spider-rs/spider/compare/v1.43.1...v1.45.8
You can now get the final redirect destination from pages with page.get_url_final
. If you need to see if a redirect was performed you can access page.final_redirect_destination
to get the Option.
Example:
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://rsseau.fr");
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
// return final redirect if found or the url used for the request
println!("{:?}", res.get_url_final());
}
});
website.crawl().await;
}
Thank you @matteoredaelli @joksas for the issue and help!
Full Changelog: https://github.com/spider-rs/spider/compare/v1.42.1...v1.43.1