The fastest web crawler written in Rust. Maintained by @a11ywatch.
RSS feeds handled automatically on crawls.
Full Changelog: https://github.com/spider-rs/spider/compare/v1.89.0...v1.89.7
You can now drive the browser with multi steps towards a goal, example. Extracting content or gathering extra data can be done as well using GPTConfigs.extra_ai_data
.
The credits used can be checked with Page.openai_credits_used
.
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::website::Website;
#[tokio::main]
async fn main() {
let gpt_config: GPTConfigs = GPTConfigs::new_multi(
"gpt-4-1106-preview",
vec![
"Search for Movies",
"Click on the first result movie result",
],
500,
);
let mut website: Website = Website::new("https://www.google.com")
.with_openai(Some(gpt_config))
.with_limit(1)
.build()
.unwrap();
website.crawl().await;
}
Full Changelog: https://github.com/spider-rs/spider/compare/v1.87.3...v1.88.7
You can now bypass Cloudflare protected page with the feature flag [real_browser]
.
Full Changelog: https://github.com/spider-rs/spider/compare/v1.86.16...v1.87.3
You can now dynamically drive the browser with custom scripts using OpenAI.
Make sure to set the OPENAI_API_KEY
env variable or pass it in to the program.
extern crate spider;
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let _ = tokio::fs::create_dir_all("./storage/").await;
let screenshot_params =
spider::configuration::ScreenshotParams::new(Default::default(), Some(true), Some(true));
let screenshot_config =
spider::configuration::ScreenShotConfig::new(screenshot_params, true, true, None);
let mut website: Website = Website::new("https://google.com")
.with_chrome_intercept(true, true)
.with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
.with_screenshot(Some(screenshot_config))
.with_limit(1)
.with_openai(Some(GPTConfigs::new(
"gpt-4-1106-preview",
"Search for Movies",
500,
)))
.build()
.unwrap();
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(page) = rx2.recv().await {
println!("{}\n{}", page.get_url(), page.get_html());
}
});
website.crawl().await;
}
The output of the custom script from the AI:
The screenshot of the page output:
Full Changelog: https://github.com/spider-rs/spider/compare/v1.85.4...v1.86.16
You can now update the crawl links outside of the context by using website.queue
to get a sender.
use spider::tokio;
use spider::url::Url;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://rsseau.fr");
let mut rx2 = website.subscribe(16).unwrap();
let mut g = website.subscribe_guard().unwrap();
let q = website.queue(100).unwrap();
tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
let u = res.get_url();
println!("{:?}", u);
let mut url = Url::parse(u).expect("Failed to parse URL");
let mut segments: Vec<_> = url
.path_segments()
.map(|c| c.collect::<Vec<_>>())
.unwrap_or_else(Vec::new);
if segments.len() > 0 && segments[0] == "en" {
segments[0] = "fr";
let new_path = segments.join("/");
url.set_path(&new_path);
// get a new url here or perform an action and queue links
// pre-fetch all fr locales
let _ = q.send(url.into());
}
g.inc();
}
});
let start = std::time::Instant::now();
website.crawl().await;
let duration = start.elapsed();
println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
website.get_links().len()
)
}
Thanks @oiwn
Full Changelog: https://github.com/spider-rs/spider/compare/v1.84.11...v1.85.1
You can now update the crawl links outside of the context by using website.queue
to get a sender.
use spider::tokio;
use spider::url::Url;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://rsseau.fr");
let mut rx2 = website.subscribe(16).unwrap();
let mut g = website.subscribe_guard().unwrap();
let q = website.queue(100).unwrap();
tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
let u = res.get_url();
println!("{:?}", u);
let mut url = Url::parse(u).expect("Failed to parse URL");
let mut segments: Vec<_> = url
.path_segments()
.map(|c| c.collect::<Vec<_>>())
.unwrap_or_else(Vec::new);
if segments.len() > 0 && segments[0] == "en" {
segments[0] = "fr";
let new_path = segments.join("/");
url.set_path(&new_path);
// get a new url here or perform an action and queue links
// pre-fetch all fr locales
let _ = q.send(url.into());
}
g.inc();
}
});
let start = std::time::Instant::now();
website.crawl().await;
let duration = start.elapsed();
println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
website.get_links().len()
)
}
Thanks @oiwn
Full Changelog: https://github.com/spider-rs/spider/compare/v1.84.11...v1.85.4
You can now pre-set links to crawl or extend using website.set_extra_links
.
@oiwn thanks for the help!
Full Changelog: https://github.com/spider-rs/spider/compare/v1.84.9...v1.84.11
Chrome sitemap compile fix and defaulting to chrome.
Full Changelog: https://github.com/spider-rs/spider/compare/v1.84.3...v1.84.9
Major increase performance of chrome crawls/scrapes by 2x
Full Changelog: https://github.com/spider-rs/spider/compare/v1.84.1...v1.84.3
chrome_headless_new
flagThanks @esemeniuc
Full Changelog: https://github.com/spider-rs/spider/compare/v1.83.6...v1.84.1