Madeindjs Spider Versions Save

The fastest web crawler written in Rust. Maintained by @a11ywatch.

v1.89.7

3 weeks ago

Whats Changed

RSS feeds handled automatically on crawls.

  1. feat(rss): add rss support
  2. chore(openai): fix compile chrome flag
  3. chore(crate): remove serde pin
  4. chore(website): fix sitemap chrome build
  5. chore(crate): remove pins on common crates ( reduces build size )

Full Changelog: https://github.com/spider-rs/spider/compare/v1.89.0...v1.89.7

v1.88.7

3 weeks ago

Whats Changed

You can now drive the browser with multi steps towards a goal, example. Extracting content or gathering extra data can be done as well using GPTConfigs.extra_ai_data.

The credits used can be checked with Page.openai_credits_used.

  1. chore(page): return all page content regardless of status
  2. chore(openai): fix svg removal
  3. feat(openai): add extra data gpt curating
  4. chore(openai): add credits used response
  5. feat(fingerprint): add fingerprint id configuration
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::website::Website;

#[tokio::main]
async fn main() {
    let gpt_config: GPTConfigs = GPTConfigs::new_multi(
        "gpt-4-1106-preview",
        vec![
            "Search for Movies",
            "Click on the first result movie result",
        ],
        500,
    );
    
   let mut website: Website = Website::new("https://www.google.com")
        .with_openai(Some(gpt_config))
        .with_limit(1)
        .build()
        .unwrap();
        
     website.crawl().await;
}

Image displaying google clicking on the first search result using AI to get the prompts.

Full Changelog: https://github.com/spider-rs/spider/compare/v1.87.3...v1.88.7

v1.87.3

3 weeks ago

Whats Changed

You can now bypass Cloudflare protected page with the feature flag [real_browser].

  • feat(real_browser): add real_browser feature flag for chrome

Full Changelog: https://github.com/spider-rs/spider/compare/v1.86.16...v1.87.3

v1.86.16

1 month ago

What's Changed

You can now dynamically drive the browser with custom scripts using OpenAI. Make sure to set the OPENAI_API_KEY env variable or pass it in to the program.

extern crate spider;
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let _ = tokio::fs::create_dir_all("./storage/").await;

    let screenshot_params =
        spider::configuration::ScreenshotParams::new(Default::default(), Some(true), Some(true));
    let screenshot_config =
        spider::configuration::ScreenShotConfig::new(screenshot_params, true, true, None);

    let mut website: Website = Website::new("https://google.com")
        .with_chrome_intercept(true, true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_screenshot(Some(screenshot_config))
        .with_limit(1)
        .with_openai(Some(GPTConfigs::new(
            "gpt-4-1106-preview",
            "Search for Movies",
            500,
        )))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("{}\n{}", page.get_url(), page.get_html());
        }
    });

    website.crawl().await;
}

The output of the custom script from the AI:

Custom Script output

The screenshot of the page output:

Output of the page

Full Changelog: https://github.com/spider-rs/spider/compare/v1.85.4...v1.86.16

v1.85.1

1 month ago

Whats Changed

You can now update the crawl links outside of the context by using website.queue to get a sender.

  • feat(q): add mid crawl queue
use spider::tokio;
use spider::url::Url;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2 = website.subscribe(16).unwrap();
    let mut g = website.subscribe_guard().unwrap();
    let q = website.queue(100).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            let u = res.get_url();
            println!("{:?}", u);
            let mut url = Url::parse(u).expect("Failed to parse URL");

            let mut segments: Vec<_> = url
                .path_segments()
                .map(|c| c.collect::<Vec<_>>())
                .unwrap_or_else(Vec::new);

            if segments.len() > 0 && segments[0] == "en" {
                segments[0] = "fr";
                let new_path = segments.join("/");
                url.set_path(&new_path);
                // get a new url here or perform an action and queue links
                // pre-fetch all fr locales
                let _ = q.send(url.into());
            }
            g.inc();
        }
    });

    let start = std::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    println!(
        "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        website.get_links().len()
    )
}

Thanks @oiwn

Full Changelog: https://github.com/spider-rs/spider/compare/v1.84.11...v1.85.1

v1.85.4

1 month ago

Whats Changed

You can now update the crawl links outside of the context by using website.queue to get a sender.

  • feat(q): add mid crawl queue
  • chore(chrome): fix semaphore limiting scrape
use spider::tokio;
use spider::url::Url;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2 = website.subscribe(16).unwrap();
    let mut g = website.subscribe_guard().unwrap();
    let q = website.queue(100).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            let u = res.get_url();
            println!("{:?}", u);
            let mut url = Url::parse(u).expect("Failed to parse URL");

            let mut segments: Vec<_> = url
                .path_segments()
                .map(|c| c.collect::<Vec<_>>())
                .unwrap_or_else(Vec::new);

            if segments.len() > 0 && segments[0] == "en" {
                segments[0] = "fr";
                let new_path = segments.join("/");
                url.set_path(&new_path);
                // get a new url here or perform an action and queue links
                // pre-fetch all fr locales
                let _ = q.send(url.into());
            }
            g.inc();
        }
    });

    let start = std::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    println!(
        "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        website.get_links().len()
    )
}

Thanks @oiwn

Full Changelog: https://github.com/spider-rs/spider/compare/v1.84.11...v1.85.4

v1.84.11

1 month ago

Whats Changed

You can now pre-set links to crawl or extend using website.set_extra_links.

  • chore(website): add set extra links extended crawls

@oiwn thanks for the help!

Full Changelog: https://github.com/spider-rs/spider/compare/v1.84.9...v1.84.11

v1.84.9

1 month ago

Whats Changed

Chrome sitemap compile fix and defaulting to chrome.

  1. chore(sitemap): fix chrome sitemap handling
  2. feat(chrome): add auth challenge response
  3. chore(smart): fix smart mode http defaults
  4. chore(chrom_intercept): fix page hang

Full Changelog: https://github.com/spider-rs/spider/compare/v1.84.3...v1.84.9

v1.84.3

1 month ago

Whats Changed

Major increase performance of chrome crawls/scrapes by 2x

  1. perf(chrome): add direct page navigation

Full Changelog: https://github.com/spider-rs/spider/compare/v1.84.1...v1.84.3

v1.84.1

1 month ago

Whats Changed

  1. chore(chrome): fix network_wait_for page hang and inconsistent pages
  2. chore(chrome): fix concurrent page handling
  3. chore(chrome): fix smart mode http request default
  4. feat(chrome): add chrome_headless_new flag

Thanks @esemeniuc

Full Changelog: https://github.com/spider-rs/spider/compare/v1.83.6...v1.84.1