Madeindjs Spider Versions Save

The fastest web crawler written in Rust. Maintained by @a11ywatch.

v1.93.5

1 week ago

Whats Changed

Updated crate compatibility with [email protected] and fixed headers compile for worker. Remove http3 feature flag - follow the unstable instructions if needed.

chore(worker): fix headers flag compile
chore(crates): update [email protected]

Full Changelog: https://github.com/spider-rs/spider/compare/v1.93.3...v1.93.5

v1.93.3

2 weeks ago

Whats Changed

You can now take screenshots per step when using OpenAI to manipulate the page. Connecting to a proxy on chrome headless remote is now fixed.

feat(openai): add screenshot js execution after effects
feat(openai): add deserialization error determination
chore(chrome): fix proxy server headless connecting

    use spider::configuration::GPTConfigs;
  
    let mut gpt_config: GPTConfigs = GPTConfigs::new_multi(
        "gpt-4-1106-preview",
        vec!["Search for Movies", "Extract the hrefs found."],
        3000,
    );

    gpt_config.screenshot = true;
    gpt_config.set_extra(true);

Full Changelog: https://github.com/spider-rs/spider/compare/v1.92.0...v1.93.3

v1.92.0

2 weeks ago

What's Changed

Caching OpenAI responses can now be done using the 'cache_openai' flag and a builder method.

docs: fix broken glob url link by @emilsivervik in https://github.com/spider-rs/spider/pull/179
feat(openai): add response caching

Example

extern crate spider;

use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::moka::future::Cache;
use spider::tokio;
use spider::website::Website;
use std::time::Duration;

#[tokio::main]
async fn main() {
    let cache = Cache::builder()
        .time_to_live(Duration::from_secs(30 * 60))
        .time_to_idle(Duration::from_secs(5 * 60))
        .max_capacity(10_000)
        .build();

    let mut gpt_config: GPTConfigs = GPTConfigs::new_multi_cache(
        "gpt-4-1106-preview",
        vec![
            "Search for Movies",
            "Click on the first result movie result",
        ],
        500,
        Some(cache),
    );
    gpt_config.set_extra(true);

    let mut website: Website = Website::new("https://www.google.com")
        .with_chrome_intercept(true, true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_limit(1)
        .with_openai(Some(gpt_config))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    let handle = tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("---\n{}\n{:?}\n{:?}\n---", page.get_url(), page.openai_credits_used, page.extra_ai_data);
        }
    });

    let start = crate::tokio::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();
    let links = website.get_links();

    println!(
        "(0) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );

    // crawl the page again to see if cache is re-used.
    let start = crate::tokio::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    website.unsubscribe();

    let _ = handle.await;

    println!(
        "(1) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );
}

New Contributors

@emilsivervik made their first contribution in https://github.com/spider-rs/spider/pull/179

Full Changelog: https://github.com/spider-rs/spider/compare/v.1.91.1...v1.92.0

v.1.91.1

3 weeks ago

Whats Changed

The AI results now return the input(prompt), js_ouput, and content_output.

Full Changelog: https://github.com/spider-rs/spider/compare/v1.90.0...v.1.91.1

v1.90.0

1 month ago

Whats Changed

RSS feeds handled automatically on crawls.

feat(rss): add rss support
chore(openai): fix compile chrome flag
chore(crate): remove serde pin
chore(website): fix sitemap chrome build
chore(crate): remove pins on common crates ( reduces build size )
chore(openai): fix prompt deserialization
chore(openai): add custom api key config

Full Changelog: https://github.com/spider-rs/spider/compare/v1.89.0...v1.90.0

v1.89.7

1 month ago

Whats Changed

RSS feeds handled automatically on crawls.

feat(rss): add rss support
chore(openai): fix compile chrome flag
chore(crate): remove serde pin
chore(website): fix sitemap chrome build
chore(crate): remove pins on common crates ( reduces build size )

Full Changelog: https://github.com/spider-rs/spider/compare/v1.89.0...v1.89.7

v1.88.7

1 month ago

Whats Changed

You can now drive the browser with multi steps towards a goal, example. Extracting content or gathering extra data can be done as well using GPTConfigs.extra_ai_data.

The credits used can be checked with Page.openai_credits_used.

chore(page): return all page content regardless of status
chore(openai): fix svg removal
feat(openai): add extra data gpt curating
chore(openai): add credits used response
feat(fingerprint): add fingerprint id configuration

use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::website::Website;

#[tokio::main]
async fn main() {
    let gpt_config: GPTConfigs = GPTConfigs::new_multi(
        "gpt-4-1106-preview",
        vec![
            "Search for Movies",
            "Click on the first result movie result",
        ],
        500,
    );
    
   let mut website: Website = Website::new("https://www.google.com")
        .with_openai(Some(gpt_config))
        .with_limit(1)
        .build()
        .unwrap();
        
     website.crawl().await;
}

Image displaying google clicking on the first search result using AI to get the prompts.

Full Changelog: https://github.com/spider-rs/spider/compare/v1.87.3...v1.88.7

v1.87.3

1 month ago

Whats Changed

You can now bypass Cloudflare protected page with the feature flag [real_browser].

feat(real_browser): add real_browser feature flag for chrome

Full Changelog: https://github.com/spider-rs/spider/compare/v1.86.16...v1.87.3

v1.86.16

1 month ago

What's Changed

You can now dynamically drive the browser with custom scripts using OpenAI. Make sure to set the OPENAI_API_KEY env variable or pass it in to the program.

Openai/chrome driver by @j-mendez in https://github.com/spider-rs/spider/pull/174
chore(page): add cold fusion file crawling support

extern crate spider;
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let _ = tokio::fs::create_dir_all("./storage/").await;

    let screenshot_params =
        spider::configuration::ScreenshotParams::new(Default::default(), Some(true), Some(true));
    let screenshot_config =
        spider::configuration::ScreenShotConfig::new(screenshot_params, true, true, None);

    let mut website: Website = Website::new("https://google.com")
        .with_chrome_intercept(true, true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_screenshot(Some(screenshot_config))
        .with_limit(1)
        .with_openai(Some(GPTConfigs::new(
            "gpt-4-1106-preview",
            "Search for Movies",
            500,
        )))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("{}\n{}", page.get_url(), page.get_html());
        }
    });

    website.crawl().await;
}

The output of the custom script from the AI:

Custom Script output

The screenshot of the page output:

Output of the page

Full Changelog: https://github.com/spider-rs/spider/compare/v1.85.4...v1.86.16

v1.85.1

1 month ago

Whats Changed

You can now update the crawl links outside of the context by using website.queue to get a sender.

feat(q): add mid crawl queue

use spider::tokio;
use spider::url::Url;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2 = website.subscribe(16).unwrap();
    let mut g = website.subscribe_guard().unwrap();
    let q = website.queue(100).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            let u = res.get_url();
            println!("{:?}", u);
            let mut url = Url::parse(u).expect("Failed to parse URL");

            let mut segments: Vec<_> = url
                .path_segments()
                .map(|c| c.collect::<Vec<_>>())
                .unwrap_or_else(Vec::new);

            if segments.len() > 0 && segments[0] == "en" {
                segments[0] = "fr";
                let new_path = segments.join("/");
                url.set_path(&new_path);
                // get a new url here or perform an action and queue links
                // pre-fetch all fr locales
                let _ = q.send(url.into());
            }
            g.inc();
        }
    });

    let start = std::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    println!(
        "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        website.get_links().len()
    )
}

Thanks @oiwn

Full Changelog: https://github.com/spider-rs/spider/compare/v1.84.11...v1.85.1