Madeindjs Spider Versions Save

The fastest web crawler written in Rust. Maintained by @a11ywatch.

v1.50.20

5 months ago

Whats Changed

feat(chrome): add chrome_screenshot feature flag
chore(control): fix control task abort after crawl
chore(website): add website.stop handling shutdown

Full Changelog: https://github.com/spider-rs/spider/compare/v1.50.2...v1.50.20

v1.50.5

5 months ago

What's Changed

You can now run a cron job at anytime to sync data from the crawls. Use the cron with subscribe to handle data curation with ease.

feat(cron): add cron feature flag by @j-mendez in https://github.com/spider-rs/spider/pull/153
chore(tls): add optional native tls
feat(napi): add napi support for nodejs

[dependencies]
spider = { version = "1.50.0", features = ["sync", "cron"] }

extern crate spider;

use spider::website::{Website, run_cron};
use spider::tokio;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://choosealicense.com");
    // set the cron to run or use the builder pattern `website.with_cron`.
    website.cron_str = "1/5 * * * * *".into();

    let mut rx2 = website.subscribe(16).unwrap();

    let join_handle = tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("{:?}", res.get_url());
        }
    });

    // take ownership of the website. You can also use website.run_cron, except you need to perform abort manually on handles created.
    let runner = run_cron(website).await;
    
    println!("Starting the Runner for 10 seconds");
    tokio::time::sleep(tokio::time::Duration::from_secs(10)).await;
    let _ = tokio::join!(runner.stop(), join_handle);
}

Full Changelog: https://github.com/spider-rs/spider/compare/v1.49.10...v1.50.5

v1.49.12

5 months ago

Whats Changed

feat(cookies): add cookie jar optional feature

You can set a cookie String directly with website.cookie_str that is added for each request. Using the cookie feature also enables storing cookies that are received.

Full Changelog: https://github.com/spider-rs/spider/compare/v1.49.10...v1.49.12

v1.49.10

5 months ago

Whats Changed

chore(chrome): fix chrome headless headful args
chore(cli): add http check cli website url
chore(cli): rename domain arg - url [#150]
chore(cli): add invalid website error log
Return status code on error by @marlonbaeten in https://github.com/spider-rs/spider/pull/151
chore(chrome): add main chromiumoxide crate - ( fork changes merged to the main repo )
chore(chrome): fix headful browser open
chore(website): add crawl_concurrent_raw method by @j-mendez in https://github.com/spider-rs/spider/pull/152
chore(deps): bump [email protected]

Thank you @marlonbaeten for help!

Full Changelog: https://github.com/spider-rs/spider/compare/v1.48.0...v1.49.10

v1.48.0

6 months ago

What's Changed

feat(page): add status code and error message page response by @j-mendez in https://github.com/spider-rs/spider/pull/148
chore(scraper): add ignore scripts and styles when text extracting nodes

Full Changelog: https://github.com/spider-rs/spider/compare/v1.46.5...v1.48.0

v1.46.5

6 months ago

What's Changed

chore(page): fix subdomain entry point handling root by @j-mendez in https://github.com/spider-rs/spider/pull/146

Full Changelog: https://github.com/spider-rs/spider/compare/v1.46.4...v1.46.5

v1.46.4

7 months ago

Whats Changed

Crawling all domains found on website now possible with * in external_domains.

feat(external): add wildcard handling all domains found https://github.com/spider-rs/spider/issues/135?notification_referrer_id=NT_kwDOAHuI6rI3NzMxNzU3NzIzOjgwOTU5Nzg#issuecomment-1734085559

Example:

let mut website = Website::new("https://choosealicense.com");

website
    .with_external_domains(Some(Vec::from(["*"].map(|d| d.to_string())).into_iter()));

Use the crawl budget and blacklist features to help prevent infinite crawls:

website
    .with_blacklist_url(Some(Vec::from(["^/blog/".into()])))
    .with_budget(Some(spider::hashbrown::HashMap::from([("*", 300), ("/licenses", 10)])));

Thank you @sebs for the help!

Full Changelog: https://github.com/spider-rs/spider/compare/v1.45.10...v1.46.0

v1.45.10

7 months ago

Whats Changed

You can now use crawl budgeting with the CLI and grouping domains.

feat(cli): add crawl budgeting
feat(cli): add external domains grouping

Example:

spider --domain https://choosealicense.com --budget "*,1" crawl -o
# ["https://choosealicense.com"]

Example of crawl grouping domains with CLI:

spider --domain https://choosealicense.com -E https://loto.rsseau.fr/ crawl -o

Full Changelog: https://github.com/spider-rs/spider/compare/v1.45.8...v1.45.9

v1.45.8

8 months ago

Whats Changed

Crawl budget limits to prevent paths from exceeding page limit. It is possible to set depth for the budget /a/b/c. Use the feature flag budget to enable.

feat(budget): add crawl budgeting pages
chore(chrome): add fast chrome redirect determination without using page.url().await extra CDP call

Example:

use spider::tokio;
use spider::website::Website;
use std::io::Error;

#[tokio::main]
async fn main() -> Result<(), Error> {
    let mut website = Website::new("https://rsseau.fr")
        .with_budget(Some(spider::hashbrown::HashMap::from([
            ("*", 15),
            ("en", 11),
            ("fr", 3),
        ])))
        .build()?;

    website.crawl().await;

    for link in website.get_links() {
        println!("- {:?}", link.as_ref());
    }

    println!("Total pages: {}", links.len());

    Ok(())
}
// - "https://rsseau.fr/en/tag/google"
// - "https://rsseau.fr/en/blog/debug-nodejs-with-vscode"
// - "https://rsseau.fr/en/books"
// - "https://rsseau.fr/en/blog"
// - "https://rsseau.fr/en/tag/zip"
// - "https://rsseau.fr/books"
// - "https://rsseau.fr/en/resume"
// - "https://rsseau.fr/en/"
// - "https://rsseau.fr/en/tag/wpa"
// - "https://rsseau.fr/en/blog/express-typescript"
// - "https://rsseau.fr/en/blog/zip-active-storage"
// - "https://rsseau.fr"
// - "https://rsseau.fr/blog"
// - "https://rsseau.fr/en"
// - "https://rsseau.fr/fr"
// Total pages: 15

Full Changelog: https://github.com/spider-rs/spider/compare/v1.43.1...v1.45.8

v1.43.1

8 months ago

Whats Changed

You can now get the final redirect destination from pages with page.get_url_final. If you need to see if a redirect was performed you can access page.final_redirect_destination to get the Option.

feat(page): add page redirect destination exposure #127

Example:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            // return final redirect if found or the url used for the request
            println!("{:?}", res.get_url_final());
        }
    });

    website.crawl().await;
}

Thank you @matteoredaelli @joksas for the issue and help!

Full Changelog: https://github.com/spider-rs/spider/compare/v1.42.1...v1.43.1