Madeindjs Spider Versions Save

The fastest web crawler written in Rust. Maintained by @a11ywatch.

v1.83.0

1 month ago

Whats Changed

feat(chrome): add wait_for configuration delay, selector, and idle network

Full Changelog: https://github.com/spider-rs/spider/compare/v1.82.7...v1.83.0

v1.82.7

1 month ago

Whats Changed

chore(website): add direct assign http client method website.set_http_client and website.get_client
chore(chrome): fix bytes handling
chore(chrome): fix wait_for_network page extracting
chore(chrome): fix final_url page when wait_for_network_idle

thanks for the help @esemeniuc

Full Changelog: https://github.com/spider-rs/spider/compare/v1.81.2...v1.82.7

v1.81.2

2 months ago

What's Changed

This release provides HTTP response headers as a feature [headers] and [decentralized_headers].

Add feature to provide http-headers #163 by @FelixEngl @j-mendez in https://github.com/spider-rs/spider/pull/164

Full Changelog: https://github.com/spider-rs/spider/compare/v1.80.85...v1.81.2

v1.80.85

3 months ago

Whats Changed

Some major changes in this release. It is recommended to upgrade immediately.

chore(sitemap): fix extracting multiple sitemap urls
chore(website): fix un-controlled shutdown
chore(website): fix http scrape persist links between crawls
chore(docs): add subscribe_guard example
chore(scrape): fix scrape persist active crawl
feat(chrome): add wait_for_network_idle
chore(sitemap): fix gathering all links
chore(page): fix external domains include
feat(cli): add page limiting
chore(cli): fix depth optional arg
chore(cli): fix external domain include [full_resources] feat flag
feat(config): add with_danger_accept_invalid_certs builder method
chore(page): fix subdomain detection
chore(cli): add accept_invalid_certs arg

Thanks for the help @apsaltis @emgardner

Full Changelog: https://github.com/spider-rs/spider/compare/v1.80.63...v1.80.85

v1.80.68

3 months ago

Whats Changed

chore(sitemap): fix extracting multiple sitemap urls
chore(website): fix un-controlled shutdown
chore(website): fix http scrape persist links between crawls
chore(docs): add subscribe_guard example

Full Changelog: https://github.com/spider-rs/spider/compare/v1.80.63...v1.80.68

v1.80.27

3 months ago

Whats Changed

feat(config): add re-usable configuration building

Example reusing the configuration for crawls.

extern crate spider;

use spider::{tokio, website::Website, configuration::Configuration};
use std::{time::Instant, io::Error};

const CAPACITY: usize = 4;
const CRAWL_LIST: [&str; CAPACITY] = [
    "https://rsseau.fr",
    "https://jeffmendez.com",
    "https://spider-rs.github.io/spider-nodejs/",
    "https://spider-rs.github.io/spider-py/",
];

#[tokio::main]
async fn main() -> Result<(), Error> {
    let config = Configuration::new()
        .with_user_agent(Some("SpiderBot"))
        .with_blacklist_url(Some(Vec::from(["https://rsseau.fr/resume".into()])))
        .with_subdomains(false)
        .with_tld(false)
        .with_redirect_limit(3)
        .with_respect_robots_txt(true)
        .with_external_domains(Some(
            Vec::from(["http://loto.rsseau.fr/"].map(|d| d.to_string())).into_iter(),
        ))
        .build();

    let mut handles = Vec::with_capacity(CAPACITY);

    for website_url in CRAWL_LIST {
        match Website::new(website_url)
            .with_config(config.to_owned())
            .build()
        {
            Ok(mut website) => {
                let handle = tokio::spawn(async move {
                    println!("Starting Crawl - {:?}", website.get_domain().inner());

                    let start = Instant::now();
                    website.crawl().await;
                    let duration = start.elapsed();

                    let links = website.get_links();

                    for link in links {
                        println!("- {:?}", link.as_ref());
                    }

                    println!(
                        "{:?} - Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
                        website.get_domain().inner(),
                        duration,
                        links.len()
                    );
                });

                handles.push(handle);
            }
            Err(e) => println!("{:?}", e),
        }
    }

    for handle in handles {
        let _ = handle.await;
    }

    Ok(())
}

Full Changelog: https://github.com/spider-rs/spider/compare/v1.80.19...v1.80.27

v1.80.63

3 months ago

Whats Changed

feat(config): add re-usable configuration building
feat(chrome): add override locale emulation
chore(chrome): fix request interception execute self hosted scripts
chore(intercept): add stripe to allow js frameworks
chore(config): add simple crawl limit builder method [website.with_limit]
perf(page): fix inline hot link extracting
chore(chrome): fix glob compile

Example reusing the configuration for crawls.

extern crate spider;

use spider::{tokio, website::Website, configuration::Configuration};
use std::{time::Instant, io::Error};

const CAPACITY: usize = 4;
const CRAWL_LIST: [&str; CAPACITY] = [
    "https://rsseau.fr",
    "https://jeffmendez.com",
    "https://spider-rs.github.io/spider-nodejs/",
    "https://spider-rs.github.io/spider-py/",
];

#[tokio::main]
async fn main() -> Result<(), Error> {
    let config = Configuration::new()
        .with_user_agent(Some("SpiderBot"))
        .with_blacklist_url(Some(Vec::from(["https://rsseau.fr/resume".into()])))
        .with_subdomains(false)
        .with_tld(false)
        .with_redirect_limit(3)
        .with_respect_robots_txt(true)
        .with_external_domains(Some(
            Vec::from(["http://loto.rsseau.fr/"].map(|d| d.to_string())).into_iter(),
        ))
        .build();

    let mut handles = Vec::with_capacity(CAPACITY);

    for website_url in CRAWL_LIST {
        match Website::new(website_url)
            .with_config(config.to_owned())
            .build()
        {
            Ok(mut website) => {
                let handle = tokio::spawn(async move {
                    println!("Starting Crawl - {:?}", website.get_domain().inner());

                    let start = Instant::now();
                    website.crawl().await;
                    let duration = start.elapsed();

                    let links = website.get_links();

                    for link in links {
                        println!("- {:?}", link.as_ref());
                    }

                    println!(
                        "{:?} - Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
                        website.get_domain().inner(),
                        duration,
                        links.len()
                    );
                });

                handles.push(handle);
            }
            Err(e) => println!("{:?}", e),
        }
    }

    for handle in handles {
        let _ = handle.await;
    }

    Ok(())
}

Full Changelog: https://github.com/spider-rs/spider/compare/v1.80.19...v1.80.63

v1.80.19

3 months ago

Whats Changed

Some performance improvements, full builder method defaults, and encoding support.

feat(encoding): add dynamic streaming encoding html

Example using dynamic streaming encoding. Enable the feature flag [encoding].

extern crate spider;

use spider::{tokio, hashbrown::HashMap, website::Website};

#[tokio::main]
async fn main() {
    let mut website: Website =
        Website::new("https://hoken.kakaku.com/health_check/blood_pressure/")
            .with_budget(Some(HashMap::from([("*", 2)])))
            .build()
            .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("{:?}", res.get_url());
            println!("{:?}", res.get_html_encoded("SHIFT_JIS"));
        }
    });

    website.crawl().await;
}

Full Changelog: https://github.com/spider-rs/spider/compare/v1.80.15...v1.80.19

v1.80.15

3 months ago

Whats Changed

feat(depth): add crawl depth level control
feat(redirect): add redirect limit expose with server respects
feat(redirect): add redirect policy Loose & Strict
perf(control): add rwlock crawl control

Example:

extern crate spider;

use spider::{tokio, website::Website, configuration::RedirectPolicy};
use std::io::Error;

#[tokio::main]
async fn main() -> Result<(), Error> {
   let mut website = Website::new("https://rsseau.fr")
       .with_depth(3)
       .with_redirect_limit(4)
       .with_redirect_policy(RedirectPolicy::Strict)
       .build()
       .unwrap();

   website.crawl().await;

   let links = website.get_links();

   for link in links {
       println!("- {:?}", link.as_ref());
   }

   println!("Total pages: {:?}", links.len());

   Ok(())
}

Full Changelog: https://github.com/spider-rs/spider/compare/v1.80.3...v1.80.15

v1.80.3

4 months ago

What's Changed

feat(cache): add caching backend feat flag by @j-mendez in https://github.com/spider-rs/spider/pull/156
chore(chrome_intercept): fix intercept redirect initial domain
perf(chrome_intercept): improve intercept handling of assets

Example:

Make sure to have the feat flag [cache] enabled. Storing cache in memory can be done with the flag [cache_mem] instead of using disk space.

extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    // we can use the builder method to enable caching or set `website.cache` to true directly.
    let mut website: Website = Website::new("https://rsseau.fr")
        .with_caching(true)
        .build()
        .unwrap();

    website.crawl().await;

    println!("Links found {:?}", website.get_links().len());
    /// next run to website.crawl().await; will be faster since content is stored on disk.
}

Full Changelog: https://github.com/spider-rs/spider/compare/v1.70.4...v1.80.3