Madeindjs Spider Versions Save

The fastest web crawler written in Rust. Maintained by @a11ywatch.

v1.42.3

8 months ago

Whats Changed

feat(external): add external domains grouping #135
chore(website): website build method to perform validations with the builder chain
chore(cli): fix links json output
chore(glob): fix link callback #136

Example below gathering links from different domains.

use spider::tokio;
use spider::website::Website;
use std::io::Error;
use std::time::Instant;

#[tokio::main]
async fn main() -> Result<(), Error>{
    let mut website = Website::new("https://rsseau.fr")
        .with_external_domains(Some(Vec::from(["http://loto.rsseau.fr/"].map(|d| d.to_string())).into_iter()))
        .build()?;

    let start = Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    let links = website.get_links();

    for link in links {
        println!("- {:?}", link.as_ref());
    }

    println!(
        "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );

    Ok(())
}

Thank you @roniemartinez and @sebs for the help!

Full Changelog: https://github.com/spider-rs/spider/compare/v1.41.1...v1.42.3

v1.41.1

8 months ago

Whats Changed

The sitemap feature flag was added to include pages found in the results. Currently the links found from the pages are not crawled.

feat(sitemap): add sitemap crawling feature flag

If you want to set a custom sitemap location use the configuration website.configuration.sitemap_url.

Example

website.configuration.sitemap_url = Some(Box::new("sitemap.xml".into()));

The builder method to adjust the location will be available in the next version, it was accidentally left out.

Full Changelog: https://github.com/spider-rs/spider/compare/v1.40.6...v1.41.1

v1.40.6

8 months ago

Whats Changed

feat(chrome): enable chrome rendering page content [experimental]
chore(crawl): remove crawl sync method for Sequential crawls

If you need crawls to be sequential use configuration.delay or use website.with_delay(1), set any value greater than 0.

Headless

Use the feature flag chrome for headless and chrome_headed for headful crawling.

Chrome installations are detected automatically on the OS. The current implementation uses chromiumaxide and handles html as raw strings so downloading media will not be ideal since the bytes may be invalid. The chrome feature does not work with the decentralized flag at the moment.

Video below shows 200 plus pages being handled within a couple seconds, headless runs drastically faster. Try to only use headed for debugging.

https://github.com/spider-rs/spider/assets/8095978/df341e32-df09-4ff6-9f77-468088b87b73

Full Changelog: https://github.com/spider-rs/spider/compare/v1.37.7...v1.40.6

v1.8.0

8 months ago

What's Changed

feat(time): add page duration uptime tracking with the feature flag time.

Full Changelog: https://github.com/spider-rs/spider/compare/v1.7.22...v1.8.0

v1.37.7

9 months ago

What's Changed

feat(page): add byte storing resource by @j-mendez in https://github.com/spider-rs/spider/pull/131
chore(pages): fix full_resource flag gathering scripts
chore(cli): fix resource extensions [130]
chore(full_resources): fix capturing link tag
chore(page): fix trailing slash url getter

You can now get the bytes from Page to store as a valid resource.

Thank you @Byter09 for the help!

Full Changelog: https://github.com/spider-rs/spider/compare/v1.36.5...v1.37.7

v1.36.5

9 months ago

What's Changed

feat(sync): add broadcast channel subscriptions
perf(css): improve link tree parsing nodes by @j-mendez in https://github.com/spider-rs/spider/pull/126
chore(cli): add regex and full_resources flags

Subscriptions 🚀

With the sync feature enabled [enabled by default].

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://choosealicense.com");
    let mut rx2 = website.subscribe(16).unwrap();

     tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("{:?}", res.get_url());
        }
    });

    website.crawl().await;
}

If you need the events to finish first, you can make a spawn call to website.crawl().await first before rx2.recv().await;

Full Changelog: https://github.com/spider-rs/spider/compare/v1.34.2...v1.36.3

v1.34.2

10 months ago

Whats changed

add builder pattern 115
remove unused buffer config
pin crates and update [email protected]
perf regex set one pass usage
fix regex test
add http header configuration
fix glob parsing links

let mut website = Website::new("https://choosealicense.com");

website
  .with_respect_robots_txt(true)
  .with_subdomains(true)
  .with_tld(false)
  .with_delay(0)
  .with_request_timeout(None)
  .with_http2_prior_knowledge(false)
  .with_user_agent(Some("myapp/version".into()))
  .with_on_link_find_callback(Some(|s| {
    println!("link target: {}", s.inner());
    s
  }))
  .with_headers(None)
  .with_blacklist_url(Some(Vec::from(["https://choosealicense.com/licenses/".into()])))
  .with_proxies(None);

Thank you @roniemartinez for the help!

v1.34.0

10 months ago

Whats changed

add builder pattern 115
remove unused buffer config
pin crates and update [email protected]
perf regex set one pass usage
fix regex test
add http header configuration

let mut website = Website::new("https://choosealicense.com");

website
  .with_respect_robots_txt(true)
  .with_subdomains(true)
  .with_tld(false)
  .with_delay(0)
  .with_request_timeout(None)
  .with_http2_prior_knowledge(false)
  .with_user_agent(Some("myapp/version".into()))
  .with_on_link_find_callback(Some(|s| {
    println!("link target: {}", s.inner());
    s
  }))
  .with_headers(None)
  .with_blacklist_url(Some(Vec::from(["https://choosealicense.com/licenses/".into()])))
  .with_proxies(None);

Thank you @roniemartinez for the help!

v1.32.8

10 months ago

Whats Changed

add cli command download to store html locally at any target destination
fix regex crate compiling for blacklist urls
add .jsp file detection support

example:

With a local with server up and the spider CLI.

cargo install spider_cli

spider --domain http://localhost:3000 download

example of html content downloading locally for the domain localhost:3000

Full Changelog: https://github.com/spider-rs/spider/compare/v1.31.7...v1.32.8

v1.31.7

11 months ago

What's Changed

perf(workers): add dynamic semaphore limits by @j-mendez in https://github.com/spider-rs/spider/pull/123

Full Changelog: https://github.com/spider-rs/spider/compare/v1.31.5...v1.31.7