Madeindjs Spider Versions Save

The fastest web crawler written in Rust. Maintained by @a11ywatch.

v1.42.3

8 months ago

Whats Changed

  • feat(external): add external domains grouping #135
  • chore(website): website build method to perform validations with the builder chain
  • chore(cli): fix links json output
  • chore(glob): fix link callback #136

Example below gathering links from different domains.

use spider::tokio;
use spider::website::Website;
use std::io::Error;
use std::time::Instant;

#[tokio::main]
async fn main() -> Result<(), Error>{
    let mut website = Website::new("https://rsseau.fr")
        .with_external_domains(Some(Vec::from(["http://loto.rsseau.fr/"].map(|d| d.to_string())).into_iter()))
        .build()?;

    let start = Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    let links = website.get_links();

    for link in links {
        println!("- {:?}", link.as_ref());
    }

    println!(
        "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );

    Ok(())
}

Thank you @roniemartinez and @sebs for the help!

Full Changelog: https://github.com/spider-rs/spider/compare/v1.41.1...v1.42.3

v1.41.1

8 months ago

Whats Changed

The sitemap feature flag was added to include pages found in the results. Currently the links found from the pages are not crawled.

  1. feat(sitemap): add sitemap crawling feature flag

If you want to set a custom sitemap location use the configuration website.configuration.sitemap_url.

Example

website.configuration.sitemap_url = Some(Box::new("sitemap.xml".into()));

The builder method to adjust the location will be available in the next version, it was accidentally left out.

Full Changelog: https://github.com/spider-rs/spider/compare/v1.40.6...v1.41.1

v1.40.6

8 months ago

Whats Changed

  • feat(chrome): enable chrome rendering page content [experimental]
  • chore(crawl): remove crawl sync method for Sequential crawls

If you need crawls to be sequential use configuration.delay or use website.with_delay(1), set any value greater than 0.

Headless

Use the feature flag chrome for headless and chrome_headed for headful crawling.

Chrome installations are detected automatically on the OS. The current implementation uses chromiumaxide and handles html as raw strings so downloading media will not be ideal since the bytes may be invalid. The chrome feature does not work with the decentralized flag at the moment.

Video below shows 200 plus pages being handled within a couple seconds, headless runs drastically faster. Try to only use headed for debugging.

https://github.com/spider-rs/spider/assets/8095978/df341e32-df09-4ff6-9f77-468088b87b73

Full Changelog: https://github.com/spider-rs/spider/compare/v1.37.7...v1.40.6

v1.8.0

8 months ago

What's Changed

  • feat(time): add page duration uptime tracking with the feature flag time.

Full Changelog: https://github.com/spider-rs/spider/compare/v1.7.22...v1.8.0

v1.37.7

9 months ago

What's Changed

  • feat(page): add byte storing resource by @j-mendez in https://github.com/spider-rs/spider/pull/131
  • chore(pages): fix full_resource flag gathering scripts
  • chore(cli): fix resource extensions [130]
  • chore(full_resources): fix capturing link tag
  • chore(page): fix trailing slash url getter

You can now get the bytes from Page to store as a valid resource.

Thank you @Byter09 for the help!

Full Changelog: https://github.com/spider-rs/spider/compare/v1.36.5...v1.37.7

v1.36.5

9 months ago

What's Changed

Subscriptions 🚀

With the sync feature enabled [enabled by default].

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://choosealicense.com");
    let mut rx2 = website.subscribe(16).unwrap();

     tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("{:?}", res.get_url());
        }
    });

    website.crawl().await;
}

If you need the events to finish first, you can make a spawn call to website.crawl().await first before rx2.recv().await;

Full Changelog: https://github.com/spider-rs/spider/compare/v1.34.2...v1.36.3

v1.34.2

10 months ago

Whats changed

  • add builder pattern 115
  • remove unused buffer config
  • pin crates and update [email protected]
  • perf regex set one pass usage
  • fix regex test
  • add http header configuration
  • fix glob parsing links
let mut website = Website::new("https://choosealicense.com");

website
  .with_respect_robots_txt(true)
  .with_subdomains(true)
  .with_tld(false)
  .with_delay(0)
  .with_request_timeout(None)
  .with_http2_prior_knowledge(false)
  .with_user_agent(Some("myapp/version".into()))
  .with_on_link_find_callback(Some(|s| {
    println!("link target: {}", s.inner());
    s
  }))
  .with_headers(None)
  .with_blacklist_url(Some(Vec::from(["https://choosealicense.com/licenses/".into()])))
  .with_proxies(None);

Thank you @roniemartinez for the help!

v1.34.0

10 months ago

Whats changed

  • add builder pattern 115
  • remove unused buffer config
  • pin crates and update [email protected]
  • perf regex set one pass usage
  • fix regex test
  • add http header configuration
let mut website = Website::new("https://choosealicense.com");

website
  .with_respect_robots_txt(true)
  .with_subdomains(true)
  .with_tld(false)
  .with_delay(0)
  .with_request_timeout(None)
  .with_http2_prior_knowledge(false)
  .with_user_agent(Some("myapp/version".into()))
  .with_on_link_find_callback(Some(|s| {
    println!("link target: {}", s.inner());
    s
  }))
  .with_headers(None)
  .with_blacklist_url(Some(Vec::from(["https://choosealicense.com/licenses/".into()])))
  .with_proxies(None);

Thank you @roniemartinez for the help!

v1.32.8

10 months ago

Whats Changed

  1. add cli command download to store html locally at any target destination
  2. fix regex crate compiling for blacklist urls
  3. add .jsp file detection support

example:

With a local with server up and the spider CLI.

cargo install spider_cli

spider --domain http://localhost:3000 download

example of html content downloading locally for the domain localhost:3000

Full Changelog: https://github.com/spider-rs/spider/compare/v1.31.7...v1.32.8

v1.31.7

11 months ago

What's Changed

Full Changelog: https://github.com/spider-rs/spider/compare/v1.31.5...v1.31.7