Nemo Scrape Save

Distributed Scraper

Project README

Distributed Scraper

This is a scraper function that automatically pulls in metadata from the page, as well as supports simple HTML querying using cheerio.

It's built on top of stdlib which makes it highly distributed and scalable.

Usage

You can either use the ready service that's deployed on stdlib here, or fork this repository and launch your own version on stdlib.

Example

For example, a simple scrape to pick up my own email address from Github (and a bunch of extra metadata):

lib nemo.scrape --url https://github.com/nemo --query "li[itemprop='email'] a"

{ metadata:
   { general:
      { description: 'nemo has 36 repositories available. Follow their code on GitHub.',
        title: 'nemo (Nima Gardideh) · GitHub',
        lang: 'en' },
     openGraph:
      { app_id: '1401488693436528',
        image: [Object],
        site_name: 'GitHub',
        type: 'profile',
        title: 'nemo (Nima Gardideh)',
        url: 'https://github.com/nemo',
        description: 'nemo has 36 repositories available. Follow their code on GitHub.',
        username: 'nemo' },
     schemaOrg: { items: [Object] },
     twitter:
      { image: [Object],
        site: '@github',
        card: 'summary',
        title: 'nemo (Nima Gardideh)',
        description: 'nemo has 36 repositories available. Follow their code on GitHub.' } },
  url: 'https://github.com/nemo',
  query: 'li[itemprop=\'email\'] a',
  query_value: '[email protected]'
}

You can view the function specification here.

Notes

Note that this scraper does not support sites that are single page Javascript applications. You should also follow robot.txt rules when you're scraping websites. Use responsibly.

License

MIT

Open Source Agenda is not affiliated with "Nemo Scrape" Project. README Source: nemo/scrape

Stars

Open Issues

Last Commit

7 years ago

Repository

nemo/scrape

Homepage

https://hackernoon.com/microservice-series-scraper-ee970df3e81f#.25rzprigt

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/nemo-scrape"><img src="https://www.opensourceagenda.com/projects/nemo-scrape/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022