Spider Less Save

Web spider as a service, spider on serverless

Project README

spider-less

Web spider on Serverless!

About Spiderless

Spiderless is the backend layer of KMPPP, a web spider as a service application, it allows you to monitor and get notified of nearly anything on the web. It is built on top of these technologies:

Technology Used For
Bulma, Buefy UI
Vue.js Front-end logic
AWS S3 Website hosting
AWS Lambda Backend API
AWS SNS Message queue
AWS DynamoDB Database
AWS API Gateway API gateway
AWS Cloudfront CDN
AWS Route 53 DNS

Architecture

serverless application architecture

API Endpoints

GET subscriptions

Description

Get a list of subscriptions (a maximum of 1 MB of data limited by DynamoDB).

Parameters

None

Request

curl /api/subscriptions

Response

[
  {
    "createdAt": 1544833435070,
    "targets": [
      {
        "selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span",
        "label":"ratingCount"
      }
    ],
    "id": "b4d98de0-ffff-11e8-a4c9-9b9ee9089058",
    "url": "https://www.imdb.com/title/tt0111161/",
    "interval": 60
  }
]

POST subscriptions

Description

Create a new subscription to feed the spider.

Parameters

  • url (required) - Target website url
  • targets (required) - List of css selectors from which text contents are expected to be extracted
  • interval (required) - The interval (in minutes) between scrape

Request

curl -X POST /api/subscriptions -d '{"url":"https://www.imdb.com/title/tt0111161/","targets":"[{\"label\":\"ratingCount\",\"selector\":\"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span\"}]","interval":"60"}' -H "Content-Type: application/json"

Response

{
  "id": "ef417d30-ffff-11e8-a4c9-9b9ee9089058",
  "url": "https://www.imdb.com/title/tt0111161/",
  "targets": [
    {
      "label":"ratingCount",
      "selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span"
    }
  ],
  "interval": 60,
  "createdAt": 1544833533059,
  "updatedAt": 1544833533059
}

DELETE subscriptions

Description

Delete a subscription.

Parameters

  • id (required) - Subscription id

Request

curl -X DELETE /api/subscriptions/:id

Response

{
  "id": "d72c05d0-ffff-11e8-a4c9-9b9ee9089058"
}

Functions List

scrape

Description

Scrape target websites and extract target contents.

Invoke

yarn invoke:local scrape -d '{"createdAt":1544833435070,"updatedAt":1544833435070,"targets":[{"selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span","label":"ratingCount"}],"id":"b4d98de0-ffff-11e8-a4c9-9b9ee9089058","url":"https://www.imdb.com/title/tt0111161/","interval":60}'

Response

[
  {
    "label": "ratingCount",
    "content": "2,025,796"
  }
]

cron

Description

Fetch subscriptions from database and filter out the ones need to be executed.

Invoke

yarn invoke:local cron

Response

None

Development

# install dependencies
yarn install

# start api server on port 8090
yarn start

# invoke function locally
yarn invoke:local function_name

# invoke remote function
yarn invoke cron function_name

Deploy

# first setup your aws credentials https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html
yarn deploy
Open Source Agenda is not affiliated with "Spider Less" Project. README Source: genkio/spider-less
Stars
186
Open Issues
15
Last Commit
1 year ago
Repository
License
MIT

Open Source Agenda Badge

Open Source Agenda Rating