Prescience Data Foundation Save

🧱 A uniform template to use as a foundation for Puppeteer bot construction.

Project README

🧱 Foundation - Puppeteer Bot Starter Kit

Update:

Currently working on https://masqueradejs.com to replace this project as it is quite a bit out of date now, but in the mean time you can check out https://github.com/clouedoc/puppeteer-boiler which is similar and actively updated. πŸ‘Ύ

What it is?

Wot...

Foundation is intended as a simple entry-point / template for developers new to designing Puppeteer bots.

It uses the (in)famous Puppeteer-Extra package as the primary Puppeteer driver to enable its library of Stealth plugins and evasions.

πŸ‘‹ PS: If you're working on botting and looking for a great developer community, check out the Puppeteer-Extra Discord server: https://discord.gg/vz7PeKk

Foundation tries to avoid wrapping existing libraries and does not "add" much that doesn't already exist, but starting a new project with an unfamiliar library can come with a lot of questions around project structure and tooling.

This attempts to solve these issues with a ready-to-go scaffolding, however it should be noted that the structure is just, like, my opinion man... and considered under heavy flux.

However, breaking changes shouldn't matter, because its only intended as a starting point and you should take it in whatever direction makes sense.

"Ok, but I've come from Selenium / Python?"

If you're new to both modern JavaScript (ES6 & TypeScript) and Puppeteer, here's a quick rundown:

πŸ“š Newbie Guide To Scraping With Puppeteer

Installation

⚠ Note for Windows users: This project does not include cross-env, so using WSL and Terminal Preview are essentially a requirement.

🎬 Download and init

Automatic

$ git clone https://github.com/prescience-data/foundation.git && cd ./foundation # Clone the project
$ npm run init

Manual

The automatic version runs the following commands:

$ git clone https://github.com/prescience-data/foundation.git && cd ./foundation # Clone the project
$ npm run update  # Updates the package.json file dependencies to latest versions
$ npm install --loglevel=error # Installs dependencies
$ npm run db:init # Initialises a sqlite database
$ npm run build:clean # Build the TypeScript code

πŸ‘¨β€πŸ”§ Configure

Edit the .env to your liking and add any services like Google Cloud Logging etc.

⚠ Remember to .gitignore and git -rm -rf your .env file before committing to any public repositories.

β›· Build / Run

The project is TypeScript so there are a few commands provided for this.

$ npm run build:clean # Just build the TypeScript files

or...

$ npm run bot # Builds the app and runs your entrypoint file
Run it!

Project Structure

The project is split into two distinct parts, core and app.

This allows you to develop a quasi-framework that you can re-use between projects in the Core concern, while keeping all project-specific code within the App concern.

πŸ›  Config

core/config.ts

.env

The project uses a .env in the root to define most of the common environment variables, but you can call these from a database etc if you prefer.

The main Puppeteer LaunchOptions are defined in the config.ts file.

πŸ€– Bot

app/bot.ts

Main self-executing function entry-point.

This is where you execute each part of your scoped logic from the modules section cleanly.

Make some magic happen πŸ§™βœ¨...

Evil plan

You call this module from the cli with:

$ npm run bot

Cli Arguments

You may wish to add cli arguments to direct the code in specific directions:

$ npm run bot -- --command=<CommandName>

Or if you prefer to shortcut your cli further you can add to your package.json scripts:

{
  "scripts": {
    "bot:moon-prism-power": "npm run bot -- --command=moon-prism-power"
  }
}
$ npm run bot:moon-prism-power ✨✨✨✨

βš™ Business Logic

app/modules/<name>.ts

Your bot logic should be defined in clear logical scopes within the src/modules folder. It's best to keep things neat and abstracted from the start to avoid huge, confusing, single-file blobs as your bot grows.

It might seem like overkill to abstract logic out at the start (which may be true for very simple bots), but you'll notice very quickly how bloated a modestly complete bot can get.

Bloat

πŸ‘¨β€πŸ”¬ Detection Tests

core/tests/<name>.ts

A large part of building your bot is rapidly testing it against known detection code.

Long-term, you'll want to develop your own internal tests by de-obfuscating the vendor code of your target, however for rapid early development, using hosted ones is fine.

You can use the existing detection tests provided, or build your own using the basic template provided.

Example

export const PixelScan: PageLogic = async (page: Page): Promise<Record<string, any>> => {
  // Load the test page.  
  await page.goto("https://pixelscan.net", { waitUntil: "networkidle2" })
  await page.waitForTimeout(1500)
  // Extract the result element text.
  const element = await page.$("#consistency h1")
  if (!element) {
    throw new ElementNotFoundError(`Heading Tag`, element)
  }
  const result = (
    await page.evaluate((element) => element.textContent, element)
  ).replace(/\s/g, " ").trim()
  // Notify and return result.
  return { result: result }
}

🧠 If you add new tests remember to add them to the index.ts index to allow you to import all tests together if needed, and main run.ts file to allow cli access.

Very sneaky, sir.

Running Detection Tests

To run your tests, use the command:

$ npm run tests -- --page=sannysoft

Available Tests

🧰 Utils

core/utils.ts

Aim to keep all your small, highly re-used utility functions in a single place.

  • rand(min: number, max: number, precision?: number) Returns a random number from a range.
  • delay(min: number, max: number) Shortcuts the rand method to return a options-ready object.
  • whitespace(value: string) Strips all duplicate whitespace and trims the string.

πŸ–₯ Browsers

core/browsers/<browser>.ts

Regular Browsers

All regular browsers are auto-loaded with the Stealth plugin.

Fancy Browsers

Surfin' the web

Examples

Chrome
  // Using Chrome via the executable.
  import Chrome from "../core/browsers" 
  const browser: Browser = await Chrome() 
  const page: Page = await browser.newPage()
MultiLogin
  // Using MultiLogin with a profile id.
  import MultiLogin from "../core/browsers" 
  const browser: Browser = await MultiLogin({ profileId: "fa3347ae-da62-4013-bcca-ef30825c9311"}) 
  const page: Page = await browser.newPage()
Browserless
  // Using Browserless with an api token.
  import Browserless from "../core/browsers" 
  const browser: Browser = await Browserless(env.BROWSERLESS_TOKEN) 
  const page: Page = await browser.newPage()

πŸ’Ύ Storage

storage/profiles/<uuid>

Local storage folder for switching Chrome profiles.

πŸ“¦ Database

core/services/db.ts

prisma/schema.prisma

Uses the fantastic Prisma database abstraction library with a simple sqlite database, but this can easily be configured for any local or remote RDBS or keystore database.

https://www.prisma.io

Commands

$ npm run db:init # Wipes the database and regenerates types and migrations
$ npm run db:migrate # Creates migrations
$ npm run db:migrate:refresh # Long version of init
$ npm run db:generate # Generates fresh prisma files

Example

import { db } from "../core/services"
;(async () => {

// Bot execution code...

// If a result was returned, store it in the database.
if (result) {
 db.scrape.create({
  data: {
    url: "https://www.startpage.com/en/privacy-policy/",
    html: result,
  },
 })
}

})()

Additionally, you can build out shortcut methods in the database folder to DRY out common database transactions.

/**
 * Basic Prisma abstraction for a common task.
 *
 * @param {string} url
 * @param {string} data
 * @return {Promise<void>}
 */
export const storeScrape = async (
  url: string,
  data: string | Record<string, any>
): Promise<void> => {
  // Flatten any objects passed in.
  if (typeof data !== "string") {
    data = JSON.stringify(data)
  }
  // Store the data.
  db.scrape.create({
    data: {
      url: url,
      data: data,
    },
  })
}

πŸ“ƒ Logging

core/services/logger.ts

Uses Winston to handle logging and output. Can but configured to transport to console, file, or third-party transport like Google Cloud Logging (provided).

Check the docs here to extend or configure transports / switch out completely.

Google Cloud Logging configuration

To setup Google Cloud Logging, you'll need a service account with Logs Writer and Monitoring Metric Writer permissions.

Guide:

  1. Create a GCP project https://console.cloud.google.com
  2. Enable the Cloud Logging API
  3. Create a service account
    • required roles:
      • Logging > Logs Writer
      • Monitoring > Monitoring Metric Writer
  4. Add a JSON key to the service account and download it to resources/google
  5. Make sure to edit the .env to match your service account key's filename ! (GOOGLE_LOGGING_KEYFILE property)

Tooling

The project comes preconfigured with the following tooling to keep your code neat and readable. Make sure to configure your IDE to pick up the configs.

Work In Progress

πŸ€·β€β™€οΈAny contributions on this would be much appreciated!

Halp!

  • Writing Mocha tests
  • More demos!
  • Define other database systems eg Firebase
  • Containerize with Docker
  • Write mouse movement recorder and database storage driver
  • Add ghost-cursor to demo
  • Apply optional world isolation
  • Add emojis to logger
  • Migrate css selectors to xpath
Open Source Agenda is not affiliated with "Prescience Data Foundation" Project. README Source: prescience-data/foundation
Stars
59
Open Issues
2
Last Commit
2 years ago

Open Source Agenda Badge

Open Source Agenda Rating