Page Fetch Save

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files. Run arbitrary JavaScript on many web pages and see the returned values

Project README

page-fetch

page-fetch is a tool for researchers that lets you:

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files
Run arbitrary JavaScript on many web pages and see the returned values

Installation

page-fetch is written with Go and can be installed with go install:

▶ go install github.com/detectify/page-fetch@latest

Or you can clone the respository and build it manually:

▶ git clone https://github.com/detectify/page-fetch.git
▶ cd page-fetch
▶ go install

Dependencies

page-fetch uses chromedp, which requires that a Chrome or Chromium browser be installed. It uses the following list of executable names in attempting to execute a browser:

headless_shell
headless-shell
chromium
chromium-browser
google-chrome
google-chrome-stable
google-chrome-beta
google-chrome-unstable
/usr/bin/google-chrome

Basic Usage

page-fetch takes a list of URLs as its input on stdin. You can provide the input list using IO redirection:

▶ page-fetch < urls.txt

Or using the output of another command:

▶ grep admin urls.txt | page-fetch

By default, responses are stored in a directory called 'out', which is created if it does not exist:

▶ echo https://detectify.com | page-fetch
GET https://detectify.com/ 200 text/html; charset=utf-8
GET https://detectify.com/site/themes/detectify/css/detectify.css?v=1621498751 200 text/css
GET https://detectify.com/site/themes/detectify/img/detectify_logo_black.svg 200 image/svg+xml
GET https://fonts.googleapis.com/css?family=Merriweather:300i 200 text/css; charset=utf-8
...
▶ tree out
out
├── detectify.com
│   ├── index
│   ├── index.meta
│   └── site
│       └── themes
│           └── detectify
│               ├── css
│               │   ├── detectify.css
│               │   └── detectify.css.meta
...

The directory structure used in the output directory mirrors the directory structure used on the target websites. A ".meta" file is stored for each request that contains the originally requested URL, including the query string), the request and response headers etc.

Options

You can get the page-fetch help output by running page-fetch -h:

▶ page-fetch -h
Request URLs using headless Chrome, storing the results

Usage:
  page-fetch [options] < urls.txt

Options:
  -c, --concurrency <int>   Concurrency Level (default 2)
  -d, --delay <int>         Milliseconds to wait between requests (default 0)
  -e, --exclude <string>    Do not save responses matching the provided string (can be specified multiple times)
  -i, --include <string>    Only save requests matching the provided string (can be specified multiple times)
  -j, --javascript <string> JavaScript to run on each page
  -o, --output <string>     Output directory name (default 'out')
  -p, --proxy <string>      Use proxy on given URL
  -w, --overwrite           Overwrite output files when they already exist
      --no-third-party      Do not save responses to requests on third-party domains
      --third-party         Only save responses to requests on third-party domains

Concurrency

You can change how many headless Chrome processes are used with the -c / --concurrency option. The default value is 2.

Excluding responses based on content-type

You can choose to not save responses that match particular content types with the -e / --exclude option. Any response with a content-type that partially matches the provided value will not be stored; so you can, for example, avoid storing image files by specifying:

▶ page-fetch --exclude image/

The option can be specified multiple times to exclude multiple different content-types.

Including responses based on content-type

Rather than excluding specific content-types, you can opt to only save certain content-types with the -i / --include option:

▶ page-fetch --include text/html

The option can be specified multiple times to include multiple different content-types.

Running JavaScript on each page

You can run arbitrary JavaScript on each page with the -j / --javascript option. The return value of the JavaScript is converted to a string and printed on a line prefixed with "JS":

▶ echo https://example.com | page-fetch --javascript document.domain
GET https://example.com/ 200 text/html; charset=utf-8
JS (https://example.com): example.com

This option can be used for a very wide variety of purposes. As an example, you could extract the href attribute from all links on a webpage:

▶ echo https://example.com | page-fetch --javascript '[...document.querySelectorAll("a")].map(n => n.href)' | grep ^JS
JS (https://example.com): [https://www.iana.org/domains/example]

Setting the output directory name

By default, files are stored in a directory called out. This can be changed with the -o / --output option:

▶ echo https://example.com | page-fetch --output example
GET https://example.com/ 200 text/html; charset=utf-8
▶ find example/ -type f
example/example.com/index
example/example.com/index.meta

The directory is created if it does not already exist.

Using a proxy

The -p / --proxy option can be used to specify a proxy for all requests to use. For example, to use the Burp Suite proxy with default settings, you could run:

▶ echo https://example.com | page-fetch --proxy http://localhost:8080

Overwriting files

By default, when a file already exists, a new file is created with a numeric suffix, e.g. if index already exists, index.1 will be created. This behaviour can be overridden with the -w / --overwrite option. When the option is used matching files will be overwritten instead.

Excluding third-party responses

You may sometimes wish to exclude responses from third-party domains. This can be done with the --no-third-party option. Any responses to requests for domains that do not match the input URL, or one of its subdomains, will not be saved.

Including only third-party responses

On rare occasions you may wish to only store responses to third party domains. This can be done with the --third-party option.

Open Source Agenda is not affiliated with "Page Fetch" Project. README Source: detectify/page-fetch

Stars

498

Open Issues

Last Commit

5 months ago

Repository

detectify/page-fetch

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/page-fetch"><img src="https://www.opensourceagenda.com/projects/page-fetch/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022