page-fetch is a tool for researchers that lets you:
page-fetch is written with Go and can be installed with
▶ go get github.com/detectify/page-fetch
Or you can clone the respository and build it manually:
▶ git clone https://github.com/detectify/page-fetch.git ▶ cd page-fetch ▶ go install
page-fetch uses chromedp, which requires that a Chrome or Chromium browser be installed. It uses the following list of executable names in attempting to execute a browser:
page-fetch takes a list of URLs as its input on
stdin. You can provide the input list using IO redirection:
▶ page-fetch < urls.txt
Or using the output of another command:
▶ grep admin urls.txt | page-fetch
By default, responses are stored in a directory called 'out', which is created if it does not exist:
▶ echo https://detectify.com | page-fetch GET https://detectify.com/ 200 text/html; charset=utf-8 GET https://detectify.com/site/themes/detectify/css/detectify.css?v=1621498751 200 text/css GET https://detectify.com/site/themes/detectify/img/detectify_logo_black.svg 200 image/svg+xml GET https://fonts.googleapis.com/css?family=Merriweather:300i 200 text/css; charset=utf-8 ... ▶ tree out out ├── detectify.com │ ├── index │ ├── index.meta │ └── site │ └── themes │ └── detectify │ ├── css │ │ ├── detectify.css │ │ └── detectify.css.meta ...
The directory structure used in the output directory mirrors the directory structure used on the target websites. A ".meta" file is stored for each request that contains the originally requested URL, including the query string), the request and response headers etc.
You can get the page-fetch help output by running
You can change how many headless Chrome processes are used with the
The default value is 2.
You can choose to not save responses that match particular content types with the
Any response with a content-type that partially matches the provided value will not be stored; so you can,
for example, avoid storing image files by specifying:
▶ page-fetch --exclude image/
The option can be specified multiple times to exclude multiple different content-types.
Rather than excluding specific content-types, you can opt to only save certain content-types with the
▶ page-fetch --include text/html
The option can be specified multiple times to include multiple different content-types.
This option can be used for a very wide variety of purposes. As an example, you could extract the
attribute from all links on a webpage:
By default, files are stored in a directory called
out. This can be changed with the
▶ echo https://example.com | page-fetch --output example GET https://example.com/ 200 text/html; charset=utf-8 ▶ find example/ -type f example/example.com/index example/example.com/index.meta
The directory is created if it does not already exist.
--proxy option can be used to specify a proxy for all requests to use. For example, to use the Burp Suite
proxy with default settings, you could run:
▶ echo https://example.com | page-fetch --proxy http://localhost:8080
By default, when a file already exists, a new file is created with a numeric suffix, e.g. if
index already exists,
index.1 will be created. This behaviour can be overridden with the
--overwrite option. When the option is
used matching files will be overwritten instead.
You may sometimes wish to exclude responses from third-party domains. This can be done with the
Any responses to requests for domains that do not match the input URL, or one of its subdomains, will not be saved.
On rare occasions you may wish to only store responses to third party domains. This can be done with the