The unix-way web crawler
Crawls web pages and prints any link it can find.
url()
propertiesrobots.txt
brute
mode - scan html comments for urls (this can lead to bogus results)HTTP_PROXY
/ HTTPS_PROXY
environment values + handles proxy auth (use HTTP_PROXY="socks5://127.0.0.1:1080/" crawley
for socks5)fast-scan
)-cookie "ONE=1; TWO=2" -cookie "ITS=ME" -cookie @cookie-file
)-header "ONE: 1" -header "TWO: 2" -header @headers-file
-tag a -tag form
, multiple: -tag a,form
, or mixed)-ignore logout
)# print all links from first page:
crawley http://some-test.site
# print all js files and api endpoints:
crawley -depth -1 -tag script -js http://some-test.site
# print all endpoints from js:
crawley -js http://some-test.site/app.js
# download all png images from site:
crawley -depth -1 -tag img http://some-test.site | grep '\.png$' | wget -i -
# fast directory traversal:
crawley -headless -delay 0 -depth -1 -dirs only http://some-test.site
paru -S crawley-bin
.crawley [flags] url
possible flags with default values:
-all
scan all known sources (js/css/...)
-brute
scan html comments
-cookie value
extra cookies for request, can be used multiple times, accept files with '@'-prefix
-css
scan css for urls
-delay duration
per-request delay (0 - disable) (default 150ms)
-depth int
scan depth (set -1 for unlimited)
-dirs string
policy for non-resource urls: show / hide / only (default "show")
-header value
extra headers for request, can be used multiple times, accept files with '@'-prefix
-headless
disable pre-flight HEAD requests
-ignore value
patterns (in urls) to be ignored in crawl process
-js
scan js code for endpoints
-proxy-auth string
credentials for proxy: user:password
-robots string
policy for robots.txt: ignore / crawl / respect (default "ignore")
-silent
suppress info and error messages in stderr
-skip-ssl
skip ssl verification
-tag value
tags filter, single or comma-separated tag names
-timeout duration
request timeout (min: 1 second, max: 10 minutes) (default 5s)
-user-agent string
user-agent string
-version
show version
-workers int
number of workers (default - number of CPU cores)
Crawley can handle flags autocompletion in bash and zsh via complete
:
complete -C "/full-path-to/bin/crawley" crawley