Webpalm Save

🕸️ Crawl in the web network

Project README

WebPalm

banner




Take a look

takealook-min

What is webpalm?

WebPalm is a command-line tool that enables users to traverse a website and generate a tree of all its webpages and their links. It uses a recursive approach to enter each link found on a webpage and continues to do so until all levels have been explored. In addition to generating a site map, WebPalm can extract data from the body of each page using regular expressions and save the results in a file. This feature can be useful for web scraping or extracting specific information.

⚠️ DISCLAIMER ⚠️:

this tool is intended to be used for legal purposes only, and you are responsible for your actions.

Features

  • Generate a palm tree struct of web urls
  • Dump data from body pages using regular expressions
  • Multi-threading and parallelism
  • Export the web-tree to json, xml, txt
  • Fast and easy to use
  • Colorized output and error handling

Installation

From source

git clone https://github.com/Malwarize/webpalm.git
cd webpalm
go build -o webpalm && ./webpalm

From binary

you can download the binary from Releases

wget https://github.com/Malwarize/webpalm/releases/download/v0.0.1/webpalm_x.x.x_os_arch.tar.gz
tar -xvf webpalm_x.x.x_os_arch.tar.gz
cd webpalm
./webpalm

if you have go installed

go install github.com/Malwarize/webpalm/v2@latest

Usage

webpalm -h
Flags:
  -d, --delay int                delay (ms) between each request / ex: -d 200
  -x, --exclude-code ints        status codes to exclude / ex : -x 404,500
  -h, --help                     help for webpalm
  -i, --include strings          include only domains / ex : -i google.com,facebook.com
  -l, --level int                level of palming / ex: -l2
  -o, --output string            file to export the result (f.json, f.xml, f.txt) / ex: -o result.json
  -p, --proxy string             proxy to use / ex: -p http://proxy.com:8080
      --regexes stringToString   regexes to match in each page / ex: --regexes comments="\<\!--.*?-->" (default [])
  -t, --timeout int              timeout in seconds / ex: -t 10 (default 10)
  -u, --url string               target url / ex: -u https://google.com
  -a, --user-agent string        user agent to use / ex: -a chrome, firefox, safari, ie, edge, opera, android, ios, custom
  -v, --version                  version for webpalm
  -w, --worker int               number of workers for multi-threading  / ex: -w 10

Examples

get the palm tree of a website:

webpalm -u https://google.com -l1
# or
webpalm -u https://google.com -l1 -w 3 # 3 workers (multi-threading)

get palm tree of a website and exclude some status codes:

webpalm -u https://google.com -l1 -x 404,500 

get the palm tree of a website and dump data from the body of the pages:

webpalm -u https://google.com -l1 --regexes comments="\<\!--.*?-->" -o result.json

this will dump the comments of each page in the body of the page

webpalm -u https://google.com -l1 --regexes comments="\<\!--.*?-->",emails="([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+)"

this will dump the comments and emails of each page in the body of the page

get the palm tree of a website and export it to xml,txt:

webpalm -u https://google.com -l3 -o result.xml
webpalm -u https://google.com -l2 -o result.txt

get the palm tree of a website and include only some domains:

webpalm -u https://google.com -l2 -i google.com,facebook.com

this will crawl only the urls that contains google.com or facebook.com

threading and concurrency

get the palm tree of a website using 100 workers:

webpalm -u https://google.com -l2 -w 100

Regexes Examples

Regex Pattern
emails ([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+)
comments \<\!--.*?-->
tokens [a-zA-Z0-9]{32}
password \bpassword\b.{0,10}

Don't forget escaping the regexes if needed

Tests

You can run unit tests to gain more confidence in the enhancements or changes to the code by running go test -v ./...

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. you can also contact me on discord:xorbit.

Powered By Malwarize

Join to Discord

Open Source Agenda is not affiliated with "Webpalm" Project. README Source: Malwarize/webpalm
Stars
323
Open Issues
2
Last Commit
2 months ago
Repository
License

Open Source Agenda Badge

Open Source Agenda Rating