CommonCrawlParser Save

Simple multi threaded tool to extract domain related data from commoncrawl.org

Project README

cc.py

Simple multi threaded tool to extract domain related data from commoncrawl.org

Usage

ccp.py [-h] -d domain -o path [-t THREADS] [-f index1] [-f index2]

necessary arguments:
  -d, --domain   The domain you want to search for in CC data.
  -o, --outfile  The path and filename where you want the results to be saved to.

optional arguments:
  -h, --help     Show help message and exit
  -f, --filter   Use only indices which contain this string
  -t, --threads  Threads for requests

Examples

Search for github.com and save to /home/folder/cc/data.txt

python3 ccp.py -d github.com -o /home/folder/cc/data.txt

Search for github.com in indices which contain "CC-MAIN-2017-09", save to data.txt

python3 ccp.py -d github.com -o ./data.txt -f CC-MAIN-2017-09

Search for github.com in indices which contain "2013" and "2014", save to data.txt

python3 ccp.py -d github.com -o ./data.txt -f 2014 -f 2013

Search for github.com using 10 threads, save to data.txt

python3 ccp.py -d github.com -o ./data.txt -t 10

grep tips

I am no grep expert but I know how to extract data, if you have better solutions for my existing commands OR additional ideas what to search for: PR

Find entries which end with popular file extension indicating dynamic pages etc:

grep -i -E '\.(php|asp|dev|jsp|wsdl|xml|cgi|json|html)$' /home/folder/cc/data.txt

Find interesting files like backups, archives, log files...

grep -i -E '\.(zip|rar|tar|bkp|sql|zip|bz2|gz|txt|bak|conf|log|error|debug|yml|lock|template|tpl)$' /home/folder/cc/data.txt

Find entries which contain popular strings like "admin" etc:

grep -i -E '(admin|account|debug|control|config|upload|system|secret|environment|dashboard)$' /home/folder/cc/data.txt

Find files which begin with "." (htaccess, ...):

grep -i -E '\/\.' /home/folder/cc/data.txt

Find obvious backup files:

grep -i -E '(\.bkp|\.bak|backup|\.dump|\.sql)' /home/folder/cc/data.txt

Extract subdomains:

sed -e 's|^[^/]*//||' -e 's|^www\.||' -e 's|/.*$||' /home/folder/cc/data.txt | grep -v ":" | grep -v "@" | grep -v "?" | grep -v "/" | sort -u

Find urls with parameters in it:

grep -i -E '(\?|\&)(.*?)=((.*?)|)' /home/folder/cc/data.txt | sort -u

Dependencies

python3
requests
argparse
json

Information

This project was initially forked from cc.py but since I refactored it completely and si9int took another path I decided to create a stand alone project.

Open Source Agenda is not affiliated with "CommonCrawlParser" Project. README Source: Damian89/commonCrawlParser

Stars

Open Issues

Last Commit

5 years ago

Repository

Damian89/commonCrawlParser

License

MIT

Homepage

https://www.damianschwyrz.de/

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/commoncrawlparser"><img src="https://www.opensourceagenda.com/projects/commoncrawlparser/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022