Hadoop, MapReduce Distributed Crawling of Data Information from All Chinese Universities.
Hadoop, MapReduce Distributed Crawling of Data Information from All Chinese Universities
The widely used MapReduce distributed crawler still recommends using Jsoup, but it cannot parse data loaded by JavaScrip Therefore, this is a warehouse that utilizes Fast Json to crawl data information from all Chinese universities, utilizing the Map Reduce distributed computing crawler in the Hadoop ecosystem At present, my programming environment is Windows10, and virtual Hadoop cannot be tested on Linux or Mac in the testing environment of Windows10. It is currently determined that Linux is an HDFS path. If you are interested, please submit Issues or Pr.
This repository contains:
This project uses Java Git, Go check them out if you don't have them locally installed.
git clone https://github.com/weiensong/ScrapySchoolAll.git
mvn package
# in Master
hadoop jar PackageName.jar
cd /d "%~dp0"
copy hadoop.dll C:\Windows\System32
cd /src/main/java/job
javac MyJob.java
java MyJob
Feel free to dive in! Open an issue or submit PRs.
Standard Java follows the Google apache Code of Conduct.
This project exists thanks to all the people who contribute.
Apache License 2.0 © weiensong