Ctenopharyngodon Idella Save

Hadoop, MapReduce Distributed Crawling of Data Information from All Chinese Universities.

Project README

ctenopharyngodon-idella

Repository Introduction

Hadoop, MapReduce Distributed Crawling of Data Information from All Chinese Universities

The widely used MapReduce distributed crawler still recommends using Jsoup, but it cannot parse data loaded by JavaScrip Therefore, this is a warehouse that utilizes Fast Json to crawl data information from all Chinese universities, utilizing the Map Reduce distributed computing crawler in the Hadoop ecosystem At present, my programming environment is Windows10, and virtual Hadoop cannot be tested on Linux or Mac in the testing environment of Windows10. It is currently determined that Linux is an HDFS path. If you are interested, please submit Issues or Pr.

img.png

This repository contains:

  1. Building a simulated distributed environment under Windows
  2. Crawling 掌上高考
  3. Data Storage

Install

This project uses Java Git, Go check them out if you don't have them locally installed.

git clone https://github.com/weiensong/ScrapySchoolAll.git

Usage

  • A truly distributed environment
mvn package

# in Master
hadoop jar PackageName.jar
  • Distributed environment simulated by Windows
    • run initTest.bat directly as administrator
    •   cd /d "%~dp0"
        copy hadoop.dll C:\Windows\System32
        cd /src/main/java/job
        javac MyJob.java
        java MyJob
      
  • hadoop —Apache Hadoop
  • opsariichthys-bidens — Basic information API construction of Chinese national universities.(中国全国大学基本信息API搭建)

Maintainers

@weiensong

Contributing

Feel free to dive in! Open an issue or submit PRs.

Standard Java follows the Google apache Code of Conduct.

Contributors

This project exists thanks to all the people who contribute.

License

Apache License 2.0 © weiensong

Open Source Agenda is not affiliated with "Ctenopharyngodon Idella" Project. README Source: touero/ctenopharyngodon-idella
Stars
140
Open Issues
0
Last Commit
5 months ago
License

Open Source Agenda Badge

Open Source Agenda Rating