Wikihadoop Save

Stream-based InputFormat for processing the compressed XML dumps of Wikipedia with Hadoop

Project README

===================== WikiHadoop


Stream-based InputFormat for processing the compressed XML dumps of Wikipedia with Hadoop

:Homepage: http://github.com/whym/wikihadoop :Date: 2012-04-16 :Version: 0.2

Overview

Wikipedia XML dumps with complete edit histories [#]_ have been difficult to process because of their exceptional size and structure. While a "page" is a common processing unit, one Wikipedia page may contain more than gigabytes of text when the edit history is very long.

This software provides an InputFormat for Hadoop Streaming_ Interface that processes Wikipedia bzip2 XML dumps in a streaming manner. Using this InputFormat, the content of every page is fed to a mapper via standard input and output without using too much memory. Thanks to Hadoop Streaming, mappers can be implemented in any language.

See the wiki page__ for a more detailed introduction and tutorial.

__ https://github.com/whym/wikihadoop/wiki .. _Hadoop Streaming: http://hadoop.apache.org/common/docs/current/streaming.html .. _Apache Hadoop: http://hadoop.apache.org .. _Apache Maven: http://maven.apache.org .. _WikiHadoop: http://github.com/whym/wikihadoop

.. [#] For example, one dump file such as pages-meta-history1.xml.bz2, pages-meta-history6.xml.bz2, etc, provided at http://dumps.wikimedia.org/enwiki/20110803/ is more than 30 gigabytes in compressed forms, and more than 700 gigabytes when decompressed.

How to use

Essentially WikiHadoop is an input format for Hadoop Streaming. Once you have StreamWikiDumpInputFormat in the class path, you can give it into the -inputformat option.

To get the input format class working with Hadoop Streaming, proceed with the following procedures:

  1. Install Apache Hadoop_. Version 0.21, 0.22 and 0.23 are the ones we tested.

    • See also Requirements_.
  2. Obtain our jar file from download page. Alternatively, you can build the class and/or the jar by yourself (see How to build). We will call the jar file wikihadoop.jar in this document.

  3. Find the jar file of Hadoop Streaming in your copy of Hadoop. It is probably found at mapred/contrib/streaming/hadoop-*-streaming.jar. We will call it hadoop-streaming.jar in this document.

  4. Run a Hadoop Streaming_ command with the jar file and our input format specified.

    • A command will look like this: ::

      hadoop jar hadoop-streaming.jar -libjars wikihadoop.jar -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat -mapper /bin/cat

      See Sample command line usage, Configuration variables and the official documentation of Hadoop Streaming_ for more details.

    We recommend to use our differ_ as the mapper when creating text diffs between consecutive revisions. The differ revision_differ.py is included in the tarball under diffs, or can be downloaded from the MediaWiki SVN repository by svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/tools/wsor/diffs. See its Differ's readme file_ for more details and other requirements.

    Note: mappers need to be distributed to the computing nodes under the same path. To do so, you can use the -file option of Hadoop Streaming or copy the necessary files manually.

.. _Differ's readme file: http://svn.wikimedia.org/svnroot/mediawiki/trunk/tools/wsor/diffs/README.txt .. _StreamWikiDumpInputFormat: https://github.com/whym/wikihadoop/blob/master/mapreduce/src/contrib/streaming/src/java/org/wikimedia/wikihadoop/StreamWikiDumpInputFormat.java .. _download page: http://whym.github.io/wikihadoop/

How to build

  1. Download WikiHadoop_ and extract the source tree.

    We provide both our git repository and a tarball package.

    • Use git clone https://[email protected]/whym/wikihadoop.git to access to the latest source,
  2. Run Maven to build a jar file. ::

    mvn package

    • By default it compiles with the Hadoop 0.22's code base. We have found that the resulting jar file is compatible with Hadoop 0.21, 0.23, 2.0 and CDH4. When it is incompatible for some reason, you could also try building it with customized pom files by running commands like mvn -f pom-hadoop-0.21.xml package or mvn -f pom-hadoop-0.23.xml package, or changing the dependencies manually.
  3. Find the resulting jar file at target/wikihadoop-*.jar.

Input & Output format

Input can be Wikipedia XML dumps either as compressed in bzip2 (this is what you can directly get from the distribution site) or uncompressed.

The record reader embedded in this input format converts a page into a sequence of page-like elements, each of which contains two consecutive revisions. Output is given as key-value style records where a key is a page-like element and a value is always empty. For example, Given the following input containing two pages and four revisions, ::

<title>ABC</title> 123 100 .... 200 .... 300 .... <title>DEF</title> 456 400 ....

it will produce four keys formatted in page-like elements as follows ::

<title>ABC</title> 123 100 ....

::

<title>ABC</title> 123 100 .... 200 ....

::

<title>ABC</title> 123 200 .... 300 ....

::

<title>DEF</title> 456 400 ....

Notice that before This result will provide a mapper with all information about the revision including the title and page ID. We recommend to use our differ_ to get diffs.

.. _differ: http://svn.wikimedia.org/svnroot/mediawiki/trunk/tools/wsor/diffs/

Requirements

Following softwares are required.

  • Apache Hadoop_

    • Versions 0.21, 0.22, 0.23, 2.0 and CDH4 are supported.
    • Cloudera's_ cdh3u1 is also supported at the cdh3u1 branch_, thanks to François Kawla).
  • Apache Maven_

    • Version 2 or 3. (the default version we test against is 2.2.1.)

See also Supported Versions of Hadoop_ for more information.

.. _Cloudera's: https://ccp.cloudera.com/display/SUPPORT/Downloads .. _cdh3u1 branch: https://github.com/whym/wikihadoop/tree/cdh3u1 .. _Supported Versions of Hadoop: https://github.com/whym/wikihadoop/wiki/Supported-Versions-of-Hadoop.

Sample command line usage

  • To process an English Wikipedia dump with the cat command: ::

    hadoop jar hadoop-streaming.jar -libjars wikihadoop.jar -D mapreduce.input.fileinputformat.split.minsize=300000000 -D mapreduce.task.timeout=6000000 -input /enwiki-20110722-pages-meta-history27.xml.bz2 -output /usr/hadoop/out -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat -mapper /bin/cat

Configuration variables

Following parameters can be configured as similarly as other parameters described in Hadoop Streaming_.

org.wikimedia.wikihadoop.excludePagesWith=REGEX Used to exclude pages with the headers that match to this. For example, to exclude all namespaces except for the main article space, use -D org.wikimedia.wikihadoop.excludePagesWith="<title>(Media|Special|Talk|User|User talk|Wikipedia|Wikipedia talk|File|File talk|MediaWiki|MediaWiki talk|Template|Template talk|Help|Help talk|Category|Category talk|Portal|Portal talk|Book|Book talk):". When unspecified, WikiHadoop sends all pages to mappers.

    Ignoring pages irrelevant to the task is a good idea, if you want to speed up the process.

org.wikimedia.wikihadoop.previousRevision=true or false When set false, WikiHadoop writes only one revision in one page-like element without attaching the previous revision. The default behaviour (true) is to write two consecutive revisions in one page-like element,

mapreduce.input.fileinputformat.split.minsize=BYTES This variables specified the minimum size of a split sent to input readers.

    The default size tends to be too small.  Try changing it to a
    larger value by setting.  The optimal value seems to be around
    (size of the input dump file) / (number of processors) / 5.
    For example, it will be 500000000 for English Wikipedia dumps
    when processing with 12 processors.

mapreduce.task.timeout=MSECS Timeout may happen when pages are too long. Try setting longer than 6000000. Before it starts parsing the data and reporting the progress, WikiHadoop can take more than 6000 seconds to preprocess XML dumps.

Mechanism

Splitting

Input dump files are split into smaller splits with the sizes close to the value of mapreduce.input.fileinputformat.split.minsize. When non-compressed input is used, each split exactly ends with a page end. When bzip2 (or other splittable compression) input is used, each split is modified so that every page is contained at least one of the splits.

Parsing

WikiHadoop's parser can be seen as a SAX parser that is tuned for Wikipedia dump XMLs. By limiting its flexibility, it is supposed to achieve higher efficiency. Instead of extracting all occurrence of elements and attributes, it only looks for beginnings and endings of page elements and revision elements.

Known problems

  • Hadoop map tasks with StreamWikiDumpInputFormat may take a long time to finish preprocessing before starting reporting the progress.
  • Some revision pairs may be emitted twice when bzip2 input is used. (Issue #1_)

.. _Issue #1: https://github.com/whym/wikihadoop/issues/1

.. Local variables: .. mode: rst .. End:

Open Source Agenda is not affiliated with "Wikihadoop" Project. README Source: whym/wikihadoop
Stars
85
Open Issues
5
Last Commit
10 years ago
Repository

Open Source Agenda Badge

Open Source Agenda Rating