Reading the data from OPIEC - an Open Information Extraction corpus
OPIEC is an Open Information Extraction (OIE) corpus, consisted of more than 341M triples extracted from the entire English Wikipedia. Each triple from the corpus is consisted of rich meta-data: each token from the subj/obj/rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence along with the dependency parse, original (golden) links from Wikipedia, sentence order, space/time, etc (for more detailed explanation of the meta-data, see here).
There are two major corpora released with OPIEC:
For more details concerning the construction, analysis and statistics of the corpus, read the AKBC paper "OPIEC: An Open Information Extraction Corpus". To download the data and get additional resources, please visit the project page. To use the code for the construction of the pipeline, please visit the GitHub repository OPIEC-pipeline.
The data is stored in avro format. For details about the metadata, see here. To read the data, you need the avroschema file found in the directory avroschema
; more specifically TripleLinked.avsc
and WikiArticleLinkedNLP.avsc
for OPIEC and WikipediaNLP respectively.
There are two corpora that you can read: OPIEC and WikipediaNLP. For reading OPIEC, see src/main/py3/read_articles_from_avro_demo.py
:
from avro.datafile import DataFileReader
from avro.io import DatumReader
import pdb
AVRO_SCHEMA_FILE = "../../../avro/TripleLinked.avsc"
AVRO_FILE = "../../../data/triples.avro"
reader = DataFileReader(open(AVRO_FILE, "rb"), DatumReader())
for triple in reader:
print(triple)
# use triple.keys() to see every field in the schema (it's a dictionary)
pdb.set_trace()
reader.close()
Similarly, for reading WikipediaNLP, see src/main/py3/read_articles_from_avro_demo.py
:
from avro.datafile import DataFileReader
from avro.io import DatumReader
import pdb
AVRO_SCHEMA_FILE = "../../../avroschema/WikiArticleLinkedNLP.avsc"
AVRO_FILE = "../../../data/articles.avro" # edit this line
reader = DataFileReader(open(AVRO_FILE, "rb"), DatumReader())
for article in reader:
print(article['title'])
# use article.keys() to see every field in the schema (it's a dictionary)
pdb.set_trace()
reader.close()
There are two corpora that you can read: OPIEC and WikipediaNLP. For reading OPIEC, see src/main/java/de/uni_mannheim/ReadTriplesAvro.java
:
package de.uni_mannheim;
import avroschema.linked.TripleLinked;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.io.DatumReader;
import org.apache.avro.specific.SpecificDatumReader;
import java.io.File;
import java.io.IOException;
public class ReadTriplesAvro {
public static void main(String args[]) throws IOException {
File f = new File("data/triples.avro");
DatumReader<TripleLinked> userDatumReader = new SpecificDatumReader<>(TripleLinked.class);
DataFileReader<TripleLinked> dataFileReader = new DataFileReader<>(f, userDatumReader);
while (dataFileReader.hasNext()) {
// Debugging variables
TripleLinked triple = dataFileReader.next();
System.out.print("Processing triple: " + triple.getTripleId());
}
}
}
Similarly, for reading WikipediaNLP, see src/main/java/de/uni_mannheim/ReadArticlesAvro.java
:
package de.uni_mannheim;
import avroschema.linked.WikiArticleLinkedNLP;
import java.io.File;
import java.io.IOException;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.io.DatumReader;
import org.apache.avro.specific.SpecificDatumReader;
public class ReadArticlesAvro {
public static void main(String args[]) throws IOException {
File f = new File("data/articles.avro");
DatumReader<WikiArticleLinkedNLP> userDatumReader = new SpecificDatumReader<>(WikiArticleLinkedNLP.class);
DataFileReader<WikiArticleLinkedNLP> dataFileReader = new DataFileReader<>(f, userDatumReader);
while (dataFileReader.hasNext()) {
WikiArticleLinkedNLP article = dataFileReader.next();
System.out.println("Processing article: " + article.getTitle());
}
}
}
There are two corpora that we are releasing: OPIEC and WikipediaNLP. In this section, the metadata for the two corpora are described.
WikipediaNLP is the NLP annotation corpus for the English Wikipedia. Each object is a Wikipedia article containing:
Each OIE triple in OPIEC contains the following metadata:
If you use any of these corpora, or use the findings from the paper, please cite:
@inproceedings{gashteovski2019opiec,
title={OPIEC: An Open Information Extraction Corpus},
author={Gashteovski, Kiril and Wanner, Sebastian and Hertling, Sven and Broscheit, Samuel and Gemulla, Rainer},
booktitle={Proceedings of the Conference on Automatic Knowledge Base Construction (AKBC)},
year={2019}
}