A Causal Relation Schema for Text
CREST is a machine-readable format/schema that is created to help researchers who work on causal/counterfactual relation extraction and commonsense causal reasoning, to use and leverage the scattered data resources around these topics more easily. CREST-formatted data are stored as pandas DataFrame.
/CREST
directory.pip install -r requirements.txt
python -m spacy download en_core_web_sm
/crest/convert.py
:
python convert.py -i
: printing the full list of currently supported datasetspython convert.py [DATASET_ID_0] ... [DATASET_ID_n] [OUTPUT_FILE_NAME]
DATASET_ID_*
: id of a dataset.OUTPUT_FILE_NAME
: name of the output file that should be in .xlsx
format1
and 2
: python convert.py 1 2 output.xlsx
5
: python convert.py 5 output.xlsx
The excel file of all converted datasets:
crest_v2.xlsx
- PDTB is not available in this file due to copyright. However, you can still use CREST to convert this dataset if you have access to PDTB.
CREST
formatEach relation in a CREST-formatted DataFrame has the following fields/values:
original_id
: the id of a relation in the original dataset, if such an id exists.span1
: a list of strings of the first span/argument of the relation.span2
: a list of strings of the second span/argument of the relationsignal
: a list of strings of signals/markers of the relation in context, if any.context
: a text string of the context in which the relation appears.idx
: indices of span1
, span2
, and signal
tokens/spans in context stored in 3 lines, each line in the form of span_type start_1:end_1 ... start_n:end_n
. For example, if span1
has multiple tokens/spans with start:end
indices 2:5
and 10:13
, respectively, span1
's line value in idx
is span1 2:5 10:13
. Indices are sorted based on the start indexes of tokens/spans.label
: label of the relation, 0: non-causal
, 1: causal
direction
: direction between span1 and span2. 0: span1 => span2
, 1: span1 <= span2
, -1: not-specified
source
: id of the source dataset (ids
are listed in a table below)split
: 0: train
, 1: dev
, 2: test
. This is the split to which the relation belongs in the original dataset. If there is no split specified for a relation in the original dataset, we assign the relation to the train
split by default.Note: The reason we save a list of strings instead of a single string for span1, span2, and signal is that these text spans may contain multiple non-consecutive sub-spans in context.
List of data resources already converted to CREST format:
Id | Data resource | Samples | Causal | Non-causal | Document | Year |
---|---|---|---|---|---|---|
1 | SemEval 2007 Task 4 | 1,529 | 114 | 1,415 | Paper | 2007 |
2 | SemEval 2010 Task 8 | 10,717 | 1,331 | 9,386 | Paper | 2010 |
3 | EventCausality | 583 | 583 | - | Paper | 2011 |
4 | Causal-TimeBank | 318 | 318 | - | Paper | 2014 |
5 | EventStoryLine v1.5 | 2,608 | 2,608 | - | Paper | 2016 |
6 | CaTeRS | 2,502 | 308 | 2,194 | Paper | 2016 |
7 | BECauSE v2.1 :warning: | 729 | 554 | 175 | Paper | 2017 |
8 | Choice of Plausible Alternatives (COPA) | 2,000 | 1,000 | 1,000 | Paper | 2011 |
9 | The Penn Discourse Treebank (PDTB) 3.0 :warning: | 7,991 | 7,991 | - | Manual | 2019 |
10 | BioCause Corpus | 844 | 844 | - | Paper | 2013 |
11 | Temporal and Causal Reasoning (TCR) | 172 | 172 | - | Paper | 2018 |
12 | Benchmark Corpus for Adverse Drug Effects | 5,671 | 5,671 | - | Paper | 2012 |
13 | SemEval 2020 Task 5 :atom: | 5,501 | 5,501 | - | Paper | 2020 |
:warning: The data is either not publicly available or partially available. You can still use CREST for conversion if you have full access to this dataset.
:atom: Counterfactual Relations
CREST
conversionWe provide helper methods to convert CREST-formatted data to popular formats and annotation schemes, mainly formats that are used across relation extraction/classification tasks. In the following, there is a list of formats for which we have already developed CREST converter methods:
brat
: we have provided helper methods for two-way conversion of CREST data frames to brat (see example here). brat is a popular web-based annotation tool that has been used for a variety of relation extraction NLP tasks. We use brat for two main reasons: 1) better visualization of causal and non-causal relations and their arguments, and 2) modifying annotations if needed and adding new annotations to provided context. In the following, there is a sample of a converted version of CREST-formatted relation to brat (example is taken from CaTeRS dataset):
TACRED
: TACRED is a large-scale relation extraction dataset. We convert samples from CREST to TACRED since TACRED-formatted data can be easily used as input to many transformers-based language models (e.g. for Relation Classification/Extraction). You can find an example of converting CREST-formatted data to TACRED in this notebook.Pull Request (PR)
, we actively check the PRs and appreciate it :relaxed:For now, please cite our arXiv paper:
@article{hosseini2021predicting,
title={Predicting Directionality in Causal Relations in Text},
author={Hosseini, Pedram and Broniatowski, David A and Diab, Mona},
journal={arXiv preprint arXiv:2103.13606},
year={2021}
}