Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus
.story
files with article and highlight in it)Requirements • About AMI Meeting Corpus • AMI DialSum Corpus • How to Use • How to Cite
Tested on Python 3.6+, Ubuntu 16.04, Mac OS
pip install nltk
Download AMI Corpus and extract .story
files
python main_obtain_meeting2summary_data.py --summary_type abstractive
Already made
.story
dataset has been provided underdata/ami-transcripts-stories/
Argument | Type | Default |
---|---|---|
summary_type |
string | "abstractive" |
ami_xml_dir |
string | "data/" |
results_transcripts_speaker_dir |
string | "data/ami-transcripts-speaker/" |
results_transcripts_dir |
string | "data/ami-transcript/" |
results_summary_dir |
string | "data/ami-summary/" |
summary_type
is the type of summary to be extracted. Options=["abstractive"
, "extractive"
].ami_xml_dir
is the directory where the AMI Corpus will be downloadedresults_transcripts_speaker_dir
is the directory where each speaker's transcript will be savedresults_transcripts_dir
is the directory where each meeting's transcript will be savedresults_summary_dir
is the directory where each meeting's summary will be savedObtain summaries
data/ami_public_manual_1.6.2/words/*.xml
EN2001a.A.words.xml
EN2001
a
(each hour is a consecutive lowercase letter)A
(usually there are four speakers named A, B, C and D, but E is sometimes also present).xml
file has a number of tags with the words and their respective times in the audio/video file.xml
parsing is required.data/ami-transcripts-speaker/
: meeting transcripts for each speakerdata/ami-transcripts/
: complete meeting transcripts (all speakers together)Obtain abstractive summaries
data/ami_public_manual_1.6.2/abstractive/*.xml
abstract
tagabstract
tag is composed of text in sentence
tagsdata/ami-summary/abstractive/
Obtain extractive summaries
data/ami_public_manual_1.6.2/extractive/*.xml
extsumm
tagextsumm
tag is composed of children nodes such as the below examples:
<nite:child href="ES2002a.B.dialog-act.xml#id(ES2002a.B.dialog-act.dharshi.3)"/>
ES2002a.B.dialog-act.dharshi.3
in a file named ES2002a.B.dialog-act.xml
in data/dialogueActs/
<dact nite:id="ES2002a.B.dialog-act.dharshi.3" reflexivity="true">
<nite:pointer role="da-aspect" href="da-types.xml#id(ami_da_4)"/>
<nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words4)..id(ES2002a.B.words16)"/>
</dact>
ES2002a.B.words.xml
in data/words/
<nite:child href="ES2002a.D.dialog-act.xml#id(ES2002a.D.dialog-act.dharshi.16)..id(ES2002a.D.dialog-act.dharshi.20)"/>
ES2002a.D.dialog-act.dharshi.16
to 20
in a file named ES2002a.D.dialog-act.xml
in data/dialogueActs/
data/ami-summary/extractive/
ami_dialsum_meeting_story.py
:
in
and sum
) and formats it into a series of .story
files compatible with the CNN/DM formatin
corresponds to a meeting transcript with summary present in the same line in file sum
/.XML reader in Python:
TODO
Please star or fork if this code was useful for you. If you use it in a paper, please cite as:
@software{cunha_sergio2019ami_xml2story,
author = {Gwenaelle Cunha Sergio},
title = {{gcunhase/AMICorpusXML: Obtaining Transcript and Abstractive and Extractive Summaries from the AMI Meeting Corpus and formatting the AMI DialSum Meeting Corpus}},
month = dec,
year = 2019,
doi = {10.5281/zenodo.3561298},
version = {v2.1},
publisher = {Zenodo},
url = {https://github.com/gcunhase/AMICorpusXML}
}
If you use the AMI Meeting Corpus, please also add the following citation:
@INPROCEEDINGS{Mccowan05theami,
author = {I. Mccowan and G. Lathoud and M. Lincoln and A. Lisowska and W. Post and D. Reidsma and P. Wellner},
title = {The AMI Meeting Corpus},
booktitle = {In: Proceedings Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research. L.P.J.J. Noldus, F. Grieco, L.W.S. Loijens and P.H. Zimmerman (Eds.), Wageningen: Noldus Information Technology},
year = {2005}
}