AMICorpusXML Save

Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus

Project README

DOI

About

  • Extracts meetings transcript and summary from AMI Meeting Corpus: extractive and abstractive
  • Transforms into CNN-DailyMail News dataset (.story files with article and highlight in it)

Contents

RequirementsAbout AMI Meeting CorpusAMI DialSum CorpusHow to UseHow to Cite

Requirements

Tested on Python 3.6+, Ubuntu 16.04, Mac OS

pip install nltk

AMI Meeting Corpus

  • More info
  • Number of meetings (including scenario and non-scenario): 171
    • Number of speakers per meeting: 4-5
    • Total number of transcripts: 687
  • Number of summaries: 142 (abstractive) and 137 (extractive)
    • Only available for meetings with names starting with ES, IS and TS

How to Use

Download AMI Corpus and extract .story files

python main_obtain_meeting2summary_data.py --summary_type abstractive

Already made .story dataset has been provided under data/ami-transcripts-stories/

Configuration options

Argument Type Default
summary_type string "abstractive"
ami_xml_dir string "data/"
results_transcripts_speaker_dir string "data/ami-transcripts-speaker/"
results_transcripts_dir string "data/ami-transcript/"
results_summary_dir string "data/ami-summary/"
  • summary_type is the type of summary to be extracted. Options=["abstractive", "extractive"].
  • ami_xml_dir is the directory where the AMI Corpus will be downloaded
  • results_transcripts_speaker_dir is the directory where each speaker's transcript will be saved
  • results_transcripts_dir is the directory where each meeting's transcript will be saved
  • results_summary_dir is the directory where each meeting's summary will be saved

Code explanation

  1. Obtain summaries

    • Summaries are originally saved in data/ami_public_manual_1.6.2/words/*.xml
    • Example: EN2001a.A.words.xml
      • Meeting name: EN2001
      • Meetings are divided into 1 hour parts: a (each hour is a consecutive lowercase letter)
      • Speaker: A (usually there are four speakers named A, B, C and D, but E is sometimes also present)
    • How:
      • Each .xml file has a number of tags with the words and their respective times in the audio/video file.
      • In order for us to extract the summaries, we have to put those words back together in sentences and paragraphs.
      • Thus xml parsing is required.
    • Output: 2 folders with corresponding .txt files
      • data/ami-transcripts-speaker/: meeting transcripts for each speaker
      • data/ami-transcripts/: complete meeting transcripts (all speakers together)
  2. Obtain abstractive summaries

    • Located in data/ami_public_manual_1.6.2/abstractive/*.xml
    • How:
      • Extract text between abstract tag
      • Text between abstract tag is composed of text in sentence tags
      • Return all these tags as a paragraph
    • Output: data/ami-summary/abstractive/
  3. Obtain extractive summaries

    • Located in data/ami_public_manual_1.6.2/extractive/*.xml
    • How:
      • Extract text between extsumm tag
      • Text between extsumm tag is composed of children nodes such as the below examples:
        • Example 1: <nite:child href="ES2002a.B.dialog-act.xml#id(ES2002a.B.dialog-act.dharshi.3)"/>
          • This node refers to a node of ID ES2002a.B.dialog-act.dharshi.3 in a file named ES2002a.B.dialog-act.xml in data/dialogueActs/
          • This ID is:
            <dact nite:id="ES2002a.B.dialog-act.dharshi.3" reflexivity="true">
                <nite:pointer role="da-aspect"  href="da-types.xml#id(ami_da_4)"/>
                <nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words4)..id(ES2002a.B.words16)"/>
            </dact>
            
          • This indicates words 4 to 16 in ES2002a.B.words.xml in data/words/
        • Example 2: <nite:child href="ES2002a.D.dialog-act.xml#id(ES2002a.D.dialog-act.dharshi.16)..id(ES2002a.D.dialog-act.dharshi.20)"/>
          • This node refers to nodes of ID ES2002a.D.dialog-act.dharshi.16 to 20 in a file named ES2002a.D.dialog-act.xml in data/dialogueActs/
      • Return all the collected words as a paragraph
    • Output: data/ami-summary/extractive/

Extra: AMI DialSum Meeting Corpus

  • DialSum: modified version of the AMI Meeting Dataset
  • Use script ami_dialsum_meeting_story.py:
    • This script takes 2 text files (in and sum) and formats it into a series of .story files compatible with the CNN/DM format
    • Each line in file in corresponds to a meeting transcript with summary present in the same line in file sum/.

Notes

  • XML reader in Python:

  • TODO

    • Overlapping meeting transcript
    • Decision abstract

Acknowledgement

Please star or fork if this code was useful for you. If you use it in a paper, please cite as:

@software{cunha_sergio2019ami_xml2story,
    author       = {Gwenaelle Cunha Sergio},
    title        = {{gcunhase/AMICorpusXML: Obtaining Transcript and Abstractive and Extractive Summaries from the AMI Meeting Corpus and formatting the AMI DialSum Meeting Corpus}},
    month        = dec,
    year         = 2019,
    doi          = {10.5281/zenodo.3561298},
    version      = {v2.1},
    publisher    = {Zenodo},
    url          = {https://github.com/gcunhase/AMICorpusXML}
    }

If you use the AMI Meeting Corpus, please also add the following citation:

@INPROCEEDINGS{Mccowan05theami,
    author = {I. Mccowan and G. Lathoud and M. Lincoln and A. Lisowska and W. Post and D. Reidsma and P. Wellner},
    title = {The AMI Meeting Corpus},
    booktitle = {In: Proceedings Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research. L.P.J.J. Noldus, F. Grieco, L.W.S. Loijens and P.H. Zimmerman (Eds.), Wageningen: Noldus Information Technology},
    year = {2005}
}
Open Source Agenda is not affiliated with "AMICorpusXML" Project. README Source: gcunhase/AMICorpusXML
Stars
52
Open Issues
3
Last Commit
4 years ago
License
MIT

Open Source Agenda Badge

Open Source Agenda Rating