MRDA Corpus Save

Utilities for Processing the Meeting Recorder Dialogue Act Corpus

Project README

Processing the Meeting Recorder Dialogue Act Corpus

Utilities for Processing the Meeting Recorder Dialogue Act Corpus outlined in this paper by Shriberg, E. et al.(2004) for the purpose of dialogue act (DA) classification. The data can also be downloaded here. The data is split into the original training and test sets suggested by the authors. There were two unused dialogues and these were added to the evaluation and test sets.

Scripts

The mrda_to_text.py script processes all dialogues into a plain text format. Individual dialogues are saved into directories corresponding to the set they belong to (train, test, etc). All utterances in a particular set are also saved to a text file.

The utilities.py script contains various helper functions for loading/saving the data.

The process_transcript.py includes functions for processing each dialogue.

The mrda_metadata.py generates various metadata from the processed dialogues and saves them as a dictionary to a pickle file. The words, labels and frequencies are also saved as plain text files in the /metadata directory.

Data Format

Utterance are tagged with the MRDA tagset, which is a variation of the SWBD-DAMSL DA. The original MRDA label construction allowed DA to be concatenated in the form <general tag> ^ <specific tag> . <disruptive from> (see the MRDA manual).

There are three sets of DA included:

Basic, collapses all DA into 5 labels outlined in the original documentation DA maps (see the basic DA map file).
General, uses the 12 <general tag> described in the MRDA manual.
Full, uses only the first <specific tag> from the original labels, 52 in total.

By default:

Utterances are written one per line in the format Speaker | Utterance Text | Basic DA Tag | General DA Tag | Full DA Tag.
Setting the utterance_only_flag == True, will change the default output to only one utterance per line i.e. no speaker or DA tags.
Utterances marked as Non-verbal ('x' tags) are removed i.e. 'Laughter' or 'Throat_clearing'.
Utterances marked as Non-labeled ('z' tags) are removed.
Interrupted, Abandoned and Uninterpretable tags are collapsed into one tag ('%').
All disfluency annotations are removed i.e. '#', '<', '>', etc.

Example Utterances

fe016|okay.|F|fg|fg

fe016|so um|F|fh|fh

fe016|i was going to try to get out of here like in half an hour.|S|s|rt

Dialogue Acts

Basic Labels

Dialogue Act	Labels	Count	%	Train Count	Train %	Test Count	Test %	Val Count	Val %
Statement	S	64233	59.36	45099	60.08	9571	57.30	9563	58.19
BackChannel	B	14620	13.51	10265	13.67	2152	12.88	2203	13.41
Disruption	D	14548	13.45	9739	12.97	2339	14.00	2470	15.03
FloorGrabber	F	7818	7.23	5324	7.09	1409	8.44	1085	6.60
Question	Q	6983	6.45	4640	6.18	1231	7.37	1112	6.77

Basic Label Frequencies

General Labels

Dialogue Act	Labels	Count	%	Train Count	Train %	Test Count	Test %	Val Count	Val %
Statement	s	69873	64.58	48952	65.21	10472	62.70	10449	63.59
Continuer	b	15167	14.02	10606	14.13	2219	13.29	2342	14.25
Floor Holder	fh	8362	7.73	5617	7.48	1520	9.10	1225	7.45
Yes-No-question	qy	4986	4.61	3310	4.41	870	5.21	806	4.90
Interrupted/Abandoned/Uninterpretable	%	3103	2.87	2171	2.89	492	2.95	440	2.68
Floor Grabber	fg	3092	2.86	2076	2.77	489	2.93	527	3.21
Wh-Question	qw	1707	1.58	1110	1.48	310	1.86	287	1.75
Hold Before Answer/Agreement	h	792	0.73	474	0.63	134	0.80	184	1.12
Or-Clause	qrr	392	0.36	244	0.33	75	0.45	73	0.44
Rhetorical Question	qh	352	0.33	260	0.35	56	0.34	36	0.22
Or Question	qr	207	0.19	131	0.17	37	0.22	39	0.24
Open-ended Question	qo	169	0.16	116	0.15	28	0.17	25	0.15

General Label Frequencies

Full Labels

Dialogue Act	Labels	Count	%	Train Count	Train %	Test Count	Test %	Val Count	Val %
Statement	s	33472	30.93	23238	30.96	4971	29.76	5263	32.03
Continuer	b	15013	13.87	10517	14.01	2175	13.02	2321	14.12
Floor Holder	fh	8362	7.73	5617	7.48	1520	9.10	1225	7.45
Acknowledge-answer	bk	7177	6.63	5117	6.82	1031	6.17	1029	6.26
Accept	aa	5898	5.45	4097	5.46	903	5.41	898	5.46
Defending/Explanation	df	3724	3.44	2790	3.72	531	3.18	403	2.45
Expansions of y/n Answers	e	3200	2.96	2360	3.14	540	3.23	300	1.83
Interrupted/Abandoned/Uninterpretable	%	3103	2.87	2171	2.89	492	2.95	440	2.68
Rising Tone	rt	3101	2.87	2015	2.68	516	3.09	570	3.47
Floor Grabber	fg	3092	2.86	2076	2.77	489	2.93	527	3.21
Offer	cs	2662	2.46	1878	2.50	402	2.41	382	2.32
Assessment/Appreciation	ba	2216	2.05	1605	2.14	354	2.12	257	1.56
Understanding Check	bu	2091	1.93	1405	1.87	371	2.22	315	1.92
Declarative-Question	d	1805	1.67	1153	1.54	350	2.10	302	1.84
Affirmative Non-yes Answers	na	1112	1.03	870	1.16	133	0.80	109	0.66
Wh-Question	qw	951	0.88	630	0.84	160	0.96	161	0.98
Reject	ar	908	0.84	594	0.79	152	0.91	162	0.99
Collaborative Completion	2	841	0.78	571	0.76	136	0.81	134	0.82
Other Answers	no	828	0.77	563	0.75	98	0.59	167	1.02
Hold Before Answer/Agreement	h	792	0.73	474	0.63	134	0.80	184	1.12
Action-directive	co	674	0.62	460	0.61	97	0.58	117	0.71
Yes-No-question	qy	669	0.62	476	0.63	90	0.54	103	0.63
Dispreferred Answers	nd	483	0.45	341	0.45	82	0.49	60	0.37
Humorous Material	j	463	0.43	326	0.43	67	0.40	70	0.43
Downplayer	bd	387	0.36	290	0.39	68	0.41	29	0.18
Commit	cc	371	0.34	258	0.34	51	0.31	62	0.38
Negative Non-no Answers	ng	351	0.32	236	0.31	56	0.34	59	0.36
Maybe	am	349	0.32	224	0.30	66	0.40	59	0.36
Or-Clause	qrr	345	0.32	216	0.29	66	0.40	63	0.38
Exclamation	fe	307	0.28	195	0.26	56	0.34	56	0.34
Mimic Other	m	293	0.27	200	0.27	48	0.29	45	0.27
Apology	fa	259	0.24	181	0.24	46	0.28	32	0.19
About-task	t	253	0.23	154	0.21	42	0.25	57	0.35
Signal-non-understanding	br	236	0.22	161	0.21	39	0.23	36	0.22
Accept-part	aap	219	0.20	158	0.21	27	0.16	34	0.21
Rhetorical-Question	qh	214	0.20	166	0.22	30	0.18	18	0.11
Topic Change	tc	212	0.20	127	0.17	35	0.21	50	0.30
Repeat	r	208	0.19	131	0.17	45	0.27	32	0.19
Self-talk	t1	198	0.18	120	0.16	38	0.23	40	0.24
3rd-party-talk	t3	165	0.15	105	0.14	36	0.22	24	0.15
Rhetorical-question Continue	bh	154	0.14	109	0.15	26	0.16	19	0.12
Reject-part	bsc	150	0.14	94	0.13	22	0.13	34	0.21
Misspeak Self-Correction	arp	150	0.14	89	0.12	18	0.11	43	0.26
Reformulate/Summarize	bs	141	0.13	89	0.12	17	0.10	35	0.21
"Follow Me"	f	128	0.12	98	0.13	12	0.07	18	0.11
Or-Question	qr	127	0.12	88	0.12	17	0.10	22	0.13
Thanking	ft	119	0.11	88	0.12	9	0.05	22	0.13
Tag-Question	g	87	0.08	58	0.08	9	0.05	20	0.12
Open-Question	qo	74	0.07	49	0.07	14	0.08	11	0.07
Correct-misspeaking	bc	51	0.05	29	0.04	13	0.08	9	0.05
Sympathy	by	11	0.01	5	0.01	2	0.01	4	0.02
Welcome	fw	6	0.01	5	0.01	0	0.00	1	0.01

Full Label Frequencies

Metadata

Total number of utterances: 108202
Max utterance length: 85
Mean utterance length: 8.01
Total Number of dialogues: 75
Max dialogue length: 3391
Mean dialogue length: 1442.69
Vocabulary size: 10866
Number of basic labels: 5
Number of general labels: 12
Number of full labels: 52
Number of speakers: 52

Train set

Number of dialogues: 51
Max dialogue length: 3391
Mean dialogue length: 1471.9
Number of utterances: 75067

Test set

Number of dialogues: 12
Max dialogue length: 2028
Mean dialogue length: 1391.83
Number of utterances: 16702

Val set

Number of dialogues: 12
Max dialogue length: 1969
Mean dialogue length: 1369.42
Number of utterances: 16433

Keys and values for the metadata dictionary

num_utterances = Total number of utterance in the full corpus.
max_utterance_len = Number of words in the longest utterance in the corpus.
mean_utterance_len = Average number of words in utterances.
num_dialogues = Total number of dialogues in the corpus.
max_dialogues_len = Number of utterances in the longest dialogue in the corpus.
mean_dialogues_len = Average number of utterances in dialogues.
word_freq = Dataframe with Word and Count columns.
vocabulary = List of all words in vocabulary.
vocabulary_size = Number of words in the vocabulary.
speakers = List of all speakers.
num_speakers = Number of speakers in the MRDA data.

Each DA label set (basic, general or full) also has:

<setname>_label_freq = Dataframe containing all data in the sets Dialogue Acts table above.
<setname>_labels = List of all DA labels.
num_<setname>_labels = Number of labels used from each of the label sets.

Each data set also has:

<setname>_num_utterances = Number of utterances in the set.
<setname>_num_dialogues = Number of dialogues in the set.
<setname>_max_dialogue_len = Length of the longest dialogue in the set.
<setname>_mean_dialogue_len = Mean length of dialogues in the set.

Open Source Agenda is not affiliated with "MRDA Corpus" Project. README Source: NathanDuran/MRDA-Corpus

Stars

Open Issues

Last Commit

3 years ago

Repository

NathanDuran/MRDA-Corpus

License

GPL-3.0

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/mrda-corpus"><img src="https://www.opensourceagenda.com/projects/mrda-corpus/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022