TUFS Asian Language Parallel Corpus
The TUFS Asian Language Parallel Corpus (TALPCo) is an open parallel corpus consisting of Japanese sentences and their translations into Korean, Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English. TALPCo is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the paper below for the details of TALPCo.
data_jpn.txt
Japanese (raw sentences)data_jpn-sound.txt
Japanese (links to sound files)data_jpn-token.txt
Japanese (tokenized sentences)data_jpn-IPSpkr.csv
Japanese (interpersonal meaning annotation, speaker)data_jpn-IPAddr.csv
Japanese (interpersonal meaning annotation, addressee)data_jpn-IPLex.csv
Japanese (interpersonal meaning annotation, lexical)data_jpn-prosub.jsonl
Japanese (pronoun substitute annotation)data_jpn-prosub.txt
Japanese (pronoun substitute annotation)data_kor.txt
Korean (raw sentences)data_kor-sound.txt
Korean (links to sound files)data_kor-token.txt
Korean (tokenized sentences)data_kor-prosub.jsonl
Korean (pronoun substitute annotation)data_kor-prosub.txt
Korean (pronoun substitute annotation)data_zsm.txt
Malay (raw sentences)data_zsm-sound.txt
Malay (links to sound files)data_zsm-token.txt
Malay (tokenized sentences)data_zsm-MWE.txt
Malay (multiword expression list)data_zsm-tree.txt
Malay (constituency tree annotation)data_zsm.jpn-zsm
Malay (partial Japanese-Malay alignment)data_zsm-IPSpkr.csv
Malay (interpersonal meaning annotation, speaker)data_zsm-IPAddr.csv
Malay (interpersonal meaning annotation, addressee)data_zsm-IPLex.csv
Malay (interpersonal meaning annotation, lexical)data_zsm-prosub.jsonl
Malay (pronoun substitute annotation)data_zsm-prosub.txt
Malay (pronoun substitute annotation)data_ind.txt
Indonesian (raw sentences)data_ind-sound.txt
Indonesian (links to sound files)data_ind-token.txt
Indonesian (tokenized sentences)data_ind-MWE.txt
Indonesian (multiword expression list)data_ind-tree.txt
Indonesian (constituency tree annotation)data_ind.jpn-ind
Indonesian (partial Japanese-Indonesian alignment)data_ind-IPSpkr.csv
Indonesian (interpersonal meaning annotation, speaker)data_ind-IPAddr.csv
Indonesian (interpersonal meaning annotation, addressee)data_ind-IPLex.csv
Indonesian (interpersonal meaning annotation, lexical)data_ind-prosub.jsonl
Indonesian (pronoun substitute annotation)data_ind-prosub.txt
Indonesian (pronoun substitute annotation)data_jav.txt
Javanese (raw sentences)data_jav-prosub.jsonl
Javanese (pronoun substitute annotation)data_jav-prosub.txt
Javanese (pronoun substitute annotation)data_tha.txt
Thai (raw sentences)data_tha-sound.txt
Thai (links to sound files)data_tha-token.txt
Thai (tokenized sentences)data_tha.jpn-tha
Thai (partial Japanese-Thai alignment)data_tha-IPSpkr.csv
Thai (interpersonal meaning annotation, speaker)data_tha-IPAddr.csv
Thai (interpersonal meaning annotation, addressee)data_tha-IPLex.csv
Thai (interpersonal meaning annotation, lexical)data_tha-prosub.jsonl
Thai (pronoun substitute annotation)data_tha-prosub.txt
Thai (pronoun substitute annotation)data_vie.txt
Vietnamese (raw sentences)data_vie-sound.txt
Vietnamese (links to sound files)data_vie-token.txt
Vietnamese (tokenized sentences)data_vie-MWE.txt
Vietnamese (multi-syllable expression list)data_vie.jpn-vie
Vietnamese (partial Japanese-Vietnamese alignment)data_vie-IPSpkr.csv
Vietnamese (interpersonal meaning annotation, speaker)data_vie-IPAddr.csv
Vietnamese (interpersonal meaning annotation, addressee)data_vie-IPLex.csv
Vietnamese (interpersonal meaning annotation, lexical)data_vie-prosub.jsonl
Vietnamese (pronoun substitute annotation)data_vie-prosub.txt
Vietnamese (pronoun substitute annotation)data_myn.txt
Burmese (raw sentences)data_myn-sound.txt
Burmese (links to sound files)data_myn-token.txt
Burmese (tokenized sentences)data_myn-ps.txt
Burmese (POS-tagged sentences)data_myn-prosub.jsonl
Burmese (pronoun substitute annotation)data_myn-prosub.txt
Burmese (pronoun substitute annotation)data_eng.txt
English (raw sentences)data_eng-US.txt
English (US) (raw sentences) [by courtesy of Charles Kelly]readme.me
(this document)All files are encoded in UTF-8 with DOS format.
Sentence_ID [TAB] Sentence
1176 田中さんは 学生では ありません。
1176 Mr Tanaka is not a student.
Sentence_ID [TAB] URL
Sentence_ID [LINEBREAK] token [LINEBREAK] token [LINEBREAK] <EOS>
3627
Buku
ini
mempunyai
se-
ratus
dua
puluh
muka surat
.
<EOS>
Sentence_ID [TAB] Sentence
White space: Phrasal boundary
Dash: Morpheme boundary
1176 n-pr-postp n pref-v-suf-suf
Sentence_ID.n [TAB] bracketing
(for the n
-th sentence for Sentence_ID
)
3695.2 [S [CP [Conj Tapi] [CP [C *C_decl*] [TP [DP_a [D saya]] [T'[T *T*] [AP [AP [AdvP [Adv agak]] [AP [DP *t*<a>] [A letih]]] [AdvP [Adv sedikit]]]]]]] [PU .]]
Sentence_ID [TAB] Japanese_token_index-target_language_token_index
1176 0-1 1-0 3-3 8-2 9-4
See the second paper above and its supplement for the details of the interpersonal meaning feature system.
Sentence_ID, Gender, Marital status, Honour, Age, Social status, Role, Group, Formality, Number
3243,female,,,,neutral,,,,sg
Token_index, token, Gender, Marital status, Honour, Age, Social status, Role, Group, Formality, Number
3845,,,,,,,,,,
0,Cô,female,,,elder.parents_younger_sibling,,parents_sibling.paternal,,,
1,tôi,,,,,neutral,,,,sg
2,làm việc,,,,,,,,,
3,ở,,,,,,,,,
4,cửa hàng,,,,,,,,,
5,hoa,,,,,,,,,
6,.,,,,,,,,,
<EOS>,,,,,,,,,,
The sound files for the following sentences come from TUFS Open Language Resources.
The Malay and Indonesian sentences were tokenized manually by Hiroki Nomoto and David Moeljadi, respectively. All clitics (i.e. -nya, -lah, -kah) were tokenized. In addition, the instances of the prefix se- were tokenized if they were cardinal numerals. Note that the suffix -nya and the non-numeral instances of se- were not tokenized. The following dictionaries were consulted when it was not immediately obvious whether a word sequence constituted a multiword expression.
The sentences were tokenized using the tokenize
function of Deepcut and then checked by Sunisa Wittayapanyanon and Yuka Sato. The principle adopted for the manual correction is:
The sentences were tokenized using the word_tokenize
function of the Undersea - Vietnamese NLP Project and then checked by Junta Nomura and Hiroki Nomoto. The following dictionary was consulted when it was not immediately obvious whether a syllable sequence constituted a multi-syllable expression.
.txt
files to ETA: Easy Text Annotator.