Yorùbá language training text for NLP, ASR and TTS tasks
This repository contains fully diacritized Yorùbá text, converted to Unicode Normalization Form Composition (NFC) format, where diacritized characters are composed into a single character with the following code:
def convert_to_NFC(filename, outfilename):
text=''.join(c for c in unicodedata.normalize('NFC', open(filename).read()))
with open(outfilename, 'w') as f:
f.write(text)
Text has been gathered with permission from online sources, and lightly preprocessed for use in NLP, TTS, ASR applications. Note, some of the sentences may have errors, please submit a pull-request if you have corrections!
If you want to cite this repo in your work, please use:
@misc{Orife_yoruba-text_2018,
author = {Orife, Iroro and Fasubaa, Timilehin and Wahab, Olamilekan},
month = {1},
title = {{yoruba-text}},
url = {https://github.com/Niger-Volta-LTI/yoruba-text},
year = {2018}
}