Pretraining transformer based Thai language models
Pretraining transformer-based Thai language models
thai2transformers provides customized scripts to pretrain transformer-based masked language model on Thai texts with various types of tokens as follows:
engine="best"
.We curate a list of sources that can be used to pretrain language model. The statistics for each data source are listed in this page.
Also, you can download current version of cleaned datasets from here.
a) Instruction for RoBERTa BASE model pretraining on Thai Wikipedia dump:
In this example, we demonstrate how pretrain RoBERTa base model on Thai Wikipedia dump from scratch
Install required libraries: 1_installation.md
Prepare thwiki
dataset from Thai Wikipedia dump: 2_thwiki_data-preparation.md
Tokenizer training and vocabulary building :
a) For SentencePiece BPE (spm
), word-level token (newmm
), syllable-level token (syllable
): 3_train_tokenizer.md
b) For word-level token from Limkonchotiwat et al., 2020 (sefr-cut
) : 3b_sefr-cut_pretokenize.md
Pretrain a masked language model: 4_run_mlm.md
b) Instruction for RoBERTa model finetuning on existing Thai text classification, and NER/POS tagging datasets.
In this example, we demonstrate how to finetune WanchanBERTa, a RoBERTa base model pretrained on Thai Wikipedia dump and Thai assorted texts.
Finetune model for sequence classification task from exisitng datasets including wisesight_sentiment
, wongnai_reviews
, generated_reviews_enth
(review star prediction), and prachathai67k
:
5a_finetune_sequence_classificaition.md
Finetune model for token classification task (NER and POS tagging) from exisitng datasets including thainer
and lst20
:
5b_finetune_token_classificaition.md
@misc{lowphansirikul2021wangchanberta,
title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
year={2021},
eprint={2101.09635},
archivePrefix={arXiv},
primaryClass={cs.CL}
}