Thai2transformers Versions Save

Pretraining transformer based Thai language models

att-v1.0

2 years ago

This release contains cleaned datasets we used in transformer-based Thai language model pre-training (WangchanBERTa; wangchanberta-base-att-spm-uncased).

The cleaned datasets is only partially available since data from Wisesight, Pantip, and TNC is not under explicit open source licenses.

qa-v0.2

2 years ago

Combine iapp_wiki_qa_squad, thaiqa_squad and xquad training sets, using validation and test sets from iapp_wiki_qa_squad. Remove all contexts in training sets that are similar (mUSE cosine similarity > 0.8) out of the training sets.

DatasetDict({
    train: Dataset({
        features: ['question_id', 'article_id', 'title', 'context', 'question', 'answers'],
        num_rows: 10916
    })
    validation: Dataset({
        features: ['question_id', 'article_id', 'title', 'context', 'question', 'answers'],
        num_rows: 742
    })
    test: Dataset({
        features: ['question_id', 'article_id', 'title', 'context', 'question', 'answers'],
        num_rows: 739
    })
})