NLP made easy
This release includes the following fixes:
As we prepare for the NumPy-based GluonNLP development, we are making the following adjustments to the branch usage:
This patch release includes the following bug fix:
This release includes the bug fix for https://github.com/dmlc/gluon-nlp/pull/1158 (#1167). It affects the determinism of the instantiated vocabulary object on the order of special tokens on Python 3.5. Users of Python 3.5 are strongly encouraged to upgrade to this version.
INT8 Quantization for BERT Sentence Classification and Question Answering (#1080)! Also Check out the blog post.
Enhancements to the pretraining script (#1121, #1099) and faster tokenizer for BERT (#921, #1024) as well as multi-GPU support for SQuAD fine-tuning (#1079).
Make BERT a HybridBlock (#877).
The XLNet model introduced by Yang, Zhilin, et. al in "XLNet: Generalized Autoregressive Pretraining for Language Understanding". The model was converted from the original repository (#866).
GluonNLP further provides scripts for finetuning XLNet on the Glue (#995) and SQuAD datasets (#1130) that reproduce the authors results. Check out the usage.
The DistilBERT model introduced by Sanh, Victor, et. al in "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" (#922).
Add a separate Transformer inference script to make inference easy and make it convenient to analysis the performance of transformer inference (#852).
Pre-trained Korean BERT is available as part of GluonNLP (#1057)
GluonNLP now provides scripts for finetuning RoBERTa (#931).
GPT2 is now a HybridBlock the model can be exported for running from other MXNet language bindings (#1010).
analogy-max-vocab-size
(#904)model.ISDense
) and noise contrastive estimation (model.NCEDense
).This release covers a few fixes for the bugs reported:
bert/embedding.py
scriptSimVerb3500
dataset URL to the aclweb hosted versionbert/pretraining_utils.py
which potentially causes crash when horovod mpi is used for trainingTrainer
assumes deterministic parameter creation order for distributed traiing. The attention cell for BERT and transformer has a non-deterministic parameter creation order in v0.8.1 and v0.8.0, which will cause divergence during distributed training. It is now fixed.Note that since v0.8.2, the default branch of gluon-nlp github will be switched to the latest stable branch, instead of the master branch under development.
Source | GluonNLP | google-research/bert | google-research/bert |
---|---|---|---|
Model | bert_12_768_12 | bert_12_768_12 | bert_24_1024_16 |
Dataset | openwebtext_book_corpus_wiki_en_uncased |
book_corpus_wiki_en_uncased |
book_corpus_wiki_en_uncased |
SST-2 | 95.3 | 93.5 | 94.9 |
RTE | 73.6 | 66.4 | 70.1 |
QQP | 72.3 | 71.2 | 72.1 |
SQuAD 1.1 | 91.0/84.4 | 88.5/80.8 | 90.9/84.1 |
STS-B | 87.5 | 85.8 | 86.5 |
MNLI-m/mm | 85.3/84.9 | 84.6/83.4 | 86.7/85.9 |
The SciBERT model introduced by Iz Beltagy and Arman Cohan and Kyle Lo in "SciBERT: Pretrained Contextualized Embeddings for Scientific Text". The model checkpoints are converted from the original repository from AllenAI with the following datasets (#735):
scibert_scivocab_uncased
scibert_scivocab_cased
scibert_basevocab_uncased
scibert_basevocab_cased
The BioBERT model introduced by Lee, Jinhyuk, et al. in "BioBERT: a pre-trained biomedical language representation model for biomedical text mining". The model checkpoints are converted from the original repository with the following datasets (#735):
biobert_v1.0_pmc_cased
biobert_v1.0_pubmed_cased
biobert_v1.0_pubmed_pmc_cased
biobert_v1.1_pubmed_cased
The ClinicalBERT model introduced by Kexin Huang and Jaan Altosaar and Rajesh Ranganath in "ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission". The model checkpoints are converted from the original repository with the clinicalbert_uncased
dataset (#735)
The ERNIE model introduced by Sun, Yu, et al. in "ERNIE: Enhanced Representation through Knowledge Integration". You can get the model checkpoints converted from the original repository with model.get_model("ernie_12_768_12", "baidu_ernie_uncased")
(#759) thanks @paperplanet
BERT fine-tuning script for named entity recognition on CoNLL2003 with test F1 92.2 (#612).
BERT fine-tuning script for Chinese XNLI dataset with 78.3% validation accuracy. (#759) thanks @paperplanet
BERT fine-tuning script for intent classification and slot labelling on ATIS (95.9 F1) and SNIPS (95.9 F1). (#817)
gpt2_117m
, gpt2_345m
) trained on the openai_webtext
dataset (#761).emb[emb.unknown_token] != 0
(#763)Source | GluonNLP | google-research/bert | google-research/bert |
---|---|---|---|
Model | bert_12_768_12 | bert_12_768_12 | bert_24_1024_16 |
Dataset | openwebtext_book_corpus_wiki_en_uncased |
book_corpus_wiki_en_uncased |
book_corpus_wiki_en_uncased |
SST-2 | 95.3 | 93.5 | 94.9 |
RTE | 73.6 | 66.4 | 70.1 |
QQP | 72.3 | 71.2 | 72.1 |
SQuAD 1.1 | 91.0/84.4 | 88.5/80.8 | 90.9/84.1 |
STS-B | 87.5 | 85.8 | 86.5 |
MNLI-m/mm | 85.3/84.9 | 84.6/83.4 | 86.7/85.9 |
The SciBERT model introduced by Iz Beltagy and Arman Cohan and Kyle Lo in "SciBERT: Pretrained Contextualized Embeddings for Scientific Text". The model checkpoints are converted from the original repository from AllenAI with the following datasets (#735):
scibert_scivocab_uncased
scibert_scivocab_cased
scibert_basevocab_uncased
scibert_basevocab_cased
The BioBERT model introduced by Lee, Jinhyuk, et al. in "BioBERT: a pre-trained biomedical language representation model for biomedical text mining". The model checkpoints are converted from the original repository with the following datasets (#735):
biobert_v1.0_pmc_cased
biobert_v1.0_pubmed_cased
biobert_v1.0_pubmed_pmc_cased
biobert_v1.1_pubmed_cased
The ClinicalBERT model introduced by Kexin Huang and Jaan Altosaar and Rajesh Ranganath in "ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission". The model checkpoints are converted from the original repository with the clinicalbert_uncased
dataset (#735)
The ERNIE model introduced by Sun, Yu, et al. in "ERNIE: Enhanced Representation through Knowledge Integration". You can get the model checkpoints converted from the original repository with model.get_model("ernie_12_768_12", "baidu_ernie_uncased")
(#759) thanks @paperplanet
BERT fine-tuning script for named entity recognition on CoNLL2003 with test F1 92.2 (#612).
BERT fine-tuning script for Chinese XNLI dataset with 78.3% validation accuracy. (#759) thanks @paperplanet
BERT fine-tuning script for intent classification and slot labelling on ATIS (95.9 F1) and SNIPS (95.9 F1). (#817)
gpt2_117m
, gpt2_345m
) trained on the openai_webtext
dataset (#761).emb[emb.unknown_token] != 0
(#763)