EasyNLP: A Comprehensive and Easy-to-use NLP Toolkit
EasyNLP is a Comprehensive and Easy-to-use NLP Toolkit
EasyNLP is an easy-to-use NLP development and application toolkit in PyTorch, first released inside Alibaba in 2021. It is built with scalable distributed training strategies and supports a comprehensive suite of NLP algorithms for various NLP applications. EasyNLP integrates knowledge distillation and few-shot learning for landing large pre-trained models, together with various popular multi-modality pre-trained models. It provides a unified framework of model training, inference, and deployment for real-world applications. It has powered more than 10 BUs and more than 20 business scenarios within the Alibaba group. It is seamlessly integrated to Platform of AI (PAI) products, including PAI-DSW for development, PAI-DLC for cloud-native training, PAI-EAS for serving, and PAI-Designer for zero-code model training.
We have a series of technical articles on the functionalities of EasyNLP.
You can setup from the source:
$ git clone https://github.com/alibaba/EasyNLP.git
$ cd EasyNLP
$ python setup.py install
This repo is tested on Python 3.6, PyTorch >= 1.8.
Now let's show how to use just a few lines of code to build a text classification model based on BERT.
from easynlp.appzoo import ClassificationDataset
from easynlp.appzoo import get_application_model, get_application_evaluator
from easynlp.core import Trainer
from easynlp.utils import initialize_easynlp, get_args
from easynlp.utils.global_vars import parse_user_defined_parameters
from easynlp.utils import get_pretrain_model_path
initialize_easynlp()
args = get_args()
user_defined_parameters = parse_user_defined_parameters(args.user_defined_parameters)
pretrained_model_name_or_path = get_pretrain_model_path(user_defined_parameters.get('pretrain_model_name_or_path', None))
train_dataset = ClassificationDataset(
pretrained_model_name_or_path=pretrained_model_name_or_path,
data_file=args.tables.split(",")[0],
max_seq_length=args.sequence_length,
input_schema=args.input_schema,
first_sequence=args.first_sequence,
second_sequence=args.second_sequence,
label_name=args.label_name,
label_enumerate_values=args.label_enumerate_values,
user_defined_parameters=user_defined_parameters,
is_training=True)
valid_dataset = ClassificationDataset(
pretrained_model_name_or_path=pretrained_model_name_or_path,
data_file=args.tables.split(",")[-1],
max_seq_length=args.sequence_length,
input_schema=args.input_schema,
first_sequence=args.first_sequence,
second_sequence=args.second_sequence,
label_name=args.label_name,
label_enumerate_values=args.label_enumerate_values,
user_defined_parameters=user_defined_parameters,
is_training=False)
model = get_application_model(app_name=args.app_name,
pretrained_model_name_or_path=pretrained_model_name_or_path,
num_labels=len(valid_dataset.label_enumerate_values),
user_defined_parameters=user_defined_parameters)
trainer = Trainer(model=model, train_dataset=train_dataset,user_defined_parameters=user_defined_parameters,
evaluator=get_application_evaluator(app_name=args.app_name, valid_dataset=valid_dataset,user_defined_parameters=user_defined_parameters,
eval_batch_size=args.micro_batch_size))
trainer.train()
The complete example can be found here.
You can also use AppZoo Command Line Tools to quickly train an App model. Take text classification on SST-2 dataset as an example. First you can download the train.tsv, and dev.tsv, then start training:
$ easynlp \
--mode=train \
--worker_gpu=1 \
--tables=train.tsv,dev.tsv \
--input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
--first_sequence=sent1 \
--label_name=label \
--label_enumerate_values=0,1 \
--checkpoint_dir=./classification_model \
--epoch_num=1 \
--sequence_length=128 \
--app_name=text_classify \
--user_defined_parameters='pretrain_model_name_or_path=bert-small-uncased'
And then predict:
$ easynlp \
--mode=predict \
--tables=dev.tsv \
--outputs=dev.pred.tsv \
--input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
--output_schema=predictions,probabilities,logits,output \
--append_cols=label \
--first_sequence=sent1 \
--checkpoint_path=./classification_model \
--app_name=text_classify
To learn more about the usage of AppZoo, please refer to our documentation.
EasyNLP currently provides the following models in ModelZoo:
Please refer to this readme for the usage of these models in EasyNLP. Meanwhile, EasyNLP supports to load pretrained models from Huggingface/Transformers, please refer to this tutorial for details.
EasyNLP also supports various popular multi-modality pre-trained models to support vision-language tasks that require visual knowledge. For example, it is equipped with CLIP-style models for text-image matching and DALLE-style models for text-to-image generation.
EasyNLP provide few-shot learning and knowledge distillation to help land large pre-trained models.
EasyNLP provides a simple toolkit to benchmark clue datasets. You can simply use just this command to benchmark CLUE dataset.
# Format: bash run_clue.sh device_id train/predict dataset
# e.g.:
bash run_clue.sh 0 train csl
We've tested chiese bert and roberta modelson the datasets, the results of dev set are:
(1) bert-base-chinese:
Task | AFQMC | CMNLI | CSL | IFLYTEK | OCNLI | TNEWS | WSC |
---|---|---|---|---|---|---|---|
P | 72.17% | 75.74% | 80.93% | 60.22% | 78.31% | 57.52% | 75.33% |
F1 | 52.96% | 75.74% | 81.71% | 60.22% | 78.30% | 57.52% | 80.82% |
(2) chinese-roberta-wwm-ext:
Task | AFQMC | CMNLI | CSL | IFLYTEK | OCNLI | TNEWS | WSC |
---|---|---|---|---|---|---|---|
P | 73.10% | 80.75% | 80.07% | 60.98% | 80.75% | 57.93% | 86.84% |
F1 | 56.04% | 80.75% | 81.50% | 60.98% | 80.75% | 57.93% | 89.58% |
Here is the detailed CLUE benchmark example.
This project is licensed under the Apache License (Version 2.0). This toolkit also contains some code modified from other repos under other open-source licenses. See the NOTICE file for more information.
Scan the following QR codes to join Dingtalk discussion group. The group discussions are mostly in Chinese, but English is also welcomed.
We have an arxiv paper for you to cite for the EasyNLP library:
@article{easynlp,
doi = {10.48550/ARXIV.2205.00258},
url = {https://arxiv.org/abs/2205.00258},
author = {Wang, Chengyu and Qiu, Minghui and Zhang, Taolin and Liu, Tingting and Li, Lei and Wang, Jianing and Wang, Ming and Huang, Jun and Lin, Wei},
title = {EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing},
publisher = {arXiv},
year = {2022}
}