The BiLSTM-CRF model implementation in Tensorflow, for sequence labeling tasks.
πππ A TensorFlow implementation of BiLSTM+CRF model, for sequence labeling tasks.
Sequential labeling
is one typical methodology modeling the sequence prediction tasks in NLP.
Common sequential labeling tasks include, e.g.,
Taking Named Entity Recognition (NER) task as example:
Stanford University located at California .
B-ORG I-ORG O O B-LOC O
here, two entities, Stanford University
and California
are to be extracted.
And specifically, each token
in the text is tagged with a corresponding label
.
E.g., {token
:Stanford, label
:B-ORG}.
The sequence labeling model aims to predict the label sequence, given a token sequence.
BiLSTM+CRF
proposed by Lample et al., 2016, is so far the most classical and stable neural model for sequential labeling tasks.
configuring all settings
train
/test
/interactive_predict
/api_service
]BIO
/BIESO
]PER
|LOC
|ORG
]logging everything
web app demo for easy demonstration
object oriented: BILSTM_CRF, Datasets, Configer, utils
modularized with clear structure, easy for DIY.
see more in HandBook.
download the repo for directly use.
git clone https://github.com/scofield7419/sequence-labeling-BiLSTM-CRF.git
pip install -r requirements.txt
install the BiLSTM-CRF package as a module.
pip install BiLSTM-CRF
usage:
from BiLSTM-CRF.engines.BiLSTM_CRFs import BiLSTM_CRFs as BC
from BiLSTM-CRF.engines.DataManager import DataManager
from BiLSTM-CRF.engines.Configer import Configer
from BiLSTM-CRF.engines.utils import get_logger
...
config_file = r'/home/projects/system.config'
configs = Configer(config_file)
logger = get_logger(configs.log_dir)
configs.show_data_summary(logger) # optional
dataManager = DataManager(configs, logger)
model = BC(configs, logger, dataManager)
###### mode == 'train':
model.train()
###### mode == 'test':
model.test()
###### mode == 'single predicting':
sentence_tokens, entities, entities_type, entities_index = model.predict_single(sentence)
if configs.label_level == 1:
print("\nExtracted entities:\n %s\n\n" % ("\n".join(entities)))
elif configs.label_level == 2:
print("\nExtracted entities:\n %s\n\n" % ("\n".join([a + "\t(%s)" % b for a, b in zip(entities, entities_type)])))
###### mode == 'api service webapp':
cmd_new = r'cd demo_webapp; python manage.py runserver %s:%s' % (configs.ip, configs.port)
res = os.system(cmd_new)
open `ip:port` in your browser.
βββ main.py
βββ system.config
βββ HandBook.md
βββ README.md
β
βββ checkpoints
βΒ Β βββ BILSTM-CRFs-datasets1
βΒ Β βΒ Β βββ checkpoint
βΒ Β βΒ Β βββ ...
βΒ Β βββ ...
βββ data
βΒ Β βββ example_datasets1
βΒ Β βΒ Β βββ logs
βΒ Β βΒ Β βββ vocabs
βΒ Β βΒ Β βββ test.csv
βΒ Β βΒ Β βββ train.csv
βΒ Β βΒ Β βββ dev.csv
βΒ Β βββ ...
βββ demo_webapp
βΒ Β βββ demo_webapp
βΒ Β βββ interface
βΒ Β βββ manage.py
βββ engines
βΒ Β βββ BiLSTM_CRFs.py
βΒ Β βββ Configer.py
βΒ Β βββ DataManager.py
βΒ Β βββ utils.py
βββ tools
βββ calcu_measure_testout.py
βββ statis.py
Folds
engines
fold, providing the core functioning py.data-subfold
fold, the datasets are placed.checkpoints-subfold
fold, model checkpoints are stored.demo_webapp
fold, we can demonstrate the system in web, and provides api.tools
fold, providing some offline utils.Files
main.py
is the entry python file for the system.system.config
is the configure file for all the system settings.HandBook.md
provides some usage instructions.BiLSTM_CRFs.py
is the main model.Configer.py
parses the system.config
.DataManager.py
manages the datasets and scheduling.utils.py
provides on the fly tools.Under following steps:
system.config
.main.py
.main.py
.main.py
.main.py
.Datasets including trainset, testset, devset are necessary for the overall usage. However, is you only wanna train the model the use it offline, only the trainset is needed. After training, you can make inference with the saved model checkpoint files. If you wanna make test, you should
For trainset
, testset
, devset
, the common format is as follows:
(Token) (Label)
for O
the O
lattice B_TAS
QCD I_TAS
computation I_TAS
of I_TAS
nucleonβnucleon I_TAS
low-energy I_TAS
interactions E_TAS
. O
It O
consists O
in O
simulating B_PRO
...
(Token) (Label)
马 B-LOC
ζ₯ I-LOC
θ₯Ώ I-LOC
δΊ I-LOC
ε― O
ζ» O
η O
γ O
δ» O
ε
Ό O
δ»» O
θ΄’ B-ORG
ζΏ I-ORG
ι¨ I-ORG
ιΏ O
...
Note that:
testset
can only exists with the the Token
row.During testing, model will output the predicted entities based on the test.csv
.
The output files include two: test.out
, test.entity.out
(optional).
test.out
with the same formation as input test.csv
.
test.entity.out
Sentence
entity1 (Type)
entity2 (Type)
entity3 (Type)
...
If you wanna adapt this project to your own specific sequence labeling task, you may need the following tips.
Download the repo sources.
Labeling Scheme (most important)
B_PER',
I_LOC'Model: modify the model architecture into the one you wanted, in BiLSTM_CRFs.py
.
Dataset: adapt to your dataset, in the correct formation.
Training
For more useage details, please refers to the HandBook
You're welcomed to issue anything wrong.