Entity recognition codes for "2019 Datagrand Cup: Text Information Extraction Challenge"
Some Entity Recognition models for 2019 Datagrand Cup: Text Information Extraction Challenge.
_elmo
), refer to Sec.
sequence_labeling.py
)
_pointer
)According to the three components described above, there actually exists 12 models in all. However, this repo only implemented the following 6 models:
sequence_labeling.py
bilstm_pointer.py
and bilstm_pointer_elmo.py
Other models can be implemented by adding/modifying few codes.
data
folderbin/trans_data.py
vocab
, tag
vocab
: word vocabulary, one word per line, with word word_count
formattag
: BIOES
ner tag list, one tag per line (O
in first line)word2vec
as an example:
bin/train_w2v.py
config.py
python sequence_labeling.py [bilstm/dgcnn] [softmax/crf]
or python bilstm_pointer.py
(remember to modify config.model_name
before a new run, or the old model will be overridden)train/dev/test
to file first, then load them for train/dev/test, not run ELMo on the fly):
train_full/dev/test
data from pre-trained ELMo weightsconfig.py
python bilstm_pointer_elmo.py
bin/train_elmo.py
: vocab = load_vocab(args.vocab_file, 50)
vocab = load_vocab(args.vocab_file, None)
n_train_tokens
char_cnn
in options
lstm.dim
/lstm.projection_dim
as you wish.n_gpus=2
, n_train_tokens=94114921
, lstm['dim']=2048
, projection_dim=256
, n_epochs=10
. It took about 17 hours long on 2 GTX 1080 Ti.