Subword Encoding in Lattice LSTM for Chinese Word Segmentation
Subword encoding for Word Segmentation using Lattice LSTM.
Models and results can be found at our paper Subword Encoding in Lattice LSTM for Chinese Word Segmentation.
Python: 2.7
PyTorch: 0.3.0
CoNLL format (prefer BMES tag scheme), with each character its label for one line. Sentences are splited with a null line.
中 B-SEG
国 E-SEG
最 B-SEG
大 E-SEG
氨 B-SEG
纶 M-SEG
丝 E-SEG
生 B-SEG
产 E-SEG
基 B-SEG
地 E-SEG
在 S-SEG
连 B-SEG
云 M-SEG
港 E-SEG
建 B-SEG
成 E-SEG
新 B-SEG
华 M-SEG
社 E-SEG
北 B-SEG
京 E-SEG
十 B-SEG
二 M-SEG
月 E-SEG
二 B-SEG
十 M-SEG
六 M-SEG
日 E-SEG
电 S-SEG
The pretrained character and word embeddings are the same with the embeddings in the baseline of RichWordSegmentor
main.py
.run_seg.py
by adding your train/dev/test file directory.sh run_seg.py
Cite our paper as:
@article{yang2019subword,
title={Subword Encoding in Lattice LSTM for Chinese Word Segmentation},
author={Jie Yang, Yue Zhang, and Shuailong Liang},
booktitle={NAACL},
year={2019}
}