Sub-Character Representation Learning
Codes and corpora for paper "Dual Long Short-Term Memory Networks for Sub-Character Representation Learning" (accepted at ITNG 2018).
We proposed to learn character and sub-character level representations jointly for capturing deeper level of semantic meanings. When applied to Chinese Word Segmentation as a case example, our solution achieved state-of-the-art results on both Simplified and Traditional Chinese, without extra Traditional to Simplified Chinese conversion.
Simply run one command:
./script/run.sh pku 1
It does everything for you on the fly, including data preparation, training and test.
pku
with msr
, cityu
and as
.1
to 6
, details are listed in the next chapter.We have presented 6
models in our paper. Their configurations are shown in following table:
#. model |
char | subchar | radical | tie | bigram |
---|---|---|---|---|---|
1. baseline |
YES | ||||
2. +subchar |
YES | ||||
3. +radical |
YES | YES | |||
4. +radical -char |
YES | ||||
5. +radical +tie |
YES | YES | YES | ||
6. +radical +tie +bigram |
YES | YES | YES | YES |