Can CNNs transliterate Pinyin into Chinese characters correctly?
In this project, we examine how well neural networks can convert Pinyin, the official romanization system for Chinese, into Chinese characters.
I frame the problem as a labelling task. In other words, every pinyin character is associated with a Chinese character or _ which means a blank.
Inputs: woaini。
Outputs: 我_爱_你_。
data/input.csv
.zho_news_2007-2009_1M-sentences.txt
to data/
folder.build_corpus.py
to build a Pinyin-Chinese parallel corpus.prepro.py
to make vocabulary and training data.train.py
. Or download the pretrained files.eval.py
.* The accuracy changes like this:
The evaluation metric is CER (Character Error Rate). Its formula is
The following is the results after 19 (nine) or 20 (qwerty) epochs. Details are available in the eval
folder.
Layout | # Proposed | SwiftKey 6.4.8.57 |
---|---|---|
QWERTY | 1203/10437=0.12 | 717/10437=0.07 |
NINE | 2104/10437=0.2 | 1775/10437=0.17 |