Somiao Pinyin: Train your own Chinese Input Method with Seq2seq Model 搜喵拼音输入法
Personalized Chinese Pinyin Input Method with Seq2seq model
Original code in https://github.com/Kyubyong/neural_chinese_transliterator for research purpose.
This repository intends to experiment with different training data and interactive user inputs, and possibly develop towards a real data-personalized and model-localized Pinyin Input product.
Python (>=3.5)
TensorFlow (>=r1.2)
xpinyin (for Chinese pinyin annotation)
distance (for calculating the similarity score between two strings)
tqdm
STEP 1. Download Leipzig Chinese Corpus
Extract it and copy zho_news_2007-2009_1M-sentences.txt to data/ folder.
Or use your own Chinese Corpus with the same format.
STEP 2. Build a Pinyin-Chinese parallel corpus.
#python3 build_corpus.py
prepro.py
to make vocabulary and training data.#python3 prepro.py
STEP 4. Adjust hyperparameters in hyperparams.py
if necessary.
STEP 5. Train the model
#python3 train.py
For command line input testing, run:
python3 eval.py
You may change the main function name to use the original testing data evaluation.
Download the pre-trained model from blog, unzip it to generate /log and /data.
Remember to overwrite the pickle files in /data with the pre-trained model data.
Then run for command line input testing:
python3 eval.py
Model is trained from Chinese News in 2007-2009. So many now common Chinese sayings are not learned.
请输入测试拼音:nihao
你好
请输入测试拼音:chenggongle
成功了
请输入测试拼音:wolegequ
我了个曲
请输入测试拼音:taibangla
太棒啦
请输入测试拼音:dacolehuizenmeyang
打破了会怎么样
请输入测试拼音:pujinghehujintaotongdianhua
普京和胡锦涛通电话
请输入测试拼音:xiangbuqilaishinianqianfashengleshenme
想不起来十年前发生了什么
请输入测试拼音:meiguohongzhawomenzainansilafudedashiguan
美国轰炸我们在南斯拉夫的大事馆
请输入测试拼音:liudehuanageshihouhaonianqing
刘德华那个时候好年轻
请输入测试拼音:shishihouxunlianyixiabilibilideyuliaole
是时候训练一下比例比例的预料了
Pretrained models on different contexts
Model selection for using different models while input different things (chatting? writing scientific papers? etc...)
Function to record LOCALLY what user has input as personalized corpus
User Interface
...