汉字字符特征提取器 (featurizer),提取汉字的特征(发音特征、字形特征)用做深度学习的特征 | A Chinese character feature extractor, which extracts the features of Chinese characters (pronunciation features, glyph features) as features for deep learning
在深度学习中,很多场合需要提取汉字的特征(发音特征、字形特征)。本项目提供了一个通用的字符特征提取框架,并内建了 拼音
、字形
(四角编码) 和 部首拆解
的特征。
胡
-> hú
,福
-> fú
门
-> 37001
,闩
-> 37101
闩
-> ['门', '一']
,闫
-> ['门', '三']
from hanzi_char_featurizer import Featurizor
featurizor = Featurizor()
result = featurizor.featurize('明天')
print(result)
输出
([['m'], ['t']], [['ing'], ['ian']], [['2'], ['1']], ('6', '1'), ('7', '0'), ('0', '8'), ('2', '0'), ('0', '4'))
import tensorflow as tf
import hanzi_char_featurizer
feature = hanzi_char_featurizer.featurize_as_tensor('./usage/data.txt')
with tf.Session() as sess:
sess.run(tf.initializers.tables_initializer())
for _ in range(1):
print('+' * 20)
data = sess.run(feature)
print(data)
输出
++++++++++++++++++++
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]