Pre Modern Chinese Corpus Dataset Save

近代汉语语料库数据集自然语言处理语料库古代汉语古汉语文言文数字人文计算语言

Project README

Pre-modern_Chinese_language_corpus

若在科研论文、项目工程中使用了该近代汉语语料库/数据集，欢迎引用：

蒋彦廷，潘雨婷，杨乐. 基于统计与词嵌入的近代汉语动量结构研究[J]. 西华大学学报（哲学社会科学版），2020，39（2）：23−32.

JIANG Yan-ting, PAN Yu-ting, YANG Le. A Research on Verbal Classifiers Collocation in Pre-modern Chinese Based on Statistics and Word Embedding[J]. Journal of Xihua University (Philosophy & Social Sciences), 2020, 39(2): 23-32.

2020-2-18 update：

2020年2月18日更新：

修复了下载链接失效的问题。 having fixed the failure of download link.

2018-11-21 update：

2018年11月21日更新：

1.Add the essays parts of 6 eras.

增加了6个时间段的散文类别的语料。

2.The total number of characters increases by over 19.3 million.

文献总字数增加1938万余字。

3.Representative works updated：

更新的代表作：元_散文_姚燧_牧庵集.txt 元_散文_戴表元_剡源文集（不含韵文部分）.txt 元_散文_掲傒斯_文安集.txt 元_散文_苏天爵_元文类.txt 元_散文_苏天爵_滋溪文稿.txt 宋_散文_王安石_临川文集（不含前38卷韵文）.txt 宋_散文_祖无择_龙学文集.txt 宋_散文_群星_五百家播芳大全文粹.txt 宋_散文_群星_宋文鉴（不含韵文部分）.txt 宋_散文_群星_辽文萃.txt 宋_散文_苏轼_东坡全集（不含前33卷韵文）.txt 明_散文_群星_明文海.txt 明_散文_群星_晚明二十家小品.txt 明_散文_群星_皇明文征（不含韵文部分）.txt 民国_散文_巴金_巴金散文集.txt 民国_散文_徐志摩_徐志摩散文集.txt 民国_散文_朱自清_朱自清散文集.txt 民国_散文_杨绛_杨绛文集.TXT 民国_散文_梁实秋_林语堂散文集.txt 民国_散文_梁实秋_梁实秋散文集.txt 民国_散文_老舍_老舍散文集.txt 民国_散文_茅盾_茅盾散文集.txt 民国_散文_萧红_散文集.txt 民国_散文_郭沫若_郭沫若散文选集.txt 民国_散文_鲁迅_鲁迅文集.txt 清_散文_刘文武_清文精选(不含晚清梁启超林纾等).txt 清_散文_游戏主人_笑林广记.txt 清_散文_群星_皇清文颖.txt 清末_散文_群星_晚清文选.txt

1.【Introduction 简介】

This is a 280-million-character pre-modern Chinese language corpus.

The total file size is more than 966 MB,including 968 text files.These language resources are by utf-8,arranged in dynasty order（Song,Yuan,Ming,Early-Qing,

Late-Qing and Republic of China）.

The relevant authors' information and types of literature also have been labelled.

这是一个2.8亿多字的近代汉语语料集合。总大小超过966 MB，含968个TXT文件。语料文本均为utf-8编码。

文本文件按朝代（宋、元、明、清初、清末、民国）排列，文本的类别、作者姓名也作了标注。

2.【Application area of this corpus 语料用途】

These language resources can be used for literature/history/linguistic/arts/chinese medical/the history of science research,Chinese teaching,data mining,

text automatic classification and so on.

这些语料可服务于文学/文献学/历史学/语言学/艺术学/中医学/科学技术史研究、汉语教学、数据挖掘和文本自动分类等领域。

3.【Types of language resources 语言资源类型】

The types of literature involve文献类型包括：

(1)诗歌 poetry;

(2)词 "Ci";

(3)剧曲 drama;

(4)小说话本 novel;

(5)军事类 military literature;

(6)中医类 chinese medical literature;

(7)技艺类 arts literature (如eg：乐器musical instrument、棋弈chess、书法calligraphy、厨艺cooking、茶tea、武术功夫Chinese kung fu);

(8)数理科学 math/algorithm/astronomy/chemistry/physics;

(9)农业类 agricultural literature;

(10)历史地理类 history/geography literature.

(11）散文类（非韵文） essay literature.

4.【Language classification 语料编排分类】

All the language resources are separated into 6 parts： (1)Song dynasty, (2)Yuan dynasty, (3)Ming dynasty, (4)Early Qing dynasty(before 1840s AD),

(5)Late Qing dynasty(1840s-1911 AD), (6)Republic of China(1912-1948).

所有语料文本被分为6个部分：宋朝、元朝、明朝、清初（1644-1840）、清末（1840-1911）、民国（1912-1948）。

5.【The number of character of each category 文档字数统计(不含标点)】

类别\朝代	散文	小说话本	历史地理	诗词	医学	农学	剧曲	数理科学	技艺	军事	总字数
宋	5820561	141317	12835787	1680594	5419232	18930	0	285620	33288	445545	26680874
元	1319350	1378162	5375872	2835050	1869542	189182	2423584	116977	50850	0	15558569
明	6423460	17357555	27279817	929987	15728504	552105	2639445	1454890	187069	803206	73356038
清初	882491	33290363	39011391	544178	10659597	5692	1040341	3749246	501007	0	89684306
清末	744835	9436857	19075096	124220	511873	0	1411883	0	0	19670	31324434
民国	3853165	9458024	20204169	160852	319042	0	427896	0	0	136671	34559819
总计	19043862	9458024	123782132	6274881	34507790	765909	7943149	5606733	772214	1405092	271164040

6.【Where to download these language resources? 语料下载地址】

请邮件联系[email protected]，或加qq号:540980735，或加微信号jyt629000获取。

If you have any question,or want to help to enlarge this free,open corpus,please contact the

editor: Jiang Yanting（[email protected]）.Thanks!

Open Source Agenda is not affiliated with "Pre Modern Chinese Corpus Dataset" Project. README Source: JiangYanting/Pre-modern_Chinese_corpus_dataset

Stars

143

Open Issues

Last Commit

9 months ago

Repository

JiangYanting/Pre-modern_Chinese_corpus_dataset

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/pre-modern-chinese-corpus-dataset"><img src="https://www.opensourceagenda.com/projects/pre-modern-chinese-corpus-dataset/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022