Kanji Frequency Save

Kanji usage frequency data collected from various sources

Project README

Kanji usage frequency

Datasets built from various Japanese language corpora

https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.

You can download the datasets here: https://github.com/scriptin/kanji-frequency/tree/master/data

Building the datasets

You'll need Node.js 18 or later.

See scripts section in package.json.

Aozora:

aozora:download - use crawler/scraper to collect the data
aozora:gaiji:extract - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
aozora:gaiji:replacements - build gaiji replacements file - produces only partial results, which may need to be manually completed
aozora:clean - clean the scraped pages (apply gaiji replacements)
aozora:count - create the dataset

Wikipedia:

wikipedia:fetch - fetch random pages using MediaWiki API
wikipedia:count - create the dataset

News:

news:wikinews:fetch - fetch random pages from Wikinews using MediaWiki API
news:count - create the dataset
news:dates - create additional file with dates of articles

Building the website

See Astro docs and the scripts section in package.json.

Open Source Agenda is not affiliated with "Kanji Frequency" Project. README Source: scriptin/kanji-frequency

Stars

121

Open Issues

Last Commit

1 month ago

Repository

scriptin/kanji-frequency

License

Creative Commons Attribution 4.0

Homepage

http://scriptin.github.io/kanji-frequency/

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/kanji-frequency"><img src="https://www.opensourceagenda.com/projects/kanji-frequency/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022