Lemmatization Lists Save

Machine-readable lists of lemma-token pairs in 23 languages.

Project README

Lemmatization Lists

These are large-coverage, machine-readable lemma/token pairs in several languages which I have collected (legally) from various sources, mostly as part of my work on the Global Glossary project. I use these for query expansion during fulltext searches: if a user searches for the lemma walk, the query is expanded to also search for the tokens walking, walked etc.

These are plain text files (zipped). Each line contains one lemma/token pair separated by a tab character in this sequence: lemma, tab, token. The files are encoded in UTF-8 with Windows-style line breaks.

Asturian (ast) (108,792 pairs)
Bulgarian (bg) (30,323 pairs)
Catalan (ca) (591,534 pairs)
Czech (cs) (36,400 pairs)
English (en) (41,760 pairs)
Estonian (et) (80,536 pairs)
French (fr) (224,002 pairs)
Galician (gl) (392,856 pairs)
German (de) (358,473 pairs)
Hungarian (hu) (39,898 pairs)
Irish (ga) (415,502 pairs)
Manx Gaelic (gv) (67,177 pairs)
Italian (it) (341,074 pairs)
Persian/Farsi (fa) (6,273 pairs)
Polish (pl) (3,296,232 pairs)
Portuguese (pt) (850,264 pairs)
Romanian (ro) (314,810 pairs)
Russian (ru) (537,810 pairs)
Scottish Gaelic (gd) (51,624 pairs)
Slovak (sk) (858,414 pairs)
Slovene (sl) (99,063 pairs)
Spanish (es) (497,560 pairs)
Swedish (sv) (675,137 pairs)
Ukrainian (uk) (193,703 pairs)
Welsh (cy) (359,224 pairs)

Licence

Available under the Open Database License

Sources

Various Hunspell dictionaries from the OpenOffice.org website
Deutsches Morphologie-Lexikon by Daniel Naber
Lexique by Boris New and Christophe Pallier
e_lemma.txt by Yasumasa Someya
Multext East (only those morphological lexicons that are under a free licence are used)
Morphological dictionaries from FreeLing
SALDO morphological lexicon
Irish National Morphology Database
Various lists by Kevin Scannell
OpenRussian.org

Open Source Agenda is not affiliated with "Lemmatization Lists" Project. README Source: michmech/lemmatization-lists

Stars

306

Open Issues

Last Commit

2 years ago

Repository

michmech/lemmatization-lists

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/lemmatization-lists"><img src="https://www.opensourceagenda.com/projects/lemmatization-lists/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022