A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
Table of Contents:
Clean Polish OSCAR - preprosessed polish oscar corpus, removed: foreign sentences(non-polish), non-valid polish senteces (eg. enums), corpus preprocessed by @Ermlab
OSCAR or Open Super-large Crawled ALMAnaCH coRpus - is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. Contains 109GB or 49GB of polish text.
Polish Wikipedia dump - regular monthly copy of Polish wikipedia. More then 4GB of text.
Opus - the open parallel corpus - you can select languages and download only polish file
Polish Parliamentary Corpus text from proceedings of Polish Parliament, Sejm and Senate
Morfologik (Java) and pyMorfologik (Python wrapper) - dictionary-based morphological analyzer
Morfeusz - morphological analyzer. See also Elasticsearch plugin
Stempel (Python port) - algorithmic stemmer. See also Elasticsearch plugin
spaCy for Polish - extend spaCy, a popular production-ready NLP library, to fully support Polish language.
spacy-pl by IPI PAN - integrating existing Polish language tools and resources into the spaCy pipeline
KRNNT Polish morphological tagger - KRNNT is a morphological tagger for Polish based on recurrent neural networks Paper
Stanza (Python) - NLP analysis package from Stanford University. Stanza is a Python natural language analysis package. It contains tools, which can be used for: sentence/word tokenizing, to generate base forms of words, parts of speech and morphological features, syntactic dependency parsing, recognizing named entities. Contains Polish model
Duckling (Haskel) - library for parsing text into structured data with support for Polish
A curated list of Polish abbreviations for NLTK sentence tokenizer based on Wikipedia text
If you have or know valuable materials (datasets, models, posts, articles) that are missing here, please feel free to edit and submit a pull request. You can also send me a note on LinkedIn or via email:[email protected].