Spacy Fi Save

Experimental Finnish language model for SpaCy

Project README

Experimental Finnish language model for spaCy

Finnish language model for spaCy. The model does POS tagging, dependency parsing, word vectors, noun phrase extraction, word occurrence probability estimates, morphological features, lemmatization and named entity recognition (NER). The lemmatization is based on Voikko.

The main differences between this model and the Finnish language model in the spaCy core:

This model includes a different lemmatizer implementation compared to spaCy core. My model's lemmatization accuracy is considerably better but the execution speed is slightly lower.
This model requires libvoikko. The spaCy model core does not need any external dependencies.
The training data for this model is partly different, and there are other minor tweaks in the pipeline implementation.

Want a hassle free installation? Install the spaCy core model. Need the highest possible accuracy especially for lemmatization? Install this model.

I'm planning to continue to experiment with new ideas on this repository and push the useful features to the spaCy core after testing them here.

Install the Finnish language model

First, install the libvoikko native library and the Finnish morphology data files.

Next, install the model by running:

pip install spacy_fi_experimental_web_md

Compatibility with spaCy versions:

spacy-fi version	Compatible with spaCy versions
0.14.0	3.7.x
0.13.0	3.6.x
0.12.0	3.5.x
0.11.0	3.4.x
0.10.0	3.3.x
0.9.0	>= 3.2.1 and < 3.3.0
0.8.x	3.2.x
0.7.x	3.0.x, 3.1.x
0.6.0	3.0.x
0.5.0	3.0.x
0.4.x	2.3.x

Usage

import spacy

nlp = spacy.load('spacy_fi_experimental_web_md')

doc = nlp('Hän ajoi punaisella autolla.')
for t in doc:
    print(f'{t.lemma_}\t{t.pos_}')

The dependency, part-of-speech and named entity labels are documented on a separate page.

Updating the model

Setting up a development environment

# Install the libvoikko native library with Finnish morphology data.
#
# This will install Voikko on Debian/Ubuntu.
# For other distros and operating systems, see https://voikko.puimula.org/python.html
sudo apt install libvoikko1 voikko-fi

python3 -m venv .venv
source .venv/bin/activate
pip install wheel
pip install -r requirements.txt

Training the model

spacy project assets
spacy project run train-pipeline

Optional steps (slow!) for training certain model components. These steps are not necessarily required because the results of have been pre-computed and stored in git.

Train floret embeddings:

spacy project run floret-vectors

Pretrain tok2vec weights:

spacy project run pretrain

Testing

Unit tests:

python -m pytest tests/unit

Functional tests for a trained model:

python -m pytest tests/functional

Importing the trained model directly from the file system without packaging it as a module:

import spacy
import fi

nlp = spacy.load('training/merged')

doc = nlp('Hän ajoi punaisella autolla.')
for t in doc:
    print(f'{t.lemma_}\t{t.pos_}')

Packaging and publishing

See packaging.md.

License

MIT license

License for the training data

The data sets downloaded by the tools/download_data.sh script are licensed as follows:

UD_Finnish-TDT: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
TurkuONE: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
MC4: ODC-BY and Common Crawl terms of use

Open Source Agenda is not affiliated with "Spacy Fi" Project. README Source: aajanki/spacy-fi

Stars

Open Issues

Last Commit

1 week ago

Repository

aajanki/spacy-fi

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/spacy-fi"><img src="https://www.opensourceagenda.com/projects/spacy-fi/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022