Nlprule Versions Save

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

0.6.4

3 years ago

Internal improvements

Decrease time it takes to load the Tokenizer by ~ 40% (#70).
Tag lookup is backed by a vector instead of a hashmap now.

Breaking changes

The tagger now returns iterators over tags instead of allocating a vector.
Remove get_group_members function.

0.6.3

3 years ago

Fixes

Fix a bug where calling Rule::suggest in parallel across threads would cause a panic (#68, thanks @drahnr!)

0.6.2

3 years ago

Internal improvements

Speed up loading the Tokenizer by ~ 25% (#66).

0.6.1

3 years ago

Fixes

Build Python wheels in container for full manylinux2014 compliance, now works for glibc 2.17 and above (thanks @dvwright!)
Speed up loading the Tokenizer by avoiding an allocation (thanks @drahnr!)

0.6.0

3 years ago

Fix a significant bug where text with multiple sentences would sometimes cause an error if one of the latter sentences matches some pattern (#61, #63, thanks @drahnr!).

Breaking changes

Remove multiword_tags on tokens (now part of the regular tags).
Make fields of the Word private and add getter methods.
Word constructor is now called new instead of new_with_tags.

New features

Adds as_str convenience method to multiple structs (WordId, PosId, Word).

0.5.3

3 years ago

CI failed for Release 0.5.2

0.5.2

3 years ago

Restore FromIterator and IntoIterator impl on Rules (#58, thanks @drahnr!)
Add Clone derives on Tokenizer and Rules (and, accordingly, on their fields)

0.5.1

3 years ago

Breaking changes

Changes the focus from Vec<Token> to Sentence (#54). pipe and sentencize return iterators over Sentence / IncompleteSentence now.
Removes the special SENT_START token (now only used internally). Each token corresponds to at least one character in the input text now.
Makes the fields of Token and IncompleteToken private and adds getter methods (#54).
char_span and byte_span are replaced by a Span struct which keeps track of char and byte indices at the same time (#54). To e.g. get the byte range, use token.span().byte().
Spans are relative to the input text now, not anymore to sentence boundaries (#53, thanks @drahnr!).

New features

The regex backend can now be chosen from Oniguruma or fancy-regex with the features regex-onig and regex-fancy. regex-onig is the default.
nlprule now compiles to WebAssembly. WebAssembly support is guaranteed for future versions and tested in CI.
A new selector API to select individual rules (details documented in nlprule::rule::id). For example:

use nlprule::{Tokenizer, Rules, rule::id::Category};
use std::convert::TryInto;

let mut rules = Rules::new("path/to/en_rules.bin")?;

// disable rules named "confusion_due_do" in category "confused_words"
rules
   .select_mut(
       &Category::new("confused_words")
           .join("confusion_due_do")
           .into(),
   )
   .for_each(|rule| rule.disable());

// disable all grammar rules
rules
   .select_mut(&Category::new("grammar").into())
   .for_each(|rule| rule.disable());

// a string syntax where slashes are the separator is also supported
rules
   .select_mut(&"confused_words/confusion_due_do".try_into()?)
   .for_each(|rule| rule.enable());

0.5.0

3 years ago

Superseded by 0.5.1. The release script for 0.5.0 did not finish.

0.4.6

3 years ago

Breaking changes

.validate() in nlprule-build now returns a Result<()> to encourage calling it after .postprocess().

Fixes

Fixes an error where Cursor position in nlprule-build was not reset appropriately.
Use fs_err everywhere for better error messages.