Nlprule Versions Save

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

0.6.4

3 years ago

Internal improvements

  • Decrease time it takes to load the Tokenizer by ~ 40% (#70).
  • Tag lookup is backed by a vector instead of a hashmap now.

Breaking changes

  • The tagger now returns iterators over tags instead of allocating a vector.
  • Remove get_group_members function.

0.6.3

3 years ago

Fixes

  • Fix a bug where calling Rule::suggest in parallel across threads would cause a panic (#68, thanks @drahnr!)

0.6.2

3 years ago

Internal improvements

Speed up loading the Tokenizer by ~ 25% (#66).

0.6.1

3 years ago

Fixes

  • Build Python wheels in container for full manylinux2014 compliance, now works for glibc 2.17 and above (thanks @dvwright!)
  • Speed up loading the Tokenizer by avoiding an allocation (thanks @drahnr!)

0.6.0

3 years ago
  • Fix a significant bug where text with multiple sentences would sometimes cause an error if one of the latter sentences matches some pattern (#61, #63, thanks @drahnr!).

Breaking changes

  • Remove multiword_tags on tokens (now part of the regular tags).
  • Make fields of the Word private and add getter methods.
  • Word constructor is now called new instead of new_with_tags.

New features

  • Adds as_str convenience method to multiple structs (WordId, PosId, Word).

0.5.3

3 years ago
  • CI failed for Release 0.5.2

0.5.2

3 years ago
  • Restore FromIterator and IntoIterator impl on Rules (#58, thanks @drahnr!)
  • Add Clone derives on Tokenizer and Rules (and, accordingly, on their fields)

0.5.1

3 years ago

Breaking changes

  • Changes the focus from Vec<Token> to Sentence (#54). pipe and sentencize return iterators over Sentence / IncompleteSentence now.
  • Removes the special SENT_START token (now only used internally). Each token corresponds to at least one character in the input text now.
  • Makes the fields of Token and IncompleteToken private and adds getter methods (#54).
  • char_span and byte_span are replaced by a Span struct which keeps track of char and byte indices at the same time (#54). To e.g. get the byte range, use token.span().byte().
  • Spans are relative to the input text now, not anymore to sentence boundaries (#53, thanks @drahnr!).

New features

  • The regex backend can now be chosen from Oniguruma or fancy-regex with the features regex-onig and regex-fancy. regex-onig is the default.
  • nlprule now compiles to WebAssembly. WebAssembly support is guaranteed for future versions and tested in CI.
  • A new selector API to select individual rules (details documented in nlprule::rule::id). For example:
use nlprule::{Tokenizer, Rules, rule::id::Category};
use std::convert::TryInto;

let mut rules = Rules::new("path/to/en_rules.bin")?;

// disable rules named "confusion_due_do" in category "confused_words"
rules
   .select_mut(
       &Category::new("confused_words")
           .join("confusion_due_do")
           .into(),
   )
   .for_each(|rule| rule.disable());

// disable all grammar rules
rules
   .select_mut(&Category::new("grammar").into())
   .for_each(|rule| rule.disable());

// a string syntax where slashes are the separator is also supported
rules
   .select_mut(&"confused_words/confusion_due_do".try_into()?)
   .for_each(|rule| rule.enable());

0.5.0

3 years ago
  • Superseded by 0.5.1. The release script for 0.5.0 did not finish.

0.4.6

3 years ago

Breaking changes

  • .validate() in nlprule-build now returns a Result<()> to encourage calling it after .postprocess().

Fixes

  • Fixes an error where Cursor position in nlprule-build was not reset appropriately.
  • Use fs_err everywhere for better error messages.