Nlprule Versions Save

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

0.4.5

3 years ago

New features

A transform function in nlprule-build to transform binaries immediately after acquiring them. Suited for e. g. compressing the binaries before caching them.

Fixes

Require srx=^0.1.2 to include a patch for out of bounds access.

0.4.4

3 years ago

Breaking changes

This is a patch release but there are some small breaking changes to the public API:

from_reader and new methods of the Tokenizer and Rules now return an nlprule::Error instead of bincode:Error.
tag_store and word_store methods of the Tagger are now private.

New features

The nlprule-build crate now has a postprocess method to allow e.g. compression of the produced binaries (#32, thanks @drahnr!).

Internal improvements

Newtypes for PosIdInt and WordIdInt to clarify use of ids in the tagger (#31).
Newtype for indices into the match graph (GraphId). All graph ids are validated at build-time now (also fixed an error where invalid graph ids in the XML files were ignored through this) (#31).
Reduced size of the English tokenizer through better serialization of the chunker. From 15MB (7.7MB gzipped) to 11MB (6.9MB gzipped).
Reduce allocations through making more use of iterators internally (#30). Improves speed but there is no significant benchmark improvement on my machine.
Improve error handling by propagating more errors in the compile module instead of panicking and better build-time validation. Reduces unwraps from ~80 to ~40.

0.4.3

3 years ago

Breaking changes

nlprule does sentence segmentation internally now using srx. The Python API has changed, removing the SplitOn class and the *_sentence methods:

tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)

rules.correct("He wants that you send him an email.") # this takes an arbitrary text

new_from is now called from_reader in the Rust API (thanks @drahnr!)
Token.text and IncompleteToken.text are now called Token.sentence / IncompleteToken.sentence to avoid confusion with Token.word.text.
Tokenizer.tokenize is now private. Use Tokenizer.pipe instead (also does sentence segmentation).

New features

Support for Spanish (experimental).
A new multiword tagger improves tagging of e. g. named entities for English and Spanish.
Adds the nlprule-build crate which makes using the correct binaries in Rust easier (thanks @drahnr for the suggestion and discussion!)
Scripts and docs in build/README.md to make creating the nlprule build directories easier and more reproducible.
Full support for LanguageTool unifications.
Binary size of the Tokenizer improved a lot. Now roughly x6 smaller for German and x2 smaller for English.
New iterator helpers for Rules (thanks @drahnr!)
A method .sentencize on the Tokenizer which does only sentence segmentation and nothing else.

0.4.0

3 years ago

0.3.0

3 years ago

BREAKING: suggestion.text is now more accurately called suggestion.replacements

Lots of speed improvements: NLPRule is now roughly 2.5x to 5x faster for German and English, respectively.

Rules have more information in the public API now: See #5

0.2.2

3 years ago

Python 3.9 support (fixes #7)

0.2.1

3 years ago

Fix precedence of Rule IDs over Rule Group IDs.

0.2.0

3 years ago

Updated to LT version 5.2.
Suggestions now have a message and source attribute (#5):

suggestions = rules.suggest_sentence("She was not been here since Monday.")
for s in suggestions:
  print(s.start, s.end, s.text, s.source, s.message)

# prints:
# 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?

NLPRule is parallelized by default now. Parallelism can be turned off by setting the NLPRULE_PARALLELISM environment variable to false.

0.1.9

3 years ago

Testing new distribution of binaries.

0.1.8

3 years ago

Testing new distribution of binaries.