Nlprule Versions Save

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

0.4.5

3 years ago

New features

  • A transform function in nlprule-build to transform binaries immediately after acquiring them. Suited for e. g. compressing the binaries before caching them.

Fixes

  • Require srx=^0.1.2 to include a patch for out of bounds access.

0.4.4

3 years ago

Breaking changes

This is a patch release but there are some small breaking changes to the public API:

  • from_reader and new methods of the Tokenizer and Rules now return an nlprule::Error instead of bincode:Error.
  • tag_store and word_store methods of the Tagger are now private.

New features

  • The nlprule-build crate now has a postprocess method to allow e.g. compression of the produced binaries (#32, thanks @drahnr!).

Internal improvements

  • Newtypes for PosIdInt and WordIdInt to clarify use of ids in the tagger (#31).
  • Newtype for indices into the match graph (GraphId). All graph ids are validated at build-time now (also fixed an error where invalid graph ids in the XML files were ignored through this) (#31).
  • Reduced size of the English tokenizer through better serialization of the chunker. From 15MB (7.7MB gzipped) to 11MB (6.9MB gzipped).
  • Reduce allocations through making more use of iterators internally (#30). Improves speed but there is no significant benchmark improvement on my machine.
  • Improve error handling by propagating more errors in the compile module instead of panicking and better build-time validation. Reduces unwraps from ~80 to ~40.

0.4.3

3 years ago

Breaking changes

  • nlprule does sentence segmentation internally now using srx. The Python API has changed, removing the SplitOn class and the *_sentence methods:
tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)

rules.correct("He wants that you send him an email.") # this takes an arbitrary text
  • new_from is now called from_reader in the Rust API (thanks @drahnr!)
  • Token.text and IncompleteToken.text are now called Token.sentence / IncompleteToken.sentence to avoid confusion with Token.word.text.
  • Tokenizer.tokenize is now private. Use Tokenizer.pipe instead (also does sentence segmentation).

New features

  • Support for Spanish (experimental).
  • A new multiword tagger improves tagging of e. g. named entities for English and Spanish.
  • Adds the nlprule-build crate which makes using the correct binaries in Rust easier (thanks @drahnr for the suggestion and discussion!)
  • Scripts and docs in build/README.md to make creating the nlprule build directories easier and more reproducible.
  • Full support for LanguageTool unifications.
  • Binary size of the Tokenizer improved a lot. Now roughly x6 smaller for German and x2 smaller for English.
  • New iterator helpers for Rules (thanks @drahnr!)
  • A method .sentencize on the Tokenizer which does only sentence segmentation and nothing else.

0.4.0

3 years ago

0.3.0

3 years ago

BREAKING: suggestion.text is now more accurately called suggestion.replacements

Lots of speed improvements: NLPRule is now roughly 2.5x to 5x faster for German and English, respectively.

Rules have more information in the public API now: See #5

0.2.2

3 years ago

Python 3.9 support (fixes #7)

0.2.1

3 years ago

Fix precedence of Rule IDs over Rule Group IDs.

0.2.0

3 years ago
  • Updated to LT version 5.2.
  • Suggestions now have a message and source attribute (#5):
suggestions = rules.suggest_sentence("She was not been here since Monday.")
for s in suggestions:
  print(s.start, s.end, s.text, s.source, s.message)

# prints:
# 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?
  • NLPRule is parallelized by default now. Parallelism can be turned off by setting the NLPRULE_PARALLELISM environment variable to false.

0.1.9

3 years ago

Testing new distribution of binaries.

0.1.8

3 years ago

Testing new distribution of binaries.