A fast, low-resource Natural Language Processing and Text Correction library written in Rust.
This is a patch release but there are some small breaking changes to the public API:
from_reader
and new
methods of the Tokenizer
and Rules
now return an nlprule::Error
instead of bincode:Error
.tag_store
and word_store
methods of the Tagger
are now private.nlprule-build
crate now has a postprocess
method to allow e.g. compression of the produced binaries (#32, thanks @drahnr!).PosIdInt
and WordIdInt
to clarify use of ids in the tagger (#31).GraphId
). All graph ids are validated at build-time now (also fixed an error where invalid graph ids in the XML files were ignored through this) (#31).compile
module instead of panicking and better build-time validation. Reduces unwrap
s from ~80 to ~40.nlprule
does sentence segmentation internally now using srx. The Python API has changed, removing the SplitOn
class and the *_sentence
methods:tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)
rules.correct("He wants that you send him an email.") # this takes an arbitrary text
new_from
is now called from_reader
in the Rust API (thanks @drahnr!)Token.text
and IncompleteToken.text
are now called Token.sentence
/ IncompleteToken.sentence
to avoid confusion with Token.word.text
.Tokenizer.tokenize
is now private. Use Tokenizer.pipe
instead (also does sentence segmentation).nlprule-build
crate which makes using the correct binaries in Rust easier (thanks @drahnr for the suggestion and discussion!)build/README.md
to make creating the nlprule build directories easier and more reproducible.Tokenizer
improved a lot. Now roughly x6 smaller for German and x2 smaller for English.Rules
(thanks @drahnr!).sentencize
on the Tokenizer
which does only sentence segmentation and nothing else.BREAKING: suggestion.text
is now more accurately called suggestion.replacements
Lots of speed improvements: NLPRule is now roughly 2.5x to 5x faster for German and English, respectively.
Rules have more information in the public API now: See #5
Python 3.9 support (fixes #7)
Fix precedence of Rule IDs over Rule Group IDs.
message
and source
attribute (#5):suggestions = rules.suggest_sentence("She was not been here since Monday.")
for s in suggestions:
print(s.start, s.end, s.text, s.source, s.message)
# prints:
# 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?
NLPRULE_PARALLELISM
environment variable to false.Testing new distribution of binaries.
Testing new distribution of binaries.