A fast, low-resource Natural Language Processing and Text Correction library written in Rust.
multiword_tags
on tokens (now part of the regular tags).Word
private and add getter methods.Word
constructor is now called new
instead of new_with_tags
.as_str
convenience method to multiple structs (WordId
, PosId
, Word
).FromIterator
and IntoIterator
impl on Rules
(#58, thanks @drahnr!)Clone
derives on Tokenizer
and Rules
(and, accordingly, on their fields)Vec<Token>
to Sentence
(#54). pipe
and sentencize
return iterators over Sentence
/ IncompleteSentence
now.SENT_START
token (now only used internally). Each token corresponds to at least one character in the input text now.Token
and IncompleteToken
private and adds getter methods (#54).char_span
and byte_span
are replaced by a Span
struct which keeps track of char and byte indices at the same time (#54). To e.g. get the byte range, use token.span().byte()
.regex-onig
and regex-fancy
. regex-onig
is the default.nlprule::rule::id
). For example:use nlprule::{Tokenizer, Rules, rule::id::Category};
use std::convert::TryInto;
let mut rules = Rules::new("path/to/en_rules.bin")?;
// disable rules named "confusion_due_do" in category "confused_words"
rules
.select_mut(
&Category::new("confused_words")
.join("confusion_due_do")
.into(),
)
.for_each(|rule| rule.disable());
// disable all grammar rules
rules
.select_mut(&Category::new("grammar").into())
.for_each(|rule| rule.disable());
// a string syntax where slashes are the separator is also supported
rules
.select_mut(&"confused_words/confusion_due_do".try_into()?)
.for_each(|rule| rule.enable());