Bnosac Udpipe Versions Save

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit

1 year ago

CHANGES IN udpipe VERSION 0.8.11

replace move with std::move to fix R CMD check warning on recent versions of clang compilers

1 year ago

use snprintf instead of sprintf to handle the R CMD check deprecating note on M1mac
reduction of timings of the examples of document_term_matrix, document_term_frequencies, document_term_frequencies_statistics, cooccurrence, dtm_bind, keywords_collocation

2 years ago

fix R CMD check message on Fedora clang infrastructure: rcpp_udpipe.cpp:243:8: warning: use of bitwise '&' with boolean operands

2 years ago

dtm_svd_similarity, fix to make sure that if provided a dtm with features which are all missing/zero, the scoring still works as expected instead of removing features which contain no data whatsoever. So that dtm_svd_similarity can be used alongside embeddings of R package word2vec which might contain words which are not in the dtm. See the example in ?dtm_svd_similarity
added txt_grepl

txt_count now always returns an integer, even if in the border case where a character vector of length 0 is supplied

2 years ago

Downloading models to paths containing non-ASCII characters now works (issue #95)
strsplit.data.frame gains ... which are passed on to strsplit (e.g. to use fixed=TRUE for speeding up)
read_connlu is now using fixed=TRUE when splitting by newline symbol (for speeding up parsing with function udpipe)
Added txt_paste
Added txt_context
Use html_vignette instead of html_document in the vignettes in order to reduce package size

3 years ago

Added document_term_matrix.default, document_term_matrix.integer and document_term_matrix.numeric
Added groups argument to dtm_colsums and dtm_rowsums
Added dtm_align
Added dtm_sample
Added document_term_matrix.matrix
dtm_cbind and dtm_rbind allow to pass more than 2 sparse matrices
cbind_morphological gains argument which to specify which morphological features to extract
txt_count now returns NA when NA is provided instead of an error
txt_contains now returns NA when NA is provided instead of FALSE, unless value is set to TRUE
txt_collapse now also works if provided a list of character vectors
paste.data.frame now works as well if a data.table is passed instead of a data.frame
txt_recode gains an extra argument na.rm

3 years ago

Fixing the Solaris compilation issue in ufal::udpipe::multiword_splitter::append_token

3 years ago

Update to UDPipe 1.2.1 (28 Sep 2018)
- this adds segment_size and learning_rate_final parameters to tokenizer training
- correctly set SpaceAfter for last token when normalizing spaces.
Default of udpipe_download_model is now changed, downloads now models built on Universal Dependencies 2.5 instead of the models build on Universal Dependencies 2.4
Added txt_count
Added txt_overlap
Added dtm_conform
Added dtm_chisq
Added dtm_svd_similarity
Added as_fasttext
Added unlist_tokens
txt_recode_ngram now also works gracefully in case ngram is set to 1 although the intention is not to use it when ngram is set to 1
Experimental changes regarding cbind_dependencies which might change in a subsequent release.
- cbind_dependencies now has been implementend for type 'child'.
- cbind_dependencies now allows to add row numbers of the parent or children where the token is linked to using the dependency parsing output.
Experimental and unfinished work on allowing to easily query dependency relations

4 years ago

Default of udpipe_download_model is now changed, downloads now models built on Universal Dependencies 2.4 instead of the models build on Universal Dependencies 2.3
also allow strsplit.data.frame to work if the data argument is a data.table
in case the model loaded with udpipe_load_model is a nil pointer (most likely due to users which restarted their R sessions without knowing), try reloading the model file in udpipe_annotate
fix issue in udpipe_reconstruct giving wrong values in start/end positions of the token in case someone had as well SpacesBefore as SpacesAfter for a token. For users prior to version 0.8.3 you can easily circumvent this issue by removing leading/trailing white space in your text by using trimws on your text before using udpipe::udpipe.
document_term_matrix now gains argument weight allowing to select another column to put into the matrix cells
add txt_contains

4 years ago

udpipe::udpipe now gains 2 arguments: parallel.cores and parallel.chunksize in order to annotate in parallel over your CPU cores.
document_term_matrix.data.frame now preserves order of the documents (issue #44)
dtm_remove_lowfreq, dtm_remove_tfidf, dtm_remove_terms gain extra argument remove_emptydocs explicitely add drop=FALSE to internal dtm_... calls
add dtm_remove_sparseterms (issue #44)
make sure downloading model fails gracefully if github internet resource is not available on CRAN machines
udpipe_download_model now also returns download_failed/download_message indicating if the download failed due to internet connectivity issues