Text2vec Versions Save

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

3 years ago

4 years ago

minor release 0.5.1.5 in order to allow zenodo to issue DOI

6 years ago

See NEWS file for details.

6 years ago

A lot of improvements - check NEWS file.

7 years ago

text2vec 0.4.0

Now under GPL (>= 2) Licence
"immutable" iterators - no need to reinitialize them
unified models interface
New models: LSA, LDA, GloVe with L1 regularization
Fast similarity and distances calculation: Cosine, Jaccard, Relaxed Word Mover's Distance, Euclidean
Better hadnling UTF-8 strings, thanks to @qinwf
iterators and models rely on R6 package
GloVe even faster now - got ~3x performance boost from code optimizations and using single precision float arithmetics

8 years ago

2016-01-13 fix for #46, thanks to @buhrmann for reporting
2016-01-16 format of vocabulary changed.
- do not keep doc_proportions. see #52.
- add stop_words argument to prune_vocabulary. signature also was changed.
2016-01-17 fix for #51. if iterator over tokens returns list with names, these names will be:
- stored as attr(corpus, 'ids')
- rownames in dtm
- names for dtm list in lda_c format
2016-02-02 high level function for corpus and vocabulary construction.
- construction of vocabulary from list of itoken.
- construction of dtm from list of itoken.
2016-02-10 rename transformers
- now all transformers starts with transform_* - more intuitive + simpler usage with autocompletion
2016-03-29 (accumulated since 2016-02-10)
- rename vocabulary to create_vocabulary.
- new functions create_dtm, create_tcm.
- All core functions are able to benefit from multicore machines (user have to register parallel backend themselves)
- Fix for progress bars. Now they are able to reach 100% and ticks increased after computation.
- ids argument to itoken. Simplifies assignement of ids to rows of DTM
- create_vocabulary now can handle stopwords
- see all updates here
2016-03-30 more robust split_into() util.

8 years ago

Fast text vectorization with stable streaming API on arbitrary n-grams.
- Functions for vocabulary extraction and management
- Hash vectorizer (based on digest murmurhash3)
- Vocabulary vectorizer
GloVe algorithm word embeddings.
- Fast term-cooccurence matrix factorization via parallel async AdaGrad.
All core functions written in C++.