Topic Modelling for Humans
hs
and negative
in Word2Vec (gau-nernst, #3443)Morfessor
, tox
and gensim.models.wrappers
by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3345
wmdistance
by @TLouf in https://github.com/RaRe-Technologies/gensim/pull/3327
Full Changelog: https://github.com/RaRe-Technologies/gensim/compare/4.2.0...4.3.0
A number of incremental improvements, optimizations and bugfixes: CHANGELOG
This is a bugfix release that addresses left over compatibility issues with older versions of numpy and MacOS.
This is a bugfix release that addresses compatibility issues with older versions of numpy.
Gensim 4.1 brings two major new functionalities:
There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.
We now handle both positive
and negative
keyword parameters consistently.
They may now be either:
So you can now simply do:
model.most_similar(positive='war', negative='peace')
instead of the slightly more involved
model.most_similar(positive=['war'], negative=['peace'])
Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.
model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
then you will need to specify the lists explicitly in gensim 4.1.
step
parameter from doc2vecWith the newer version, do this:
model.infer_vector(..., epochs=123)
instead of this:
model.infer_vector(..., steps=123)
Plus a large number of smaller improvements and fixes, as usual.
⚠️ If migrating from old Gensim 3.x, read the Migration guide first.
shrink_windows
argument for Word2Vec, by @M-Demay
This is a bugfix release that addresses compatibility issues with older versions of numpy.
Gensim 4.1 brings two major new functionalities:
There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.
We now handle both positive
and negative
keyword parameters consistently.
They may now be either:
So you can now simply do:
model.most_similar(positive='war', negative='peace')
instead of the slightly more involved
model.most_similar(positive=['war'], negative=['peace'])
Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.
model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
then you will need to specify the lists explicitly in gensim 4.1.
step
parameter from doc2vecWith the newer version, do this:
model.infer_vector(..., epochs=123)
instead of this:
model.infer_vector(..., steps=123)
Plus a large number of smaller improvements and fixes, as usual.
⚠️ If migrating from old Gensim 3.x, read the Migration guide first.
shrink_windows
argument for Word2Vec, by @M-Demay
Gensim 4.1 brings two major new functionalities:
There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.
We now handle both positive
and negative
keyword parameters consistently.
They may now be either:
So you can now simply do:
model.most_similar(positive='war', negative='peace')
instead of the slightly more involved
model.most_similar(positive=['war'], negative=['peace'])
Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.
model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
then you will need to specify the lists explicitly in gensim 4.1.
step
parameter from doc2vecWith the newer version, do this:
model.infer_vector(..., epochs=123)
instead of this:
model.infer_vector(..., steps=123)
Plus a large number of smaller improvements and fixes, as usual.
⚠️ If migrating from old Gensim 3.x, read the Migration guide first.
shrink_windows
argument for Word2Vec, by @M-Demay
Bugfix release to address issues with Wheels on Windows:
⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.
Gensim 4.0 is a major release with lots of performance & robustness improvements, and a new website.
Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:
a. Efficiency
model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput |
---|---|---|
fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s |
word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |
In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)
b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)
c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.
These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.
Dropped a bunch of externally contributed modules and wrappers: summarization, pivoted TFIDF, Mallet…
Code quality was not up to our standards. Also there was no one to maintain these modules, answer user questions, support them.
So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them, please fork & publish into your own repo. They can live happily outside of Gensim.
Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.
A new Gensim website – finally! 🙃
So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.
This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting concrete NLP & document similarity use-cases.
max_final_vocab
parameter in fastText constructor, by @mpenkov
alpha
parameter in LDA model, by @xh2
save_facebook_model
failure after update-vocab & other initialization streamlining, by @gojomo
xml.etree.cElementTree
, by @hugovk
similarities.index
to the more appropriate similarities.annoy
, by @piskvorky
num_words
to topn
in dtm_coherence, by @MeganStodel
on_batch_begin
and on_batch_end
callbacks, by @mpenkov
pattern
dependency, by @mpenkov
gensim.viz
subpackage, by @mpenkov
⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.
Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.
Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:
a. Efficiency
model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput |
---|---|---|
fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s |
word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |
In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)
b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)
c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.
These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.
Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.
Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.
So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.
Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.
A new Gensim website – finally! 🙃
So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.
This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.
⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.
Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.
Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:
a. Efficiency
model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput |
---|---|---|
fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s |
word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |
In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)
b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)
c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.
These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.
Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.
Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.
So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.
Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.
A new Gensim website – finally! 🙃
So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.
This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.
This 4.0.0beta pre-release is for users who want the cutting edge performance and bug fixes. Plus users who want to help out, by testing and providing feedback: code, documentation, workflows… Please let us know on the mailing list!
Install the pre-release with:
pip install --pre --upgrade gensim
Production stability is important to Gensim, so we're improving the process of upgrading already-trained saved models. There'll be an explicit model upgrade script between each 4.n
to 4.(n+1)
Gensim release. Check progress here.
max_final_vocab
parameter in fastText constructor, by @mpenkov
alpha
parameter in LDA model, by @xh2
save_facebook_model
failure after update-vocab & other initialization streamlining, by @gojomo
xml.etree.cElementTree
, by @hugovk
similarities.index
to the more appropriate similarities.annoy
, by @piskvorky
num_words
to topn
in dtm_coherence, by @MeganStodel
⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.
Gensim 4.0 is a major release with lots of performance & robustness improvements, and a new website.
Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:
a. Efficiency
model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput |
---|---|---|
fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s |
word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |
In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)
b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)
c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.
These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.
Dropped a bunch of externally contributed modules and wrappers: summarization, pivoted TFIDF, Mallet…
Code quality was not up to our standards. Also there was no one to maintain these modules, answer user questions, support them.
So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them, please fork & publish into your own repo. They can live happily outside of Gensim.
Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.
A new Gensim website – finally! 🙃
So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.
This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting concrete NLP & document similarity use-cases.
max_final_vocab
parameter in fastText constructor, by @mpenkov
alpha
parameter in LDA model, by @xh2
save_facebook_model
failure after update-vocab & other initialization streamlining, by @gojomo
xml.etree.cElementTree
, by @hugovk
similarities.index
to the more appropriate similarities.annoy
, by @piskvorky
num_words
to topn
in dtm_coherence, by @MeganStodel
on_batch_begin
and on_batch_end
callbacks, by @mpenkov
pattern
dependency, by @mpenkov
gensim.viz
subpackage, by @mpenkov
⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.
Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.
Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:
a. Efficiency
model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput |
---|---|---|
fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s |
word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |
In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)
b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)
c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.
These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.
Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.
Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.
So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.
Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.
A new Gensim website – finally! 🙃
So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.
This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.
⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.
Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:
a. Efficiency
model | 3.8.3 wall time / peak RAM / throughput |
4.0.0 wall time / peak RAM / throughput |
---|---|---|
fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s |
word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |
In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. 4.0 benchmarks.
b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)
c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.
These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.
Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, wrappers for 3rd party libraries: Mallet, scikit-learn, DTM model, Vowpal Wabbit, wordrank, varembed.
Why? Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules and wrappers.
So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim, linked to as "contributed" from Gensim docs.
Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.
A new Gensim website – finally! 🙃
So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.
This is the direction we'll keep going forward: less kitchen-sink of "latest academic fad", more focus on robust engineering, targetting common NLP & document similarity use-cases.
This 4.0.0beta pre-release is for users who want the cutting edge performance and bug fixes. Plus users who want to help out, by testing and providing feedback: code, documentation, workflows… Please let us know on the mailing list!
Install the pre-release with:
pip install --pre --upgrade gensim
Check progress here.
max_final_vocab
parameter in fastText constructor, by @mpenkov
alpha
parameter in LDA model, by @xh2
save_facebook_model
failure after update-vocab & other initialization streamlining, by @gojomo
xml.etree.cElementTree
, by @hugovk
similarities.index
to the more appropriate similarities.annoy
, by @piskvorky
num_words
to topn
in dtm_coherence, by @MeganStodel