Topic Modelling for Humans
This is primarily a bugfix release to bring back Py2.7 compatibility to gensim 3.8.
lxml.etree.cElementTree
(PR #2777, @tirkarthi)Remove
gensim.models.FastText.load_fasttext_format
: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation)gensim.models.wrappers.fasttext
(obsoleted by the new native gensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new native gensim.scripts.segment_wiki
implementation)Move
gensim.scripts.make_wikicorpus
➡ gensim.scripts.make_wiki.py
gensim.summarization
➡ gensim.models.summarization
gensim.topic_coherence
➡ gensim.models._coherence
gensim.utils
➡ gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡ gensim.utils.text_utils
smart_open
version for compatibility with Py2.7Remove
gensim.models.FastText.load_fasttext_format
: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation)gensim.models.wrappers.fasttext
(obsoleted by the new native gensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new native gensim.scripts.segment_wiki
implementation)Move
gensim.scripts.make_wikicorpus
➡ gensim.scripts.make_wiki.py
gensim.summarization
➡ gensim.models.summarization
gensim.topic_coherence
➡ gensim.models._coherence
gensim.utils
➡ gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡ gensim.utils.text_utils
Remove
gensim.models.FastText.load_fasttext_format
: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation)gensim.models.wrappers.fasttext
(obsoleted by the new native gensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new native gensim.scripts.segment_wiki
implementation)Move
gensim.scripts.make_wikicorpus
➡ gensim.scripts.make_wiki.py
gensim.summarization
➡ gensim.models.summarization
gensim.topic_coherence
➡ gensim.models._coherence
gensim.utils
➡ gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡ gensim.utils.text_utils
gensim.downloader
to run offline, by introducing a local file cache (mpenkov, #2545)gensim.downloader
target directory configurable (mpenkov, #2456)smart_open
deprecation warning globally (itayB, #2530)topn=0
versus topn=None
bug in most_similar
, accept topn
of any integer type (Witiko, #2497)CHANGELOG.md
(mpenkov, #2482)gensim.similarities.termsim
module (Witiko, #2485)Support
section in README (piskvorky, #2542)Remove
gensim.models.FastText.load_fasttext_format
: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation)gensim.models.wrappers.fasttext
(obsoleted by the new native gensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new native gensim.scripts.segment_wiki
implementation)Move
gensim.scripts.make_wikicorpus
➡ gensim.scripts.make_wiki.py
gensim.summarization
➡ gensim.models.summarization
gensim.topic_coherence
➡ gensim.models._coherence
gensim.utils
➡ gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡ gensim.utils.text_utils
WordEmbeddingsKeyedVectors.most_similar
(Witiko, #2461)matutils.unitvec
always return float norm when requested (Witiko, #2419)Doc2Vec.docvecs
comment (gojomo, #2472)Remove
gensim.models.FastText.load_fasttext_format
: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation)gensim.models.wrappers.fasttext
(obsoleted by the new native gensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new native gensim.scripts.segment_wiki
implementation)Move
gensim.scripts.make_wikicorpus
➡ gensim.scripts.make_wiki.py
gensim.summarization
➡ gensim.models.summarization
gensim.topic_coherence
➡ gensim.models._coherence
gensim.utils
➡ gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡ gensim.utils.text_utils
gensim.models.fasttext.load_facebook_model
function: load full model (slower, more CPU/memory intensive, supports training continuation)
>>> from gensim.test.utils import datapath
>>>
>>> cap_path = datapath("crime-and-punishment.bin")
>>> fb_model = load_facebook_model(cap_path)
>>>
>>> 'landlord' in fb_model.wv.vocab # Word is out of vocabulary
False
>>> oov_term = fb_model.wv['landlord']
>>>
>>> 'landlady' in fb_model.wv.vocab # Word is in the vocabulary
True
>>> iv_term = fb_model.wv['landlady']
>>>
>>> new_sent = [['lord', 'of', 'the', 'rings'], ['lord', 'of', 'the', 'flies']]
>>> fb_model.build_vocab(new_sent, update=True)
>>> fb_model.train(sentences=new_sent, total_examples=len(new_sent), epochs=5)
gensim.models.fasttext.load_facebook_vectors
function: load embeddings only (faster, less CPU/memory usage, does not support training continuation)
>>> fbkv = load_facebook_vectors(cap_path)
>>>
>>> 'landlord' in fbkv.vocab # Word is out of vocabulary
False
>>> oov_vector = fbkv['landlord']
>>>
>>> 'landlady' in fbkv.vocab # Word is in the vocabulary
True
>>> iv_vector = fbkv['landlady']
To achieve consistency with the reference implementation from Facebook,
a FastText
model will now always report any word, out-of-vocabulary or
not, as being in the model, and always return some vector for any word
looked-up. Specifically:
'any_word' in ft_model
will always return True
. Previously, it
returned True
only if the full word was in the vocabulary. (To test if a
full word is in the known vocabulary, you can consult the wv.vocab
property: 'any_word' in ft_model.wv.vocab
will return False
if the full
word wasn't learned during model training.)ft_model['any_word']
will always return a vector. Previously, it
raised KeyError
for OOV words when the model had no vectors
for any ngrams of the word.The gensim.models.FastText.load_fasttext_format
function (deprecated) now loads the entire model contained in the .bin file, including the shallow neural network that enables training continuation.
Loading this NN requires more CPU and RAM than previously required.
Since this function is deprecated, consider using one of its alternatives (see below).
Furthermore, you must now pass the full path to the file to load, including the file extension. Previously, if you specified a model path that ends with anything other than .bin, the code automatically appended .bin to the path before loading the model. This behavior was confusing, so we removed it.
Remove
gensim.models.FastText.load_fasttext_format
: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation)gensim.models.wrappers.fasttext
(obsoleted by the new native gensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new native gensim.scripts.segment_wiki
implementation)Move
gensim.scripts.make_wikicorpus
➡ gensim.scripts.make_wiki.py
gensim.summarization
➡ gensim.models.summarization
gensim.topic_coherence
➡ gensim.models._coherence
gensim.utils
➡ gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡ gensim.utils.text_utils
FastText.load_fasttext_model
(@mpenkov, #2340)Doc2Vec.infer_vector
(@tobycheese, #2347)LdaSeqModel
(@horpto, #2360)process_result_queue
from cycle in LdaMulticore
(@horpto, #2358)LdaModel.do_mstep
(@horpto, #2344)FastTextKeyedVectors
using KeyedVectors
(missing attribute compatible_hash
) (@menshikh-iv, #2349)WordEmbeddingsKeyedVectors.most_similar
(@Witiko, #2356)flake8==3.7.1
(@horpto, #2365)FastText
documentation (@mpenkov, #2353)Any*Vec
docstrings (@tobycheese, #2345)poincare
documentation to indicate the relation format (@AMR-KELEG, #2357)Remove
gensim.models.wrappers.fasttext
(obsoleted by the new native gensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new native gensim.scripts.segment_wiki
implementation)Move
gensim.scripts.make_wikicorpus
➡ gensim.scripts.make_wiki.py
gensim.summarization
➡ gensim.models.summarization
gensim.topic_coherence
➡ gensim.models._coherence
gensim.utils
➡ gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡ gensim.utils.text_utils
Fast Online NMF (@anotherbugmaster, #2007)
Benchmark wiki-english-20171001
Model | Perplexity | Coherence | L2 norm | Train time (minutes) |
---|---|---|---|---|
LDA | 4727.07 | -2.514 | 7.372 | 138 |
NMF | 975.74 | -2.814 | 7.265 | 73 |
NMF (with regularization) | 985.57 | -2.436 | 7.269 | 441 |
Simple to use (same interface as LdaModel
)
from gensim.models.nmf import Nmf
from gensim.corpora import Dictionary
import gensim.downloader as api
text8 = api.load('text8')
dictionary = Dictionary(text8)
dictionary.filter_extremes()
corpus = [
dictionary.doc2bow(doc) for doc in text8
]
nmf = Nmf(
corpus=corpus,
num_topics=5,
id2word=dictionary,
chunksize=2000,
passes=5,
random_state=42,
)
nmf.show_topics()
"""
[(0, '0.007*"km" + 0.006*"est" + 0.006*"islands" + 0.004*"league" + 0.004*"rate" + 0.004*"female" + 0.004*"economy" + 0.003*"male" + 0.003*"team" + 0.003*"elections"'),
(1, '0.006*"actor" + 0.006*"player" + 0.004*"bwv" + 0.004*"writer" + 0.004*"actress" + 0.004*"singer" + 0.003*"emperor" + 0.003*"jewish" + 0.003*"italian" + 0.003*"prize"'),
(2, '0.036*"college" + 0.007*"institute" + 0.004*"jewish" + 0.004*"universidad" + 0.003*"engineering" + 0.003*"colleges" + 0.003*"connecticut" + 0.003*"technical" + 0.003*"jews" + 0.003*"universities"'),
(3, '0.016*"import" + 0.008*"insubstantial" + 0.007*"y" + 0.006*"soviet" + 0.004*"energy" + 0.004*"info" + 0.003*"duplicate" + 0.003*"function" + 0.003*"z" + 0.003*"jargon"'),
(4, '0.005*"software" + 0.004*"games" + 0.004*"windows" + 0.003*"microsoft" + 0.003*"films" + 0.003*"apple" + 0.003*"video" + 0.002*"album" + 0.002*"fiction" + 0.002*"characters"')]
"""
See also:
Massive improvement of FastText
compatibilities (@mpenkov, #2313)
from gensim.models import FastText
# 'cc.ru.300.bin' - Russian Facebook FT model trained on Common Crawl
# Can be downloaded from https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ru.300.bin.gz
model = FastText.load_fasttext_format("cc.ru.300.bin")
# Fixed hash-function allow to produce same output as FB FastText & works correctly for non-latin languages (for example, Russian)
assert "мяу" in m.wv.vocab # 'мяу' - vocab word
model.wv.most_similar("мяу")
"""
[('Мяу', 0.6820122003555298),
('МЯУ', 0.6373013257980347),
('мяу-мяу', 0.593108594417572),
('кис-кис', 0.5899622440338135),
('гав', 0.5866007804870605),
('Кис-кис', 0.5798211097717285),
('Кис-кис-кис', 0.5742273330688477),
('Мяу-мяу', 0.5699705481529236),
('хрю-хрю', 0.5508339405059814),
('ав-ав', 0.5479759573936462)]
"""
assert "котогород" not in m.wv.vocab # 'котогород' - out-of-vocab word
model.wv.most_similar("котогород", topn=3)
"""
[('автогород', 0.5463314652442932),
('ТагилНовокузнецкНовомосковскНовороссийскНовосибирскНовотроицкНовочеркасскНовошахтинскНовый',
0.5423436164855957),
('областьНовосибирскБарабинскБердскБолотноеИскитимКарасукКаргатКуйбышевКупиноОбьТатарскТогучинЧерепаново',
0.5377570390701294)]
"""
# Now we load full model, for this reason, we can continue an training
from gensim.test.utils import datapath
from smart_open import smart_open
with smart_open(datapath("crime-and-punishment.txt"), encoding="utf-8") as infile: # russian text
corpus = [line.strip().split() for line in infile]
model.train(corpus, total_examples=len(corpus), epochs=5)
Similarity search improvements (@Witiko, #2016)
Add similarity search using the Levenshtein distance in gensim.similarities.LevenshteinSimilarityIndex
Performance optimizations to gensim.similarities.SoftCosineSimilarity
(full benchmark)
dictionary size | corpus size | speed |
---|---|---|
1000 | 100 | 1.0× |
1000 | 1000 | 53.4× |
1000 | 100000 | 156784.8× |
100000 | 100 | 3.8× |
100000 | 1000 | 405.8× |
100000 | 100000 | 66262.0× |
See updated soft-cosine tutorial for more information and usage examples
Add python3.7
support (@menshikh-iv, #2211)
Phraser
memory usage (drop frequencies) (@jenishah, #2208)ldamodel.update_dir_prior
(@horpto, #2274)KeyedVector.wmdistance
(@horpto, #2326)remove_unreachable_nodes
in gensim.summarization
(@horpto, #2263)mz_entropy
from gensim.summarization
(@horpto, #2267)filter_extremes
methods in Dictionary
and HashDictionary
(@horpto, #2303)KeyedVectors.relative_cosine_similarity
(@rsdel2007, #2307)random_seed
to LdaMallet
(@Zohaggie & @menshikh-iv, #2153)common_terms
parameter to sklearn_api.PhrasesTransformer
(@pmlk, #2074)corpora.Dictionary
based on special tokens (@Froskekongen, #2200)six
usage (xrange
, map
, zip
) (@horpto, #2264)line2doc
methods of LowCorpus
and MalletCorpus
(@horpto, #2269)PYTHONHASHSEED
) (@menshikh-iv, #2196)__getitem__
code duplication in gensim.models.phrases
(@jenishah, #2206)flake8-rst
for docstring code examples (@kataev, #2192)py26
stuff (@menshikh-iv, #2214)itertools.chain
instead of sum
to concatenate lists (@Stigjb, #2212)utils.get_max_id
(@horpto, #2254)np.sum(generator)
(@rsdel2007, #2296)BM25
(@horpto, #2275)metadata=True
for make_wikicorpus
script by default (@Xinyi2016, #2245)Phrases
(@rsdel2007, #2331)open()
by smart_open()
in gensim.models.fasttext._load_fasttext_format
(@rsdel2007, #2335)*Vec
corpusfile-based training (@bm371613, #2239)malletmodel2ldamodel
conversion (@horpto, #2288)LdaModel
(@horpto, #2308)SvmLightCorpus.serialize
if labels
instance of numpy.ndarray (@aquatiko, #2243)plotly>=3.0.0
(@jenishah, #2226)keep_n
behavior for Dictionary.filter_extremes
(@johann-petrak, #2232)sphinx==1.8.1
(last r (@menshikh-iv, #None)np.issubdtype
warnings (@marioyc, #2210)-c
from gensim.downloader
description (@horpto, #2262)viz.line()
instead of viz.updatetrace()
) (@allenyllee, #2252)gensim.downloader
& fix rendering of code examples (@menshikh-iv, #2327)gensim.models
(@rsdel2007, #2323)Doc2Vec
documentation: how tags are assigned in corpus_file
mode (@persiyanov, #2320)gensim/models/keyedvectors.py
(@rsdel2007, #2290)Phrases
(@jenishah, #2242)KeyedVectors.evaluate_word_*
(@Stigjb, #2205)KeyedVector.evaluate_word_analogies
(@Stigjb, #2207)WmdSimilarity
documentation (@jagmoreira, #2217)fify -> fifty
in gensim.parsing.preprocessing.STOPWORDS
(@coderwassananmol, #2220)alpha="auto"
from LdaMulticore
(not supported yet) (@johann-petrak, #2225)tutorials.md
(@rsdel2007, #2302)Remove
gensim.models.wrappers.fasttext
(obsoleted by the new native gensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new native gensim.scripts.segment_wiki
implementation)Move
gensim.scripts.make_wikicorpus
➡ gensim.scripts.make_wiki.py
gensim.summarization
➡ gensim.models.summarization
gensim.topic_coherence
➡ gensim.models._coherence
gensim.utils
➡ gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡ gensim.utils.text_utils
File-based training for *2Vec
models (@persiyanov, #2127 & #2078 & #2048)
New training mode for *2Vec
models (word2vec, doc2vec, fasttext) that allows model training to scale linearly with the number of cores (full GIL elimination). The result of our Google Summer of Code 2018 project by Dmitry Persiyanov.
Benchmark on the full English Wikipedia, Intel(R) Xeon(R) CPU @ 2.30GHz 32 cores (GCE cloud), MKL BLAS:
Model | Queue-based version [sec] | File-based version [sec] | speed up | Accuracy (queue-based) | Accuracy (file-based) |
---|---|---|---|---|---|
Word2Vec | 9230 | 2437 | 3.79x | 0.754 (± 0.003) | 0.750 (± 0.001) |
Doc2Vec | 18264 | 2889 | 6.32x | 0.721 (± 0.002) | 0.683 (± 0.003) |
FastText | 16361 | 10625 | 1.54x | 0.642 (± 0.002) | 0.660 (± 0.001) |
Usage:
import gensim.downloader as api
from multiprocessing import cpu_count
from gensim.utils import save_as_line_sentence
from gensim.test.utils import get_tmpfile
from gensim.models import Word2Vec, Doc2Vec, FastText
# Convert any corpus to the needed format: 1 document per line, words delimited by " "
corpus = api.load("text8")
corpus_fname = get_tmpfile("text8-file-sentence.txt")
save_as_line_sentence(corpus, corpus_fname)
# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = cpu_count()
# Train models using all cores
w2v_model = Word2Vec(corpus_file=corpus_fname, workers=num_cores)
d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores)
ft_model = FastText(corpus_file=corpus_fname, workers=num_cores)
FastText
(@mcemilg, #2178)BM25
(@Shiki-H, #2146)name_only
option for downloader api (@aneesh-joshi, #2143)word2vec2tensor
script compatible with python3
(@vsocrates, #2147)Wikicorpus
(@mattilyra, #2089)similarity_matrix
support non-contiguous dictionaries (@Witiko, #2047)AuthorTopicModel
(@philipphager, #2122)AuthorTopicModel
(@probinso, #2133)keywords
issue with short input (@LShostenko, #2154)min_count
handling in phrases detection using npmi_scorer
(@lopusz, #2072)Phraser
log message (@robguinness, #2151)np.integer
-> np.int
in AuthorTopicModel
(@menshikh-iv, #2145)prune_at
parameter description for gensim.corpora.Dictionary
(@yxonic, #2128)default
-> auto
prior parameter in documentation for lda-related models (@Laubeee, #2156)gensim.models.translation_matrix
(@nzw0301, #2164)gensim.models.Word2Vec
(@nzw0301, #2161)gensim.models.Doc2Vec
(@xuhdev, #2165)Phrases
(@RunHorst, #2148)Remove
gensim.models.wrappers.fasttext
(obsoleted by the new native gensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new native gensim.scripts.segment_wiki
implementation)Move
gensim.scripts.make_wikicorpus
➡ gensim.scripts.make_wiki.py
gensim.summarization
➡ gensim.models.summarization
gensim.topic_coherence
➡ gensim.models._coherence
gensim.utils
➡ gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡ gensim.utils.text_utils
This release comprises a glorious 38 pull requests from 28 contributors. Most of the effort went into improving the documentation—hence the release code name "Docs 💬"!
Apart from the massive overhaul of all Gensim documentation (including docstring style and examples—you asked for it), we also managed to sneak in some new functionality and a number of bug fixes. As usual, see the notes below for a complete list, with links to pull requests for more details.
Huge thanks to all contributors! Nobody loves working on documentation. 3.5.0 is a result of several months of laborious, unglamorous, and sometimes invisible work. Enjoy!
*2vec
models (@steremma & @piskvorky & @menshikh-iv, #1944, #2087)gensim.models.phrases
(@CLearERR & @menshikh-iv, #1950)gensim.models.AuthorTopicModel
(@souravsingh & @menshikh-iv, #1907)gensim.similarities.docsim
(@CLearERR & @menshikh-iv, #2030)IndexedCorpus
(@darindf, #2033)gensim.models.coherencemodel
(@CLearERR & @menshikh-iv, #1933)gensim.sklearn_api
(@steremma & @menshikh-iv, #1895)gensim.models.KeyedVectors.similarity_matrix
(@Witiko, #1971)smart_open()
instead of open()
in notebooks (@sharanry, #1812)add_entity
method to KeyedVectors
to allow adding word vectors manually (@persiyanov, #1957)AuthorTopicModel
(@Stamenov, #1766)evaluate_word_analogies
(will replace accuracy
) method to KeyedVectors
(@akutuzov, #1935)TfidfModel
(@markroxor, #1780)max_final_vocab
in lieu of min_count
in Word2Vec
(@aneesh-joshi, #1915)dtype
argument for chunkize_serial
in LdaModel
(@darindf, #2027)Phrases.analyze_sentence
(@JonathanHourany, #2070)ns_exponent
parameter to control the negative sampling distribution for *2vec
models (@fernandocamargoti, #2093)Doc2Vec.infer_vector
+ notebook cleanup (@gojomo, #2103)Doc2Vec.infer_vector
(@umangv, #2063)word2vec
and doc2vec
models saved using old Gensim versions (@manneshiva, #2012)SoftCosineSimilarity.get_similarities
on corpora ssues/1955) (@Witiko, #1972)matutils.unitvec
according to input dtype (@o-P-o, #1992)gensim.corpora.WikiCorpus
(@steremma, #2042)Similarity.query_shards
in multiprocessing case (@bohea, #2044)df == "n"
(@PeteBleackley, #2021)_is_single
from Phrases
for case when corpus is a NumPy array (@rmalouf, #1987)EuclideanKeyedVectors.similarity_matrix
(@Witiko, #1984)D2VTransformer
and W2VTransformer
(@MritunjayMohitesh, #1945)Doc2Vec.infer_vector
after loading old Doc2Vec
(gensim<=3.2
)(@manneshiva, #1974)load_word2vec_format
(@DennisChen0307, #1968)keras==2.1.5
) (@menshikh-iv, #1963)Remove
gensim.models.wrappers.fasttext
(obsoleted by the new native gensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new native gensim.scripts.segment_wiki
implementation)Move
gensim.scripts.make_wikicorpus
➡ gensim.scripts.make_wiki.py
gensim.summarization
➡ gensim.models.summarization
gensim.topic_coherence
➡ gensim.models._coherence
gensim.utils
➡ gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡ gensim.utils.text_utils