Compress Fasttext Versions Save

Tools for shrinking fastText models (in gensim format)

v0.1.5

2 weeks ago

v0.1.4

7 months ago

Closing https://github.com/avidale/compress-fasttext/issues/19#event-10655768121 by removing pqkmeans dependency that has un-installable dependencies of its own.

v0.1.1

2 years ago

Wrap up the refactoring related to new gensim version
add FastTextTransformer, a scikit-learn-like wrapper for feature extraction

gensim-4-draft

2 years ago

Support of gensim>=4.0.0 and deprecation of earlier gensim

New released models

Russian models based on geowac_tokens_none_fasttextskipgram_300_5_2020 from RusVectores, 1.9GB:

Model	RAM size, mb	similarity to the original model
geowac_tokens_sg_300_5_2020-100K-20K-100.bin	26	0.9619
geowac_tokens_sg_300_5_2020-400K-100K-300.bin	202	0.9990

English models based on cc.en.300.bin from the Facebook website, 7.2GB:

Model	RAM size, mb	similarity to the original model
ft_cc.en.300_freqprune_50K_5K_pq_100.bin	12	0.3570
ft_cc.en.300_freqprune_100K_20K_pq_100.bin	25	0.6081
ft_cc.en.300_freqprune_100K_20K_pq_300.bin	48	0.6268
ft_cc.en.300_freqprune_400K_100K_pq_300.bin	199	0.8782

Much more small models for various languages can be found at https://zenodo.org/record/4905385.

v0.0.4

2 years ago

Publish more compressed models and compare their quality
Make the compressed models downloadable

v0.0.6

4 years ago

require sklearn and pqkmeans only in the [full] setup mode

v0.0.3

4 years ago

Now attempts of arithmetic operations on compressed matrices do not raise errors. However, they lead to conversion of these matrices to numpy.array, which uses time and memory.

0.0.2

4 years ago

Now prune_ft_freq method takes into account not only n-gram frequency, but also the norm of its embedding. This improves model compression accuracy for the same model size.

v0.0.1

4 years ago

We publish the code for compressing Gensim FastText models and using their small versions.

We also publish 4 compressed versions of the ruscorpora_none_fasttextskipgram_300_2_2019 model from RusVectores.

Model	RAM, mb	Similarity to the original	Intrinsic evaluation (relative to the original)
ft_freqprune_50K_5K_pq_100.bin	13	92.7%	89.9%
ft_freqprune_100K_20K_pq_100.bin	28	96.1%	96.6%
ft_freqprune_100K_20K_pq_300.bin	51	98.2%	97.9%
ft_freqprune_400K_100K_pq_300.bin	180	99.7%	99.9%