Compress Fasttext Versions Save

Tools for shrinking fastText models (in gensim format)

v0.1.5

2 weeks ago

v0.1.4

7 months ago

Closing https://github.com/avidale/compress-fasttext/issues/19#event-10655768121 by removing pqkmeans dependency that has un-installable dependencies of its own.

v0.1.1

2 years ago
  • Wrap up the refactoring related to new gensim version
  • add FastTextTransformer, a scikit-learn-like wrapper for feature extraction

gensim-4-draft

2 years ago

Support of gensim>=4.0.0 and deprecation of earlier gensim

New released models

Russian models based on geowac_tokens_none_fasttextskipgram_300_5_2020 from RusVectores, 1.9GB:

Model RAM size, mb similarity to the original model
geowac_tokens_sg_300_5_2020-100K-20K-100.bin 26 0.9619
geowac_tokens_sg_300_5_2020-400K-100K-300.bin 202 0.9990

English models based on cc.en.300.bin from the Facebook website, 7.2GB:

Model RAM size, mb similarity to the original model
ft_cc.en.300_freqprune_50K_5K_pq_100.bin 12 0.3570
ft_cc.en.300_freqprune_100K_20K_pq_100.bin 25 0.6081
ft_cc.en.300_freqprune_100K_20K_pq_300.bin 48 0.6268
ft_cc.en.300_freqprune_400K_100K_pq_300.bin 199 0.8782

Much more small models for various languages can be found at https://zenodo.org/record/4905385.

v0.0.4

2 years ago
  • Publish more compressed models and compare their quality
  • Make the compressed models downloadable

v0.0.6

4 years ago
  • require sklearn and pqkmeans only in the [full] setup mode

v0.0.3

4 years ago

Now attempts of arithmetic operations on compressed matrices do not raise errors. However, they lead to conversion of these matrices to numpy.array, which uses time and memory.

0.0.2

4 years ago

Now prune_ft_freq method takes into account not only n-gram frequency, but also the norm of its embedding. This improves model compression accuracy for the same model size.

v0.0.1

4 years ago

We publish the code for compressing Gensim FastText models and using their small versions.

We also publish 4 compressed versions of the ruscorpora_none_fasttextskipgram_300_2_2019 model from RusVectores.

Model RAM, mb Similarity to the original Intrinsic evaluation (relative to the original)
ft_freqprune_50K_5K_pq_100.bin 13 92.7% 89.9%
ft_freqprune_100K_20K_pq_100.bin 28 96.1% 96.6%
ft_freqprune_100K_20K_pq_300.bin 51 98.2% 97.9%
ft_freqprune_400K_100K_pq_300.bin 180 99.7% 99.9%