Near Lossless Binarization Save

This repository contains source code to binarize any real-value word embeddings into binary vectors.

Project README
             Near-lossless Binarization of Word Embeddings
             =============================================

PREAMBLE

This work  is  one  of  my  contributions  of  my  PhD  thesis  entitled
"Improving methods to learn word representations for efficient  semantic
similarities computations" in which  I  propose  new  methods  to  learn
better word embeddings. You can find and read my thesis freely available
at https://github.com/tca19/phd-thesis.

ABOUT

This repository contains source code to  binarize  any  real-value  word
embeddings into binary  vectors.   It  also  contains  some  scripts  to
evaluate the performances of the binary vectors on  semantic  similarity
tasks  and   top-k   queries.    Related   paper   can   be   found   at
https://aaai.org/ojs/index.php/AAAI/article/view/4692/4570.

If you use this repository, please cite:

@inproceedings{tissier2019near,
  author    = {Tissier, Julien and Gravier, Christophe and Habrard, Amaury},
  title     = {Near-Lossless Binarization of Word Embeddings},
  booktitle = {Proceedings of the Thirty-Third {AAAI} Conference on
               Artificial Intelligence, Honolulu, Hawaii, USA,
               January 27 - February 1, 2019.},
  volume    = {33},
  pages     = {7104--7111},
  year      = {2019},
  url       = {https://aaai.org/ojs/index.php/AAAI/article/view/4692},
  doi       = {10.1609/aaai.v33i01.33017104}
}

INSTALLATION

To compile the source files of this repository, you need to have on your
system:
  - OpenBLAS [1]
  - a C compiler (gcc, clang ...)
  - make

Then run the command `make` to build the different  binary  executables.

[1] https://github.com/xianyi/OpenBLAS/wiki/Precompiled-installation-packages

USAGE

1. Binarize word vectors
------------------------
Run the executable `binarize` to transform  real-value  embeddings  into
binary vectors.  The only mandatory command line argument  is  `-input`,
the filename containing the real-value vectors.

./binarize -input vectors.vec

All  the  other  existing  flags  documentation  can   be   found   with
`./binarize -h` or `./binarize --help`

Binary vectors are saved by default into the file  `binary_vectors.vec`.
The first line of this file indicates the number of binary word  vectors
and the number of bits in each vector. Each following line are formatted
like:

WORD INTEGER_1 INTEGER_2 [...]

Binary vectors are not saved as strings of zeros (0) and ones (1) but as
groups of unsigned long integers. Each integer represents 64 bits so for
a binary vector of 256 bits, there are 4 integers (4 * 64 =  256).   The
binary  vector  of  a  word  is  the   concatenation   of   the   binary
representations  of  all  the  integers  on  the  rest  of   its   line.

2. Evaluate semantic similarity
-------------------------------
Run  the  executable  `similarity_binary`  to  evaluate   the   semantic
similarity  correlation  scores  of   the   produced   binary   vectors.

./similarity_binary binary_vectors.vec

This repository includes some semantic similarity datasets:
  - MEN
  - Rare Word (RW)
  - SimVerb 3500 (SimVerb)
  - SimLex 999 (SimLex)
  - WordSim 353 (WS353)
To evaluate on other semantic similarity datasets, simply add them  into
the datasets/ folder and run again the `./similarity_binary` executable.

3. Top-K queries
----------------
Run the executable `topk_binary` to  compute  the  K  closest  neighbors
words   and   their   respective   similarity   to   a    QUERY    word.

./topk_binary binary_vectors.vec K QUERY

The script will report the closest words and their similarity,  as  well
as the time needed to compute the K closest neighbors.  You can also run
multiple top-k queries at the same time, simply replace the  QUERY  word
with a list of space separated words, like:

./topk_binary binary_vectors.vec 10 queen automobile man moon computer

AUTHOR

Written  by  Julien  Tissier  <[email protected]>.

COPYRIGHT

This software is licensed under the GNU GPLv3 license.  See the  LICENSE
file for more details.
Open Source Agenda is not affiliated with "Near Lossless Binarization" Project. README Source: tca19/near-lossless-binarization

Open Source Agenda Badge

Open Source Agenda Rating