Parallelizing word2vec in shared and distributed memory
:warning: DISCONTINUATION OF PROJECT - This project will no longer be maintained by Intel. Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project. Intel no longer accepts patches to this project. If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.
This is a C++ implementation of word2vec that is optimized on Intel CPUs, particularly, Intel Xeon and Xeon Phi (Knights Landing) processors. It supports the "HogBatch" parallel SGD as described in a NIPS workshop paper "Parallelizing Word2Vec in Multi-Core and Many-Core Architectures". It also uses data parallelism to distribute the computation via MPI over a CPU cluster.
The code is developed based on the original word2vec implementation from Google.
All source code files in the package are under Apache License 2.0.
The code is developed and tested on UNIX-based systems with the following software dependencies:
source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64 (please point to the path of your installation)
source /opt/intel/impi/latest/compilers_and_libraries/linux/bin/compilervars.sh intel64 (please point to the path of your installation)
sudo yum install numactl (on RedHat/Centos)
sudo apt-get install numactl (on Ubuntu)
git clone https://github.com/IntelLabs/pWord2Vec
cd data; .\getText8.sh or .\getBillion.sh
cd sandbox; ./run_single_text8.sh (for single machine demo) or ./run_mpi_text8.sh (for distributed w2v demo)
cd billion; ./run_single.sh (for single machine w2v) or ./run_mpi.sh (for distributed w2v) (please set ncores=number of logical cores of your machine)
cd sandbox; ./eval.sh or cd billion; ./eval.sh
Parallelizing Word2Vec in Shared and Distributed Memory, IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), Volume 30, Issue 9, Pages 2090-2100, Sept. 1 2019.
Parallelizing Word2Vec in Multi-Core and Many-Core Architectures, NIPS workshop on Efficient Methods for Deep Neural Networks, Dec. 2016.
For questions and bug reports, you can reach me at https://grid.cs.gsu.edu/~sji/