Dict2vec is a framework to learn word embeddings using lexical dictionaries.
Dict2vec : Learning Word Embeddings using Lexical Dictionaries
==============================================================
CONTENT
1. PREAMBLE
2. ABOUT
3. REQUIREMENTS
4. USAGE
1. Train word embeddings
2. Evaluate word embeddings
3. Download Dict2vec pre-trained word embeddings
4. Download Wikipedia training corpora and dictionary definitions
5. AUTHOR
6. COPYRIGHT
------------------------------
PREAMBLE
This work is one of my contributions of my PhD thesis entitled "Improving methods to learn word representations for efficient semantic similarities computations" in which I propose new methods to learn better word embeddings. You can find and read my thesis freely available at https://github.com/tca19/phd-thesis.
ABOUT
This repository contains source code to train word embeddings with the Dict2vec model, which uses both Wikipedia and dictionary definitions during training. It also contains scripts to evaluate learned word embeddings (trained with Dict2vec or any other method), to download Wikipedia training corpora, to fetch dictionary definitions from online dictionaries and to generate strong and weak pairs from the definitions. Related paper describing the Dict2vec model can be found at https://www.aclweb.org/anthology/D17-1024/.
If you use this repository, please cite:
@inproceedings{tissier2017dict2vec, title = {Dict2vec : Learning Word Embeddings using Lexical Dictionaries}, author = {Tissier, Julien and Gravier, Christophe and Habrard, Amaury}, booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing}, month = {sep}, year = {2017}, address = {Copenhagen, Denmark}, publisher = {Association for Computational Linguistics}, url = {https://www.aclweb.org/anthology/D17-1024}, doi = {10.18653/v1/D17-1024}, pages = {254--263}, }
REQUIREMENTS
To compile and run the Dict2vec model, you will need the programs:
To evaluate the learned embeddings on the word similarity task, you will need to install on you system:
To fetch definitions from online dictionaries, you will need to install on your system:
To run demo scripts and download training data, you will also need a
system with wget
, bzip2
, perl
and bash
installed.
USAGE
Before running the example script, open the file demo-train.sh
and
modify the line 62 so the variable THREADS is equal to the number of
cores in your machine. By default, it is equal to 8, so if your machine
only has 4 cores, update it to be:
THREADS=4
Run demo-train.sh
to have a quick glimpse of Dict2vec performances.
./demo-train.sh
This will:
To directly compile the code and interact with the sotfware, run:
make && ./dict2vec
Full documentation of each possible parameters is displayed when you run
./dict2vec
without any arguments.
Run evaluate.py
to evaluate trained word embeddings. Once the
evaluation is done, you get something like this:
./evaluate.py embeddings.txt
W.Average | 0.570
The script computes the Spearman's rank correlation score for some word
similarity datasets, as well as the OOV rate for each dataset and the
weighted average based on the number of pairs evaluated on each dataset.
We provide the following evaluation datasets in data/eval/
:
This script is also able to evaluate several embeddings files at the same time, and compute the average score as well as the standard deviation. To evaluate several embeddings, simply add multiple filenames as arguments:
./evaluate.py embedding-1.txt embedding-2.txt embedding-3.txt
The evaluation script indicates:
When you evaluate only one embedding, you get the same value for AVG/MIN/MAX and a standard deviation STD of 0.
We provide word embeddings trained with the Dict2vec model on the July 2017 English version of Wikipedia. Vectors with dimension 100 (resp. 200) were trained on the first 50M (resp. 200M) words of this corpus whereas vectors with dimension 300 were trained on the full corpus. First line is composed of (number of words / dimension). Each following line contains the word and all its space separated vector values. If you use these word embeddings, please cite the paper as explained in section "2. ABOUT".
You need to extract the embeddings before using them. Use the following command to do so:
tar xvjf dict2vec100.tar.bz2
For Wikipedia corpora, you can generate the same 3 files (50M, 200M and
full) we use for training in the paper by running ./wiki-dl.sh
.
This script will download the full English Wikipedia dump of January 2021, uncompress it and directly feed it into Mahoney's parser script [1]. It also cuts the entire dump into two smaller datasets: one containing the first 50M tokens (enwiki-50M), and the other one containing the first 200M tokens (enwiki-200M). The training corpora have the following filesizes:
[1] http://mattmahoney.net/dc/textdata#appendixa
For dictionary definitions, we provide scripts to download online definitions and generate strong/weak pairs based on these definitions. More information and full documentation can be found in the folder dict-dl/ of this repository.
AUTHOR
Written by Julien Tissier [email protected].
COPYRIGHT
This software is licensed under the GNU GPLv3 license. See the LICENSE file for more details.