An evaluation of word-embeddings for classification
This is the source code to go along with the series of blog articles
The code employs,
Elasticsearch (localhost:9200) as the repository
See the Pipfle for Python dependencies
Generate tokens for the 20-news corpus & the movie review data set and save them to Elasticsearch.
Generate custom word vectors for the two text corpus in 1 above and save them to Elasticsearch. text-data/twenty-news/vectors & text-data/acl-imdb/vectors directories have the scripts
Process pre-trained vectors and save them to Elasticsearch. Look into pre-trained-vectors/ for the code. You need to download the actual published vectors from their sources. We have used Word2Vec, Glove and FastText in these articles.
The script run.sh can be configured to run whichever combination of the pipeline steps.
The logs contain the F-scores and timing results. Create a "logs" directory before running the run.sh script
mkdir logs