Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English
Phrase-At-Scale
provides a fast and easy way to discover phrases from large text corpora using PySpark. Here's an example of phrases extracted from a review dataset:
To re-run phrase discovery using the default dataset:
Install Spark
Clone this repo and move into its top-level directory.
git clone [email protected]:kavgan/phrase-at-scale.git
Run the spark job:
<your_path_to_spark>/bin/spark-submit --master local[200] --driver-memory 4G phrase_generator.py
This will use settings (including input data files) as specified in config.py
.
Notes:
local[num_of_threads]
.top-opinrank-phrases.txt
data/tagged-data/
To change configuration, just edit the config.py file.
Config | Description |
---|---|
input_file |
Path to your input data files. This can be a file or folder with files. The default assumption is one text document (of any size) per line. This can be one sentence per line, one paragraph per line, etc. |
output-folder |
Path to output your annotated corpora. Can be local path or on HDFS |
phrase-file |
Path to file that should hold the list of discovered phrases. |
stop-file |
Stop-words file to use to indicate phrase boundary. |
min-phrase-count |
Minimum number of occurrence for phrases. Guidelines: use 50 for < 300 MB of text, 100 for < 2GB and larger values for a much larger dataset. |
The default configuration uses a subset of the OpinRank dataset, consisting of about 255,000 hotel reviews. You can use the following to cite the dataset:
@article{ganesan2012opinion,
title={Opinion-based entity ranking},
author={Ganesan, Kavita and Zhai, ChengXiang},
journal={Information retrieval},
volume={15},
number={2},
pages={116--150},
year={2012},
publisher={Springer}
}
This repository is maintained by Kavita Ganesan. Please send me an e-mail or open a GitHub issue if you have questions.