cat🐈: the repo for the paper "Embarrassingly Simple Unsupervised Aspect extraction"
This is the repository for the ACL 2020 paper Embarrassingly Simple Unsupervised Aspect Extraction. In this work, we extract aspects from restaurant reviews with attention that uses RBF kernels.
Install these with pip install -r requirements.txt
If you want to apply cat to your data, you need a couple of things.
If you have all these things, you can simply look at example_pipeline/run.py
and replace the paths in this file with the paths to the appropriate files/instances.
cat🐈 has two hyperparameters: the gamma of the kernel, and the set of aspect words on which the attention is computed.
If you do not have access to pre-trained embeddings or aspect words, but you do have access to in-domain text, you will need a parser to extract either nouns or tree fragments. For maximum portability, we adopt the CoNLLu format, a format that many parsers output. If you use spacy, you can use the spacyconllu script to convert text to CoNLLu format.
To obtain the nouns and embeddings for a given set of text in CoNLLu format, run example_pipeline/preprocessing.py
, and replace the paths with the appropriate paths to your CoNLLu parsed file.
This will train your embeddings and extract the aspect words, which you can then use in example_pipeline/run.py
.
If you just want to use or adapt cat🐈
in your own project, check out cat/simple.py
. This contains all the relevant code for computing the attention distribution.
You can reproduce the experiments by obtaining the data, putting it in the data/
folder and running the experiments from experiments/
.
In the paper, we use the SemEval 2014, 2015 and citysearch dataset, which you can do here:
If you extract the text from these XML files and put the tokenized training data in data/
, you can rerun our experiments.
If you use the code or the techniques therein, please cite the paper:
@inproceedings{tulkens2020embarrassingly,
title = "Embarrassingly Simple Unsupervised Aspect Extraction",
author = "Tulkens, St{\'e}phan and van Cranenburgh, Andreas",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.290",
doi = "10.18653/v1/2020.acl-main.290",
pages = "3182--3187",
}
GPL-V3