Convolutional Neural Network based on Hierarchical Category Structure for Multi-label Short Text Categorization
These four code/models are Chainer based implementation for text categorization by Convolutional Neural Networks.
If you use any part of this code in my research, please cite my paper:
@inproceedings{HFT-CNN,
title={HFT-CNN: Learning Hierarchical Category Structure for Multi-label Short Text Categorization},
Author={Kazuya Shimura and Jiyi Li and Fumiyo Fukumoto},
booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
pages={811--816},
year={2018},
}
Contact: Kazuya Shimura, g17tk008(at)yamanashi(dot)ac(dot)jp
If you have further questions, please feel free to contact me.
Feature\Method | Flat model | WoFt model | HFT model | XML-CNN model |
---|---|---|---|---|
Hierarchical Structure | ✔ | ✔ | ||
Fine-tuning | ✔ | ✔ | ||
Pooling Type | 1-max pooling | 1-max pooling | 1-max pooling | dynamic max pooling |
Compact Representation | ✔ |
In order to run the code, I recommend the following environment.
The code require GPU environment. Please see requirements.txt to run my code.
wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh
bash Anaconda3-5.1.0-Linux-x86_64.sh
## Create virtual environments with the Anaconda Python distribution ##
conda env create -f=hft_cnn_env.yml
source activate hft_cnn_env
|--CNN ## Directory for saving the models
| |--LOG ## Log files
| |--PARAMS ## CNN parameters
| |--RESULT ## Store categorization results
|--cnn_model.py ## CNN model
|--cnn_train.py ## CNN training
|--data_helper.py ## Data helper
|--example.sh ## you can run and categorize my code by using sample data
|--hft_cnn_env.yml ## Anaconda components dependencies
|--LICENSE ## MIT LICENSE
|--MyEvaluator.py ## CNN training (validation)
|--MyUpdater.py ## CNN training (iteration)
|--README.md ## README
|--requirements.txt ## Dependencies(pip)
|--Sample_data ## Amazon sample data
| |--sample_test.txt ## Sample test data
| |--sample_train.txt ## Sample training data
| |--sample_valid.txt ## Sample validation data
|--train.py ## Main
|--Tree
| |--Amazon_all.tree ## a hierarchical structure provided by Amazon
|--tree.py ## Tree operation
|--Word_embedding ## Directory of word embedding
|--xml_cnn_model.py ## Chainer's version of XML-CNN model [Liu+'17]
You can categorize sample data (Amazon product reviews) by running example.sh, with the Flat model.
bash example.sh
--------------------------------------------------
Loading data...
Loading train data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 465927/465927 [00:18<00:00, 24959.42it/s]
Loading valid data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24522/24522 [00:00<00:00, 27551.44it/s]
Loading test data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 153025/153025 [00:05<00:00, 27051.62it/s]
--------------------------------------------------
Loading Word embedings...
The results are stored:
You can change a training model by modifying the "ModelType" in the file "example.sh"
## Network Type (XML-CNN, CNN-Flat, CNN-Hierarchy, CNN-fine-tuning or Pre-process)
ModelType=XML-CNN
Notes:
ModelType=Pre-process => ModelType=CNN-Hierarchy
my code utilize word embedding obtained by fastText. There are two options:
You can simply run example.sh. In this case, wiki.en.vec
is downloaded in the directory Word_embedding and is used for training.
You can specify your own "bin" file by making a path EmbeddingWeightsPath
in the example.sh file.
## Embedding Weights Type (fastText .bin)
EmbeddingWeightsPath=./Word_embedding/
Validation data is used to evaluate generalization error for each epoch. It is used to find when overfitting starts during the training. Training is then stopped before convergence to avoid the overfitting, i.e., early stopping. The parameter whose generalization error is the lowest among all the epochs is stored.
The data format is:
Each column is split by Tab(\t).
Example:
LABEL1 I am a boy .
LABEL2,LABEL6 This is my pen .
LABEL3,LABEL1 ...
When your data has a hierarchical structure, you can use my WoFT model and HTF model. Please see "TREE/Amazon_all.tree". You can use your own hierarchical structure by overwriting "TreefilePath" in the example.sh file.
MIT
[Liu+'17]
J. Liu, W-C. Chang, Y. Wu, and Y. Yang. 2017. Deep Learning for Extreme Multi-Label Text Classifica- tion. In Proc. of the 40th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval, pages 115–124.