PyTorch repository for text categorization and NER experiments in Turkish and English.
This is my personal, pet project which I apply machine learning and natural language processing stuffs by using PyTorch. I stopped working with Tensorflow after some hellish times that I could not do some basic extentions (such fasttext based oov embeddings, details are below). Also, Tensorflow's updates and functionality deprecation rate is annoying for me.
In this repository, I implement popular learning models and extend them with different minor adjustments (like variational dropouts). Even though it is really slow, I execute experiments by using these models on a dataset which me and my old colleagues in Huawei constructed (details are below, again) and try to announce experiment results.
Before diving into details, the python and library versions are as follows:
I try to keep every part of the project clean and easy to follow. Even though the folders are self explanatory for me, let me explain them for those who may have hard time to understand.
./crf/CRF.py
contains the conditional random field implementation (not finished yet)../datahelper/dataset_reader.py
contains the "DatasetLoader" object that reads a text dataset, splits it into 3 subsets (train/vali/test), creates vocabulary and iterators. It is a little bit hard-coded for the dataset I am using now. However, it is easy to make changes to use it for your own dataset../datahelper/embedding_helper.py
is a helper class to generate OOV word embeddings. To use Fasttext-based OOV embedding generation, it leverages Gensim!./datahelper/preprocessor.py
contains the "Preprocessor" object and actions to apply on sentences../dropout_models/gaussian_dropout.py
contains the Gaussian Dropout object../dropout_models/variational_dropout.py
contains the Variational Dropout object../dropout_models/dropout.py
contains the Dropout object which you can select your dropout type among Bernoulli (basic), Gaussian and Variational dropout types../evaluation/evaluator.py
is the factory for evaluation objects that are used in model trainings as well as interactive evaluation../evaluation/xyz_evaluator.py
methods are the evaluator functions for specified models../model/xyz.py
contains network objects../model/Util_xyz.py
contains custom-defined objects that are used in xyz
../optimizer/custom_optimizer.py
contains custom-defined optimizer objects../scorer/accuracy_scorer.py
contains classification accuracy metric calculations../scorer/ner_scorer.py
contains NER-task related metric calculations../training/trainer.py
is a class that returns the necessary trainer for the user's selected learning model./training/xyz_trainer.py
methods are the trainer functions for specified models../utils/utils.py
contains both utility and common methods that are being used in several places in the project../main.py
is the main code. To execute this project, one needs to provide a valid config.json
file which contains the necessary configuration properties../config/config.json
is the configuration file.I had to make some changes in the torchtext backend codes to be able to do several stuffs:
To be able to run the main code, you need to provide a valid JSON file which contains 4 main properties. These are dataset_properties
, model_properties
, training_properties
, and evaluation_properties
:
dataset_properties
contains dataset-related information such as path, embedding, batch information.model_properties
contains model-related parameters. Inside this property,
common_model_properties
contains common properties for all models like embeddings, vocabulary size, etc.model_name
(like text_cnn, char_cnn, etc.) contains model-specific properties.training_properties
contains training-related properties.evaluation_properties
contains evaluation-related properties.Details of the config.json
can be found in "/config/README.md" folder.
If you make the necessary changes described in "changes in torchtext.txt" and prepare "config.json", you have two ways to run the code.
python main.py --config /path/to/config.json
.You can train your model from 0th epoch until max_epoch, and/or continue your training from xth epoch to the end. You do not need to do anything extra for the first case; however, to be able to continue your training you need to make necessary changes in "config.json":
dataset_properties/checkpoint_path
is empty, the code will start a new training process. If you type your saved PyTorch model, the main flow will automatically load it and continue from where it left.
dataset_properties/saved_sentence_vocab
(don't ask why it is sentence)) and labels (dataset_properties/saved_category_vocab
).To be able to activate interactive evaluation, you need to make necessary changes in "config.json":
model_properties/common_model_properties/run_mode
's value to "eval_interactive".evaluation_properties
.This section presents the Top-1 and Top-5 test accuracies for text categorization task of my experiments. Due to computational resource limit, I cannot test every single parameter/hyperparameter. In general, I hold algorithm parameters same for all experiments; however, I change embedding related parameters. I assume the result table is self-explanatory. As a final note, I won't share my best models and I won't guarantee reproducibility. Dataset splits (training/validation/test) are deterministic for all experiments, but anything else that needs random initialization is non-deterministic.
Note: Epoch is set to 20 for all experiments, until further notice (last update: 31-10-2018). However, if I believe that results may improve, I let the experiment run for 10 more epochs (at most 30 epoch per experiments).
Note 2 (Update: 22-01-2019): Most of the English-language experiments are executed in Google Cloud (by using 300$ initial credit). Since, I want to finish as many experiments as possible, I cannot increase the max_epoch from 20 to 30. In this experiments, I saw that validation loss and accuracies were improving in every epoch until the 20th, and I am pretty sure models can improve further. Unfortunately, I chose the maximum number of experiment runs instead of best results for each experiment in this trade-off.
# | Language | # Of Categories | Pre-trained Embedding | OOV Embedding | Embedding Training | Top-1 Test Accuracy | Top-5 Test Accuracy |
---|---|---|---|---|---|---|---|
1 | Turkish | 25 | Fasttext | zeros | static | 49.4565 | 76.2760 |
2 | Turkish | 25 | Fasttext | zeros | nonstatic | 62.6054 | 86.3384 |
3 | Turkish | 25 | Fasttext | Fasttext | static | 49.6810 | 75.2684 |
4 | Turkish | 25 | Fasttext | Fasttext | nonstatic | 63.9391 | 87.9597 |
5 | Turkish | 49 | Fasttext | zeros | static | 43.5519 | 68.4336 |
6 | Turkish | 49 | Fasttext | zeros | nonstatic | 56.0081 | 79.8634 |
7 | Turkish | 49 | Fasttext | Fasttext | static | 43.8025 | 68.8641 |
8 | Turkish | 49 | Fasttext | Fasttext | nonstatic | 60.4009 | 82.7879 |
9 | English | 25 | Fasttext | zeros | static | 56.2290 | 83.2425 |
10 | English | 25 | Fasttext | zeros | nonstatic | 64.2642 | 89.2115 |
11 | English | 25 | Fasttext | Fasttext | static | 56.5313 | 83.9873 |
12 | English | 25 | Fasttext | Fasttext | nonstatic | 65.9558 | 91.1536 |
13 | English | 49 | Fasttext | zeros | static | 51.3862 | 78.7806 |
14 | English | 49 | Fasttext | zeros | nonstatic | 59.2086* | 84.8054 |
15 | English | 49 | Fasttext | Fasttext | static | 51.7878 | 79.9472 |
16 | English | 49 | Fasttext | Fasttext | nonstatic | 55.3833* | 80.4958 |
In this title, I will save the previous updates for me and the visitors to keep track.
pip install
, I updated the requirement.txt.config.json
is also updated, two new parameters are added related to Adabound.Below repositories really helped me to write a decent and working code: