A PyTorch-based open-source framework that provides methods for improving the weakly annotated data and allows researchers to efficiently develop and compare their own methods.
Knodle (Knowledge-supervised Deep Learning Framework) - a new framework for weak supervision with neural networks. It provides a modularization for separating weak data annotations, powerful deep learning models, and methods for improving weakly supervised training.
More details about Knodle are in our recent paper.
pip install knodle
knodle offers various methods for denoising weak supervision sources and improve them. There are several methods available for denoising. Examples can be seen in the tutorials folder.
There are four mandatory inputs for knodle:
model_input_x
: Your model features (e.g. TF-IDF values) without any labels. Shape: (n_instances x features)mapping_rules_labels_t
: This matrix maps all weak rules to a label. Shape: (n_rules x n_classes)rule_matches_z
: This matrix shows all applied rules on your dataset. Shape: (n_instances x n_rules)model
: A PyTorch model which can take your provided model_input_x
as input. Examples are in the model folder.If you know which denoising method you want to use, you can directly call the corresponding module (the list of currently supported methods is provided below).
Example for training the baseline classifier:
from knodle.model.logistic_regression_model import LogisticRegressionModel
from knodle.trainer.baseline.majority import MajorityVoteTrainer
NUM_OUTPUT_CLASSES = 2
model = LogisticRegressionModel(model_input_x.shape[1], NUM_OUTPUT_CLASSES)
trainer = MajorityVoteTrainer(
model=model,
mapping_rules_labels_t=mapping_rules_labels_t,
model_input_x=model_input_x,
rule_matches_z=rule_matches_z,
dev_model_input_x=X_dev,
dev_gold_labels_y=Y_dev
)
trainer.train()
trainer.test(X_test, Y_test)
A more detailed example of classifier training is here.
The framework provides a simple tensor-driven abstraction based on PyTorch allowing researchers to efficiently develop and compare their methods. The emergence of machine learning software frameworks is the biggest enabler for the wide spread adoption of machine learning and its speed of development. With Knodle we want to empower researchers in a similar fashion.
Knodle main goals:
Apart from that, Knodle includes a selection of well-known data sets from prior work in weak supervision. Knodle ecosystem provides modular access to datasets and denoising methods (that can, in turn, be combined with arbitrary deep learning models), enabling easy experimentation.
Datasets currently provided in Knodle:
All datasets are added to the Knodle framework in the tensor format described above and could be dowloaded here. To see how the datasets were created please have a look at the dedicated tutorial.
There are several denoising methods available.
Trainer Name | Module | Description |
---|---|---|
MajorityVoteTrainer | knodle.trainer.baseline |
This builds the baseline for all methods. No denoising takes place. The final label will be decided by using a simple majority vote approach and the provided model will be trained with these labels. |
AutoTrainer | knodle.trainer |
This incorporates all denoising methods currently provided in Knodle. |
KNNAggregationTrainer | knodle.trainer.knn_aggregation |
This method looks at the similarities in sentence values. The intuition behind it is that similar samples should be activated by the same rules which is allowed by a smoothness assumption on the target space. Similar sentences will receive the same label matches of the rules. This counteracts the problem of missing rules for certain labels. |
WSCrossWeighTrainer | knodle.trainer.wscrossweigh |
This method weighs the training samples basing on how reliable their labels are. The less reliable sentences (i.e. sentences, whose weak labels are possibly wrong) are detected using a DS-CrossWeigh method, which is similar to k-fold cross-validation, and got reduced weights in further training. This counteracts the problem of wrongly classified sentences. |
SnorkelTrainer | knodle.trainer.snorkel |
A wrapper of the Snorkel system, which incorporates both generative and discriminative Snorkel steps in a single call. |
CleanlabTrainer | knodle.trainer.cleanlab |
A wrapper of the Cleanlab framework. |
WSCleanlabTrainer | knodle.trainer.wscleanlab |
An adaptation of Cleanlab framework for weak supervision. |
ULF | knodle.trainer.ulf |
A method for Unsupervised Labeling Function correction, which denoises WS data by leveraging models trained on all but some LFs to identify and correct biases specific to the held-out LFs. ULF refines the allocation of LFs to classes by re-estimating this assignment on highly reliable cross-validated samples. For more details, see: Sedova and Roth. ULF: Unsupervised Labeling Function Correction using Cross-Validation for Weak Supervision. EMNLP 2023. |
Each of the methods has its own default config file, which will be used in training if no custom config is provided.
We also aimed at providing the users with basic tutorials that would explain how to use our framework. All of them are stored in examples folder and logically divided into two groups:
Currently, the package is tested on Python 3.7. It is possible to add further versions. The CI/CD pipeline needs to be updated in that case.
The structure of the code is as follows
knodle
├── knodle
│ ├── evaluation
│ ├── model
│ ├── trainer
│ ├── baseline
│ ├── knn_aggregation
│ ├── snorkel
│ ├── wscrossweigh
│ └── utils
│ ├── transformation
│ └── utils
├── tests
│ ├── data
│ ├── evaluation
│ ├── trainer
│ ├── baseline
│ ├── snorkel
│ ├── cleanlab
│ ├── wscrossweigh
│ ├── wscleanlab
│ ├── ulf
│ └── utils
│ └── transformation
└── examples
├── data_preprocessing
├── imdb_dataset
└── tac_based_dataset
└── training
├── simple_auto_trainer
└── wscrossweigh
Licensed under the Apache 2.0 License.
If you notices a problem in the code, you can report it by submitting an issue.
If you want to share your feedback with us or take part in the project, contact us via [email protected].
And don't forget to follow @knodle_ai on Twitter :)
@inproceedings{Sedova_2021,
title={Knodle: Modular Weakly Supervised Learning with PyTorch},
url={http://dx.doi.org/10.18653/v1/2021.repl4nlp-1.12},
DOI={10.18653/v1/2021.repl4nlp-1.12},
booktitle={Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)},
publisher={Association for Computational Linguistics},
author={Sedova, Anastasiia and Stephan, Andreas and Speranskaya, Marina and Roth, Benjamin},
year={2021} }
@misc{sedova2023ulf,
title={ULF: Unsupervised Labeling Function Correction using Cross-Validation for Weak Supervision},
author={Anastasiia Sedova and Benjamin Roth},
year={2023},
eprint={2204.06863},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
This research was funded by the WWTF though the project “Knowledge-infused Deep Learning for Natural Language Processing” (WWTF Vienna Research Group VRG19-008).