[NeurIPS 2020] Disentangling Human Error from the Ground Truth in Segmentation of Medical Images
This repository contains a PyTorch implementation of the NeurIPS 2020 paper "Disentangling Human Error from the Ground Truth in Segmentation of Medical Images", 2020.
Mou-Cheng Xu is the main developer of the Python code; Le Zhang is the main developer of the data simulation code.
We recommend to try the toy-example in MNIST_example.ipynb to understand the pipeline, this is a simplied main function for MNIST, similar to other main functions in Train_GCM.py, Train_ours.py, Train_puunet.py and Train_unet.py.
Following MNIST_example.ipynb, you might want to replace the data-loader with your own data-loader for your preferred pre-processing. An example for a data-loader can be found in Utilis.py, namely CustomDataset_punet.
The loss function is implemented in Loss.py as noisy_label_loss.
data_path='/.../.../all_of_datasets' data_tag='some_data_set'
The full path to the data='/.../.../all_of_datasets/some_data_set'
An example of BRATS in our experiments.
your path
│
└───data sets
│ │
│ └───brats
│ │
│ └───train
│ │ │
│ │ └───Over # where all over segmentation label are stored in training
│ │ │
│ │ └───Under # where all under segmentation label are stored in training
│ │ │
│ │ └───Wrong # where all wrong segmentation label are stored in training
│ │ │
│ │ └───Good # where all good segmentation label are stored in training
│ │ │
│ │ └───Image # where all training images are stored
│ │
│ └───validate
│ │ │
│ │ └───Over
│ │ │
│ │ └───Under
│ │ │
│ │ └───Wrong
│ │ │
│ │ └───Good
│ │ │
│ │ └───Image # where all validation images are stored
│ │
│ └───test
│ │ │
│ │ └───Over
│ │ │
│ │ └───Under
│ │ │
│ │ └───Wrong
│ │ │
│ │ └───Good
│ │ │
│ │ └───Image # where all testing images are stored
We present a method for jointly learning, from purely noisy observations alone, the reliability of individual annotators and the true segmentation label distributions, using two coupled CNNs. The separation of the two is achieved by encouraging the estimated annotators to be maximally unreliable while achieving high fidelity with the noisy training data.
The architecture of our model is depicted below:
All required libraries can be installed via conda (anaconda). We recommend creating an conda env with all dependencies via environment file e.g.,
conda env create -f conda_env.yml
We generate synthetic annotations from an assumed GT on MNIST, MS lesion and BraTS datasets, to generate efficacy of the approach in an idealised situation where the GT is known. We simulate a group of 5 annotators of disparate characteristics by performing morphological transformations (e.g., thinning, thickening, fractures, etc) on the ground-truth (GT) segmentation labels, using Morpho-MNIST software. In particular, the first annotator provides faithful segmentation (“good-segmentation”) with approximate GT, the second tends over-segment (“over-segmentation”), the third tends to under-segment (“under-segmentation”), the fourth is prone to the combination of small fractures and over-segmentation (“wrong-segmentation”) and the fifth always annotates everything as the background (“blank-segmentation”). To create synthetic noisy labels in multi-class scenario, we first choose a target class and then apply morphological operations on the provided GT mask to create 4 synthetic noisy labels at different patterns, namely, over-segmentation, under-segmentation, wrong segmentation and good segmentation. We create training data by deriving labels from the simulated annotators. Here we provide several example images in data_simulation
.
Here we also introduce another simple method for annotator data simulation for both binary and multi-class masks. Before running the data simulator, please make sure you have installed FSL in your machine.
For binary mask:
./data_simulation/over-segmentation.m
to generate the simulated over-segmentation mask;./data_simulation/under-segmentation.m
to generate the simulated under-segmentation mask;./data_simulation/artificial_wrong_mask.py
to generate the simulated wrong-segmentation mask;For multi-class mask:
change the folder path to your data folder and run ./data_simulation/multiclass_data_simulator.m
to generate the over-segmentation, under-segmentation and wrong-segmentation masks simultaneously.
Download example datasets in following table as used in the paper, and pre-process the dataset using the folowing steps for multiclass segmentation purpose:
Download the training dataset with annotations from the corresponding link (e.g. Brats2019)
Unzip the data and you will have two folders:
Extract the 2D images and annotations from nii.gz files by running
- cd Brats
- python ./preprocessing/Prepare_BRATS.py
Dataset (with Link) | Content | Resolution (pixels) | Number of Classes |
---|---|---|---|
MNIST | Handwritten Digits | 28 x 28 | 2 |
ISBI2015 | Multiple Sclerosis Lesion | 181 x 217 x 181 | 2 |
BraTS2019 | Multimodal Brain Tumor | 240 x 240 x 155 | 4 |
LIDC-IDRI | Lung Image Database Consortium image collection | 180 x 180 | 2 |
For BraTS dataset, set the hyper-parameters in run.py
- input_dim=4,
- class_no=4,
- repeat=1,
- train_batchsize=2,
- validate_batchsize=1,
- num_epochs=30,
- learning_rate=1e-4,
- alpha=1.5,
- width=16,
- depth=4,
- data_path=your path,
- dataset_tag='brats',
- label_mode='multi',
- save_probability_map=True,
- low_rank_mode=False
and run:
python run.py
To test our model, please run segmentation.py
with the following setting:
model_path
to your pre-trained model;test_path
to your testing data.Models | Brats Dice (%) | Brats CM estimation | LIDC-IDRI Dice (%) | LIDC-IDRI CM estimation |
---|---|---|---|---|
Naive CNN on mean labels | 29.42±0.58 | n/a | 56.72±0.61 | n/a |
Naive CNN on mode labels | 34.12±0.45 | n/a | 58.64±0.47 | n/a |
Probabilistic U-net | 40.53±0.75 | n/a | 61.26±0.69 | n/a |
STAPLE | 46.73±0.17 | 0.2147±0.0103 | 69.34±0.58 | 0.0832±0.0043 |
Spatial STAPLE | 47.31±0.21 | 0.1871±0.0094 | 70.92±0.18 | 0.0746±0.0057 |
Ours without Trace | 49.03±0.34 | 0.1569±0.0072 | 71.25±0.12 | 0.0482±0.0038 |
Ours | 53.47±0.24 | 0.1185±0.0056 | 74.12±0.19 | 0.0451±0.0025 |
Oracle (Ours but with known CMs) | 67.13±0.14 | 0.0843±0.0029 | 79.41±0.17 | 0.0381±0.0021 |
Models | Brats Dice (%) | Brats CM estimation | LIDC-IDRI Dice (%) | LIDC-IDRI CM estimation |
---|---|---|---|---|
Naive CNN on mean & mode labels | 36.12±0.93 | n/a | 48.36±0.79 | n/a |
STAPLE | 38.74±0.85 | 0.2956±0.1047 | 57.32±0.87 | 0.1715±0.0134 |
Spatial STAPLE | 41.59±0.74 | 0.2543±0.0867 | 62.35±0.64 | 0.1419±0.0207 |
Ours without Trace | 43.74±0.49 | 0.1825±0.0724 | 66.95±0.51 | 0.0921±0.0167 |
Ours | 46.21±0.28 | 0.1576±0.0487 | 68.12±0.48 | 0.0587±0.0098 |
If you use this code or the dataset for your research, please cite our paper:
@article{HumanError2020,
title={Disentangling Human Error from the Ground Truth in Segmentation of Medical Images},
author={Zhang, Le and Tanno, Ryutaro and Xu, Mou-Cheng and Jacob, Joseph and Ciccarelli, Olga and Barkhof, Frederik and C. Alexander, Daniel},
journal={NeurIPS},
year={2020},
}