Catalyst.Classification
PyTorch framework for Deep Learning research and development.
It was developed with a focus on reproducibility,
fast experimentation and code/ideas reusing.
Being able to research/develop something new,
rather than write another regular train loop.
Break the cycle - use the Catalyst!
Project manifest. Part of PyTorch Ecosystem. Part of Catalyst Ecosystem:
Note: this repo uses advanced Catalyst Config API and could be a bit out-of-day right now. Use Catalyst's minimal examples section for a starting point and up-to-day use cases, please.
You will learn how to build image classification pipeline with transfer learning using the Catalyst framework to get reproducible results.
pip install -r requirements/requirements.txt
This creates a build catalyst-classification
with the necessary libraries:
make docker-build
export DATASET="artworks"
rm -rf data/
mkdir -p data
if [[ "$DATASET" == "ants_bees" ]]; then
# https://www.kaggle.com/ajayrana/hymenoptera-data
download-gdrive 1czneYKcE2sT8dAMHz3FL12hOU7m1ZkE7 ants_bees_cleared_190806.tar.gz
tar -xf ants_bees_cleared_190806.tar.gz &>/dev/null
mv ants_bees_cleared_190806 ./data/origin
elif [[ "$DATASET" == "flowers" ]]; then
# https://www.kaggle.com/alxmamaev/flowers-recognition
download-gdrive 1rvZGAkdLlbR_MEd4aDvXW11KnLaVRGFM flowers.tar.gz
tar -xf flowers.tar.gz &>/dev/null
mv flowers ./data/origin
elif [[ "$DATASET" == "artworks" ]]; then
# https://www.kaggle.com/ikarus777/best-artworks-of-all-time
download-gdrive 1eAk36MEMjKPKL5j9VWLvNTVKk4ube9Ml artworks.tar.gz
tar -xf artworks.tar.gz &>/dev/null
mv artworks ./data/origin
fi
Make sure, that final folder with data has the required structure:
/path/to/your_dataset/
class_name_1/
images
class_name_2/
images
...
class_name_100500/
...
The easiest way is to move your data:
mv /path/to/your_dataset/* /catalyst.classification/data/origin
In that way you can run pipeline with default settings.
If you prefer leave data in /path/to/your_dataset/
In local environment:
ln -s /path/to/your_dataset $(pwd)/data/origin
DATADIR=/path/to/your_dataset
when you start the pipeline.Using docker
You need to set:
-v /path/to/your_dataset:/data \ #instead default $(pwd)/data/origin:/data
in the script below to start the pipeline.
The pipeline will automatically guide you from raw data to the production-ready model.
We will initialize ResNet-18 model with a pre-trained network. During current pipeline model will be trained sequentially in two stages, also in the first stage we will train several heads simultaneously.
CUDA_VISIBLE_DEVICES=0 \
CUDNN_BENCHMARK="True" \
CUDNN_DETERMINISTIC="True" \
bash ./bin/catalyst-classification-pipeline.sh \
--workdir ./logs \
--datadir ./data/origin \
--max-image-size 224 \ # 224 or 448 works good
--balance-strategy 256 \ # images in epoch per class, 1024 works good
--config-template ./configs/templates/main.yml \
--num-workers 4 \
--batch-size 256 \
--criterion CrossEntropyLoss # one of CrossEntropyLoss, BCEWithLogits, FocalLossMultiClass
docker run -it --rm --shm-size 8G --runtime=nvidia \
-v $(pwd):/workspace/ \
-v $(pwd)/logs:/logdir/ \
-v $(pwd)/data/origin:/data \
-e "CUDA_VISIBLE_DEVICES=0" \
-e "CUDNN_BENCHMARK='True'" \
-e "CUDNN_DETERMINISTIC='True'" \
catalyst-classification ./bin/catalyst-classification-pipeline.sh \
--workdir /logdir \
--datadir /data \
--max-image-size 224 \ # 224 or 448 works good
--balance-strategy 256 \ # images in epoch per class, 1024 works good
--config-template ./configs/templates/main.yml \
--num-workers 4 \
--batch-size 256 \
--criterion CrossEntropyLoss # one of CrossEntropyLoss, BCEWithLogits, FocalLossMultiClass
The pipeline is running and you don’t have to do anything else, it remains to wait for the best model!
You can use W&B account for visualisation right after pip install wandb
:
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
Tensorboard also can be used for visualisation:
tensorboard --logdir=/catalyst.classification/logs
All results of all experiments can be found locally in WORKDIR
, by default catalyst.classification/logs
. Results of experiment, for instance catalyst.classification/logs/logdir-191010-141450-c30c8b84
, contain:
best.pth
and last.pht
can be also found in the corresponding experiment in your W&B account.For your future experiments framework provides powerful configs allow to optimize configuration of the whole pipeline of classification in a controlled and reproducible way.
Common settings of stages of training and model parameters can be found in catalyst.classification/configs/_common.yml
.
model_params
: detailed configuration of models, including:
MultiHeadNet
stages
: you can configure training or inference in several stages with different hyperparameters. In our example:
The CONFIG_TEMPLATE
with other experiment`s hyperparameters, such as data_params and is here: catalyst.classification/configs/templates/main.yml
. The config allows you to define:
data_params
: path, batch size, num of workers and so oncallbacks_params
: Callbacks are used to execute code during training, for example, to get metrics or save checkpoints. Catalyst provide wide variety of helpful callbacks also you can use custom.You can find much more options for configuring experiments in catalyst documentation.
The classical way to reduce the amount of unlabeled data by having a trained model would be to run unlabeled dataset through the model and automatically label images with confidence of label prediction above the threshold. Then automatically labeled data pushing in the training process so as to optimize prediction accuracy.
To run the iteration process we need to specify number of iterations n-trials
and threshold
of confidence to label image.
catalyst.classification/data/
raw/
all/
...
clean/
0/
...
1/
...
CUDA_VISIBLE_DEVICES=0 \
CUDNN_BENCHMARK="True" \
CUDNN_DETERMINISTIC="True" \
bash ./bin/catalyst-autolabel-pipeline.sh \
--workdir ./logs \
--datadir-clean ./data/clean \
--datadir-raw ./data/raw \
--n-trials 10 \
--threshold 0.8 \
--config-template ./configs/templates/autolabel.yml \
--max-image-size 224 \
--num-workers 4 \
--batch-size 256
docker run -it --rm --shm-size 8G --runtime=nvidia \
-v $(pwd):/workspace/ \
-e "CUDA_VISIBLE_DEVICES=0" \
-e CUDNN_BENCHMARK="True" \
-e CUDNN_DETERMINISTIC="True" \
catalyst-classification bash ./bin/catalyst-autolabel-pipeline.sh \
--workdir ./logs \
--datadir-clean ./data/clean \
--datadir-raw ./data/raw \
--n-trials 10 \
--threshold 0.8 \
--config-template ./configs/templates/autolabel.yml \
--max-image-size 224 \
--num-workers 4 \
--batch-size 256
Out:
Predicted: 23 (100.00%)
...
Pseudo Lgabeling done. Nothing more to label.
Logs for trainings visualisation can be found here: ./logs/autolabel
Labeled raw data can be found here: /data/data_clean/dataset.csv