Deep Learning project template for PyTorch (multi-gpu training is supported)
use_background_generator
in configassets
dir: icon image of Pytorch Project Template
. You can remove this directory.config
dir: directory for config filesdataset
dir: dataloader and dataset codes are here. Also, put dataset in meta
dir.model
dir: model.py
is for wrapping network architecture. model_arch.py
is for coding network architecture.tests
dir: directory for pytest
testing codes. You can check your network's flow of tensor by fixing tests/model/net_arch_test.py
.
Just copy & paste Net_arch.forward
method to net_arch_test.py
and add assert
phrase to check tensor.utils
dir:
train_model.py
and test_model.py
are for train and test model once.utils.py
is for utility. random seed setting, dot-access hyper parameter, get commit hash, etc are here.writer.py
is for writing logs in tensorboard / wandb.trainer.py
file: this is for setting up and iterating epoch.requirements.txt
. (https://pytorch.org/get-started/)pip install -r requirements.txt
config/default.yaml
. Custom configs are under config/job/
name
is train name you run.working_dir
is root directory for saving checkpoints, logging logs.device
is device mode for running your model. You can choose cpu
or cuda
data
field
train_dir
/ test_dir
with file_format
for Dataloader.divide_dataset_per_gpu
is true, origin dataset is divide into sub dataset for each gpu.
This could mean the size of origin dataset should be multiple of number of using gpu.
If this option is false, dataset is not divided but epoch goes up in multiple of number of gpus.train
/test
field
random_seed
is for setting python, numpy, pytorch random seed.num_epoch
is for end iteration step of training.optimizer
is for selecting optimizer. Only adam optimizer
is supported for now.dist
is for configuring Distributed Data Parallel.
gpus
is the number that you want to use with DDP (gpus
value is used at world_size
in DDP).
Not using DDP when gpus
is 0, using all gpus when gpus
is -1.timeout
is seconds for timeout of process interaction in DDP.
When this is set as ~
, default timeout (1800 seconds) is applied in gloo
mode and timeout is turned off in nccl
mode.model
field
log
field
summary_interval
and checkpoint_interval
are interval of step and epoch between training logging and checkpoint saving.working_dir/chkpt_dir
and working_dir/trainer.log
. Tensorboard logs are saving under working_dir/outputs/tensorboard
load
field
wandb_load_path
is Run path
in overview of run. If you don't want to use wandb load, this field should be ~
.network_chkpt_path
is path to network checkpoint file.
If using wandb loading, this field should be checkpoint file name of wandb run.resume_state_path
is path to training state file.
If using wandb loading, this field should be training state file name of wandb run.pip install -r requirements-dev.txt
for install develop dependencies (this requires python 3.6 and above because of black)
pre-commit install
for adding pre-commit to git hook
python trainer.py working_dir=$(pwd)