https://tacc.ust.hk
The above picture illustrates the submission and debug workflows of TACC job.
Before using tcloud SDK, please make sure that you have applied for a TACC account and submitted your public key to TACC. You may generate SSH public key according to the steps. To apply for a TACC account, please visit our website .
tcloud
SDKsetup.sh
and tcloud
in the same directory, and run setup.sh
.tcloud config
command:
$ tcloud config [-u/--username] MYUSERNAME
$ tcloud config [-f/--file] MYPRIVATEFILEPATH
tcloud init
command to obtain the latest cluster hardware information from TACC cluster.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
tacc* up infinite 5 alloc 10-0-7-[18-19],10-0-8-[18-19]
tacc* up infinite 19 idle 10-0-2-[18-19],10-0-3-[10-13]
You can use this link to download our example code.
Each job requires a main.py
with tuxiv.conf
main.py: Your machine learning training code.
tuxiv.conf: Detail about tuxiv.conf
After tcloud is configured correctly, you can try to submit your first job.
tcloud submit
command.
~/Dow/quickstart-master/example/helloworld ❯ tcloud submit
Start parsing tuxiv.conf...
building file list ...
8 files to consider
helloworld/
helloworld/run.sh
151 100% 0.00kB/s 0:00:00 (xfer#1, to-check=5/8)
helloworld/configurations/
helloworld/configurations/citynet.sh
12 100% 11.72kB/s 0:00:00 (xfer#2, to-check=2/8)
helloworld/configurations/conda.yaml
107 100% 104.49kB/s 0:00:00 (xfer#3, to-check=1/8)
helloworld/configurations/run.slurm
278 100% 271.48kB/s 0:00:00 (xfer#4, to-check=0/8)
sent 429 bytes received 144 bytes 382.00 bytes/sec
total size is 1071 speedup is 1.87
Submitted batch job 2000
Job helloworld submitted.
In this section, we provide two methods to monitor the job log.
After training, you can use tcloud ls [filepath]
to find the output files
cat
You can configure your log path in the tuxiv.conf
. The default path is slurm_log/slurm-jobid.out
.
tcloud cat slurm_log/slurm-jobid.out
In the helloworld example, the tuxiv.conf file specifies the log path as slurm_log/hello.log
download
You can use tcloud download [filepath]
.
Note that you can only read and download files in USERDIR
, and the files in WORKDIR
may be removed after the job is finished.
tcloud download slurm_log/slurm-jobid.out
tcloud uses Conda to manage your dependencies. All dependencies will be installed through conda. Please specify the required conda channel to meet the installation requirements. In tcloud, we offer two ways of environment management:
tuxiv.conf
, we will reuse the previous environment to save time. This is the default behavior.
environment:
# name: # do not specify environment name
dependencies:
- pytorch=1.6.0
- torchvision=0.7.0
channels: pytorch
tuxiv.conf
for each project. When you change your dependencies configuration with an exist environment, tcloud will update this environment in stead of creating a new one. Learn how to do this in tuxiv.conf documentation environment part.
environment:
name: torch-env # dedicated environment name
dependencies:
- pytorch=1.6.0
- torchvision=0.7.0
channels: pytorch
The following videos will help you use tcloud CLI to begin your TACC journey: demo video.
Basic examples are provided under the example folder. These examples include: HelloWorld, TensorFlow, PyTorch and MXNet.