Slurm Docker Cluster Save

A Slurm cluster using docker-compose

Project README

Slurm Docker Cluster

This is a multi-container Slurm cluster using docker-compose. The compose file creates named volumes for persistent storage of MySQL data files as well as Slurm state and log directories.

Containers and Volumes

The compose file will run the following containers:

  • mysql
  • slurmdbd
  • slurmctld
  • c1 (slurmd)
  • c2 (slurmd)

The compose file will create the following named volumes:

  • etc_munge ( -> /etc/munge )
  • etc_slurm ( -> /etc/slurm )
  • slurm_jobdir ( -> /data )
  • var_lib_mysql ( -> /var/lib/mysql )
  • var_log_slurm ( -> /var/log/slurm )

Building the Docker Image

Build the image locally:

docker build -t slurm-docker-cluster:21.08.6 .

Build a different version of Slurm using Docker build args and the Slurm Git tag:

docker build --build-arg SLURM_TAG="slurm-19-05-2-1" -t slurm-docker-cluster:19.05.2 .

Or equivalently using docker-compose:

SLURM_TAG=slurm-19-05-2-1 IMAGE_TAG=19.05.2 docker-compose build

Starting the Cluster

Run docker-compose to instantiate the cluster:

IMAGE_TAG=19.05.2 docker-compose up -d

Register the Cluster with SlurmDBD

To register the cluster to the slurmdbd daemon, run the register_cluster.sh script:

./register_cluster.sh

Note: You may have to wait a few seconds for the cluster daemons to become ready before registering the cluster. Otherwise, you may get an error such as sacctmgr: error: Problem talking to the database: Connection refused.

You can check the status of the cluster by viewing the logs: docker-compose logs -f

Accessing the Cluster

Use docker exec to run a bash shell on the controller container:

docker exec -it slurmctld bash

From the shell, execute slurm commands, for example:

[root@slurmctld /]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 5-00:00:00      2   idle c[1-2]

Submitting Jobs

The slurm_jobdir named volume is mounted on each Slurm container as /data. Therefore, in order to see job output files while on the controller, change to the /data directory when on the slurmctld container and then submit a job:

[root@slurmctld /]# cd /data/
[root@slurmctld data]# sbatch --wrap="hostname"
Submitted batch job 2
[root@slurmctld data]# ls
slurm-2.out
[root@slurmctld data]# cat slurm-2.out
c1

Stopping and Restarting the Cluster

docker-compose stop
docker-compose start

Deleting the Cluster

To remove all containers and volumes, run:

docker-compose stop
docker-compose rm -f
docker volume rm slurm-docker-cluster_etc_munge slurm-docker-cluster_etc_slurm slurm-docker-cluster_slurm_jobdir slurm-docker-cluster_var_lib_mysql slurm-docker-cluster_var_log_slurm

Updating the Cluster

If you want to change the slurm.conf or slurmdbd.conf file without a rebuilding you can do so by calling

./update_slurmfiles.sh slurm.conf slurmdbd.conf

(or just one of the files). The Cluster will automatically be restarted afterwards with

docker-compose restart

This might come in handy if you add or remove a node to your cluster or want to test a new setting.

Open Source Agenda is not affiliated with "Slurm Docker Cluster" Project. README Source: giovtorres/slurm-docker-cluster
Stars
261
Open Issues
13
Last Commit
2 weeks ago
License
MIT

Open Source Agenda Badge

Open Source Agenda Rating