This is a multi-container Slurm cluster using docker compose. The compose file creates named volumes for persistent storage of MySQL data files as well as Slurm state and log directories.
The compose file will run the following containers:
- mysql
- slurmdbd
- slurmctld
- c1 (slurmd)
- c2 (slurmd)
The compose file will create the following named volumes:
- etc_munge ( -> /etc/munge )
- etc_slurm ( -> /etc/slurm )
- slurm_jobdir ( -> /data )
- var_lib_mysql ( -> /var/lib/mysql )
- var_log_slurm ( -> /var/log/slurm )
Build the image locally:
docker build -t slurm-docker-cluster .
Or equivalently using docker compose
:
docker compose build
Run docker compose
to instantiate the cluster:
docker compose up -d
To register the cluster to the slurmdbd daemon, run the register_cluster.sh
script:
./register_cluster.sh
Note: You may have to wait a few seconds for the cluster daemons to become ready before registering the cluster. Otherwise, you may get an error such as sacctmgr: error: Problem talking to the database: Connection refused.
You can check the status of the cluster by viewing the logs:
docker compose logs -f
Use docker exec
to run a bash shell on the controller container:
docker exec -it slurmctld bash
From the shell, execute slurm commands, for example:
[root@slurmctld /]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 5-00:00:00 2 idle c[1-2]
The slurm_jobdir
named volume is mounted on each Slurm container as /data
.
Therefore, in order to see job output files while on the controller, change to
the /data
directory when on the slurmctld container and then submit a job:
[root@slurmctld /]# cd /data/
[root@slurmctld data]# sbatch --wrap="uptime"
Submitted batch job 2
[root@slurmctld data]# ls
slurm-2.out
docker compose stop
docker compose start
To remove all containers and volumes, run:
docker compose stop
docker compose rm -f
docker volume rm slurm-docker-cluster_etc_munge slurm-docker-cluster_etc_slurm slurm-docker-cluster_slurm_jobdir slurm-docker-cluster_var_lib_mysql slurm-docker-cluster_var_log_slurm