Skip to content

wu-kan/slurm-docker-cluster

 
 

Repository files navigation

Slurm Docker Cluster

This is a multi-container Slurm cluster using docker compose. The compose file creates named volumes for persistent storage of MySQL data files as well as Slurm state and log directories.

Containers and Volumes

The compose file will run the following containers:

  • mysql
  • slurmdbd
  • slurmctld
  • c1 (slurmd)
  • c2 (slurmd)

The compose file will create the following named volumes:

  • etc_munge ( -> /etc/munge )
  • etc_slurm ( -> /etc/slurm )
  • slurm_jobdir ( -> /data )
  • var_lib_mysql ( -> /var/lib/mysql )
  • var_log_slurm ( -> /var/log/slurm )

Building the Docker Image

Build the image locally:

docker build -t slurm-docker-cluster .

Or equivalently using docker compose:

docker compose build

Starting the Cluster

Run docker compose to instantiate the cluster:

docker compose up -d

Register the Cluster with SlurmDBD

To register the cluster to the slurmdbd daemon, run the register_cluster.sh script:

./register_cluster.sh

Note: You may have to wait a few seconds for the cluster daemons to become ready before registering the cluster. Otherwise, you may get an error such as sacctmgr: error: Problem talking to the database: Connection refused.

You can check the status of the cluster by viewing the logs: docker compose logs -f

Accessing the Cluster

Use docker exec to run a bash shell on the controller container:

docker exec -it slurmctld bash

From the shell, execute slurm commands, for example:

[root@slurmctld /]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 5-00:00:00      2   idle c[1-2]

Submitting Jobs

The slurm_jobdir named volume is mounted on each Slurm container as /data. Therefore, in order to see job output files while on the controller, change to the /data directory when on the slurmctld container and then submit a job:

[root@slurmctld /]# cd /data/
[root@slurmctld data]# sbatch --wrap="uptime"
Submitted batch job 2
[root@slurmctld data]# ls
slurm-2.out

Stopping and Restarting the Cluster

docker compose stop
docker compose start

Deleting the Cluster

To remove all containers and volumes, run:

docker compose stop
docker compose rm -f
docker volume rm slurm-docker-cluster_etc_munge slurm-docker-cluster_etc_slurm slurm-docker-cluster_slurm_jobdir slurm-docker-cluster_var_lib_mysql slurm-docker-cluster_var_log_slurm

About

A Slurm cluster using docker-compose

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 59.4%
  • Dockerfile 40.6%