Basic Terms

From a user's perspective not much of slurm's architecture is visible. Users do not interact with the individual nodes of a cluster directly. They rather interact with a dedicated control node using a certain set of command line tools. Slurm handles the orchestration of compute nodes in the background. Nonetheless it is useful to know some of its components and terminology.

You interact with the cluster by talking to the Slurm control daemon (slurmctld) which runs on the control node of the cluster. This daemon takes care of the cluster by orchestrating the Slurm daemons (slurmd) which are running on the individual compute nodes. The control node itself has a Slurm daemon but is not used a compute resource itself. Usually a cluster is split into several partitions which may have different configurations. These are just logical to facilitate job allocations. It is possible to distribute a job across multiple partitions.

Everything is managed and controlled from a shell on the control node using a set of Slurm commands. Most of these "s-commands" (srun, sbatch, ...) have a lot of options to control how jobs should use the cluster.

Hardware

guenther cluster

D-PHYS maintains a cluster called guenther. It consists of 64 identical nodes which are split in two partitions main and test. The first node guenther1 is not used for computations but as a dedicated control node.

Partition Number of Nodes Nodes
- 1 phd-guenther (Control Node)
main 60 guenther2-61
test 3 guenther62-64

The test partition is meant as toy cluster to test things without interfering with running calculation on the other nodes.

Each node is running Ubuntu 20.04 and has the same hardware specifications:

CPU Frequency HW Threads total Sockets Cores per Socket Threads per Core Memory Scratch
Intel(R) Xeon(R) CPU E5-2660 2.20GHz 40 2 10 2 32/64 GB 4.5 TB

So in principle, using all cores on all 60 nodes, 2400 individual tasks can be run in parallel. Usually it is more useful to tailor the resources as precise as possible to the computation tasks. If Slurm cannot provide the requested resources, for example because other jobs form other users are running, the job is queued and has to wait until enough resources can be allocated.