Basic Terms¶

From a user's perspective not much of slurm's architecture is visible. Users do not interact with the individual nodes of a cluster directly. They rather interact with a dedicated control node using a certain set of command line tools. Slurm handles the orchestration of compute nodes in the background. Nonetheless it is useful to know some of its components and terminology.

You interact with the cluster by talking to the Slurm control daemon (slurmctld) which runs on the control node of the cluster. This daemon takes care of the cluster by orchestrating the Slurm daemons (slurmd) which are running on the individual compute nodes. The control node itself has a Slurm daemon but is not used a compute resource itself. Usually a cluster is split into several partitions which may have different configurations. These are just logical to facilitate job allocations. It is possible to distribute a job across multiple partitions.

Everything is managed and controlled from a shell on the control node using a set of Slurm commands. Most of these "s-commands" (srun, sbatch, ...) have a lot of options to control how jobs should use the cluster.

Hardware¶

guenther cluster¶

D-PHYS maintains a cluster called guenther. It consists of 64 identical nodes which are split in two partitions main and test. The first node guenther1 is not used for computations but as a dedicated control node.

Partition	Number of Nodes	Nodes
-	1	`phd-guenther` (Control Node)
main	60	`guenther2-61`
test	3	`guenther62-64`

The test partition is meant as toy cluster to test things without interfering with running calculation on the other nodes.

Each node is running Ubuntu 20.04 and has the same hardware specifications:

CPU	Frequency	HW Threads total	Sockets	Cores per Socket	Threads per Core	Memory	Scratch
Intel(R) Xeon(R) CPU E5-2660	2.20GHz	40	2	10	2	32/64 GB	4.5 TB

So in principle, using all cores on all 60 nodes, 2400 individual tasks can be run in parallel. Usually it is more useful to tailor the resources as precise as possible to the computation tasks. If Slurm cannot provide the requested resources, for example because other jobs form other users are running, the job is queued and has to wait until enough resources can be allocated.

IT Services Group

HPT H 6 – H 9

Contact

ETH Zurich

Physics Department

IT Services Group

HPT H 6 – H 9

Auguste-Piccard-Hof 1

8093 Zürich

Switzerland