Slurm¶
Slurm (historically: Simple Linux Utility for Resource Management) is a piece of software to orchestrate workloads on HPC clusters. The main tasks can be summarized as follows:
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm has three key functions:
- Manage access and allocate resources (compute nodes) to users for some duration of time so they can perform work.
- Starting, executing, and monitoring work (normally parallel jobs) on the set of allocated nodes.
- Arbitrating contention for resources by managing work queues and scheduling jobs.
At the moment D-PHYS maintains a 64 node slurm cluster with restricted access rights.
external docs: slurm.schedmd.com/quickstart