Resource control¶

To ensure system stability and fair distribution of compute resources (CPU, memory and IO) we use kernel resource control features (cgroup v2) and a userspace OOM killer (systemd-oomd) which uses pressure stall information (PSI) metrics to monitor user cgroups and take corrective action before an OOM occurs in the kernel space.

See also metrics and use the resource control dashboard.

Introduction¶

Most actions (such as entering a command or starting a program) require allocation of memory. When the system memory is fully exhausted, the Kernel OOM (Out Of Memory) killer is invoked reactively to kill a (basically random) process in order to free up some memory. This can leave the system in an undefined or broken state if the wrong process is killed.

A more serious problem arises when multiple users/processes compete for memory allocation or when processes heavily fork during OOM condition (for instance in compute jobs, which could start many processes). It is unlikely the OOM killer alone is able to correct the problem while many processes are constantly forked.

Memory starvation will always end up in disk IO at some point, because the Kernel has to drop page caches and possibly needed code pages from memory. Now at the latest the system will be completely unresponsive because disks (SSDs, HDDs) are many orders of magnitude slower than RAM which causes everything to slow down to snail pace. This is sometimes incorrectly called "it crashed" or "it froze" but the Kernel is probably still trying to fix the problem, we will just never know.

This kind of problem is difficult do debug and the only solution is pulling the power cord (power cycle). All unsaved work including that of other users is lost.

We can prevent this situation by enforcing resource "limits", use resource accounting per user or per group of processes and use a userspace OOM killer aware of all those variables. This guide will show our default settings on managed Linux workstations and how you can monitor and configure your jobs to reside within healthy limits.

Control Group v2¶

Our managed Linux workstations use the newer version of cgroup (v2). This can be confirmed with:

+user@host:~$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 ...

cgroup is a mechanism to organize processes hierarchically and distribute system resources along the hierarchy in a controlled and configurable manner. cgroups form a tree structure and every process in the system belongs to one and only one cgroup. All threads of a process belong to the same cgroup. On creation, all processes are put in the cgroup that the parent process belongs to at the time.

For more information refer to the kernel docs.

cgroupv2 only allows for a single writer (configuration) which is systemd on Debian. So we first need to cover some systemd resource control concepts.

Systemd's Resource Control Concepts¶

Systemd provides three unit types that are useful for the purpose of resource control, encapsulation of processes or grouping of systemd units:

services: encapsulate processes that are managed by systemd (defined by configuration)
scopes: encapsulate processes that are NOT managed by systemd (created programmatically)
slices: group services and scopes together in a hierarchical tree

See man 5 systemd.resource-control for more information.

Slices and cgroups¶

Our systemd slice hierarchy which also defines the cgroup hierarchy:

The resources are distributed across 3 slices:

system.slice: contains all system services
user.slice: contains all user slices
hostcritical.slice: contains critical services required for system responsiveness

The actual tree structure and their cgroups processes can be inspected with systemd-cgls. To view only the user.slice use systemd-cgls -u user.slice, which contains all user slices.

CPU and IO¶

We deploy the following default settings (CPU and IO) for all user cgroups:

CPUWeight=100
IOWeight=100
...

This equally distributes those resources between cgroups attached to the same parent branch in the tree. In the user.slice this effectively equally distributes resources between users. In previous versions of Linux or without those settings the distribution of CPU cycles was determined only by the CPU scheduler and nice levels and whoever had the most processes got the most CPU resources.

Memory¶

Memory is the most important resource to control to ensure system stability as described above. Our settings are mostly work conserving settings for services in hostcritical.slice and for user-0.slice (root).

On multi-user machines (and currently activated by default for all workstations with >60 GiB physical RAM) we deploy additional settings to reserve some memory for the system and for other users. This helps to ensure system stability and responsiveness during user induced OOM conditions. There should always be some resources available for other users to open a new session and one user alone should not be able to consume all resources which might interrupt the work of other users.

The settings for the user.slice effectively reserve 1-2 GiB (MemoryMax) for the system:

MemoryMax=RAM - reserved
MemoryHigh=RAM - 2x reserved
...

The default settings for all user-<UID>.slices defined in user-.slice reserve additional 1-2 GiB (MemoryMax) inside the user.slice:

MemoryMax=RAM - 2x reserved
MemoryHigh=RAM - 3x reserved
...

Which means the maximum memory usage of a user can be: installed_physical_RAM - 2x reserved GiB. On top of that we set an additional 1-2 GiB using MemoryHigh which starts to induce memory reclaim pressure inside the cgroup.

Amount of reserved memory by physical RAM size:

RAM >60 GiB: reserved = 1 GiB
RAM >120 GiB: reserved = 2 GiB

You can check your actual settings using:

systemctl show user-${UID}.slice | grep -iE 'memory(cur|avail|high|max|swap)'

Tasks¶

The maximal number of Tasks for a user is a percentage of the system maximum:

TasksMax=5%

The system maximum is hardware dependent and defined by the minimum value of either of those two values:

cat /proc/sys/kernel/pid_max          # usually ~4M
cat /proc/sys/kernel/threads-max      # variable, depends on installed physical memory

You can check your actual settings using:

systemctl show user-${UID}.slice | grep -iE 'tasks(cur|max)'

If you are exceeding the task limit you should get the following message in your shell:

-bash: fork: retry: Resource temporarily unavailable

Should you run into the task limit please contact us. We can increase the limit for specific users if needed.

User slices and session scopes¶

Every user gets its own user-<UID>.slice, that contains all their processes grouped into service and scope units. A users login session which contains the login shell process lives inside a transient session-<NNN>.scope. The services are grouped beneath the user service manager user@<UID>.service.

To inspect your own user slice use:

systemd-cgls -u user-${UID}.slice

Which would look like this when connected via ssh:

Unit user-11804.slice (/user.slice/user-11804.slice):
├─session-4837.scope (#18827)
│ ├─235070 sshd: user [priv]
│ ├─235100 sshd: user@pts/1
│ ├─235108 -bash
│ ├─235120 systemd-cgls -u user-11804.slice
│ └─235121 pager
└─user@11804.service … (#18570)
  → user.delegate: 1
  → user.invocation_id: 0a9252c67f51479e89f75239432ba068
  ├─session.slice (#18705)
  │ └─dbus.service (#18991)
  │   └─235101 /usr/bin/dbus-daemon --session --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
  └─init.scope (#18617)
    ├─235073 /lib/systemd/systemd --user
    └─235074 (sd-pam)

If you connect via ssh again from another terminal and are using session multiplexing as described in our recommended ssh settings, you will end up in a new login shell process (-bash) in the same session-4837.scope.

In that case to open a new session you can always use:

ssh -o ControlMaster=no -o ControlPath=none user@host

cgroup kernel interface¶

The actual Kernel interface for cgroup is mounted on /sys/fs/cgroup. It reflects the slices/services/scopes defined by systemd, you can inspect the structure with:

tree -d /sys/fs/cgroup

The actual cgroups and their settings are managed by systemd exclusively, except for delegated cgroups which may be managed by another process:

└─user@11804.service … (#18570)
  → user.delegate: 1

The service manager sets the user.delegate extended attribute (readable via getxattr(2) and related calls) to the character 1 on cgroup directories where delegation is enabled:

+user@host:~$ getfattr -m - -de text /sys/fs/cgroup/user.slice/user-11804.slice/user@11804.service
getfattr: Removing leading '/' from absolute path names
# file: sys/fs/cgroup/user.slice/user-11804.slice/user@11804.service
user.delegate="1"

Refer to https://systemd.io/CGROUP_DELEGATION/ for details (TL'DR: better don't touch it).

Locating the cgroup a process belongs to¶

To get the cgroup of your current shell/process use:

cat /proc/self/cgroup

or of another process:

cat /proc/<PID>/cgroup

Alternatively inspect the output of systemd-cgls or ps:

ps xawf -ewwo pid,nlwp,ppid,c,tname,stat,time:12,user:12,cgroup:54,args

Custom user scopes and settings¶

To execute processes in a custom scope (and cgroup) with user-defined settings systemd-run can be used:

systemd-run --user --scope -u myscope -p MemoryMax=1G -p OOMPolicy=continue bash

This opens a bash shell in a new transient scope named myscope.scope in the app.slice slice beneath the user service manager:

└─user@11804.service … (#18570)
  → user.delegate: 1
  → user.invocation_id: 0a9252c67f51479e89f75239432ba068
  ├─session.slice (#18705)
  │ └─dbus.service (#18991)
  │   └─235101 /usr/bin/dbus-daemon --session --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
  ├─app.slice (#18661)
  │ └─myscope.scope (#19397)
  │   ├─245867 /usr/bin/bash
  │   ├─245929 systemd-cgls -u user-11804.slice
  │   └─245930 pager

It should have an upper limit for memory usage set to 1 GiB. This can be confirmed with:

systemctl --user show myscope.scope | grep MemoryMax

Due to user.delegate: 1 in the user service manager (user@<UID>.service) we can modify the scope's cgroup settings:

systemctl --user set-property myscope.scope MemoryMax=100M

If we now start a memory runaway process in myscope.scope, the process or scope should get killed by the kernel OOM killer when the cgroup reaches MemoryMax. You can check what happened in the user journal:

journalctl --user -f

If you omit -p OOMPolicy=continue, the default policy for services will apply (OOMPolicy=stop) (systemd bug?), which results in systemd stopping the whole unit/cgroup as soon as one of its processes gets killed. If it cannot stop the unit/cgroup it will be killed after some timeout. If the scope was killed it should be in the list of failed services. To start it again it needs to be reset first:

systemctl --user --all --failed
systemctl --user reset-failed

You may also kill a scope yourself:

systemctl --user kill -s SIGKILL myscope.scope

See man 1 systemd-run for details.

Background or compute jobs¶

Background or compute jobs should be placed in the pre-defined user background.slice:

systemd-run --user --scope --slice background -u myscope1 -p OOMPolicy=continue screen -S myscreen1

Multiple jobs are best started in separate scopes (and screens) to fully isolate them (so that if one of the jobs exceeds a cgroup memory/swap limit, only the cgroup (scope) containing that job will be killed):

systemd-run --user --scope --slice background -u myscope2 -p OOMPolicy=continue screen -S myscreen2
systemd-run --user --scope --slice background -u myscope3 -p OOMPolicy=continue screen -S myscreen3 path/to/compute_job.py

The background.slice has a lower default CPUWeight=30 while other user slices default to CPUWeight=100, which effectively prioritizes foreground tasks and improve desktop application or cli interaction responsiveness. See man 7 systemd.special and also gnu screen for more information.

Alternatively to screen you may also use any other command (bash, python, etc.) or tmux -L <unique-socket-name>. Note that with tmux you must start each instance/session with the parameter -L <unique-socket-name> to ensure tmux forks off as a new process from init (PID 1) instead of spawning a shell in a child process of the main tmux process. For simplicity we recommend screen over tmux.

systemd-cgtop¶

The cli program systemd-cgtop interactively shows resource usage (tasks, cpu, memory, io) per cgroup, is less resource intensive than top or htop and still works when there are huge amounts of tasks running. See man 1 systemd-cgtop for details.

You can also use it in batch mode, for instance to show which cgroup uses the most amount memory:

+user@host:~$ systemd-cgtop -b1m
Control Group                                   Tasks   %CPU   Memory  Input/s Output/s
/                                                 223      -     7.7G        -        -
user.slice                                         33      -     7.3G        -        -
user.slice/user-11804.slice                        20      -     7.1G        -        -
user.slice/user-11804.slice/session-381.scope       4      -     7.1G        -        -
hostcritical.slice                                 17      -   361.3M        -        -
user.slice/user-0.slice                            13      -   151.9M        -        -

It is important to note that it shows the memory usage as accounted by cgroup, which account for all memory types (incl. buffers, caches, socket and kernel memory) unlike other tools like htop which usually only show RSS (basically just anonymous pages and mapped file memory).

PSI¶

Pressure Stall Information (PSI) provides a canonical way to see resource pressure increases as they develop, with new pressure metrics for three major resources: cpu, memory and io.

For instance memory.pressure tells us the percentage of the time some or all (full) tasks in the cgroup were stuck because they had to do memory work (waiting for kernel mem lock, throttled, reclaim, swapping) during the last 10, 60 or 300 seconds.

Or in other words: "If I had more of this resource, I could probably run N% faster."

System-wide metrics are in /proc/pressure/*:

+user@host:~$ cat /proc/pressure/cpu
some avg10=6.85 avg60=6.26 avg300=3.94 total=1050213423
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+user@host:~$ cat /proc/pressure/memory
some avg10=3.98 avg60=3.68 avg300=1.97 total=829285049
full avg10=3.73 avg60=3.51 avg300=1.90 total=699656579
+user@host:~$ cat /proc/pressure/io
some avg10=0.07 avg60=0.18 avg300=0.22 total=344046532
full avg10=0.01 avg60=0.08 avg300=0.18 total=191071593

Per cgroup metrics are in /sys/fs/cgroup/*/*.pressure:

+user@host:~$ cat /sys/fs/cgroup/user.slice/user-11804.slice/cpu.pressure
some avg10=7.21 avg60=3.54 avg300=0.99 total=3570628
full avg10=0.42 avg60=0.41 avg300=0.12 total=724864
+user@host:~$ cat /sys/fs/cgroup/user.slice/user-11804.slice/memory.pressure
some avg10=12.96 avg60=7.47 avg300=1.98 total=6428873
full avg10=12.93 avg60=7.45 avg300=1.97 total=6417159
+user@host:~$ cat /sys/fs/cgroup/user.slice/user-11804.slice/io.pressure
some avg10=0.21 avg60=0.20 avg300=0.06 total=372980
full avg10=0.11 avg60=0.15 avg300=0.04 total=303310

Refer to psi overview by facebook or the kernel psi docs for more information.

psi-notify¶

psi-notify is a minimal unprivileged notifier for system-wide resource pressure using PSI. This can help you to identify misbehaving applications before they start to severely impact system responsiveness.

The service is automatically started when you log in, see systemctl --user status psi-notify.service. It will show a notification on your desktop when a certain resource pressure limit was reached, and produce a log entry in your journal (journalctl --user).

For custom or default configuration settings refer to the GitHub page.

psitop¶

psitop is like top for /proc/pressure. Allows you to see resource contention for CPU, IO and memory separately, with high-resolution 10 second load averages. Use the keybindings shown in the interface.

Memory types¶

There are different kinds of memory in Linux. Some of them and their corresponding color in htop are:

anonymous pages (green): irreclaimable, not backed by a backing store
page cache (yellow): maybe reclaimable, (file cache and code pages)
other memory (blue/purple): buffers, shared, socket, kernel slab, stack

Refer to Linux Kernel memory management concepts for an explanation of terms like page cache, anonymous pages or reclaim.

Swap¶

Swap provides a backing store for otherwise irreclaimable memory (anonymous pages). This usually is memory allocated using malloc or mmap MAP_ANONYMOUS, where the only copy of the data resides in memory (it is locked in memory). Swap allows the Kernel to reclaim those kinds of pages. Under memory pressure the Kernel can swap out pages to free up memory. If a program accesses a swapped out memory page, it will result in a page fault and the Kernel has to load the page back into memory. If this cycle is happening over and over again is also called "thrashing".

So swap basically allows to ramp up memory pressure more slowly. It allows to efficiently use the maximum amount of physical memory without the immediate risk of Kernel space OOM and kill if programs go 1 byte over the edge.

Swap misconceptions¶

There are some misconceptions about swap among users and sysadmins:

Swap is not an extension of memory or emergency memory (swapped pages cannot be used by programs). No swap does not mean that there is no disk IO. Memory pressure will always result in disk IO, with or without swap.

Refer to bit.ly/whyswap for more detailed and accurate explanation.

zram or zswap¶

In the past swap space was provided by HDDs or SSDs. Recently new Kernel features allow to provide swap space by compressing memory pages and storing them in memory. This wastes additional CPU cycles for the compression, but it still can be many times faster than having to store the pages on a disk. It is important to note, that the compressed pages are stored in memory and therefore the actual amount of memory usable by programs is reduced.

Refer to the kernel docs about zram and zswap for more information.

Swap on managed Linux workstations¶

We deploy swap backed by zram (0.5 x RAM size) with additional disk based swap (0.5 x RAM size, up to max 64 GiB). The priority of zram is higher and used first. The disk swap is only there for emergency cases and abnormal situations. Normally operating workloads will never reach the low priority disk based swap. The sometimes huge amount of zram of 0.5 x RAM is also that large mainly for abnormal workloads as we will see later.

To see the actual amount of configured swap use /usr/sbin/swapon or cat /proc/swaps.

Keep in mind that if you are using a lot of swap, something is probably wrong with the software/code or it could run faster, since swap can never be used in computations. We believe for the computational use-cases at the D-PHYS it makes more sense to simply install more RAM than to optimize tasks to run with 100% memory efficiency.

To check your actual swap usage use oomctl provided by systemd-oomd which is explained in the next section.

systemd-oomd¶

systemd-oomd is a system service that uses cgroups-v2 and pressure stall information (PSI) to monitor and take corrective action before an OOM occurs in the kernel space. It requires swap to function properly.

It is basically a userspace OOM killer, which is activated proactively while memory pressure rises, whilst the Kernel OOM killer is only reactive and activated when it is already too late.

It is activated either when a certain amount of total swap is used (Swap Used Limit) or if memory pressure reaches a certain amount (Memory Pressure Limit) for some time (Memory Pressure Duration). If the configured limits are exceeded, systemd-oomd will select a cgroup to terminate, and send SIGKILL to all processes in it. See man 8 systemd-oomd for details and man 5 oomd.conf (/SwapUsedLimit=) for the kill selection algorithm.

You can check the actual systemd-oomd settings on your system and your swap usage using the command oomctl. We deploy different settings based on the hardware. In particular SwapUsedLimit is set to 90% of zram by default.

IT Services Group

HPT H 6 – H 9

Contact

ETH Zurich

Physics Department

IT Services Group

HPT H 6 – H 9

Auguste-Piccard-Hof 1

8093 Zürich

Switzerland