Torque

Torque is a resource manager and queueing system. It's a fork of the no longer actively maintained OpenPBS. The name stands for Terascale Open-Source Resource and QUEue Manager.

Job configuration

Any executable script can be submitted to Torque. The code does not need to be linked to any specific libraries. The configuration of a job is done through PBS directives at the beginning of the script. Consider the following example testjob.sh:

 #!/bin/bash
 #PBS -M username@phys.ethz.ch
 #PBS -N myjobname
 #PBS -l walltime=02:30:00
 #PBS -l nodes=1:ppn=24
 #PBS -l mem=2gb
 #PBS -l vmem=8gb
cd $PBS_O_WORKDIR
echo "Sleeping"
sleep 10
exit 0

The job named myjobname is supposed to run for less than two and a half hours on 24 cores. It requires a minimum of 2GB and a maximum of 8GB of RAM. All it does is to print a text to standard output and sleep for 10 seconds before successfully exiting.

The cd $PBS_O_WORKDIR makes sure that the job output is saved in the directory from which the job was submitted, instead of the home directory of the user.

Queue policies

The queue delays the running of the job until all required ressources are available. In the above case, the queue waits to have 24 available cores and at least 2GB of memory.

If the running job does not meet the announced requirements, it can be killed by the queue. This happens for instance if the job is still running after two and a half hours, or if its memory needs exceed 8GB. This is to make sure that jobs with potential memory-leaks get killed before they block other jobs or even worse the whole system.

Job submission

qsub testjob.sh  # submit job to the queue
qstat -a         # list jobs in the queue
qdel <jobID>     # remove job numbered <jobID> from the queue

(the torque binaries can be found in /opt/torque/bin/)

The jobs in the queue are flagged R when they are running, Q while queued, and C during 15 minutes after they are completed. After that time they disappear from the queue.

If you get an error message related to a wrong ssh passphrase, try submitting your job as follows:

qsub -v SSH_AUTH_SOCK testjob.sh

When submitting MPI jobs you may add an option that the termination signals received by mpirun will be transmitted to the computing processes. This ensures that all processes are properly killed when the job is removed from the queue using qdel.

mpirun -mca orte_forward_job_control 1 -np 2 a.out

Job output

After completion of the job, two files are created for it:

  • myjobname.o<jobID> : contains the standard output of your job
  • myjobname.e<jobID> : contains the error output of your job

By default these files are stored at the root of your homefolder. If you include the cd $PBS_O_WORKDIR command in your script, they are saved in the directory from which the job was submitted.