SLURM

The cluster runs SLURM (formely known as Simple Linux Utility for Resource Management).

All jobs MUST BE scheduled using SLURM!

For quick start you have to know some Slurm basic commands:

Listing nodes

To list all nodes basic information including status, type:


  user@nod-mgmt:~$ slurmls

Sample output:


 PARTITION     NODELIST    CPUS(A/I/O/T)    STATE        MEMORY       GRES        REASON
 cpus*          nod-01       36/0/0/36      alloc          1          (null)       none
 cpus*          nod-02       36/0/0/36      alloc          1          (null)       none
 cpus*          nod-03       0/36/0/36       idle          1          (null)       none
 cpus*          nod-04       36/0/0/36      alloc          1          (null)       none
 cpus*          nod-05       0/36/0/36       idle          1          (null)       none
 cpus*          nod-06       0/36/0/36       idle          1          (null)       none
 cpus*          nod-07       0/36/0/36       idle          1          (null)       none
 cpus*          nod-08       0/36/0/36       idle          1          (null)       none
 cpus*          nod-09       0/36/0/36       idle          1          (null)       none
 cpus*          nod-10       0/36/0/36       idle          1          (null)       none
 cpus*          nod-11       0/36/0/36       idle          1          (null)       none
 cpus*          nod-12       0/36/0/36       idle          1          (null)       none
 cpus*          nod-13       0/36/0/36       idle          1          (null)       none
 cpus*          nod-14       0/36/0/36       idle          1          (null)       none
 cpus*          nod-15       0/36/0/36       idle          1          (null)       none
 cpus*          nod-16       0/36/0/36       idle          1          (null)       none
  gpus          nod-17       40/0/0/40      alloc          1     gpu:Tesla:2       none
  gpus          nod-18       0/40/0/40       idle          1     gpu:Tesla:2       none
  gpus          nod-19       36/4/0/40        mix          1     gpu:Tesla:2       none
  gpus          nod-20       40/0/0/40      alloc          1     gpu:Tesla:2       none

Explanation of above output:

PARTITION	the cluster is logically divded to CPU nodes (only CPU cores) and GPU nodes (CPU & GPU cores). Scheduled a job you have to specify which partition you want to use - 'cpus' or 'gpus'
NODELIST	a hostname of a computing node - 'nod-01' to 'nod-20'
CPUS	lists number of A(allocated), I(idle) cpus on a specific node
STATE	may be: alloc - fully allocated node, idle - not fully allocated, mix - allocated with both cpu and gpu cores, offline - disabled from scheduling by administrator, down - failed
MEMORY	amount of memory avaliable on a node
GRES	Generic RESource - addinitional resources which a node may have - in our case - 2 x Tesla GPUs card each
REASON	a comment (usually set by administrator)

Listing queue

To list all scheduled jobs type:


  user@nod-mgmt:~$ squeue

Sample output:

 
  JOBID PARTITION     NAME       USER   ST        TIME   NODES  NODELIST(REASON)
   1894      gpus     bash     krider    R 46-06:30:12      1   nod-17
   1948      cpus     bash       jdoe    R 16-18:50:01      1   nod-04
   1966      gpus     bash       jdoe    R 16-17:52:11      1   nod-19
   1967      gpus     bash       jdoe    R 16-17:51:49      1   nod-20
   1977      cpus     bash       jdoe    R  3-02:15:11      1   nod-01
   1982      gpus     bash       jdoe    R    23:15:26      1   nod-17
   1983      cpus     bash   alincoln    R     1:16:58      1   nod-02

Explanation:

JOBID	identity number of a job
PARTITION	on which partition a job is scheduled
NAME	-
USER	a user who scheduled a job
ST	status of a job; most common cases are: R - running, Q - queued
TIME	run time
NODES	number of nodes scheduled to a job
NODELIST	list of nodes involved in a job

Running a job


  srun

Example of running a job reqesting 4 nodes from "gpus" partition with resources "gpu:Tesla:2" and running "hostname" command:

  
  user@nod-mgmt:~$ srun -n4 --partition=gpus --gres=gpu:Tesla:2 hostname

To submit a batch script to Slurm use:


  sbatch

Example:


 user@nod-mgmt:~$ sbatch my_batch_job.sh

Canceling a job


  scancel

To cancel a job you have to know its JOBID which you put as a parameter.
Example:


 user@nod-mgmt:~$ scancel 19467

You can only cancel your own jobs.