SLURM

SLURM logo

The cluster runs SLURM (formely known as Simple Linux Utility for Resource Management).

All jobs MUST BE scheduled using SLURM!

For quick start you have to know some Slurm basic commands:

Listing nodes

To list all nodes basic information including status, type:


  user@nod-mgmt:~$ slurmls

Sample output:


 PARTITION     NODELIST    CPUS(A/I/O/T)    STATE        MEMORY       GRES        REASON
 cpus*          nod-01       36/0/0/36      alloc          1          (null)       none
 cpus*          nod-02       36/0/0/36      alloc          1          (null)       none
 cpus*          nod-03       0/36/0/36       idle          1          (null)       none
 cpus*          nod-04       36/0/0/36      alloc          1          (null)       none
 cpus*          nod-05       0/36/0/36       idle          1          (null)       none
 cpus*          nod-06       0/36/0/36       idle          1          (null)       none
 cpus*          nod-07       0/36/0/36       idle          1          (null)       none
 cpus*          nod-08       0/36/0/36       idle          1          (null)       none
 cpus*          nod-09       0/36/0/36       idle          1          (null)       none
 cpus*          nod-10       0/36/0/36       idle          1          (null)       none
 cpus*          nod-11       0/36/0/36       idle          1          (null)       none
 cpus*          nod-12       0/36/0/36       idle          1          (null)       none
 cpus*          nod-13       0/36/0/36       idle          1          (null)       none
 cpus*          nod-14       0/36/0/36       idle          1          (null)       none
 cpus*          nod-15       0/36/0/36       idle          1          (null)       none
 cpus*          nod-16       0/36/0/36       idle          1          (null)       none
  gpus          nod-17       40/0/0/40      alloc          1     gpu:Tesla:2       none
  gpus          nod-18       0/40/0/40       idle          1     gpu:Tesla:2       none
  gpus          nod-19       36/4/0/40        mix          1     gpu:Tesla:2       none
  gpus          nod-20       40/0/0/40      alloc          1     gpu:Tesla:2       none
  

Explanation of above output:

PARTITIONthe cluster is logically divded to CPU nodes (only CPU cores) and GPU nodes (CPU & GPU cores). Scheduled a job you have to specify which partition you want to use - 'cpus' or 'gpus'
 NODELISTa hostname of a computing node - 'nod-01' to 'nod-20'
 CPUSlists number of A(allocated), I(idle) cpus on a specific node
 STATEmay be: alloc - fully allocated node, idle - not fully allocated, mix - allocated with both cpu and gpu cores, offline - disabled from scheduling by administrator, down - failed
 MEMORYamount of memory avaliable on a node
 GRESGeneric RESource - addinitional resources which a node may have - in our case - 2 x Tesla GPUs card each
 REASONa comment (usually set by administrator)

Listing queue 

To list all scheduled jobs type:


  user@nod-mgmt:~$ squeue

Sample output:

 
  JOBID PARTITION     NAME       USER   ST        TIME   NODES  NODELIST(REASON)
   1894      gpus     bash     krider    R 46-06:30:12      1   nod-17
   1948      cpus     bash       jdoe    R 16-18:50:01      1   nod-04
   1966      gpus     bash       jdoe    R 16-17:52:11      1   nod-19
   1967      gpus     bash       jdoe    R 16-17:51:49      1   nod-20
   1977      cpus     bash       jdoe    R  3-02:15:11      1   nod-01
   1982      gpus     bash       jdoe    R    23:15:26      1   nod-17
   1983      cpus     bash   alincoln    R     1:16:58      1   nod-02

Explanation:

JOBIDidentity number of a job
PARTITIONon which partition a job is scheduled
NAME-
USERa user who scheduled a job
STstatus of a job; most common cases are: R - running, Q - queued
TIMErun time
NODESnumber of nodes scheduled to a job
NODELISTlist of nodes involved in a job

Running a job


  srun

Example of running a job reqesting 4 nodes from "gpus" partition with resources "gpu:Tesla:2" and running "hostname" command:

  
  user@nod-mgmt:~$ srun -n4 --partition=gpus --gres=gpu:Tesla:2 hostname

To submit a batch script to Slurm use:


  sbatch

Example:


 user@nod-mgmt:~$ sbatch my_batch_job.sh

Canceling a job


  scancel

To cancel a job you have to know its JOBID which you put as a parameter.
Example:


 user@nod-mgmt:~$ scancel 19467

You can only cancel your own jobs.