SLURM

The cluster runs SLURM (formely known as Simple Linux Utility for Resource Management).
All jobs MUST BE scheduled using SLURM!
For quick start you have to know some Slurm basic commands:
Listing nodes
To list all nodes basic information including status, type:
user@nod-mgmt:~$ slurmls
Sample output:
PARTITION NODELIST CPUS(A/I/O/T) STATE MEMORY GRES REASON
cpus* nod-01 36/0/0/36 alloc 1 (null) none
cpus* nod-02 36/0/0/36 alloc 1 (null) none
cpus* nod-03 0/36/0/36 idle 1 (null) none
cpus* nod-04 36/0/0/36 alloc 1 (null) none
cpus* nod-05 0/36/0/36 idle 1 (null) none
cpus* nod-06 0/36/0/36 idle 1 (null) none
cpus* nod-07 0/36/0/36 idle 1 (null) none
cpus* nod-08 0/36/0/36 idle 1 (null) none
cpus* nod-09 0/36/0/36 idle 1 (null) none
cpus* nod-10 0/36/0/36 idle 1 (null) none
cpus* nod-11 0/36/0/36 idle 1 (null) none
cpus* nod-12 0/36/0/36 idle 1 (null) none
cpus* nod-13 0/36/0/36 idle 1 (null) none
cpus* nod-14 0/36/0/36 idle 1 (null) none
cpus* nod-15 0/36/0/36 idle 1 (null) none
cpus* nod-16 0/36/0/36 idle 1 (null) none
gpus nod-17 40/0/0/40 alloc 1 gpu:Tesla:2 none
gpus nod-18 0/40/0/40 idle 1 gpu:Tesla:2 none
gpus nod-19 36/4/0/40 mix 1 gpu:Tesla:2 none
gpus nod-20 40/0/0/40 alloc 1 gpu:Tesla:2 none
Explanation of above output:
PARTITION | the cluster is logically divded to CPU nodes (only CPU cores) and GPU nodes (CPU & GPU cores). Scheduled a job you have to specify which partition you want to use - 'cpus' or 'gpus' |
NODELIST | a hostname of a computing node - 'nod-01' to 'nod-20' |
CPUS | lists number of A(allocated), I(idle) cpus on a specific node |
STATE | may be: alloc - fully allocated node, idle - not fully allocated, mix - allocated with both cpu and gpu cores, offline - disabled from scheduling by administrator, down - failed |
MEMORY | amount of memory avaliable on a node |
GRES | Generic RESource - addinitional resources which a node may have - in our case - 2 x Tesla GPUs card each |
REASON | a comment (usually set by administrator) |
Listing queue
To list all scheduled jobs type:
user@nod-mgmt:~$ squeue
Sample output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1894 gpus bash krider R 46-06:30:12 1 nod-17
1948 cpus bash jdoe R 16-18:50:01 1 nod-04
1966 gpus bash jdoe R 16-17:52:11 1 nod-19
1967 gpus bash jdoe R 16-17:51:49 1 nod-20
1977 cpus bash jdoe R 3-02:15:11 1 nod-01
1982 gpus bash jdoe R 23:15:26 1 nod-17
1983 cpus bash alincoln R 1:16:58 1 nod-02
Explanation:
JOBID | identity number of a job |
PARTITION | on which partition a job is scheduled |
NAME | - |
USER | a user who scheduled a job |
ST | status of a job; most common cases are: R - running, Q - queued |
TIME | run time |
NODES | number of nodes scheduled to a job |
NODELIST | list of nodes involved in a job |
Running a job
srun
Example of running a job reqesting 4 nodes from "gpus" partition with resources "gpu:Tesla:2" and running "hostname" command:
user@nod-mgmt:~$ srun -n4 --partition=gpus --gres=gpu:Tesla:2 hostname
To submit a batch script to Slurm use:
sbatch
Example:
user@nod-mgmt:~$ sbatch my_batch_job.sh
Canceling a job
scancel
To cancel a job you have to know its JOBID which you put as a parameter.
Example:
user@nod-mgmt:~$ scancel 19467
You can only cancel your own jobs.