Action
To schedule research computing work using SLURM, follow the instructions below.
...
All jobs on the general purpose cluster request resources via SLURM. SLURM, is open source software that allocates resources to users for their computations, provides a framework for starting, executing and monitoring compute jobs, and arbitrates contention for resources by managing a queue of pending work. SLURM is widely used in the high performance computing (HPC) landscape and it is likely you will encounter it outside of our systems. For more information please see https://slurm.schedmd.com/
General Purpose Computing
...
Code Block | ||
---|---|---|
| ||
$ sinfo -p batch -o "%n, %c, %m" | sort PARTITION, HOSTNAMES, CPUS, MEMORY batch*, rheauagc19-01, 2440, 64133 batch*, rhea95902 uagc19-02, 2440, 64133 batch*, rhea95902 uagc19-03, 2440, 64133 batch*, rhea95902 uagc19-04, 3240, 96411 batch*, rhea-05, 32, 96411 batch*, rhea-06, 32, 96411 batch*, rhea-07, 40, 128619 batch*, rhea-08, 40, 128619 batch*, rhea-09, 48, 257627 batch*, rhea-10, 48, 257566 batch*, uagc12-01, 12, 64166 batch*, uagc12-02, 12, 64166 batch*, uagc12-03, 12, 64166 batch*, uagc12-04, 12, 64166 batch*, uagc12-05, 32, 128703 batch*, uagc19-01, 20, 94956 batch*, uagc19-02, 20, 94956 batch*, uagc19-03, 20, 94956 batch*, uagc19-04, 20, 94956 batch*, uagc19-05, 20, 94956 batch*, uagc19-06, 20, 94956 95902 uagc20-01, 64, 191716 uagc20-02, 64, 191716 uagc20-03, 64, 191716 uagc20-04, 64, 191716 uagc20-05, 64, 191716 uagc20-06, 64, 191716 uagc20-07, 64, 191716 uagc20-08, 64, 191716 uagc20-09, 64, 191716 uagc20-10, 64, 191716 uagc20-11, 64, 191716 uagc20-12, 64, 191716 uagc20-13, 64, 191716 uagc20-14, 64, 191716 uagc20-15, 64, 191716 uagc21-01, 80, 385236 uagc21-02, 80, 385236 uagc21-03, 80, 385236 uagc21-04, 80, 385236 uagc21-05, 80, 385236 uagc21-06, 80, 385236 uagc21-07, 80, 385236 uagc21-08, 80, 385236 uagc21-09, 80, 385234 uagc21-10, 80, 385234 uagc21-11, 80, 385234 uagc21-12, 80, 385234 |
Frequently asked questions
...
Code Block | ||||
---|---|---|---|---|
| ||||
$-bash-4.2$ sinfo -p batch -o "%n, %a, %C, %e, %O" HOSTNAMES, AVAIL, CPUS(A/I/O/T), FREE_MEM, CPU_LOAD rhea-01, up, 1/23/0/24, 47457, 1.02 rhea-07, up, 8/32/0/40, 106761, 8.03 rhea-08, up, 8/32/0/40, 111833, 8.07 rhea-10, up, 8/40/0/48, 238471, 8.01 rhea-09, up, 48/0/0/48, 243033, 45.69 rhea-04, up, 32/0/0/32, 50843, 20.25 rhea%n, %a, %C, %e, %O" HOSTNAMES, AVAIL, CPUS(A/I/O/T), FREE_MEM, CPU_LOAD uagc19-01, up, 2/38/0/40, 93618, 0.01 uagc19-02, up, 12/28/0/40, 54256, 15.74 uagc19-03, up, 14/26/0/40, 67932, 15.87 uagc20-05, up, 32/32/0/64, 98158, 32.10 uagc20-06, up, 32/32/0/64, 97454, 32.03 uagc20-07, up, 2/62/0/64, 188180, 0.00 uagc20-08, up, 32/32/0/64, 94103, 32.10 uagc20-13, up, 24/40/0/64, 175916, 0.01 uagc20-10, up, 64/0/0/64, 9002, 64.30 uagc20-12, up, 64/0/0/64, 9312, 64.23 uagc19-04, up, 36/4/0/40, 75889, 16.60 uagc20-09, up, 40/24/0/64, 180208, 48.21 uagc21-06, up, 80/0/0/80, 339206, 25.01 uagc21-07, up, 80/0/0/80, 380445, 13.35 uagc21-08, up, 80/0/0/80, 339179, 25.00 uagc20-01, up, 0/64/0/64, 189369, 0.00 uagc20-02, up, 0/2464/0/2464, 61907189358, 0.00 rheauagc20-03, up, 0/2464/0/2464, 61530189372, 0.00 rheauagc20-0504, up, 0/3264/0/3264, 94105189462, 0.0200 rheauagc20-0611, up, 0/3264/0/3264, 93951189304, 0.00 uagc12uagc20-0114, up, 0/1264/0/1264, 62691189369, 0.00 uagc12uagc20-0215, up, 0/1264/0/1264, 62672189345, 0.00 uagc12uagc21-0301, up, 0/1280/0/1280, 62867278987, 01.0547 uagc12uagc21-0402, up, 0/1280/0/1280, 62862371335, 0.0001 uagc12uagc21-0503, up, 0/3280/0/3280, 127211381238, 0.00 uagc19uagc21-0104, up, 0/2080/0/2080, 93496381046, 0.0300 uagc19uagc21-0205, up, 0/2080/0/2080, 93489283133, 04.0002 uagc19uagc21-0309, up, 0/2080/0/2080, 93482380172, 0.0075 uagc19uagc21-0410, up, 0/2080/0/2080, 93570380682, 0.0020 uagc19uagc21-0511, up, 0/2080/0/2080, 93579381268, 0.00 uagc19uagc21-0612, up, 0/2080/0/2080, 93583303282, 05.0027 |
Info |
---|
Note that %a reports CPUS as allocated/idle/other/available. In this example, rheauagc20-09 05 has all of it's cores allocated (48 32 out of 4832), and is showing a CPU load of 4532.68 10 (or that 4532.68 10 cores are active). Whereas, many of the other nodes have lower utilization. We can use this information to make smart decisions about how many resources we request. |
...
Info |
---|
batch has some important restrictions. A job can only request 3 nodes and will run for 14 days before being automatically terminated. If you need an exception to this rule, please contact askIT@albany.edu |
Request access to more nodes, or a longer time limit
On a case by case basis, ARCC will grant users temporary access to more than the default job limitations. Please contact askIT@albany.edu if you would like to request access to more nodes, or a longer time limit.
...
Info |
---|
This job ran on rhea-09, and it's max memory size was ~52 GB. That that I requested 60000MB, so I could refine this job to request slightly less memory. It ran for 14:50:14 and used about 350 CPU hours. |
Restrict a job to a certain CPU architecture
Use the --constraint flag in #SBATCH. To few available architecture on individual nodes use scontrol show node
...