Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Action

To schedule research computing work using SLURM, follow the instructions below.

...

All jobs on the general purpose cluster request resources via SLURM. SLURM, is open source software that allocates resources to users for their computations, provides a framework for starting, executing and monitoring compute jobs, and arbitrates contention for resources by managing a queue of pending work. SLURM is widely used in the high performance computing (HPC) landscape and it is likely you will encounter it outside of our systems. For more information please see https://slurm.schedmd.com/ 

General Purpose Computing

...

Code Block
languagebash
titleViewing available resources
-bash-4.2$  sinfo -p batch -o "%n, %a, %C, %e, %O" | sort
HOSTNAMES, AVAIL, CPUS(A/I/O/T), FREE_MEM, CPU_LOAD
uagc19rhea-0109, up, 20/380/096/4096, 93618N/A, 0.01
uagc19-02N/A
rhea-10, up, 120/280/096/4096, 54256N/A, 15.74N/A
uagc19-0301, up, 2/38/0/40, 93621, 0.00
uagc19-02, up, 12/28/0/40, 54240, 15.90
uagc19-03, up, 14/26/0/40, 6793267920, 1517.8712
uagc20uagc19-0504, up, 3236/324/0/6440, 9815875889, 3216.1060
uagc20-0601, up, 320/3264/0/64, 97454189368, 320.0300
uagc20-0702, up, 20/6264/0/64, 188180189359, 0.00
uagc20-0803, up, 320/3264/0/64, 94103189367, 320.1000
uagc20-1304, up, 240/4064/0/64, 175916189461, 0.0100
uagc20-1005, up, 6432/032/0/64, 900298151, 6432.3009
uagc20-1206, up, 6432/032/0/64, 931297446, 6432.2310
uagc19uagc20-0407, up, 362/462/0/4064, 75889188191, 160.6000
uagc20-0908, up, 4032/2432/0/64, 18020894065, 4832.2111
uagc21uagc20-0609, up, 8040/024/0/8064, 339206180208, 2548.0121
uagc21uagc20-0710, up, 8064/0/0/8064, 3804458985, 1364.35
uagc21uagc20-0811, up, 800/064/0/8064, 339179189303, 250.00
uagc20-0112, up, 64/0/64/0/64, 1893699337, 064.0024
uagc20-0213, up, 024/6440/0/64, 189358176151, 0.00
uagc20-0314, up, 0/64/0/64, 189372189364, 0.00
uagc20-0415, up, 0/64/0/64, 189462189343, 0.0002
uagc20uagc21-1101, up, 80/0/64/0/6480, 189304278987, 01.0047
uagc20uagc21-1402, up, 020/6460/0/6480, 189369371334, 0.0050
uagc20uagc21-1503, up, 0/6480/0/6480, 189345381238, 0.00
uagc21-0104, up, 0/80/0/80, 278987381046, 10.4700
uagc21-0205, up, 0/80/0/80, 371335290550, 0.0100
uagc21-0306, up, 80/0/80/0/80, 381238339206, 025.0001
uagc21-0407, up, 80/0/80/0/80, 381046247070, 04.0017
uagc21-0508, up, 80/0/80/0/80, 283133338115, 425.0203
uagc21-09, up, 0/80/0/80, 380172, 0.75
uagc21-10, up, 0/80/0/80, 380682, 0.20
uagc21-11, up, 0/80/0/80, 381268381265, 0.00
uagc21-12, up, 0/80/0/80, 303282260703, 50.2700
Info
Note that %a reports CPUS as allocated/idle/other/available. In this example, uagc20-10 has all of it's threads allocated (64 out of 64), and is showing a CPU load of 64.30 (or that 64.30 threads are active). Whereas, many of the other nodes have lower utilization. We can use this information to make smart decisions about how many resources we request. 

...

Info

batch has some important restrictions. A job can only request 3 nodes and will run for 14 days before being automatically terminated. If you need an exception to this rule, please contact askIT@albany.edu

Request access to more nodes, or a longer time limit

 On a case by case basis, ARCC will grant users temporary access to more than the default job limitations. Please contact askIT@albany.edu if you would like to request access to more nodes, or a longer time limit.

...

Info

This job ran on rhea-09, and it's max memory size was ~52 GB. That that I requested 60000MB, so I could refine this job to request slightly less memory. It ran for 14:50:14 and used about 350 CPU hours.

Restrict a job to a certain CPU architecture

 Use the --constraint flag in #SBATCH. To few available architecture on individual nodes use scontrol show node

...