...
...
...
...
...
...
...
...
...
...
...
...
...
Action
To schedule research computing work using SLURM, follow the instructions below.
Instructions
Table of Contents | ||
---|---|---|
|
SLURM
All jobs on the general purpose cluster request resources via SLURM. SLURM, is open source software that allocates resources to users for their computations, provides a framework for starting, executing and monitoring compute jobs, and arbitrates contention for resources by managing a queue of pending work. SLURM is widely used in the high performance computing (HPC) landscape and it is likely you will encounter it outside of our systems. For more information please see https://slurm.schedmd.com/
General Purpose Computing
All resources on the general purpose cluster are submitted using the SLURM scheduler. For more information, please read the Frequently asked Questions. Jobs can be submitted from the following headnodes:
- head.arcc.albany.edu
- headnode7.rit.albany.edu
Or from the large memory machine:
- lmm.
...
- its.albany.edu
Resource information
All users have access to the "batch" partition for general purpose computing.
Info |
---|
The batch partition is comprised of 544 CPUs and 21 1040 CPU cores (2080 threads) and 31 compute nodes. Note that a job can only request 3 nodes and may only be active for 14 days. If you need an exception to this, please contact arcc@albanycontact askIT@albany.edu. |
Code Block | ||
---|---|---|
| ||
$ sinfo -p batch -o "%n, %c, %m" | sort PARTITION, HOSTNAMES, CPUS, MEMORY batch*, rheauagc19-01, 2440, 64133 batch*, rhea95902 uagc19-02, 2440, 64133 batch*, rhea95902 uagc19-03, 2440, 64133 batch*, rhea95902 uagc19-04, 3240, 96411 batch*, rhea-05, 32, 96411 batch*, rhea-06, 32, 96411 batch*, rhea-07, 40, 128619 batch*, rhea-08, 40, 128619 batch*, rhea-09, 48, 257627 batch*, rhea-10, 48, 257566 batch*, uagc12-01, 12, 64166 batch*, uagc12-02, 12, 64166 batch*, uagc12-03, 12, 64166 batch*, uagc12-04, 12, 64166 batch*, uagc12-05, 32, 128703 batch*, uagc19-01, 20, 94956 batch*, uagc19-02, 20, 94956 batch*, uagc19-03, 20, 94956 batch*, uagc19-04, 20, 94956 batch*, uagc19-05, 20, 94956 batch*, uagc19-06, 20, 94956 |
Frequently asked questions
...
95902
uagc20-01, 64, 191716
uagc20-02, 64, 191716
uagc20-03, 64, 191716
uagc20-04, 64, 191716
uagc20-05, 64, 191716
uagc20-06, 64, 191716
uagc20-07, 64, 191716
uagc20-08, 64, 191716
uagc20-09, 64, 191716
uagc20-10, 64, 191716
uagc20-11, 64, 191716
uagc20-12, 64, 191716
uagc20-13, 64, 191716
uagc20-14, 64, 191716
uagc20-15, 64, 191716
uagc21-01, 80, 385236
uagc21-02, 80, 385236
uagc21-03, 80, 385236
uagc21-04, 80, 385236
uagc21-05, 80, 385236
uagc21-06, 80, 385236
uagc21-07, 80, 385236
uagc21-08, 80, 385236
uagc21-09, 80, 385234
uagc21-10, 80, 385234
uagc21-11, 80, 385234
uagc21-12, 80, 385234 |
Frequently asked questions
SLURM documentation can be found at the SLURM website (https://slurm.schedmd.com); but below are answers to frequently asked questions which demonstrate several useful SLURM commands.
...
View the current status, or resources available, of batch nodes
...
sinfo is commonly used to few the status of a give given cluster or node, or how many resources are available to schedule.
Code Block | ||||
---|---|---|---|---|
| ||||
$-bash-4.2$ sinfo -p batch -o "%n, %a, %C, %e, %O" | sort HOSTNAMES, AVAIL, CPUS(A/I/O/T), FREE_MEM, CPU_LOAD rhea-0109, up, 10/230/096/2496, 47457N/A, 1.02N/A rhea-0710, up, 80/320/096/4096, 106761N/A, 8.03 rhea-08N/A uagc19-01, up, 82/3238/0/40, 11183393621, 80.0700 rheauagc19-1002, up, 812/4028/0/4840, 23847154240, 815.0190 rheauagc19-0903, up, 4814/026/0/4840, 24303367920, 4517.6912 rheauagc19-04, up, 3236/04/0/3240, 5084375889, 2016.2560 rheauagc20-0201, up, 0/2464/0/2464, 61907189368, 0.00 rheauagc20-0302, up, 0/2464/0/2464, 61530189359, 0.00 rheauagc20-0503, up, 0/3264/0/3264, 94105189367, 0.0200 rheauagc20-0604, up, 0/3264/0/3264, 93951189461, 0.00 uagc12uagc20-0105, up, 032/1232/0/1264, 6269198151, 032.0009 uagc12uagc20-0206, up, 032/1232/0/1264, 6267297446, 032.0010 uagc12uagc20-0307, up, 02/1262/0/1264, 62867188191, 0.0500 uagc12uagc20-0408, up, 032/1232/0/1264, 6286294065, 032.0011 uagc12uagc20-0509, up, 040/3224/0/3264, 127211180208, 048.0021 uagc19uagc20-0110, up, 64/0/20/0/2064, 934968985, 064.0335 uagc19uagc20-0211, up, 0/2064/0/2064, 93489189303, 0.00 uagc19uagc20-0312, up, 64/0/20/0/2064, 934829337, 064.0024 uagc19uagc20-0413, up, 024/2040/0/2064, 93570176151, 0.00 uagc19uagc20-0514, up, 0/2064/0/2064, 93579189364, 0.00 uagc19uagc20-0615, up, 0/2064/0/2064, 93583189343, 0.00 |
Info |
---|
Note that %a reports CPUS as allocated/idle/other/available. In this example, rhea-09 has all of it's cores allocated (48 out of 48), and is showing a CPU load of 45.68 (or that 45.68 cores are active). Whereas, many of the other nodes have lower utilization. We can use this information to make smart decisions about how many resources we request. |
How can I view jobs currently running, and waiting in queue?
squeue will show jobs currently waiting in the queue or running, for all partitions that you have access to.
Code Block | ||||
---|---|---|---|---|
| ||||
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 140574 batch g.slurm [netid] PD 0:00 1 (Resources) 140486 batch g.slurm [netid] R 21:53:54 1 rhea-04 140290 batch run.sh [netid] R 2-19:09:35 1 rhea-01 140216 batch shell1_5 [netid] R 3-08:48:18 1 rhea-09 135093 batch g.sh [netid] R 28-19:56:31 1 rhea-0802 uagc21-01, up, 80/0/0/80, 278987, 1.47 uagc21-02, up, 20/60/0/80, 371334, 0.50 uagc21-03, up, 0/80/0/80, 381238, 0.00 uagc21-04, up, 0/80/0/80, 381046, 0.00 uagc21-05, up, 0/80/0/80, 290550, 0.00 uagc21-06, up, 80/0/0/80, 339206, 25.01 uagc21-07, up, 80/0/0/80, 247070, 4.17 uagc21-08, up, 80/0/0/80, 338115, 25.03 uagc21-09, up, 0/80/0/80, 380172, 0.75 uagc21-10, up, 0/80/0/80, 380682, 0.20 uagc21-11, up, 0/80/0/80, 381265, 0.00 uagc21-12, up, 0/80/0/80, 260703, 0.00 |
Info |
---|
Note that %a reports CPUS as allocated/idle/other/available. In this example, uagc20-10 has all of it's threads allocated (64 out of 64), and is showing a CPU load of 64.30 (or that 64.30 threads are active). Whereas, many of the other nodes have lower utilization. We can use this information to make smart decisions about how many resources we request. |
View jobs currently running, and waiting in queue
squeue will show jobs currently waiting in the queue or running, for all partitions that you have access to.
Code Block | ||||
---|---|---|---|---|
| ||||
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 140574 batch g.slurm [netid] PD 0:00 1 (Resources) 135087140486 batch g.shslurm [netid] R 28-20:43:49 21:53:54 1 rhea-1004 135090140290 batch grun.sh [netid] R 282-2019:4909:4235 1 rhea-07 |
Info |
---|
At the time this command was run, there were 7 jobs running or waiting in queue. JOBID 140574 is waiting in the queue due to inadequate available resources, while the other jobs have been running for a few days. |
How can I view the resources requested for an active job?
scontrol show job [jobid] will generate a report with information about how a job was scheduled.
Info |
---|
Note that once a job is completed, this report can no longer be generated via scontrol. See How do I view the resources used by my job? for accessing similar information upon job completion. |
Code Block | ||||
---|---|---|---|---|
| ||||
$ scontrol show job ######
JobId=###### JobName=g.slurm
UserId=[netid](52639) GroupId=faculty(972) MCS_label=N/A
Priority=1 Nice=0 Account=rit QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=14-00:00:00 TimeMin=N/A
SubmitTime=2019-02-13T07:48:25 EligibleTime=2019-02-13T07:48:25
StartTime=2019-02-14T11:53:10 EndTime=2019-02-28T11:53:10 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-02-13T08:42:58
Partition=batch AllocNode:Sid=headnode7:86819
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
TRES=cpu=32,mem=87.50G,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=32 MinMemoryCPU=2800M MinTmpDiskNode=0
Features=avx2 DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/path/to/command/
WorkDir=/path/to/workdir/
StdErr=/path/to/stderr/
StdIn=/dev/null
StdOut=/path/to/stdout/
Power= |
Info |
---|
Here, the job requested 32 CPUs on one node, with 87.5GB of memory, at 2019-02-13T07:48:25, with a constraint of Features=avx2. NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:* Features=avx2 |
What are the maximum resources I can request?
Code Block | ||
---|---|---|
| ||
$ scontrol show partition batch
PartitionName=batch
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=3 MaxTime=14-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=rhea-[01-10],uagc19-[01-06],uagc12-[01-05]
PriorityJobFactor=1 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=544 TotalNodes=21 SelectTypeParameters=NONE
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED |
Info |
---|
batch has some important restrictions. A job can only request 3 nodes and will run for 14 days before being automatically terminated. If you need an exception to this rule, please contact arcc@albany.edu |
How can I request access to more nodes, or a longer time limit?
On a case by case basis, ARCC will grant users temporary access to more than the default job limitations. Please contact arcc@albany.edu if you would like to request access to more nodes, or a longer time limit.
How do I schedule a non-interactive job?
There are many ways to schedule jobs via slurm. For non-interactive jobs, we recommend using sbatch with a shell script that runs your script. We will use #SBATCH commands to allocate the appropriate resources required for our script. Below is an example workflow of how to submit a python script via sbatch to batch.
First ssh into head.arcc.albany.edu. On windows, you can use an ssh client such as PuTTY, on mac, simply use the terminal. Replace [netid] below with your username and type in your password at the prompt. You will not see your password, but it is being typed.
...
language | bash |
---|
...
01
140216 batch shell1_5 [netid] R 3-08:48:18 1 rhea-09
135093 batch g.sh [netid] R 28-19:56:31 1 rhea-08
135087 batch g.sh [netid] R 28-20:43:49 1 rhea-10
135090 batch g.sh [netid] R 28-20:49:42 1 rhea-07 |
Info |
---|
At the time this command was run, there were 7 jobs running or waiting in queue. JOBID 140574 is waiting in the queue due to inadequate available resources, while the other jobs have been running for a few days. |
View the resources requested for an active job
scontrol show job [jobid] will generate a report with information about how a job was scheduled.
Info |
---|
Note that once a job is completed, this report can no longer be generated via scontrol. See How do I view the resources used by my job? for accessing similar information upon job completion. |
Code Block | ||||
---|---|---|---|---|
| ||||
$ scontrol show job ######
JobId=###### JobName=g.slurm
UserId=[netid](52639) GroupId=faculty(972) MCS_label=N/A
Priority=1 Nice=0 Account=rit QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=14-00:00:00 TimeMin=N/A
SubmitTime=2019-02-13T07:48:25 EligibleTime=2019-02-13T07:48:25
StartTime=2019-02-14T11:53:10 EndTime=2019-02-28T11:53:10 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-02-13T08:42:58
Partition=batch AllocNode:Sid=headnode7:86819
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
TRES=cpu=32,mem=87.50G,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=32 MinMemoryCPU=2800M MinTmpDiskNode=0
Features=avx2 DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/path/to/command/
WorkDir=/path/to/workdir/
StdErr=/path/to/stderr/
StdIn=/dev/null
StdOut=/path/to/stdout/
Power= |
Info |
---|
Here, the job requested 32 CPUs on one node, with 87.5GB of memory, at 2019-02-13T07:48:25, with a constraint of Features=avx2. NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:* Features=avx2 |
Maximum resources allowed
Code Block | ||
---|---|---|
| ||
$ scontrol show partition batch
PartitionName=batch
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=3 MaxTime=14-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=rhea-[01-10],uagc19-[01-06],uagc12-[01-05]
PriorityJobFactor=1 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=544 TotalNodes=21 SelectTypeParameters=NONE
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED |
Info |
---|
batch has some important restrictions. A job can only request 3 nodes and will run for 14 days before being automatically terminated. If you need an exception to this rule, please contact askIT@albany.edu |
Request access to more nodes, or a longer time limit
On a case by case basis, ITS will grant users temporary access to more than the default job limitations. Please contact askIT@albany.edu if you would like to request access to more nodes, or a longer time limit.
Schedule a non-interactive job
There are many ways to schedule jobs via slurm. For non-interactive jobs, we recommend using sbatch with a shell script that runs your script. We will use #SBATCH commands to allocate the appropriate resources required for our script. Below is an example workflow of how to submit a python script via sbatch to batch.
First ssh into head.arcc.albany.edu. On windows, you can use an ssh client such as PuTTY, on mac, simply use the terminal. Replace [netid] below with your username and type in your password at the prompt. You will not see your password, but it is being typed.
Code Block language bash $ ssh [netid]@head.arcc.albany.edu Warning: Permanently added the ECDSA host key for IP address '169.226.65.82' to the list of known hosts. [netid]@head.arcc.albany.edu's password: Warning: No xauth data; using fake authentication data for X11 forwarding. Last login: Wed Jan 30 13:49:20 2019 from lmm.its.albany.edu ========================================
Next, change directories to /network/rit/misc/software/examples/slurm/
Code Block language bash $ cd /network/rit/misc/software/examples/slurm/
/network/rit/misc/software/examples/slurm/run.sh contains #SBATCH commands that will request the appropriate amount of resources for our python code, then execute the code.
Code Block language bash $ more run.sh #!/bin/bash #SBATCH -p batch #SBATCH --cpus-per-task=4 #SBATCH --mem-per-cpu=100 #SBATCH --mail-type=ALL #SBATCH --mail-user=%u@albany.edu #SBATCH -o /network/rit/home/%u/example-slurm-%j.out # Now, run the python script /network/rit/misc/software/examples/slurm/simple_multiprocessing.py
Info --cpus-per-task=4 tells SLURM how many cores we want to allocate on one node
--mem-per-cpu=100 tells SLURM how much memory to allocate per core (see also --mem)
In total, we are requesting 4 cores and 400MB of memory for this simple python code
- To submit the job, we simply run sbatch run.sh. Keep note of the Job ID that is output to the terminal, it will be different that what is shown below.
======================================== This University at Albany computer system is reserved for authorized use only. http://www.albany.edu/its/authorizeduse.htm Headnodes: head.arcc.albany.edu headnode7.rit.albany.edu headnode.rit.albany.edu - LEGACY SUPPORT General Purpose Computing: lmm.its.albany.edu - Large memory x2go headnode: eagle.arcc.albany.edu Questions / Assistance - askIT@albany.edu ================================================================================
Next, change directories to /network/rit/misc/software/examples/slurm/
Code Block language bash $ sbatch run.sh Submitted batch job 140584
The job will output a file to your home directory called ~/example-slurm-[jobid].out. We will view it using the "more" command. You should see output similar to below.Info Note that you can use squeue to view the job status
cd /network/rit/misc/software/examples/slurm/
/network/rit/misc/software/examples/slurm/run.sh contains #SBATCH commands that will request the appropriate amount of resources for our python code, then execute the code.
Code Block language bash $ more ~/example-slurm-140584.out USER [netid] was granted 4 cores and 100 MB per node on [hostname]. The job is current running with job # [jobid] Process D waiting 3 seconds Process D Finished. Process C waiting 1 seconds Process C Finished. Process E waiting 4 seconds Process E Finished. Process A waiting 5 seconds Process A Finished. Process B waiting 2 seconds Process B Finished. Process F waiting 5 seconds Process F Finished.
- Congratulations, you just ran your first job on the cluster!
How do I schedule an interactive job?
An "interactive" job means that you will have access to a terminal so that you can run "interactively" on the cluster. To achieve this, we will use srun. Interactive sessions are useful for debugging code, or making sure certain software compiles correctly.
First ssh into head.arcc.albany.edu. On windows, you can use an ssh client such as PuTTY, on mac, simply use the terminal. Replace [netid] below with your username and type in your password at the prompt. You will not see your password, but it is being typed.
Code Block language bash $ ssh [netid]@head.arcc.albany.edu Warning: Permanently added the ECDSA host key for IP address '169.226.65.82' to the list of known hosts. [netid]@head.arcc.albany.edu's password: Warning: No xauth data; using fake authentication data for X11 forwarding. Last login: Wed Jan 30 13:49:20 2019 from lmm.rit.albany.edu ================================================================================ This University at Albany computer system is reserved for authorized use only. http://www.albany.edu/its/authorizeduse.htm Headnodes: head.arcc.albany.edu headnode7.rit.albany.edu headnode.rit.albany.edu - LEGACY SUPPORT General Purpose Computing: lmm.rit.albany.edu - Large memory x2go headnode: eagle.arcc.albany.edu Questions / Assistance - arcc@albany.edu ================================================================================
Next, allocate resources on the cluster for your interactive session. We will request a session that will last for 1 hour, with 4 cpus and 400 mb of memory. Note your job number will be different
Code Block language bash $ srun --partition=batch --nodes=1 --time=01:00:00 --cpus-per-task=4 --mem=400 --pty $SHELL -i $ hostname uagc12-01.arcc.albany.edu
Now we are running a terminal session on a specific node on the cluster. Notice in step 2, that the hostname command output a host other than head.arcc.albany.edu.
Code Block language bash $ cd /network/rit/misc/software/examples/slurm/ $ ./simple_multiprocessing.py USER ns742711 was granted 4 cores and None MB per node on uagc12-01. The job is current running with job # 140590 Process D waiting 3 seconds Process D Finished. Process A waiting 5 seconds Process A Finished. Process C waiting 1 seconds Process C Finished. Process E waiting 4 seconds Process E Finished. Process B waiting 2 seconds Process B Finished. Process F waiting 5 seconds Process F Finished.
When you are finished, type exit and then use scancel to relinquish the allocation
Code Block $ exit $ scancel 140590 salloc: Job allocation 140590 has been revoked.
How do I view the resources used by a completed job?
sacct is useful to view accounting information on completed jobs. Read the documentation for all output fields.
Code Block | ||
---|---|---|
| ||
$ sacct -u ns742711 -j 139907 -o "Nodelist, JobID, AllocNodes, AllocTRES%30, MaxVMSize, MaxVMSizeTask, AveVMSize, TotalCPU, Elapsed"
NodeList JobID AllocNodes AllocTRES MaxVMSize MaxVMSizeTask AveVMSize TotalCPU Elapsed
--------------- ------------ ---------- ------------------------------ ---------- -------------- ---------- ---------- ----------
rhea-09 139907 1 cpu=24,mem=60000M,energy=1844+ 13-00:45:+ 14:50:14
rhea-09 139907.batch 1 cpu=24,mem=60000M,node=1 54764616K 0 54506520K 13-00:45:+ 14:50:1 |
Info |
---|
This job ran on rhea-09, and it's max memory size was ~52 GB. That that I requested 60000MB, so I could refine this job to request slightly less memory. It ran for 14:50:14 and used about 350 CPU hours. |
Can I restrict my job to a certain CPU architecture?
Yes! Use the --constraint flag in #SBATCH. To few available architecture on individual nodes use scontrol show node
Code Block | ||
---|---|---|
| ||
$ scontrol show node uagc19-06
NodeName=uagc19-06 Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.00
AvailableFeatures=intel,skylake,sse4_2,avx,avx2,avx512
ActiveFeatures=intel,skylake,sse4_2,avx,avx2,avx512
Gres=(null)
NodeAddr=uagc19-06.arcc.albany.edu NodeHostName=uagc19-06 Version=17.11
OS=Linux 4.14.35-1844.0.7.el7uek.x86_64 #2 SMP Wed Dec 12 19:48:02 PST 2018
RealMemory=94956 AllocMem=0 FreeMem=93582 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=4086 Weight=256 Owner=N/A MCS_label=N/A
Partitions=batch
BootTime=2019-02-11T10:15:23 SlurmdStartTime=2019-02-11T10:15:48
CfgTRES=cpu=20,mem=94956M,billing=20
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s |
How can I run jupyter notebook on the cluster?
There are two ways to spawn jupyter notebooks on the server:
- https://jupyterlab.arcc.albany.edu ; please see Jupyterhub for more information
If you need more resources, or longer than an eight hour time limit, you can run jupyter notebook interactively
First, ssh into head.arcc.albany.edu and run; then enter a password at the prompt (note that you will not see your password, but it is being registered)
Code Block language bash run.sh #!/bin/bash #SBATCH -p batch #SBATCH --cpus-per-task=4 #SBATCH --mem-per-cpu=100 #SBATCH --mail-type=ALL #SBATCH -o /network/rit/home/%u/example-slurm-%j.out # Now, run the python script /network/rit/misc/software/examples/slurm/simple_multiprocessing.py
Info --cpus-per-task=4 tells SLURM how many cores we want to allocate on one node
--mem-per-cpu=100 tells SLURM how much memory to allocate per core (see also --mem)
In total, we are requesting 4 cores and 400MB of memory for this simple python code
To submit the job, we simply run sbatch run.sh. Keep note of the Job ID that is output to the terminal, it will be different that what is shown below.
Code Block language bash $ sbatch run.sh Submitted batch job 140584
Info Note that you can use squeue to view the job status
The job will output a file to your home directory called ~/example-slurm-[jobid].out. We will view it using the "more" command. You should see output similar to below.
Code Block language bash $ more ~/example-slurm-140584.out USER [netid] was granted 4 cores and 100 MB per node on [hostname]. The job is current running with job # [jobid] Process D waiting 3 seconds Process D Finished. Process C waiting 1 seconds Process C Finished. Process E waiting 4 seconds Process E Finished. Process A waiting 5 seconds Process A Finished. Process B waiting 2 seconds Process B Finished. Process F waiting 5 seconds Process F Finished.
- Congratulations, you just ran your first job on the cluster!
Schedule an interactive job
To spawn a terminal session on a cluster node run:
Code Block | ||
---|---|---|
| ||
srun --partition=batch --nodes=1 --time=01:00:00 --cpus-per-task=2 --mem=400 --pty $SHELL -i |
View the resources used by a completed job
sacct is useful to view accounting information on completed jobs. Read the documentation for all output fields.
Code Block | ||
---|---|---|
| ||
$ sacct -u ns742711 -j 139907 -o "Nodelist, JobID, AllocNodes, AllocTRES%30, MaxVMSize, MaxVMSizeTask, AveVMSize, TotalCPU, Elapsed"
NodeList JobID AllocNodes AllocTRES MaxVMSize MaxVMSizeTask AveVMSize TotalCPU Elapsed
--------------- ------------ ---------- ------------------------------ ---------- -------------- ---------- ---------- ----------
rhea-09 139907 1 cpu=24,mem=60000M,energy=1844+ 13-00:45:+ 14:50:14
rhea-09 139907.batch 1 cpu=24,mem=60000M,node=1 54764616K 0 54506520K 13-00:45:+ 14:50:1 |
Info |
---|
This job ran on rhea-09, and it's max memory size was ~52 GB. That that I requested 60000MB, so I could refine this job to request slightly less memory. It ran for 14:50:14 and used about 350 CPU hours. |
Restrict a job to a certain CPU architecture
Use the --constraint flag in #SBATCH. To few available architecture on individual nodes use scontrol show node
Code Block | ||
---|---|---|
| ||
$ scontrol show node uagc19-06
NodeName=uagc19-06 Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.00
AvailableFeatures=intel,skylake,sse4_2,avx,avx2,avx512
ActiveFeatures=intel,skylake,sse4_2,avx,avx2,avx512
Gres=(null)
NodeAddr=uagc19-06.arcc.albany.edu NodeHostName=uagc19-06 Version=17.11
OS=Linux 4.14.35-1844.0.7.el7uek.x86_64 #2 SMP Wed Dec 12 19:48:02 PST 2018
RealMemory=94956 AllocMem=0 FreeMem=93582 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=4086 Weight=256 Owner=N/A MCS_label=N/A
Partitions=batch
BootTime=2019-02-11T10:15:23 SlurmdStartTime=2019-02-11T10:15:48
CfgTRES=cpu=20,mem=94956M,billing=20
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s |
Spawn on the infiniband nodes
You need to add the directive --constraint=mpi_ib
Code Block | ||
---|---|---|
| ||
srun --partition=batch --nodes=2 --constraint=mpi_ib --time=01:00:00 --cpus-per-task=4 --mem=400 --pty $SHELL -i |
OR
Code Block | ||
---|---|---|
| ||
#SBATCH --constraint=mpi_ib |
Allocate GPU resources
You can request access to the GPUs on --partition=ceashpc by adding the following flag:
--gres=gpu:1 # For half of the K80
--gres=gpu:2 # For the full K80
To request access to the A40s on the batch cluster for your research lab, please email askIT@albany.edu.
Once your group is added you can request access to the GPUs on --partition=batch-gpu by adding the following flag:
--gres=gpu:1 # For one of the A40s
--gres=gpu:2 # For two of the A40s
etc.
Run jupyter notebook on the cluster
There are two ways to spawn jupyter notebooks on the server:
- https://jupyterlab.its.albany.edu ; please see How-to: Using Jupyterhub for more information
If you need more resources, or longer than an eight hour time limit, you can run jupyter notebook interactively
First, ssh into head.arcc.albany.edu and run; then enter a password at the prompt (note that you will not see your password, but it is being registered)
Code Block language bash /network/rit/misc/software/jupyterhub/miniconda3/bin/jupyter notebook password
Next, you can either run jupyter notebook interactively with srun, or you can submit the process via sbatch script located at /network/rit/misc/software/examples/slurm/spawn_jhub.sh (see below)
Spawning jupyter notebook interactively using ITS's anaconda (you may change the path to your own conda distribution)
Code Block language bash srun --partition=batch --nodes=1 --time=01:00:00 --cpus-per-task=4 --mem=400 --pty $SHELL -i unset XDG_RUNTIME_DIR /network/rit/misc/software/jupyterhub/miniconda3/bin/jupyter notebook --no-browser --ip=0.0.0.0
You should see a jupyter output related to launching the server. Once it is complete, you should see output that looks like:
Code Block language bash [I 08:31:49.694 NotebookApp] http://(uagc19-02.rit.albany.edu or 127.0.0.1):8889/ [I 08:31:49.694 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Open up a web browser and navigate to the suggested location, in the example we would navigate to uagc19-02.rit.albany.edu:8889 , enter the configured password at the prompt, and that you are all set!
- Spawning jupyter notebook via sbatch using ITSs anaconda (you may change the path to your own conda distribution):
ssh into head.its.albany.edu and copy the file below to your home directory and submit the script with sbatch.
Code Block language bash # Copy the file cp /network/rit/misc/software/examples/slurm/spawn_jupyter.sh ~/spawn_jupyter.sh # change the directory to the home directory cd ~/ # submit the script sbatch spawn_jupyter.sh
Info Note that you will want to edit the script to request the amount of resources that you need
This script will create an output file called juptyer.[jobid].log. Open up this file, replacing [jobid] with the allocation number you were given (you can get this by looking at squeue) and you will see output that looks like:
Code Block language bash firstline 1 linenumbers true USER [netid] was granted 1 cores and MB per node on uagc12-02. The job is current running with job #144168.\n [I 10:06:31.758 NotebookApp] JupyterLab extension loaded from /network/rit/misc/software/jupyterhub/miniconda3
/lib/python3.6/site-packages/jupyterlab [I 10:06:31.758 NotebookApp] JupyterLab application directory is /network/rit/misc/software/
jupyterhub/
Spawning jupyter notebook interactively using ARCC's anaconda (you may change the path to your own conda distribution)
Code Block language bash srun --partition=batch --nodes=1 --time=01:00:00 --cpus-per-task=4 --mem=400 --pty $SHELL -i unset XDG_RUNTIME_DIR /network/rit/misc/software/jupyterhub/miniconda3/bin/jupyter notebook --no-browser --ip=0.0.0.0
You should see a bunch of output launching the server and at the bottom something that looks like:
Code Block language bash [I 08:31:49.694miniconda3/share/jupyter/lab [I 10:06:31.779 NotebookApp] Serving notebooks from local directory: /network/rit/home/[netid] [I 10:06:31.779 NotebookApp] The Jupyter Notebook is running at: [I 10:06:31.780 NotebookApp] http://(uagc19uagc12-02.ritarcc.albany.edu or 127.0.0.1):88898888/ [I 0810:06:31:49.694780 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Open up a web browser, and point to the location it saysnoted in the second to last line, in the example we would navigate to uagc19above example, http://uagc12-02.ritarcc.albany.edu:8889 , enter the configured password at the prompt, and that is it8888, enter your password, and you are all set!
- Spawning jupyter notebook via sbatch using ARCCs anaconda (you may change the path to your own conda distribution):
...
Include Page Footer Footer