Action
To schedule research computing work using SLURM, follow the instructions below.
...
All jobs on the general purpose cluster request resources via SLURM. SLURM, is open source software that allocates resources to users for their computations, provides a framework for starting, executing and monitoring compute jobs, and arbitrates contention for resources by managing a queue of pending work. SLURM is widely used in the high performance computing (HPC) landscape and it is likely you will encounter it outside of our systems. For more information please see https://slurm.schedmd.com/
General Purpose Computing
...
Or from the large memory machine:
- lmm.ritits.albany.edu
Resource information
...
Info |
---|
batch has some important restrictions. A job can only request 3 nodes and will run for 14 days before being automatically terminated. If you need an exception to this rule, please contact askIT@albany.edu |
Request access to more nodes, or a longer time limit
On a case by case basis, ITS will grant users temporary access to more than the default job limitations. Please contact askIT@albany.edu if you would like to request access to more nodes, or a longer time limit.
...
First ssh into head.arcc.albany.edu. On windows, you can use an ssh client such as PuTTY, on mac, simply use the terminal. Replace [netid] below with your username and type in your password at the prompt. You will not see your password, but it is being typed.
Code Block language bash $ ssh [netid]@head.arcc.albany.edu Warning: Permanently added the ECDSA host key for IP address '169.226.65.82' to the list of known hosts. [netid]@head.arcc.albany.edu's password: Warning: No xauth data; using fake authentication data for X11 forwarding. Last login: Wed Jan 30 13:49:20 2019 from lmm.ritits.albany.edu ================================================================================ This University at Albany computer system is reserved for authorized use only. http://www.albany.edu/its/authorizeduse.htm Headnodes: head.arcc.albany.edu headnode7.rit.albany.edu headnode.rit.albany.edu - LEGACY SUPPORT General Purpose Computing: lmm.ritits.albany.edu - Large memory x2go headnode: eagle.arcc.albany.edu Questions / Assistance - askIT@albany.edu ================================================================================
Next, change directories to /network/rit/misc/software/examples/slurm/
Code Block language bash $ cd /network/rit/misc/software/examples/slurm/
/network/rit/misc/software/examples/slurm/run.sh contains #SBATCH commands that will request the appropriate amount of resources for our python code, then execute the code.
Code Block language bash $ more run.sh #!/bin/bash #SBATCH -p batch #SBATCH --cpus-per-task=4 #SBATCH --mem-per-cpu=100 #SBATCH --mail-type=ALL #SBATCH -o /network/rit/home/%u/example-slurm-%j.out # Now, run the python script /network/rit/misc/software/examples/slurm/simple_multiprocessing.py
Info --cpus-per-task=4 tells SLURM how many cores we want to allocate on one node
--mem-per-cpu=100 tells SLURM how much memory to allocate per core (see also --mem)
In total, we are requesting 4 cores and 400MB of memory for this simple python code
To submit the job, we simply run sbatch run.sh. Keep note of the Job ID that is output to the terminal, it will be different that what is shown below.
Code Block language bash $ sbatch run.sh Submitted batch job 140584
Info Note that you can use squeue to view the job status
The job will output a file to your home directory called ~/example-slurm-[jobid].out. We will view it using the "more" command. You should see output similar to below.
Code Block language bash $ more ~/example-slurm-140584.out USER [netid] was granted 4 cores and 100 MB per node on [hostname]. The job is current running with job # [jobid] Process D waiting 3 seconds Process D Finished. Process C waiting 1 seconds Process C Finished. Process E waiting 4 seconds Process E Finished. Process A waiting 5 seconds Process A Finished. Process B waiting 2 seconds Process B Finished. Process F waiting 5 seconds Process F Finished.
- Congratulations, you just ran your first job on the cluster!
...
To spawn a terminal session on a cluster node, with X11 forwarding, runnode run:
Code Block | ||
---|---|---|
| ||
srun --partition=batch --nodes=1 --time=01:00:00 --cpus-per-task=42 --mem=400 --x11 --pty $SHELL -i |
This will spawn a 01:00:00 hour session, with 4 CPUs and 400mb of RAM. To spawn the same terminal, without X11 forwarding:
Code Block | ||
---|---|---|
| ||
srun --partition=batch --nodes=1 --time=01:00:00 --cpus-per-task=4 --mem=400 --pty $SHELL -i |
View the resources used by a completed job
...
Info |
---|
This job ran on rhea-09, and it's max memory size was ~52 GB. That that I requested 60000MB, so I could refine this job to request slightly less memory. It ran for 14:50:14 and used about 350 CPU hours. |
Restrict a job to a certain CPU architecture
Use the --constraint flag in #SBATCH. To few available architecture on individual nodes use scontrol show node
...
Code Block | ||
---|---|---|
| ||
srun --partition=batch --nodes=2 --constraint=mpi_ib --time=01:00:00 --cpus-per-task=4 --mem=400 --x11 --pty $SHELL -i |
OR
Code Block | ||
---|---|---|
| ||
#SBATCH --constraint=mpi_ib |
...
There are two ways to spawn jupyter notebooks on the server:
- https://jupyterlab.arccits.albany.edu ; please see How-to: Using Jupyterhub for more information
If you need more resources, or longer than an eight hour time limit, you can run jupyter notebook interactively
First, ssh into head.arcc.albany.edu and run; then enter a password at the prompt (note that you will not see your password, but it is being registered)
Code Block language bash /network/rit/misc/software/jupyterhub/miniconda3/bin/jupyter notebook password
Next, you can either run jupyter notebook interactively with srun, or you can submit the process via sbatch script located at /network/rit/misc/software/examples/slurm/spawn_jhub.sh (see below)
Spawning jupyter notebook interactively using ITS's anaconda (you may change the path to your own conda distribution)
Code Block language bash srun --partition=batch --nodes=1 --time=01:00:00 --cpus-per-task=4 --mem=400 --pty $SHELL -i unset XDG_RUNTIME_DIR /network/rit/misc/software/jupyterhub/miniconda3/bin/jupyter notebook --no-browser --ip=0.0.0.0
You should see a jupyter output related to launching the server. Once it is complete, you should see output that looks like:
Code Block language bash [I 08:31:49.694 NotebookApp] http://(uagc19-02.rit.albany.edu or 127.0.0.1):8889/ [I 08:31:49.694 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Open up a web browser and navigate to the suggested location, in the example we would navigate to uagc19-02.rit.albany.edu:8889 , enter the configured password at the prompt, and that you are all set!
- Spawning jupyter notebook via sbatch using ITSs anaconda (you may change the path to your own conda distribution):
ssh into head.arccits.albany.edu and copy the file below to your home directory and submit the script with sbatch.
Code Block language bash # Copy the file cp /network/rit/misc/software/examples/slurm/spawn_jupyter.sh ~/spawn_jupyter.sh # change the directory to the home directory cd ~/ # submit the script sbatch spawn_jupyter.sh
Info Note that you will want to edit the script to request the amount of resources that you need
This script will create an output file called juptyer.[jobid].log. Open up this file, replacing [jobid] with the allocation number you were given (you can get this by looking at squeue) and you will see output that looks like:
Code Block language bash firstline 1 linenumbers true USER [netid] was granted 1 cores and MB per node on uagc12-02. The job is current running with job #144168.\n [I 10:06:31.758 NotebookApp] JupyterLab extension loaded from /network/rit/misc/software/jupyterhub/miniconda3/lib/python3.6/site-packages/jupyterlab [I 10:06:31.758 NotebookApp] JupyterLab application directory is /network/rit/misc/software/jupyterhub/miniconda3/share/jupyter/lab [I 10:06:31.779 NotebookApp] Serving notebooks from local directory: /network/rit/home/[netid] [I 10:06:31.779 NotebookApp] The Jupyter Notebook is running at: [I 10:06:31.780 NotebookApp] http://(uagc12-02.arcc.albany.edu or 127.0.0.1):8888/ [I 10:06:31.780 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Open up a web browser, and point to the location noted in the second to last line, in the above example, http://uagc12-02.arcc.albany.edu:8888, enter your password, and you are all set!
...