Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Action

To schedule research computing work using SLURM, follow the instructions below.

...

All jobs on the general purpose cluster request resources via SLURM. SLURM, is open source software that allocates resources to users for their computations, provides a framework for starting, executing and monitoring compute jobs, and arbitrates contention for resources by managing a queue of pending work. SLURM is widely used in the high performance computing (HPC) landscape and it is likely you will encounter it outside of our systems. For more information please see https://slurm.schedmd.com/ 

General Purpose Computing

...

Info

batch has some important restrictions. A job can only request 3 nodes and will run for 14 days before being automatically terminated. If you need an exception to this rule, please contact askIT@albany.edu

Request access to more nodes, or a longer time limit

On a case by case basis, ITS will grant users temporary access to more than the default job limitations. Please contact askIT@albany.edu if you would like to request access to more nodes, or a longer time limit.

...

  1. First ssh into head.arcc.albany.edu. On windows, you can use an ssh client such as PuTTY, on mac, simply use the terminal. Replace [netid] below with your username and type in your password at the prompt. You will not see your password, but it is being typed. 

    Code Block
    languagebash
    $ ssh [netid]@head.arcc.albany.edu
    
    Warning: Permanently added the ECDSA host key for IP address '169.226.65.82' to the list of known hosts.
    [netid]@head.arcc.albany.edu's password:
     
    Warning: No xauth data; using fake authentication data for X11 forwarding.
    Last login: Wed Jan 30 13:49:20 2019 from lmm.ritits.albany.edu
    ================================================================================
     This University at Albany computer system is reserved for authorized use only.
                  http://www.albany.edu/its/authorizeduse.htm
    Headnodes:
     head.arcc.albany.edu
     headnode7.rit.albany.edu
     headnode.rit.albany.edu - LEGACY SUPPORT
    General Purpose Computing:
     lmm.ritits.albany.edu - Large memory
    x2go headnode:
     eagle.arcc.albany.edu
       Questions / Assistance - askIT@albany.edu
    ================================================================================
  2. Next, change directories to /network/rit/misc/software/examples/slurm/

    Code Block
    languagebash
    $ cd /network/rit/misc/software/examples/slurm/
  3. /network/rit/misc/software/examples/slurm/run.sh contains #SBATCH commands that will request the appropriate amount of resources for our python code, then execute the code. 

    Code Block
    languagebash
    $ more run.sh
    
    #!/bin/bash
    #SBATCH -p batch
    #SBATCH --cpus-per-task=4
    #SBATCH --mem-per-cpu=100
    #SBATCH --mail-type=ALL
    #SBATCH -o /network/rit/home/%u/example-slurm-%j.out
    
    # Now, run the python script
    /network/rit/misc/software/examples/slurm/simple_multiprocessing.py
    Info
    --cpus-per-task=4 tells SLURM how many cores we want to allocate on one node
    --mem-per-cpu=100 tells SLURM how much memory to allocate per core (see also --mem)
    In total, we are requesting 4 cores and 400MB of memory for this simple python code
  4. To submit the job, we simply run sbatch run.sh. Keep note of the Job ID that is output to the terminal, it will be different that what is shown below. 

    Code Block
    languagebash
    $ sbatch run.sh
    Submitted batch job 140584
    Info

    Note that you can use squeue to view the job status

  5. The job will output a file to your home directory called ~/example-slurm-[jobid].out. We will view it using the "more" command. You should see output similar to below.

    Code Block
    languagebash
    $ more ~/example-slurm-140584.out
    USER [netid] was granted 4 cores and 100 MB per node on [hostname].
    The job is current running with job # [jobid]
     Process D waiting 3 seconds
     Process D Finished.
     Process C waiting 1 seconds
     Process C Finished.
     Process E waiting 4 seconds
     Process E Finished.
     Process A waiting 5 seconds
     Process A Finished.
     Process B waiting 2 seconds
     Process B Finished.
     Process F waiting 5 seconds
     Process F Finished.
  6. Congratulations, you just ran your first job on the cluster!

...

Info

This job ran on rhea-09, and it's max memory size was ~52 GB. That that I requested 60000MB, so I could refine this job to request slightly less memory. It ran for 14:50:14 and used about 350 CPU hours.

Restrict a job to a certain CPU architecture

 Use the --constraint flag in #SBATCH. To few available architecture on individual nodes use scontrol show node

...