DGX On-Prem How-To
This article will detail how launch a jupyter notebook session on DGX systems that are located on-prem. The DGX cluster in our data center has 24 nodes that house 8 A100 GPUs each, for a total of 192 A100 GPUs. Running jobs on the DGX Cluster is limited to 1 running job per user, 4 queued jobs at a time, with a maximum of 4 GPUs allowed at once as free resources must be shared. All jobs have a maximum runtime of 8 hours. If you are planning on training a model for longer than 8 hours, please be sure to create model checkpoint files within your code to resume training in a new job.
API Key
To generate your API key from Nvidia, follow the instructions here. Once your key is generated, log into dgx-head01.its.albany.edu and module load slurm.
dgx-head01.its.albany.edu
module load slurm
Enroot & Container Setup
In order to use containers with SLURM, you’ll need to set up Enroot. Navigate to your home directory and then your .config directory. SSH into the DGX Headnode using PuTTY or your terminal of choice. Instructions can be found here. The address is ‘dgx-head01.its.albany.edu’. Remember to not run jobs directly on the head node.
cd
cd ~/.config
mkdir enroot
cd enroot
From inside your /enroot/ directory, create a .credentials file and format it as such.
Replace the ‘APIKEY’ with your API key you retrieved earlier. This will allow enroot to use your credentials in order to pull and run container images from nvcr.io. Leave $oauthtoken as-is, as your login is authenticated via your API key. If you have containers in other repos, you can format your .credentials file as such:
With this file set up, you are now ready to interact with containers for on-prem jobs.
Conda & Custom Kernels
If you would like to use conda and use custom kernels with local software instead of a container, you can follow the instructions here to create your custom environment.
SBATCH Script & Jupyter Notebook
The DGX On-Prem system uses SLURM to schedule jobs. This is the same as our other HPC resources, and thus you can learn more about SLURM scheduling on this page. For more information on how to run a local jupyter notebook using conda, click here. To create a custom kernel, see the above section.
Before anything else, make sure that SLURM is loaded. If the command ‘squeue’ returns an error, you can module load SLURM by doing:
The following SBATCH script pulls a pytorch container from nvcr.io and prints the torch version to your output log, which will be created in the directory you executed the job from as a slurm-##.out file. To create your SBATCH script, invoke:
and then enter the following:
This will create a job using a pytorch container, version 21.12. To use other containers, you can invoke the same script but change the container image as needed. Containers can be found in the Nvidia catalog.
To submit the job, run the following:
To create a jupyter notebook within the container, you can use the following:
As usual, you can check if your job submitted by looking for an output file, or by invoking ‘squeue’ to see if your job is in the queue. You can use ‘scancel JOBID' to cancel a job you have submitted. Jobs that pull containers may take 5-10 minutes to spin up as containers must be pulled over from the registry. This wait time may scale with the size of the container. Containers on the DGX On-Prem System are run via Pyxis and Enroot. Usage documentation can be found here.
Be careful about mounting your home directory. AI related work uses lots of storage, thus it is important that you mount your Lab Directory and work from there. Don’t attempt to put large AI models in your 10 gb home directory. Make sure you operate from a proper 10 TB lab directory instead. You can mount multiple directories by separating with a comma. Be sure to terminate your jupyter notebook session once you are done.
Interactive Jobs
For interactive jobs, you can use ‘srun’ to log directly into a node with a simple command:
To be more specific with which node or how many nodes to spawn on, you can specify with additional flags:
If requesting 2 GPUs, add --gres=gpu:2. Each node has 8 GPUs total. Each A100 GPU has 80gb of vRAM, thus your --mem flag will be 80 * n GPUs. For 2 GPUs, this would then be --mem=160gb. For 12 GPUs, this will math out to 960gb. For 1 node (which is 8 GPUs), this amounts to 640 gb.
The --pty $SHELL -i flags indicate an interactive shell to spawn on the node.
Job Restrictions
Jobs are limited to 1 running job per user, with 4 jobs queued waiting in line. Jobs can be run up to 1 week (--time=07-00:00:00) and up to 4 GPUs (--gpu=4). If you require more GPUs, please reach out to askIT@albany.edu regarding purchase information for additional GPUs or priority queues.