DGX On-Prem How-To

This article will detail how launch a jupyter notebook session on DGX systems that are located on-prem. The DGX cluster in our data center has 24 nodes that house 8 A100 GPUs each, for a total of 192 A100 GPUs. Running jobs on the DGX Cluster is limited to a maximum of 8 GPUs at once, as free resources must be shared. Any jobs beyond this limited will be queued and must wait until your previous resources are freed up. All jobs have a maximum runtime of 8 hours. If you are planning on training a model for longer than 8 hours, please be sure to create model checkpoint files within your code to resume training in a new job.

API Key

To generate your API key from Nvidia, follow the instructions here. Once your key is generated, log into dgx-head01.its.albany.edu and module load slurm.

dgx-head01.its.albany.edu

module load slurm

Enroot & Container Setup

In order to use containers with SLURM, you’ll need to set up Enroot. Navigate to your home directory and then your .config directory. SSH into the DGX Headnode using PuTTY or your terminal of choice. Instructions can be found here. The address is ‘dgx-head01.its.albany.edu’. Remember to not run jobs directly on the head node.

cd
cd ~/.config
mkdir enroot
cd enroot

From inside your /enroot/ directory, create a .credentials file and format it as such.

nano .credentials

# NVIDIA GPU Cloud (both endpoints are required)
machine nvcr.io login $oauthtoken password APIKEY
machine authn.nvidia.com login $oauthtoken password APIKEY

Replace the ‘APIKEY’ with your API key you retrieved earlier. This will allow enroot to use your credentials in order to pull and run container images from nvcr.io. Leave $oauthtoken as-is, as your login is authenticated via your API key. If you have containers in other repos, you can format your .credentials file as such:

# DockerHub
machine auth.docker.io login <login> password <password>

# Google Container Registry with OAuth
machine gcr.io login oauth2accesstoken password $(gcloud auth print-access-token)
# Google Container Registry with JSON
machine gcr.io login _json_key password $(jq -c '.' $GOOGLE_APPLICATION_CREDENTIALS | sed 's/ /\\u0020/g')

# Amazon Elastic Container Registry
machine 12345.dkr.ecr.eu-west-2.amazonaws.com login AWS password $(aws ecr get-login-password --region eu-west-2)

# Azure Container Registry with ACR refresh token
machine myregistry.azurecr.io login 00000000-0000-0000-0000-000000000000 password $(az acr login --name myregistry --expose-token --query accessToken  | tr -d '"')
# Azure Container Registry with ACR admin user
machine myregistry.azurecr.io login myregistry password $(az acr credential show --name myregistry --subscription mysub --query passwords[0].value | tr -d '"')

With this file set up, you are now ready to interact with containers for on-prem jobs.

Conda & Custom Kernels

If you would like to use conda and use custom kernels with local software instead of a container, you can follow the instructions here to create your custom environment.

SBATCH Script & Jupyter Notebook

The DGX On-Prem system uses SLURM to schedule jobs. This is the same as our other HPC resources, and thus you can learn more about SLURM scheduling on this page. For more information on how to run a local jupyter notebook using conda, click here. To create a custom kernel, see the above section.

Before anything else, make sure that SLURM is loaded. If the command ‘squeue’ returns an error, you can module load SLURM by doing:

module load slurm

The following SBATCH script pulls a pytorch container from nvcr.io and prints the torch version to your output log, which will be created in the directory you executed the job from as a slurm-##.out file. To create your SBATCH script, invoke:

nano run.sh

and then enter the following:

#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:21.12-py3'

python -c 'import torch ; print(torch.__version__)'

This will create a job using a pytorch container, version 21.12. To use other containers, you can invoke the same script but change the container image as needed. Containers can be found in the Nvidia catalog.

To submit the job, run the following:

sbatch run.sh

To create a jupyter notebook within the container, you can use the following:

#!/bin/bash

#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:24.10-py3'
#SBATCH --time=00-01:00:00
#SBATCH --mem=80gb
#SBATCH --gres=gpu:1
#SBATCH --container-mounts=/network/rit/lab/YOUR_LAB_HERE:/mnt/YOUR_LAB_HERE,/network/rit/lab/YOUR_LAB_HERE/YOUR_RESULTS_FOLDER_HERE:/mnt/YOUR_RESULTS_FOLDER

jupyter lab --allow-root --port=8888 --no-browser --ip=0.0.0.0 --IdentityProvider.token="<YOUR_CUSTOM_STRING>" --NotebookApp.allow_origin='*' --notebook-dir=/mnt/YOUR_RESULTS_FOLDER

As usual, you can check if your job submitted by looking for an output file, or by invoking ‘squeue’ to see if your job is in the queue. You can use ‘scancel JOBID' to cancel a job you have submitted. Jobs that pull containers may take 5-10 minutes to spin up as containers must be pulled over from the registry. This wait time may scale with the size of the container. Containers on the DGX On-Prem System are run via Pyxis and Enroot. Usage documentation can be found here.

For jupyter notebook connections, your notebook link will typically follow the format: http://${node_name}.its.albany.edu:${port}, where node_name is the node that your job is running on (gpu001, gpu002, etc) and the port is the port specified by your jupyter script (typically 8888).

Be careful about mounting your home directory. AI related work uses lots of storage, thus it is important that you mount your Lab Directory and work from there. Don’t attempt to put large AI models in your 10 gb home directory. Make sure you operate from a proper 10 TB lab directory instead. You can mount multiple directories by separating with a comma. Be sure to terminate your jupyter notebook session once you are done.

Interactive Jobs

For interactive jobs, you can use ‘srun’ to log directly into a node with a simple command:

srun --time=01:00:00 --mem=80gb --pty $SHELL -i

To be more specific with which node or how many nodes to spawn on, you can specify with additional flags:

srun --nodelist=dgx## --nodes=# --time=01:00:00 --mem=#gb --pty $SHELL -i

If requesting 2 GPUs, add --gres=gpu:2. Each node has 8 GPUs total. Each A100 GPU has 80gb of vRAM, thus your --mem flag will be 80 * n GPUs. For 2 GPUs, this would then be --mem=160gb. For 12 GPUs, this will math out to 960gb. For 1 node (which is 8 GPUs), this amounts to 640 gb.

The --pty $SHELL -i flags indicate an interactive shell to spawn on the node.

Job Restrictions

Jobs are limited to 8 GPUs maximum across all jobs. Jobs can be run up to 1 week (--time=07-00:00:00). If you require more GPUs, please reach out to askIT@albany.edu regarding purchase information for additional GPUs or priority queues.