Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This article will detail how launch a jupyter notebook session on DGX systems that are located on-prem. The DGX cluster in our data center has 24 nodes that house 8 A100 GPUs each, for a total of 192 A100 GPUs. Running jobs on the DGX Cluster is limited to 3 jobs 1 running job per user, 4 queued jobs at a time, with a maximum of 4 GPUs allowed at once as free resources must be shared. All jobs have a maximum runtime of 8 hours. If you are planning on training a model for longer than 8 hours, please be sure to create model checkpoint files within your code to resume training in a new job.

API Key

To generate your API key from Nvidia, follow the instructions here. Once your key is generated, log into dgx-head01.its.albany.edu and module load slurm.

Code Block
dgx-head01.its.albany.edu
Code Block
module load slurm

...

Code Block
#!/bin/bash

#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:2124.1210-py3'
#SBATCH --time=00-01:00:00
#SBATCH --nodelistmem=dgx0180gb
#SBATCH --gres=gpu:1
#SBATCH --container-mounts=/network/rit/lab/YOUR_LAB_HERE:/mnt/YOUR_LAB_HERE, /network/rit/lab/YOUR_LAB_HERE/YOUR_RESULTS_FOLDER_HERE:/mnt/YOUR_RESULTS_FOLDER

jupyter lab --allow-root --port=8888 --no-browser --ip=0.0.0.0 --NotebookAppIdentityProvider.token=''"<YOUR_CUSTOM_STRING>" --NotebookApp.allow_origin='*' --notebook-dir=/mnt/lab_container
YOUR_RESULTS_FOLDER

As usual, you can check if your job submitted by looking for an output file, or by invoking ‘squeue’ to see if your job is in the queue. You can use ‘scancel JOBID' to cancel a job you have submitted. Jobs that pull containers may take 5-10 minutes to spin up as containers must be pulled over from the registry. This wait time may scale with the size of the container. Containers on the DGX On-Prem System are run via Pyxis and Enroot. Usage documentation can be found here.

Be careful about mounting your home directory. AI related work uses lots of storage, thus it is paramount important that you mount your Lab Directory and work from there. Don’t attempt to put large AI models in your 10 gb home directory. Make sure you operate from a proper 10 TB lab directory instead. You can mount multiple directories by separating with a comma. Be sure to terminate your jupyter notebook session once you are done.

Interactive Jobs

For interactive jobs, you can use ‘srun’ to log directly into a node with a simple command:

...

The --pty $SHELL -i flags indicate an interactive shell to spawn on the node.

Job Restrictions

Jobs are limited to 1 running job per user, with 4 jobs queued waiting in line. Jobs can be run up to 1 week (--time=07-00:00:00) and up to 4 GPUs (--gpu=4). If you require more GPUs, please reach out to askIT@albany.edu regarding purchase information for additional GPUs or priority queues.