This article will detail how launch a jupyter notebook session on DGX systems that are located on-prem. The DGX cluster in our data center has 24 nodes that house 8 A100 GPUs each, for a total of 192 A100 GPUs. Running jobs on the DGX Cluster is limited to 3 jobs per user, with a maximum of 4 GPUs allowed at once as free resources must be shared.
...
Code Block |
---|
#!/bin/bash
#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:21.12-py3'
#SBATCH --nodelist=dgx01
#SBATCH --gres=gpu:1
#SBATCH --container-mounts=/network/rit/lab/YOUR_LAB_HERE:/mnt/YOUR_LAB_HERE, /network/rit/lab/YOUR_LAB_HERE/YOUR_RESULTS_FOLDER_HERE:/mnt/YOUR_RESULTS_FOLDER
jupyter lab --allow-root -port=8888 --no-browser --ip=0.0.0.0 --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/mnt/lab_container
|
As usual, you can check if your job submitted by looking for an output file, or by invoking ‘squeue’ to see if your job is in the queue. You can use ‘scancel JOBID' to cancel a job you have submitted. Jobs that pull containers may take 5-10 minutes to spin up as containers must be pulled over from the registry. This wait time may scale with the size of the container. Containers on the DGX On-Prem System are run via Pyxis and Enroot. Usage documentation can be found here.
Be careful about mounting your home directory. AI related work uses lots of storage, thus it is paramount that you mount your Lab Directory and work from there. Don’t attempt to put large AI models in your 10 gb home directory. Make sure you operate from a proper 10 TB lab directory instead. You can mount multiple directories by separating with a comma.
...