Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

This article will detail how launch a jupyter notebook session on DGX systems that are located on-prem. The DGX cluster in our data center has 24 nodes that house 8 A100 GPUs each, for a total of 192 A100 GPUs. Running jobs on the DGX Cluster is limited to 3 jobs per user, with a maximum of 4 GPUs allowed at once as free resources must be shared.

API Key

To generate your API key from Nvidia, follow the instructions here. Once your key is generated, log into dgx-head01.its.albany.edu and module load slurm.

module load slurm

Enroot & Container Setup

In order to use containers with SLURM, you’ll need to set up Enroot. Navigate to your home directory and then your .config directory. SSH into the DGX Headnode using PuTTY or your terminal of choice. Instructions can be found here. The address is ‘dgx-head01.its.albany.edu’. Remember to not run jobs directly on the head node.

cd
cd ~/.config
mkdir enroot
cd enroot

From inside your /enroot/ directory, create a .credentials file and format it as such.

nano .credentials
# NVIDIA GPU Cloud (both endpoints are required)
machine nvcr.io login $oauthtoken password APIKEY
machine authn.nvidia.com login $oauthtoken password APIKEY

Replace the ‘APIKEY’ with your API key you retrieved earlier. This will allow enroot to use your credentials in order to pull and run container images from nvcr.io. Leave $oauthtoken as-is, as your login is authenticated via your API key. If you have containers in other repos, you can format your .credentials file as such:

# DockerHub
machine auth.docker.io login <login> password <password>

# Google Container Registry with OAuth
machine gcr.io login oauth2accesstoken password $(gcloud auth print-access-token)
# Google Container Registry with JSON
machine gcr.io login _json_key password $(jq -c '.' $GOOGLE_APPLICATION_CREDENTIALS | sed 's/ /\\u0020/g')

# Amazon Elastic Container Registry
machine 12345.dkr.ecr.eu-west-2.amazonaws.com login AWS password $(aws ecr get-login-password --region eu-west-2)

# Azure Container Registry with ACR refresh token
machine myregistry.azurecr.io login 00000000-0000-0000-0000-000000000000 password $(az acr login --name myregistry --expose-token --query accessToken  | tr -d '"')
# Azure Container Registry with ACR admin user
machine myregistry.azurecr.io login myregistry password $(az acr credential show --name myregistry --subscription mysub --query passwords[0].value | tr -d '"')
 

With this file set up, you are now ready to interact with containers for on-prem jobs.

Conda & Custom Kernels

If you would like to use conda and use custom kernels with local software instead of a container, you can follow the instructions here to create your custom environment.

SBATCH Script & Jupyter Notebook

The DGX On-Prem system uses SLURM to schedule jobs. This is the same as our other HPC resources, and thus you can learn more about SLURM scheduling on this page. For more information on how to run a local jupyter notebook using conda, click here. To create a custom kernel, see the above section.

Before anything else, make sure that SLURM is loaded. If the command ‘squeue’ returns an error, you can module load SLURM by doing:

module load slurm

The following SBATCH script pulls a pytorch container from nvcr.io and prints the torch version to your output log, which will be created in the directory you executed the job from as a slurm-##.out file. To create your SBATCH script, invoke:

nano run.sh

and then enter the following:

#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:21.12-py3'

python -c 'import torch ; print(torch.__version__)'

This will create a job using a pytorch container, version 21.12. To use other containers, you can invoke the same script but change the container image as needed. Containers can be found in the Nvidia catalog.

To submit the job, run the following:

sbatch run.sh

To create a jupyter notebook within the container, you can use the following:

#!/bin/bash

#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:21.12-py3'
#SBATCH --nodelist=dgx01
#SBATCH --gres=gpu:1
#SBATCH --container-mounts=/network/rit/lab/YOUR_LAB_HERE:/mnt/YOUR_LAB_HERE, /network/rit/lab/YOUR_LAB_HERE/YOUR_RESULTS_FOLDER_HERE:/mnt/YOUR_RESULTS_FOLDER

jupyter lab --allow-root -port=8888 --no-browser --ip=0.0.0.0 --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/mnt/lab_container

As usual, you can check if your job submitted by looking for an output file, or by invoking ‘squeue’ to see if your job is in the queue. You can use ‘scancel JOBID' to cancel a job you have submitted. Jobs that pull containers may take 5-10 minutes to spin up as containers must be pulled over from the registry. This wait time may scale with the size of the container. Containers on the DGX On-Prem System are run via Pyxis and Enroot. Usage documentation can be found here.

Be careful about mounting your home directory. AI related work uses lots of storage, thus it is paramount that you mount your Lab Directory and work from there. Don’t attempt to put large AI models in your 10 gb home directory. Make sure you operate from a proper 10 TB lab directory instead. You can mount multiple directories by separating with a comma.

Interactive Jobs

For interactive jobs, you can use ‘srun’ to log directly into a node with a simple command:

srun --time=01:00:00 --mem=80gb --pty $SHELL -i

To be more specific with which node or how many nodes to spawn on, you can specify with additional flags:

srun --nodelist=dgx## --nodes=# --time=01:00:00 --mem=#gb --pty $SHELL -i

If requesting 2 GPUs, add --gres=gpu:2. Each node has 8 GPUs total. Each A100 GPU has 80gb of vRAM, thus your --mem flag will be 80 * n GPUs. For 2 GPUs, this would then be --mem=160gb. For 12 GPUs, this will math out to 960gb. For 1 node (which is 8 GPUs), this amounts to 640 gb.

The --pty $SHELL -i flags indicate an interactive shell to spawn on the node.

  • No labels