DGX Cloud: H100 How-To

Faculty members should submit a request for access by completing the form here. If you are a student, graduate assistant, or postdoctoral associate, the request for access should come from your PI. After your PI has a lab group set up, you can be added to their lab group.

Logging in and Setting Up

The DGX H100 cloud is terminal based rather than GUI based. To access this resource, you will need to create SSH key pairs on LMM. This guide will walk you through how to do this. Make sure you are connected to the UAlbany VPN or are on UAlbany Wifi.

Firstly, log into LMM and navigate to your home directory via ‘cd’.

lmm.its.albany.edu

cd

Next you will generate your SSH key pairs. Make sure to replace NETID with your own netID in lowercase, and replace YOUR_EMAIL@albany.edu with your own UAlbany email. It would also be a good idea to copy these over to your personal machine, especially in the case of Windows. This will be useful when forwarding a jupyter notebook instance to your localhost:8888.

ssh-keygen -t ed25519 -b 4096 -f ~/.ssh/NETID-ed25519-dgxc -C "YOUR_EMAIL@albany.edu"

After entering this, you see ‘Creating ECDSA key for ssh’ on your terminal. Your SSH key will now be under your home directory in a hidden folder called .ssh. There are two keys, one is public and one is private. To log into the cloud, you must supply the private key. Make a copy of these files and also place them into your local machine’s .ssh folder through PowerShell (Windows) or Terminal (Mac/Linux). You will need this in order to open a jupyter notebook on localhost later.

ssh -i ~/.ssh/netID-ed25519-dgxc netID@207.211.163.76

Here you will be prompted to remember this host and enter your passphrase, which is your password equivalent for logging into the cloud. Do not forget this passphrase! The IP for the login node is 207.211.163.76and invoking ‘hostname’ should return ‘slogin001’. Don’t forget to load slurm as a module here, otherwise commands such as squeue/sbatch will not work.

module load slurm
sinfo

Use ‘sinfo’ to confirm that you can see the CPU/GPU resources.

Once you are logged in, you must now set up your Enroot folder in order to start using containers, including containers used for working within a jupyter notebook. To start, navigate to your home directory and create a .config folder. Within that folder, create your enroot folder.

cd
mkdir .config
cd .config
mkdir enroot
cd enroot

Once you are inside your enroot directory, create a .credentials file and format it as such with your API key.

nano .credentials

# NVIDIA GPU Cloud (both endpoints are required)
machine nvcr.io login $oauthtoken password APIKEY
machine authn.nvidia.com login $oauthtoken password APIKEY

If you have not made your API key yet, please do so by following the instructions here. If you’ve already made an API key but forgot to write it down, you’ll have to make a new key and re-do your NGC CLI setup according to the instructions.

Starting a Jupyter Notebook

To start a jupyter notebook, we’ll create a run.sh script, launch the notebook job, and then port forward localhost 8888 to the GPU that is hosting the job. On the H100 cloud, you’ll make the following sbatch script. Use ‘nano run.sh’ to create your run script. You can name this however you want as well if you have multiple run scripts.

#!/bin/bash
#SBATCH --job-name=notebook             #name of the job
#SBATCH --time=00-01:00:00              #time limit in dd-hh:mm:ss
#SBATCH --gres=gpu:1                    #GPU allocation
#SBATCH --mem=80gb                      #memory allocation
#SBATCH --output=notebook-%j.out        #output file
#SBATCH --error=notebook-%j.err         #log file
#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:21.12-py3'

jupyter lab --allow-root --port=8888 --no-browser --ip=0.0.0.0 --IdentityProvider.token='' --NotebookApp.allow_origin='*' --notebook-dir=/

This will make a notebook using the specified container from the nvcr.io registry. You can then sbatch this job, make sure you have used ‘module load slurm’ to load slurm. You can check by invoking ‘sinfo’.

sbatch run.sh

Your jupyter notebook address will be in the .err log file once the job has downloaded its container and started to run.

squeue -u netID

You can then check the log:

more notebook-JOBID.err

And you should see a link to your jupyter notebook. Opening this now will not work as you need to now port forward your local computer to access the notebook. To do this, open your terminal/powershell window and invoke the following where you replace netID with your own netID, adjust the path to your own private key, and replace JOBGPU with the gpu ID of your job.

ssh -L 8888:JOBGPU:8888 netID@207.211.163.76 -i C:\Users\netID\.ssh.\netID-ed25519-dgxc

This will log you in to the DGX Cloud and forward port 8888 on your local machine to port 8888 of the GPU that your job is running on.

Uploading Data

https://docs.nvidia.com/dgx-cloud/slurm/latest/cluster-user-guide.html#moving-data-into-your-dgx-cloud-cluster

To upload your data to the H100 cluster, you’ll have to first identify the source of your data. Please see the guide above for how to move your data into the cluster.

scp archive.tgz netID@ip-addr-of-login-node:/path/to/your/folder

If you receive a permission denied error, you may need to tweak this above command to the following:

scp -i /path/to/private_key archive.tgz netID@ip-addr-of-login-node:/path/to/your/folder

For example, the simplest way from our cluster to the cloud cluster will be to use SCP on LMM and target the cloud cluster’s login node. Make sure to specify the path to your directory.

For larger file transfers, use SFTP instead. You can unzip/untar your data on the cloud cluster later to improve transfer speed.

sftp netID@ip-addr-of-login-node

You can also use rsync to transfer files and file structures as well.

rsync archive.tgz netID@ip-addr-of-login-node:/path/to/your/folder

rsync local-directory/ netID@ip-addr-of-login-node:/path/to/your/folder

The trailing / in the second command will ensure your local directory structure is preserved. Rsync has many flags and you should read up on the official documentation if you plan to use this method extensively.