DGX Cloud: H100 How-To

Faculty members should submit a request for access by completing the form here. If you are a student, graduate assistant, or postdoctoral associate, the request for access should come from your PI. After your PI has a lab group set up, you can be added to their lab group.

Logging in and Setting Up

The DGX H100 cloud is terminal based rather than GUI based. To access this resource, please have your PI submit the DGX Cloud request form and include your lab’s netIDs to be provisioned.

Make sure you are connected to the UAlbany VPN or are on UAlbany Wifi. Firstly, log into LMM and navigate to your home directory via ‘cd’.

lmm.its.albany.edu

cd

Next you will log into the cloud using this command but only after you have been provisioned by ITS.

ssh ngc

Here you will be prompted to remember this host and enter your passphrase, which is your password equivalent for logging into the cloud. Do not forget this passphrase! The IP for the login node is 207.211.163.76, though you should not need this to ssh as the config should take care of this. Invoking ‘hostname’ should return ‘slogin001’. Don’t forget to load slurm as a module here, otherwise commands such as squeue/sbatch will not work.

module load slurm
sinfo

Use ‘sinfo’ to confirm that you can see the CPU/GPU resources.

Once you are logged in, you must now set up your Enroot folder in order to start using containers, including containers used for working within a jupyter notebook. To start, navigate to your home directory and create a .config folder. Within that folder, create your enroot folder.

cd
mkdir .config
cd .config
mkdir enroot
cd enroot

Once you are inside your enroot directory, create a .credentials file and format it as such with your API key.

nano .credentials

# NVIDIA GPU Cloud (both endpoints are required)
machine nvcr.io login $oauthtoken password APIKEY
machine authn.nvidia.com login $oauthtoken password APIKEY

If you have not made your API key yet, please do so by following the instructions here. If you’ve already made an API key but forgot to write it down, you’ll have to make a new key and re-do your NGC CLI setup according to the instructions. Use the Generate Legacy Key options at the bottom of the Personal Key Generation page on Nvidia’s site in order to generate a viable key.

Starting a Jupyter Notebook

To start a jupyter notebook, we’ll create a run.sh script, launch the notebook job, and then port forward localhost 8888 to the GPU that is hosting the job. On the H100 cloud, you’ll make the following sbatch script. Use ‘nano run.sh’ to create your run script. You can name this however you want as well if you have multiple run scripts.

#!/bin/bash
#SBATCH --job-name=notebook             #name of the job
#SBATCH --time=00-01:00:00              #time limit in dd-hh:mm:ss
#SBATCH --gres=gpu:1                    #GPU allocation
#SBATCH --mem=80gb                      #memory allocation
#SBATCH --output=notebook-%j.out        #output file
#SBATCH --error=notebook-%j.err         #log file
#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:25.02-py3-igpu'

jupyter lab --allow-root --port=8888 --no-browser --ip=0.0.0.0 --IdentityProvider.token='' --NotebookApp.allow_origin='*' --notebook-dir=/

This will make a notebook using the specified container from the nvcr.io registry. You can then sbatch this job, make sure you have used ‘module load slurm’ to load slurm. You can check by invoking ‘sinfo’.

sbatch run.sh

Your jupyter notebook address will be in the .err log file once the job has downloaded its container and started to run.

squeue -u netID

You can then check the log:

more notebook-JOBID.err

And you should see a link to your jupyter notebook. Opening this now will not work as you need to now port forward your local computer to access the notebook. To do this, open your terminal/powershell window and invoke the following where you replace netID with your own netID, adjust the path to your own private key, and replace GPU_ID with the node name or GPU ID of your job (ex: gpu001, gpu002, etc).

ssh -L 8888:GPU_ID:8888 netID@207.211.163.76 -i C:\Users\netID\.ssh\netID-ed25519-dgxc

This will log you in to the DGX Cloud and forward port 8888 on your local machine to port 8888 of the GPU that your job is running on. You can then open your notebook address on your local machine.

Uploading Data

3. Cluster User Guide — NVIDIA DGX Cloud Slurm Documentation

To upload your data to the H100 cluster, you’ll have to first identify the source of your data. Please see the guide above for how to move your data into the cluster.

scp archive.tgz netID@ip-addr-of-login-node:/path/to/your/folder

If you receive a permission denied error, you may need to tweak this above command to the following:

scp -i /path/to/private_key archive.tgz netID@ip-addr-of-login-node:/path/to/your/folder

For example, the simplest way from our cluster to the cloud cluster will be to use SCP on LMM and target the cloud cluster’s login node. Make sure to specify the path to your directory.

For larger file transfers, use SFTP instead. You can unzip/untar your data on the cloud cluster later to improve transfer speed.

sftp netID@ip-addr-of-login-node

You can also use rsync to transfer files and file structures as well.

rsync archive.tgz netID@ip-addr-of-login-node:/path/to/your/folder

rsync local-directory/ netID@ip-addr-of-login-node:/path/to/your/folder

The trailing / in the second command will ensure your local directory structure is preserved. Rsync has many flags and you should read up on the official documentation if you plan to use this method extensively.

Interactive Jobs

For interactive jobs, you can use ‘srun’ to log directly into a node with a simple command:

srun --time=01:00:00 --mem=80gb --pty $SHELL -i

To be more specific with which node or how many nodes to spawn on, you can specify with additional flags:

srun --nodelist=dgx## --nodes=# --time=01:00:00 --mem=#gb --pty $SHELL -i

If requesting 2 GPUs, add --gres=gpu:2. Each node has 8 GPUs total. Each H100 GPU has 80gb of vRAM, thus your --mem flag will be 80 * n GPUs. For 2 GPUs, this would then be --mem=160gb. You cannot request more than 2 H100 GPUs at a time.

The --pty $SHELL -i flags indicate an interactive shell to spawn on the node.

Job Restrictions

Jobs are limited to 2 GPUs maximum across all jobs. Jobs can be run up to 1 week (--time=07-00:00:00). If you require more GPUs, please reach out to askIT@albany.edu for more information regarding resources and purchasing priority access.