/
DGX Cloud: H100 How-To

DGX Cloud: H100 How-To

Faculty members should submit a request for access by completing the form here. If you are a student, graduate assistant, or postdoctoral associate, the request for access should come from your PI. After your PI has a lab group set up, you can be added to their lab group.

Logging in and Setting Up

The DGX H100 cloud is terminal based rather than GUI based. To access this resource, please have your PI submit the DGX Cloud request form and include your lab’s netIDs to be provisioned.

Make sure you are connected to the UAlbany VPN or are on UAlbany Wifi. Firstly, log into LMM and navigate to your home directory via ‘cd’.

lmm.its.albany.edu
cd

Next you will generate your SSH key pairs. Make sure to replace NETID with your own netID in lowercase, and replace YOUR_EMAIL@albany.edu with your own UAlbany email. It would also be a good idea to copy these over to your personal machine, especially in the case of Windows. This will be useful when forwarding a jupyter notebook instance to your localhost:8888.

ssh ngc

Here you will be prompted to remember this host and enter your passphrase, which is your password equivalent for logging into the cloud. Do not forget this passphrase! The IP for the login node is 207.211.163.76, though you should not need this to ssh as the config should take care of this. Invoking ‘hostname’ should return ‘slogin001’. Don’t forget to load slurm as a module here, otherwise commands such as squeue/sbatch will not work.

Use ‘sinfo’ to confirm that you can see the CPU/GPU resources.

Once you are logged in, you must now set up your Enroot folder in order to start using containers, including containers used for working within a jupyter notebook. To start, navigate to your home directory and create a .config folder. Within that folder, create your enroot folder.

Once you are inside your enroot directory, create a .credentials file and format it as such with your API key.

If you have not made your API key yet, please do so by following the instructions here. If you’ve already made an API key but forgot to write it down, you’ll have to make a new key and re-do your NGC CLI setup according to the instructions.

Starting a Jupyter Notebook

To start a jupyter notebook, we’ll create a run.sh script, launch the notebook job, and then port forward localhost 8888 to the GPU that is hosting the job. On the H100 cloud, you’ll make the following sbatch script. Use ‘nano run.sh’ to create your run script. You can name this however you want as well if you have multiple run scripts.

This will make a notebook using the specified container from the nvcr.io registry. You can then sbatch this job, make sure you have used ‘module load slurm’ to load slurm. You can check by invoking ‘sinfo’.

Your jupyter notebook address will be in the .err log file once the job has downloaded its container and started to run.

You can then check the log:

And you should see a link to your jupyter notebook. Opening this now will not work as you need to now port forward your local computer to access the notebook. To do this, open your terminal/powershell window and invoke the following where you replace netID with your own netID, adjust the path to your own private key, and replace GPU_ID with the node name or GPU ID of your job (ex: gpu001, gpu002, etc).

This will log you in to the DGX Cloud and forward port 8888 on your local machine to port 8888 of the GPU that your job is running on. You can then open your notebook address on your local machine.

 

Uploading Data

3. Cluster User Guide — NVIDIA DGX Cloud Slurm Documentation

To upload your data to the H100 cluster, you’ll have to first identify the source of your data. Please see the guide above for how to move your data into the cluster.

If you receive a permission denied error, you may need to tweak this above command to the following:

For example, the simplest way from our cluster to the cloud cluster will be to use SCP on LMM and target the cloud cluster’s login node. Make sure to specify the path to your directory.

For larger file transfers, use SFTP instead. You can unzip/untar your data on the cloud cluster later to improve transfer speed.

You can also use rsync to transfer files and file structures as well.

The trailing / in the second command will ensure your local directory structure is preserved. Rsync has many flags and you should read up on the official documentation if you plan to use this method extensively.