Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The University's Research Storage

Getting Started:

To begin, log into NVIDIA AI Enterprise Base Command using your University email (Web SSO)

...

Once you successfully log in, select the University At Albany (SUNY) as your organization, and then select your team. If you do not have a team, please reach out to ITS to add you to your lab's team. You may still access resources, but your lab's data and workspace will be unavailable until you are placed into their team.

Generating an API Key

In order to upload your data and even use the Nvidia command line, you will need to generate an API Key. This key allows you to log into your workspace as yourself, so it is important to not lose it or give it out to people.

...

On the next page, click 'Generate API Key'. Save this key to a text file for easy access on your machine. If you lose your key you can generate a new one on this page again. Next, you will install the Nvidia CLI

Using the Nvidia CLI via LMM

Access LMM via SSHing to lmm.its.albany.edu. The NGC command is available here and can be checked by invoking:

...

You should see the following output to your terminal.

Configuring your Terminal on LMM

Next, you will need to configure your terminal such that you can upload your data in a neat and usable format. Change directories to your data.

...

Congrats on setting up the terminal! You are now ready to upload your data.

Uploading Data

In your Linux terminal or Powershell terminal, navigate to where your data is stored. To do this you can change directories via:

...

In this fashion, your original data is never truly changed, ensuring the reproducibility of your work. If you are doing data cleaning, you can do so in a workspace and save it as a new csv, or clean the data locally if that is easier.

Starting a Job

The default time limit of a job is 30 days. There are 2 ways to start a job on DGX cloud, via web interface or via CLI. The web interface is graphical, easy to use, and also generates a CLI prompt for you to use if you wish. In this example, we will submit a job to launch a Jupyter Notebook instance where we can access our data from inside the notebook.

From the Web Interface:

In the create job section, there are templates created by people in your team, Nvidia, or UAlbany ITS. In this example, we'll be using a template to launch a Jupyter Notebook session. Here we see a templates tab, and one template available for us to use. The name of this template is tf_3.1_jupyter_notebook, uses 1 GPU, 1 node, and a container image made by Nvidia for TensorFlow 3.1.

...

Once you load the template, you can edit the options in the web interface to suit your needs. You can swap the number of GPUs/nodes or container type to better suit your computing needs. Scroll down to 'Inputs' and you can see which datasets and workspaces you can load, and load multiple datasets/workspaces. Both datasets and workspaces can contain data that you would use, but there are some key differences to know about them.

Datasets

These are read-only and will be the same between sessions. This is useful or reference data that you don't want to change. Files such as CSVs are useful to put here for your code to reference and to load into memory from.

Workspaces

Files in this artifact are readable and writable. This space is useful for living files that are being edited and worked on. Files such as Jupyter notebooks (.ipynb), python scripts, etc, are useful to put here as you work on them between sessions.

...

You can also download datasets and workspaces, and convert them to results as well to reupload into other spaces. For example, if you are finished working on a script in a workspace, you might download and then reupload it as a dataset such that you have an immutable copy of the script to reference. Once you have selected a dataset or workspace to include, a text box will appear under the 'Mount Point' column. Here you can enter /mount/data, or any other custom path to your data or workspace. In your Jupyter Notebook, you can access this data using this path. Scrolling down even further will show a /results path for any output you may generate.


Containers

The container is similar to a conda environment where packages that are relevant to your work are pre-loaded and ready to use. In this job, we are using nvaie/tensorflow-3-1 with the specified tag. We will open the notebook on port 8888.

...

As you would do normally when submitting a non-interactive job on batch.


Starting the Job

Finally, to start the job scroll down to the Launch Job section. The default runtime for a job is 30 days (2592000 seconds).

...

You can also copy-paste the generated CLI command from the web interface directly into your terminal as well, though you must specify your dataset and workspace ID and paths.

From the CLI:

Code Block
ngc batch run --name "Job-univ-of-albany-iad2-ace-622835" --priority NORMAL --order 50 --preempt RUNONCE --min-timeslice 2592000s --total-runtime 2592000s --ace univ-of-albany-iad2-ace --instance dgxa100.80g.1.norm --commandline "jupyter lab --allow-root -port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/" --result /results --image "nvaie/tensorflow-3-1:23.03-tf1-nvaie-3.1-py3" --org tt6xxv6at61b --datasetid dataset_ID_here:/mount/data --workspace workspace_ID_here:/mount/workspace:RW --port 8888

...

Here we can see 1 GPU is available. You can see more available if you selected multiple GPUs upon job creation.

Closing the Notebook

Once you are done working on your code, File→Save to save your work. Then File→Shut Down to close the notebook and end the session. It will take a few minutes for the compute resources to become available again as the system saves work and clears memory.

Results & Logs

You can now look at the job and see that it has the 'Finished Success' status. From here you can download results if any were generated, and also obtain a log-file where changes and errors are documented.

...