Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: added warnings for saving files in the wrong place. fixed wording on preemption details.

In this article, you will learn how to access the Nvidia NVIDIA DGX Cloud resources available from the University.

Compute Request Form wip

Getting Started:

To begin, log into NVIDIA AI Enterprise Base These resources are available at the request of faculty only. 


To request access to DGX Cloud Services, Fill out this form.


Before getting started please ensure you are familiar with the following:

The University's ssh Protocols

The University's Research Storage

Please note: Research Groups are allowed 1-2 GPUs and 20TB of storage. Users who abuse this system will be notified and may be suspended

Getting Started:

To begin, log into NVIDIA AI Enterprise Base Command using your University email (Web SSO)

...

Once you successfully log in, select the University At Albany (SUNY) as your organization, and then select your team. If you do not have a team, please reach out to ITS to add you to your lab's team. You may still access resources, but your lab's data and workspace will be unavailable until you are placed into their team.

Generating an API Key

In order to upload your data and even use the nvidia Nvidia command line, you will need to generate an API Key. This key allows you to log into your workspace as yourself, so it is important to not lose it or give it out to people.

Once you have selected a team and logged in, you will be on the Base Command homepage. On the left-hand side, select 'BASE COMMAND' and from the dropdown, select 'Dashboard' to bring yourself to the overview of the system.

...

On the next page, click 'Generate API Key'. Save this key to a text file for easy access on your machine. If you lose your key you can generate a new one on this page again. Next, you will install the Nvidia CLI

...

Using the Nvidia CLI

...

Next, download the CLI according to your system specifications. To find your system specifications, press the Windows key and type 'System Information'.

Image Removed

Your system information on a Windows machine will look like this. In this case, this is a 64 bit installation of Windows since it is x64-based.

Image Removed

Next select the appropriate download of the CLI and install it.

Image Removed

Once the install is complete, you will configure the terminal and check if it has installed correctly.

Configuring your Terminal

...

via LMM

Access LMM via SSHing to lmm.its.albany.edu. The NGC command is available here and can be checked by invoking:

Code Block
ngc --version

You should see the following output to your terminal.

Image Added

Configuring your Terminal on LMM

Next, you will need to configure your terminal such that you can upload your data in a neat and usable format. Press the windows key and type 'Windows PowerShell' to open a powershell terminal. Powershell has many similarities to a linux terminal that you may be familiar with from accessing exiting University resources but looks different due to syntax. You can still use commands such as 'cd' and 'ls' just like you can in linux.In your terminal, call the following to check if Nvidia CLI has installed correctlyChange directories to your data.

Code Block
cd #brings you home
cd /path/to/my/data

Next, you will configure your NGC such that you can upload your data to your workspace and access it within a job. To begin, invoke:

Code Block
ngc --version

Image Removed

This should return the version of the CLI that you installed. Your username should appear here as well. You can change directories by invoking:

Code Block
cd C:\Users\your_username\

To list the contents of your directory, invoke the command 'ls'.

Next, you will configure your ngc such that you can upload your data to your workspace and access it within a job. To begin, invoke:

Code Block
ngc config set

You will be prompted for your API key that you generated earlier. You can copy (ctrl+c) and then paste (ctrl+v) in your powershell terminal to submit your API key.

Next it will prompt you for your CLI output type, select ascii by typing in:

Code Block
ascii

If you entered a different option or accidentally skipped this entry, you can invoke 'ngc config set' again to pick your choices again. Hitting enter without any input will skip the prompt, so you do not need to re-enter your API key unless you need to.

Next you will be asked to enter your organization, enter the following:

Code Block
University at Albany (SUNY)

Next it will ask you to enter your team name. This should be your lab team in which you will be working in. Lastly the terminal will prompt you for 'ace'. Enter the following in order to set your Accelerated Computing Environment:

Code Block
univ-of-albany-iad2-ace

If done successfully, you will see something similar to the following:

Image Removed

You should see your username and lab name in the appropriate spaces. Congrats on setting up the terminal! You are now ready to upload your data.

Uploading Data

From your computer:

In your powershell terminal, navigate to where your data is stored. To do this you can change directories via:

Code Block
cd \Users\your_username\path\to\your\data\

You can also obtain this path by opening a file explorer and copy-pasting the address at the top into your terminal.

...

config set

You will be prompted for your API key that you generated earlier. You can copy (ctrl+c) and then right-click in your terminal to paste. You will need this API key to upload your data.

Next, it will prompt you for your CLI output type, select ascii by typing in:

Code Block
ascii

If you entered a different option or accidentally skipped this entry, you can invoke 'ngc config set' again to pick your choices again. Hitting enter without any input will skip the prompt, so you do not need to re-enter your API key unless you need to.

Next, you will be asked to enter your organization, enter the following:

Code Block
tt6xxv6at61b

Next, it will ask you to enter your team name. This should be your lab team in which you will be working. Lastly, the terminal will prompt you for 'ace'. Enter the following in order to set up your Accelerated Computing Environment:

Code Block
univ-of-albany-iad2-ace

If done successfully, you will see something similar to the following:

Image Added

You should see your username and lab name in the appropriate spaces. If you mis-entered a value, you can use invoke 'ngc config set' to go through each step. Pressing enter without any input will not overwrite previously inputted information, thus you can hit enter to skip portions you entered correctly. To clear the entire config, invoke 'ngc config clear'.

If you do not see 'univ-of-albany-iad2-ace', ITS has yet to add you to a team such that you can access this part. Please wait for ITS to add you to a team and then regenerate your API key.

Congrats on setting up the terminal! You are now ready to upload your data.

Uploading Data

In your Linux terminal or Powershell terminal, navigate to where your data is stored. To do this you can change directories via:

Code Block
cd /path/to/your/data

You can also obtain this path by opening a file explorer and copy-pasting the address at the top into your terminal for Windows.


Make sure your data is not zipped, tarred, or archived. If your data is zipped or in .tar format, it will upload as is and will not be as accessible on the cloud. Unzip/untar your data before uploading. The option under --source should be the file or folder that contains your files. The --desc option has a descriptor and name section. Lastly the --share option will designate to which team you want to upload the data to, if you are a part of multiple teams. Add a 2nd --share option if you want to upload to another team at the same time.

...

An example command to upload a series of csv CSV files in a folder containing them labled labeled 'world happiness' to the team awan_lab, would look like the following:

Code Block
titleEXAMPLE
ngc dataset upload --source world_happiness --desc "csvs of world happiness data ranging from 2015 to 2019" world_happiness --share awan_lab

Now if you return to the base command dashboard, and look under 'Datasets', you should see your just uploaded file/folder. You can upload more files in the same fashion and they will appear in the same manner. The data you upload in this fashion is immutable, thus you do not need to worry about accidentally editing the data during your work. In practice, you should make a copy of the data when executing code, and have the code interface with the copy. This would be accomplished in a manner such asthe team awan_lab, would look like the following:

Code Block
titlepython
linenumberstrue
import pandas as pd
df = pd.read_csv("/mount/data/folder/name_of_your_data.csv")
print(df)

In this fashion, your original data is never truly changed, ensuring reproducibility of your work. If you are doing data cleaning, you can do so in a workspace and save it as a new csv, or clean the data locally if that is easier.

From your lab folder:

...

EXAMPLE
ngc dataset upload --source world_happiness --desc "csvs of world happiness data ranging from 2015 to 2019" world_happiness --share awan_lab

Now if you return to the base command dashboard, and look under 'Datasets', you should see you've just uploaded a file/folder. You can upload more files in the same fashion and they will appear in the same manner. The data you upload in this fashion is immutable, thus you do not need to worry about accidentally editing the data during your work. In practice, you should make a copy of the data when executing code, and have the code interface with the copy. This would be accomplished in a manner such as:

Code Block
titlepython
linenumberstrue
import pandas as pd
df = pd.read_csv("/mount/data/folder/name_of_your_data.csv")
print(df)

In this fashion, your original data is never truly changed, ensuring the reproducibility of your work. If you are doing data cleaning, you can do so in a workspace and save it as a new csv, or clean the data locally if that is easier.

Starting a Job

The default time limit of a job is 30 days. There are 2 ways to start a job on DGX cloud, via web interface or via CLI. The web interface is graphical, easy to use, and also generates a CLI prompt for you to use if you wish. In this example, we will submit a job to launch a jupyternotebook Jupyter Notebook instance where we can access our data from inside the notebook.

From the Web Interface:

In the create job section, there are templates created by people in your team, Nvidia, or UAlbany ITS. In this example, we'll be using a template to launch a jupyter notebook Jupyter Notebook session. Here we see a templates tab, and one template available for us to use. The name of this template is tf_3.1_jupyter_notebook, uses 1 GPU, 1 node, and a container image made by Nvidia for tensorflow TensorFlow 3.1.

Once you load the template, you can edit the options in the web interface to suit your needs. You can swap the number of GPUs/nodes or container type to better suit your computing needs. Scroll down to 'Inputs' and you can see which datasets and workspaces you can load, and load multiple datasets/workspaces. Both datasets and workspaces can contain data that you would use, but there are some key differences to know about them.

Datasets

These are read-only and will be the same between sessions. This is useful or reference data that you don't want to change. Files such as CSVs are useful to put here for your code to reference and load into memory from.

Workspaces

Files in this artifact are readable and writable. This space is useful for living files that are being edited and worked on. Files such as jupyter notebooks (.ipynb), python scripts, etc, are useful to put here as you work on them between sessions.between sessions. This is useful or reference data that you don't want to change. Files such as CSVs are useful to put here for your code to reference and to load into memory from.

Workspaces

Files in this artifact are readable and writable. This space is useful for living files that are being edited and worked on. Files such as Jupyter notebooks (.ipynb), python scripts, etc, are useful to put here as you work on them between sessions. It is wise to always have a workspace mounted as a save space for in progress files.

Results

This directory exists in each job and can be used to direct outputs into. Doing so ensures that your outputs are available post job completion.

**Do not save notebooks or files in directories that are not under /mount or under /results**


You can also download datasets and workspaces, and convert them to results as well to reupload re-upload into other spaces. For example, if you are finished working on a script in a workspace, you might download and then reupload it as a dataset such that you have an immutable copy of the script to reference. Once you have selected a dataset or workspace to include, a text box will appear under the 'Mount Point' column. Here you can enter /mount/data, or any other custom path to your data or workspace. In your jupyter notebookJupyter Notebook, you can access this data using this path. Scrolling down even further will show a /results path for any output you may generate.


Containers

The container is similar to a conda environment where packages that are relevant to your work are pre-loaded and ready to use. In this job, we are using nvaie/tensorflow-3-1 with the specified tag. We will open the notebook on port 8888.

Image Removed

...

-3-1 with the specified tag. We will open the notebook on port 8888.

Image Added

Code Block
jupyter lab --allow-root -port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/
--or--
jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/ --or--
jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/ 

...

More information can be found here.

To run a non-interactive script, you would upload your script and do the following:

Code Block
python myscript.py --param1 x --param2 y etc

As you would do normally when submitting a non-interactive job on batch.


Starting the Job

Finally, to start the job scroll down to the Launch Job section. The default runtime for a job is 30 days (2592000 seconds).

Image RemovedImage Added

Job Priority should always be set to Normal. Changing this priority can disrupt jobs for other users, and if everyone sets the priority to High, then no one is prioritized. Please respect your colleagues by not using higher-priority values in this field. ITS may terminate jobs that disrupt the useability of compute computing resources.

Preemption Options should be 'Resumable'. This ensures that your job can be paused if the system experiences high usage or needs to be taken down for maintenance.

Job Order will run jobs in the order specified, ranging from 1 to 99. If you submit two jobs, one with order 2, and another with order 1, the job with order 1 will execute first. This ordering is only relevant to you as the user , and does not affect other users. The default value if left blank is 50. In the CLI, one can set job order using the --order flag, here is an example job sumitted submitted with order 66.

Code Block
ngc batch run --name test-order (job details...) --order=66

...

You can also copy-paste the generated CLI command from the web interface directly into your terminal as well, though you must specifcy specify your dataset and workspace ID and paths.

From the CLI:

Code Block
ngc batch run --name "Job-univ-of-albany-iad2-ace-622835" --priority NORMAL --order 50 --preempt RUNONCE --min-timeslice 2592000s --total-runtime 2592000s --ace univ-of-albany-iad2-ace --instance dgxa100.80g.1.norm --commandline "jupyter lab --allow-root -port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/" --result /results --image "nvaie/tensorflow-3-1:23.03-tf1-nvaie-3.1-py3" --org tt6xxv6at61b --datasetid dataset_ID_here:/mount/data --workspace workspace_ID_here:/mount/workspace:RW --port 8888

You can specify your datasets and workspace wth with the respectve respective --dataset and --workspace flags, and then use the ID for each. The option RW for --workspace denotes Read and Write permissions.

...

Click the newly created job to see the Overview page. Here you can see the generating command that spawned the job, Telemetry monitoring about the job's performance, open ports for any related services, among many other features. To access the actual jupyter notebookJupyter Notebook, click on the URL/Hostname under Service Mapped Ports. Please note that anyone with the URL can acess your work and data. ITS reminds you to not share sensitive information such as generated URLs or API Keys.

...

Once you open the link, you will be greeted by the standard jupyter notebook Jupyter Notebook launch page. From here you can open your uploaded code or start a new ipynb. Your data will be found in the same path that you specified for mounting, in our case this is the /mount/data and /mount/workspace folders. If you are making a new jupyternb, save your notebook within /mount/workspaces so that you may edit and access it later. You will not be able to save your notebook in the /mount/data folder as that is read only.

Lastly, to access your data in a notebook, you can simple simply invoke:

Code Block
import pandas as pd
data = pd.read_csv('/mount/data/your_data.csv')
data

...

Here we can see 1 GPU is available. You can see more available if you selected multiple GPUs upon job creation.

Closing the Notebook

Once you are done working on your code, File→Save to save your work. Then File→Shut Down to close the notebook and end the session. It will take a few minutes for the compute resources to become available again as the system saves work and clears memory.

Results & Logs

You can now look at the job and see that it has the 'Finished Success' status. From here you can download results if any were generated, and also obtain a log-file where changes and errors are documented.

...