Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Jupyter Notebooks are excellent for building a strong foundation for your project. However, they can be resource-intensive and less practical for running multiple models with different hyperparameters simultaneously. In such cases, non-interactive approaches are more efficient. While the steps may vary slightly between DGX Cloud and On-Prem environments, you will need to convert your Notebook into a standard Python script for these methods.

Export a Python Script from Jupyter Notebook Using the Command Line

You can export your notebook to a .py file using the command line with the nbconvert tool. This tool comes pre-installed with Jupyter Notebook, so you don’t need to install anything extra - it should work right out of the box.

Code Block
languagebash
jupyter nbconvert --to python <your_notebook>.ipynb 

Depending on your code, you may need to make adjustments after you export the script. For example, with the Multiclass Classification model we implemented previously, it’s advisable to either comment out charting and output display or redirect these outputs to appropriate files. These and other changes will be addressed next.

Refactoring Your Script for a Non-Interactive Approach

When transitioning to a non-interactive approach, it’s important to ensure that results (and any other relevant outputs) are properly stored. With that in mind, we will continue from where we left off with our Multiclass Classification model and refactor it for a non-interactive setup. The goal is to make it flexible enough to run for different hyperparameters in a single execution. To achieve this, the first step is to identify the parameters we want to tune:

  • Learning Rate

  • Momentum

  • Number of Epochs

  • Batch Size

We won’t modify the model’s architecture (such as the number of layers or neurons) but will instead focus on experimenting with different parameters. To make this process more efficient, we’ll create a file to store all the hyperparameter combinations we want to test. This allows us to update hyperparameters without changing the code, and it also makes sharing hyperparameter grids with others much simpler.

Set

Learning Rate

Momentum

Number of Epochs

Batch Size

1

1%

90%

500

10

2

2%

90%

500

10

3

1%

80%

500

5

4

2%

85%

500

1

We'll also include a column called set to help identify the results for each parameter combination. You can use formats such as CSV, JSON, or YAML to store hyperparameters. For simplicity, we'll use a CSV file (hyperparameters.csv). The table presented earlier can be represented in this CSV format as shown below.

Code Block
languagetext
set,learning_rate,momentum,n_epochs,batch_size
1,0.01,0.9,500,10
2,0.02,0.9,500,10
3,0.01,0.8,500,5
4,0.02,0.85,500,1

To read and iterate through the rows of the CSV file, we can adjust the code as follows.

Code Block
languagepy
import csv

# Read the Hyperparameter Grid from a CSV File
with open('hyperparameters.csv', 'r') as f:
    reader = csv.DictReader(f)
    hyperparam_list = list(reader)

# Loop Over Each Row of Hyperparameters in the CSV File
for params in hyperparam_list:
    hset = params['set']
    lr = float(params['learning_rate'])
    momentum = float(params['momentum'])
    batch_size = int(params['batch_size'])
    num_epoch = int(params['num_epochs'])
    print(f"Training with LR: {lr}, Momentum: {momentum}, Epochs: {n_epochs}, Batch Size: {batch_size}")

We need to update how we save our trained model parameters and results. To do this, we'll use the set column in our CSV file. Once training is complete, the model's parameters and results will be saved to their respective files based on this column.

Code Block
languagepy
# Save Model State
torch.save(model.state_dict(), hset + '_UA_Multiclass_state.pt')

# Save Loss History
np.save(hset + '_train_loss_hist.npy', np.array(train_loss_hist))
np.save(hset + '_val_loss_hist.npy', np.array(val_loss_hist))

# Save Accuracy History
np.save(hset + '_train_acc_hist.npy', np.array(train_acc_hist))
np.save(hset + '_val_acc_hist.npy', np.array(val_acc_hist))

# Plot Loss Metric
# ...
plt.savefig(hset + '_plot_train_loss.png')

# Plot Accuracy
# ...
plt.savefig(hset + '_plot_acc.png')

# Save Test Inference Data
np.save(hset + '_y_test.npy', np.array(y_test))
np.save(hset + '_y_test_pred.npy', np.array(y_test_pred))

Finally, we will replace the last set of print statements with a method that saves the output to a file, allowing us to analyze it later.

Code Block
languagepy
# Open the File in Write Mode
with open(hset + '_out.txt', 'w') as file:
    file.write('--- Dataset\n\n')
    file.write(f'Number of Instances: {n_instances}\n')
    file.write(f'Number of Features: {n_features}\n')
    file.write(f'Number of Classes: {n_classes}\n\n')
    file.write('--- Training Parameters\n\n')
    file.write(f'Learning Rate: {lr * 100:.1f}%\n')
    file.write(f'Momentum: {momentum * 100:.1f}%\n')
    file.write(f'Epochs: {n_epochs}\n')
    file.write(f'Batch Size: {batch_size}\n\n')
    file.write('--- Hardware\n\n')
    file.write(f'Using Device: {device}\n')
    if device == torch.device('cuda'):
        file.write(f'Available GPUs: {torch.cuda.device_count()}\n')
    file.write('\n--- Results\n\n')
    file.write(f'Training Time: {toc - tic:.2f}s\n')
    file.write(f'Accuracy: {acc * 100:.1f}%\n')

We are now ready to test our script. If everything looks good, we can proceed with submitting it to SLURM.

Info

For your convenience, the updated code is also available as a downloadable Python Script and a CSV file.

View file
nameUA_Multiclass.py
View file
namehyperparameters.csv

You may notice that the downloadable script is slightly more enhanced than the instructions provided here. It saves the output in the /results folder and minimizes console output. Be sure to review these notes alongside the script to fully understand how to adapt these techniques to your project.

Please note that this guide is not a one-size-fits-all approach. The way to refactor your project will depend on your specific code, the libraries you use, and other factors. Nevertheless, this example should give you a good idea of the possibilities and methods for adapting your project.

DGX On-Prem - Submitting a Job on SLURM

Before moving forward, please ensure you have all the necessary access in place and have reviewed the steps in the DGX On-Prem How-To.

Step 1 - Connect to the Head Node

First, connect to the head node via SSH at dgx-head01.its.albany.edu. Once connected, make sure SLURM is loaded by running the following command.

Code Block
languagebash
module load slurm

It’s recommended to work from your lab directory, so be sure to navigate to it and place your script and the CSV file there.

Code Block
cd /network/rit/lab/<your_lab>

This step is especially important since we’ll be using a container to run the script, so be sure not to skip it.

Step 2 - Create an SBATCH Script

Next, you'll create an SBATCH script that pulls a PyTorch container from the NVIDIA Container Registry and runs your custom script. You can use any text editor, but we'll use VIM for this example.

Code Block
languagebash
vim run.sh

In the editor, enter the following script.

Code Block
languagebash
#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:24.07-py3'
#SBATCH --container-mounts=/network/rit/lab/<your_lab>:/mnt/<your_lab>

pip install ucimlrepo
python /mnt/<your_lab>/UA_Multiclass.py

To save and exit VIM, press:

  • Ctrl + S to save

  • Ctrl + Q to quit

Step 3 - Submit and Monitor the Job

To submit the job, use the following command.

Code Block
languagebash
sbatch run.sh

You can check if your job has been submitted by either:

  • Looking for an output file generated by the job

  • Running squeue to see if your job is in the queue.

Please note that jobs that pull containers may take 5-10 minutes to start, as the containers need to be fetched from the registry.

Panel
panelIconIdatlassian-info
panelIcon:info:
bgColor#F4F5F7

You can also use the command tail -f slurm-<job_id>.out to watch your script log in real time.

To cancel a submitted job, use scancel <job_id>. When your job completes, you should see a /results folder with the outputs of our script.

DGX Cloud - Submitting a Job on NGC

Before moving forward, please ensure you have all the necessary access in place and have reviewed the steps in the How-to: NVIDIA DGX Cloud. There are two ways to submit a job.

To remain consistent with the DGX On-Prem instructions, we will submit this job using the CLI. You can run the NGC CLI from lmm.its.albany.edu, dgx-head01.its.albany.edu, or your own machine (installation required). In this example, we will connect to lmm.its.albany.edu and use the CLI from there. Once again, detailed setup instructions for the NGC CLI are available in How-to: NVIDIA DGX Cloud.

Step 1 - Connect to LMM

First, connect to the head node via SSH at lmm.its.albany.edu. If you have any questions on how to connect, please refer to How-to: Connect via SSH. It's recommended to work from your lab directory, so make sure to navigate there first.

Code Block
languagebash
cd /network/rit/lab/<your_lab>

This step is especially important since we’ll be mounting the NGC workspace on this directory, so be sure not to skip it.

Step 2 - Upload the Script to NGC Workspace

Now, from the terminal, first create a directory to mount your workspace to. Then use the ngc workspace mount command to mount the workspace to this new directory - mount as readable and writable via the --mode RW flag and argument to allow data to be copied to it.

Code Block
languagebash
cd /network/rit/lab/<your_lab>

mkdir ngc-mount

ngc workspace mount <your_workspace> ./ngc-mount --mode RW

cd ngc-mount

Place your script and the CSV file in the mounted directory. When done, feel free to unmount the workspace.

Code Block
languagebash
ngc workspace unmount ngc-mount

If you have any questions on how to use ngc mount or ngc unmount, please refer to the official NVIDIA documentation on this topic.

Step 3 - Submit and Monitor the Job

The next and final step is to submit your job through the ngc batch run command.

Code Block
languagebash
ngc base-command job run --name "<your_job_name>" --priority NORMAL --order 50 --preempt RUNONCE --min-timeslice 2592000s --total-runtime 2592000s --ace univ-of-albany-iad2-ace --instance dgxa100.80g.1.norm --commandline "pip install ucimlrepo; python /mount/workspace/UA_Multiclass.py" --result /results --image "nvidia/pytorch:24.07-py3" --org tt6xxv6at61b --team <your_team_name> --workspace <your_workspace_id>:/mount/workspace:RW

This command schedules a job to run on an NVIDIA DGX A100 GPU instance. The job installs the ucimlrepo package and runs a Python script (UA_Multiclass.py) stored in your workspace. The job has a normal priority and a runtime limit of 30 days. Although we are saving our results within the workspace, you can also stores the results in the /results directory. Please note that you need to specify your Workspace ID (not name), which you can find using the following command.

Code Block
languagebash
ngc workspace list

Finally, make sure to have replaced the following parameters with your own info: <your_job_name>, <your_team_name>, and <your_workspace_id>. If the command executes successfully, you can expect something similar to the following output.

Code Block
languagetext
---------------------------------------------------------------------------
 Job Information
   Id: 7010723
   Name: UA_Multiclass
   Number of Replicas: 1
   Job Type: BATCH
   Submitted By: Vieira Sobrinho, Jose
   Order: 50
   Priority: NORMAL
 Job Container Information
   Docker Image URL: nvidia/pytorch:24.07-py3
 Job Commands
   Command: pip install ucimlrepo; python /mount/workspace/UA_Multiclass.py
 Datasets, Workspaces and Results
     Workspace ID: hGsgUUQaQRu3kwe783doHA
       Workspace Name: vieirasobrinho_lab
       Workspace Mount Point: /mount/workspace
       Workspace Mount Mode: RW
     Result Mount Point: /results
 Job Resources
   Instance Type: dgxa100.80g.1.norm
   Instance Details: 1 GPU, 30.0 CPU, 244 GB System Memory
   ACE: univ-of-albany-iad2-ace
   Team: vierasobrino_lab
 Job Labels
   Locked: False
 Job Status
   Created at: 2024-09-11 19:13:56 UTC
   Status: CREATED
   Preempt Class: RUNONCE
   Total Runtime: 30D00H00M00S
   Minimum Timeslice: 30D00H00M00S
---------------------------------------------------------------------------

You can now check the status of your job with ngc batch job status <your_job_id>. Make sure to replace <your_job_id> with the Job ID provided in the previous step. If your job has executed successfully, you can expect output similar to the following.

Code Block
languagetext
Job Status
   Created at: 2024-09-11 19:13:56 UTC
   Started at: 2024-09-11 19:14:13 UTC
   Ended at: 2024-09-11 19:15:14 UTC
   Duration: 01M01S
   Status: FINISHED_SUCCESS
   Status Type: OK
   Preempt Class: RUNONCE
   Total Runtime: 30D00H00M00S
   Minimum Timeslice: 30D00H00M00S 

Your results should now be available in your workspace.