Jupyter Notebooks are excellent for building a strong foundation for your project. However, they can be resource-intensive and less practical for running multiple models with different hyperparameters simultaneously. In such cases, non-interactive approaches are more efficient. While the steps may vary slightly between DGX Cloud and On-Prem environments, you will need to convert your Notebook into a standard Python script for these methods.
Export a Python Script from Jupyter Notebook Using the Command Line
You can export your notebook to a .py
file using the command line with the nbconvert
tool. This tool comes pre-installed with Jupyter Notebook, so you don’t need to install anything extra - it should work right out of the box.
jupyter nbconvert --to python <your_notebook>.ipynb
Depending on your code, you may need to make adjustments after you export the script. For example, with the Multiclass Classification model we implemented previously, it’s advisable to either comment out charting and output display or redirect these outputs to appropriate files. These and other changes will be addressed next.
Refactoring Your Script for a Non-Interactive Approach
When transitioning to a non-interactive approach, it’s important to ensure that results (and any other relevant outputs) are properly stored. With that in mind, we will continue from where we left off with our Multiclass Classification model and refactor it for a non-interactive setup. The goal is to make it flexible enough to run for different hyperparameters in a single execution. To achieve this, the first step is to identify the parameters we want to tune:
Learning Rate
Momentum
Number of Epochs
Batch Size
We won’t modify the model’s architecture (such as the number of layers or neurons) but will instead focus on experimenting with different parameters. To make this process more efficient, we’ll create a file to store all the hyperparameter combinations we want to test. This allows us to update hyperparameters without changing the code, and it also makes sharing hyperparameter grids with others much simpler.
Set | Learning Rate | Momentum | Number of Epochs | Batch Size |
---|---|---|---|---|
1 | 1% | 90% | 500 | 10 |
2 | 2% | 90% | 500 | 10 |
3 | 1% | 80% | 500 | 5 |
4 | 2% | 85% | 500 | 1 |
We'll also include a column called set
to help identify the results for each parameter combination. You can use formats such as CSV, JSON, or YAML to store hyperparameters. For simplicity, we'll use a CSV file (hyperparameters.csv
). The table presented earlier can be represented in this CSV format as shown below.
set,learning_rate,momentum,n_epochs,batch_size 1,0.01,0.9,500,10 2,0.02,0.9,500,10 3,0.01,0.8,500,5 4,0.02,0.85,500,1
To read and iterate through the rows of the CSV file, we can adjust the code as follows.
import csv # Read the Hyperparameter Grid from a CSV File with open('hyperparameters.csv', 'r') as f: reader = csv.DictReader(f) hyperparam_list = list(reader) # Loop Over Each Row of Hyperparameters in the CSV File for params in hyperparam_list: hset = params['set'] lr = float(params['learning_rate']) momentum = float(params['momentum']) batch_size = int(params['batch_size']) num_epoch = int(params['num_epochs']) print(f"Training with LR: {lr}, Momentum: {momentum}, Epochs: {n_epochs}, Batch Size: {batch_size}")
We need to update how we save our trained model parameters and results. To do this, we'll use the set
column in our CSV file. Once training is complete, the model's parameters and results will be saved to their respective files based on this column.
# Save Model State torch.save(model.state_dict(), hset + '_UA_Multiclass_state.pt') # Save Loss History np.save(hset + '_train_loss_hist.npy', np.array(train_loss_hist)) np.save(hset + '_val_loss_hist.npy', np.array(val_loss_hist)) # Save Accuracy History np.save(hset + '_train_acc_hist.npy', np.array(train_acc_hist)) np.save(hset + '_val_acc_hist.npy', np.array(val_acc_hist)) # Plot Loss Metric # ... plt.savefig(hset + '_plot_train_loss.png') # Plot Accuracy # ... plt.savefig(hset + '_plot_acc.png') # Save Test Inference Data np.save(hset + '_y_test.npy', np.array(y_test)) np.save(hset + '_y_test_pred.npy', np.array(y_test_pred))
Finally, we will replace the last set of print statements with a method that saves the output to a file, allowing us to analyze it later.
# Open the File in Write Mode with open(hset + '_out.txt', 'w') as file: file.write('--- Dataset\n\n') file.write(f'Number of Instances: {n_instances}\n') file.write(f'Number of Features: {n_features}\n') file.write(f'Number of Classes: {n_classes}\n\n') file.write('--- Training Parameters\n\n') file.write(f'Learning Rate: {lr * 100:.1f}%\n') file.write(f'Momentum: {momentum * 100:.1f}%\n') file.write(f'Epochs: {n_epochs}\n') file.write(f'Batch Size: {batch_size}\n\n') file.write('--- Hardware\n\n') file.write(f'Using Device: {device}\n') if device == torch.device('cuda'): file.write(f'Available GPUs: {torch.cuda.device_count()}\n') file.write('\n--- Results\n\n') file.write(f'Training Time: {toc - tic:.2f}s\n') file.write(f'Accuracy: {acc * 100:.1f}%\n')
We are now ready to test our script. If everything looks good, we can proceed with submitting it to SLURM.
For your convenience, the updated code is also available as a downloadable Python Script and a CSV file.
You may notice that the downloadable script is slightly more enhanced than the instructions provided here. It saves the output in the /results
folder and minimizes console output. Be sure to review these notes alongside the script to fully understand how to adapt these techniques to your project.
Please note that this guide is not a one-size-fits-all approach. The way to refactor your project will depend on your specific code, the libraries you use, and other factors. Nevertheless, this example should give you a good idea of the possibilities and methods for adapting your project.
DGX On-Prem - Submitting a Job on SLURM
Before moving forward, please ensure you have all the necessary access in place and have reviewed the steps in the DGX On-Prem How-To.
If you have any questions on how to connect, please refer to How-to: Connect via SSH.
If you are not familiar with SLURM, please refer to How-to: Scheduling via SLURM.
Step 1 - Connect to the Head Node
First, connect to the head node via SSH at dgx-head01.its.albany.edu
. Once connected, make sure SLURM is loaded by running the following command.
module load slurm
It’s recommended to work from your lab directory, so be sure to navigate to it and place your script and the CSV file there.
cd /network/rit/lab/<your_lab>
This step is especially important since we’ll be using a container to run the script, so be sure not to skip it.
Step 2 - Create an SBATCH Script
Next, you'll create an SBATCH script that pulls a PyTorch container from the NVIDIA Container Registry and runs your custom script. You can use any text editor, but we'll use VIM for this example.
vim run.sh
In the editor, enter the following script.
#!/bin/bash #SBATCH --job-name=test_job #SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:24.07-py3' #SBATCH --container-mounts=/network/rit/lab/<your_lab>:/mnt/<your_lab> pip install ucimlrepo python /mnt/<your_lab>/UA_Multiclass.py
To save and exit VIM, press:
Ctrl + S
to saveCtrl + Q
to quit
Step 3 - Submit and Monitor the Job
To submit the job, use the following command.
sbatch run.sh
You can check if your job has been submitted by either:
Looking for an output file generated by the job
Running
squeue
to see if your job is in the queue.
Please note that jobs that pull containers may take 5-10 minutes to start, as the containers need to be fetched from the registry.
You can also use the command tail -f slurm-<job_id>.out
to watch your script log in real time.
To cancel a submitted job, use scancel <job_id>
. When your job completes, you should see a /results
folder with the outputs of our script.
DGX Cloud - Submitting a Job on NGC
Before moving forward, please ensure you have all the necessary access in place and have reviewed the steps in the How-to: NVIDIA DGX Cloud. There are two ways to submit a job.
Start a job using the web interface on NGC Base Command.
Start a job using the NGC CLI.
To remain consistent with the DGX On-Prem instructions, we will submit this job using the CLI. You can run the NGC CLI from lmm.its.albany.edu
, dgx-head01.its.albany.edu
, or your own machine (installation required). In this example, we will connect to lmm.its.albany.edu
and use the CLI from there. Once again, detailed setup instructions for the NGC CLI are available in How-to: NVIDIA DGX Cloud.
Step 1 - Connect to LMM
First, connect to the head node via SSH at lmm.its.albany.edu
. If you have any questions on how to connect, please refer to How-to: Connect via SSH. It's recommended to work from your lab directory, so make sure to navigate there first.
cd /network/rit/lab/<your_lab>
This step is especially important since we’ll be mounting the NGC workspace on this directory, so be sure not to skip it.
Step 2 - Upload the Script to NGC Workspace
Now, from the terminal, first create a directory to mount your workspace to. Then use the ngc workspace mount
command to mount the workspace to this new directory - mount as readable and writable via the --mode RW
flag and argument to allow data to be copied to it.
cd /network/rit/lab/<your_lab> mkdir ngc-mount ngc workspace mount <your_workspace> ./ngc-mount --mode RW cd ngc-mount
Place your script and the CSV file in the mounted directory. When done, feel free to unmount the workspace.
ngc workspace unmount ngc-mount
If you have any questions on how to use ngc mount
or ngc unmount
, please refer to the official NVIDIA documentation on this topic.
Step 3 - Submit and Monitor the Job
The next and final step is to submit your job through the ngc batch run
command.
ngc base-command job run --name "<your_job_name>" --priority NORMAL --order 50 --preempt RUNONCE --min-timeslice 2592000s --total-runtime 2592000s --ace univ-of-albany-iad2-ace --instance dgxa100.80g.1.norm --commandline "pip install ucimlrepo; python /mount/workspace/UA_Multiclass.py" --result /results --image "nvidia/pytorch:24.07-py3" --org tt6xxv6at61b --team <your_team_name> --workspace <your_workspace_id>:/mount/workspace:RW
This command schedules a job to run on an NVIDIA DGX A100 GPU instance. The job installs the ucimlrepo
package and runs a Python script (UA_Multiclass.py
) stored in your workspace. The job has a normal priority and a runtime limit of 30 days. Although we are saving our results within the workspace, you can also stores the results in the /results
directory. Please note that you need to specify your Workspace ID (not name), which you can find using the following command.
ngc workspace list
Finally, make sure to have replaced the following parameters with your own info: <your_job_name>
, <your_team_name>
, and <your_workspace_id>
. If the command executes successfully, you can expect something similar to the following output.
--------------------------------------------------------------------------- Job Information Id: 7010723 Name: UA_Multiclass Number of Replicas: 1 Job Type: BATCH Submitted By: Vieira Sobrinho, Jose Order: 50 Priority: NORMAL Job Container Information Docker Image URL: nvidia/pytorch:24.07-py3 Job Commands Command: pip install ucimlrepo; python /mount/workspace/UA_Multiclass.py Datasets, Workspaces and Results Workspace ID: hGsgUUQaQRu3kwe783doHA Workspace Name: vieirasobrinho_lab Workspace Mount Point: /mount/workspace Workspace Mount Mode: RW Result Mount Point: /results Job Resources Instance Type: dgxa100.80g.1.norm Instance Details: 1 GPU, 30.0 CPU, 244 GB System Memory ACE: univ-of-albany-iad2-ace Team: vierasobrino_lab Job Labels Locked: False Job Status Created at: 2024-09-11 19:13:56 UTC Status: CREATED Preempt Class: RUNONCE Total Runtime: 30D00H00M00S Minimum Timeslice: 30D00H00M00S ---------------------------------------------------------------------------
You can now check the status of your job with ngc batch job status <your_job_id>
. Make sure to replace <your_job_id>
with the Job ID provided in the previous step. If your job has executed successfully, you can expect output similar to the following.
Job Status Created at: 2024-09-11 19:13:56 UTC Started at: 2024-09-11 19:14:13 UTC Ended at: 2024-09-11 19:15:14 UTC Duration: 01M01S Status: FINISHED_SUCCESS Status Type: OK Preempt Class: RUNONCE Total Runtime: 30D00H00M00S Minimum Timeslice: 30D00H00M00S
Your results should now be available in your workspace.