In this tutorial, you will discover how to use PyTorch to develop neural network models for multi-class classification problems and run them on NVIDIA DGX hardware. This guide will walk you through the fundamentals and provide you with the tools to build machine learning models.
Fundamentals
If you're interested in understanding the fundamentals behind this application, feel free to explore this section. Otherwise, you can jump straight into the code.
Code
Firstly, we need to import the required libraries for this project. In this setup:
We import PyTorch libraries for building and training our neural network.
We include data manipulation libraries like NumPy and Pandas.
We import scikit-learn for data preprocessing and evaluation metrics.
We import Matplotlib for data visualization.
Make sure to select a PyTorch container (e.g., nvidia/pytorch:24.07-py3
) in order to have these dependencies automatically resolved.
DGX Cloud - Select from the container dropdown when creating a job.
DGX On-Prem - Specify the the container image attribute on your SLURM job (
#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:24.07-py3'
).
import copy import matplotlib.pyplot as plt import numpy as np import pandas as pd import time import torch import torch.nn as nn import torch.optim as optim import tqdm from sklearn.datasets import make_classification from sklearn.metrics import f1_score from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder
Before we proceed with our model development, it's crucial to understand the hardware resources available to us. This will allow us to optimize our code accordingly, making the most out of the available resources. There are three possible scenarios we need to account for:
CPU
Single GPU
Multiple GPUs
Understanding our hardware setup allows us to make informed decisions about aspects such as whether to use DataParallel
or DistributedDataParallel
for multi-GPU training.
For more information on these two classes, please refer to DataParallel and DistributedDataParallel.
When running on a container on either DGX Cloud or On-Prem, we will always have at least one GPU available.
DGX Cloud - Select the number of GPUs from the computer resource selection when creating a job.
DGX On-Prem - Specify the the number of GPUs attribute on your SLURM job (
#SBATCH --gres=gpu:1
).
From now on, we'll use the device
variable to ensure our data and model are on the correct hardware for optimal performance.
# Check GPU Availability if torch.cuda.is_available(): device = torch.device('cuda') print('Using Device:', device) print('Available GPUs:', torch.cuda.device_count()) else: device = torch.device('cpu') print('Using Device:', device)
In this tutorial, we will be leveraging the datasets available on the UCI Machine Learning Repository. The UCI ML Repository provides a convenient library to download datasets, making it easy to access a wide range of machine learning problems. To load the dataset, we'll use the ucimlrepo
library.
Installing the ucimlrepo
library:
In Jupyter Notebook: You don’t need to take any action as
!pip install ucimlrepo
will install the library directly within the notebook.In a Python Script: Before running your script, ensure the library is installed by executing
pip install ucimlrepo
in your terminal.
For this example, we will use the Iris dataset. The goal of this dataset is to classify iris flowers into three species (setosa, versicolor, and virginica) based on the length and width of their sepals and petals.
# Install UCI ML Repo Lib !pip install ucimlrepo from ucimlrepo import fetch_ucirepo # Import Iris from UCI ML Repo ucirepo = fetch_ucirepo(id=53) # Data (as Pandas Dataframes) X = ucirepo.data.features y = ucirepo.data.targets # Target Variable (Class) target = 'class'
This code will fetch the dataset and load it into pandas DataFrames. X
contains the feature data, and y
contains the target labels.
If you want to try a different UCI dataset, you can do so by changing the id
attribute in the fetch_ucirepo
function. Make sure to adjust the target
column name accordingly in your subsequent code. e.g., to use the Musk (Version 2) dataset, simply use id=75
to run the Notebook.
The following code gathers and prints some basic information about the dataset, displays the first few rows, and creates a bar plot of the class distribution. This initial exploration helps us understand the structure and balance of our dataset.
# Gather Dataset Info n_instances = X.shape[0] n_features = X.shape[1] n_classes = y.nunique()[target] # Print Dataset Info print('Number of Instances:', n_instances) print('Number of Features:', n_features) print('Number of Classes:', n_classes) # Print Dataset Sample print('\nDataset Sample\n') ds_sample = X.copy() ds_sample[target] = y ds_sample.sample(n=5)
For the Iris dataset, these are the numbers you should expect.
Number of Instances | Number of Features | Number of Classes |
---|---|---|
150 | 4 | 3 |
Furthermore, we can analyze how individuals are distributed among different classes.
# Plot Instances by Class counts = ds_sample.groupby(target).size() colors = plt.cm.Paired(range(len(counts))) ax = counts.plot(kind='bar', color=colors) for index, value in enumerate(counts): ax.text(index, value, str(value), ha='center', va='bottom') ax.set_ylabel('Instances') plt.show()
Instances are evenly distributed across the different classes.
Now let's reshape and encode the input (X
) and output (y
) data to prepare it optimally for the neural network.
# Reshape Input X = np.array(X) # Reshape Output labelencoder = LabelEncoder() y = labelencoder.fit_transform(y[target]) a = np.array(y) b = np.zeros((a.size, a.max() + 1)) b[np.arange(a.size), a] = 1 y = b # Print Reshaped Input/Output Samples print('Input Sample\n') print(X[0:5]) print('\nOutput Sample\n') print(y[0:5])
Next, we convert the data from NumPy arrays into PyTorch tensors, which are the primary data structures used in PyTorch for efficient computation. By converting X
and `y into tensors, we enable the neural network to perform operations like matrix multiplication and backpropagation.
Moving the tensors (move
) to a device (such as a GPU) allows for faster computation, taking advantage of hardware acceleration. This is crucial for training deep learning models efficiently, especially with large datasets or complex architectures. Therefore, this is a pre-requisite for maximizing the workloads on NVIDIA DGX.
# Convert Arrays Into Tensors X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32) # Move Tensors to Same Hardware X = X.to(device) y = y.to(device)
Now we need to allocate portions of our dataset for training, testing, and validation of the neural network. The training set will be used to teach the model, the validation set will help tune the model during development, and the test set will evaluate its performance on unseen data. By splitting the dataset:
60% of the data is assigned to training (
X_train
,y_train
), allowing the model to learn patterns and relationships within the data.20% is assigned to validation (
X_val
,y_val
), enabling us to fine-tune the model and adjust hyperparameters.20% is set aside for testing (
X_test
,y_test
), allowing us to assess the model's generalization ability on new, unseen data.
Shuffling the data (shuffle=True
) before splitting helps ensure that the subtest sets are representative of the overall dataset, reducing the risk of biased results.
This approach first assigns 60% of the data to training, then divides the remaining 40% equally between validation and testing (20% each). Feel free to adjust the train_size
parameter to see how it impacts the results. As a rule of thumb, a larger training set generally helps the model learn better, but it's important to keep enough data for testing to accurately evaluate performance.
# Split Dataset Into Training and Temporary Sets (Validation + Test) X_train, X_temp, y_train, y_temp = train_test_split(X, y, train_size=0.6, shuffle=True) # Split Temporary Set Into Validation and Test Sets (50% for Validation & 50% for Testing) X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, shuffle=True)
The moment do define our model has finally arrived! We will set up our neural network with an input layer, one hidden layer, and an output layer to handle the classification task.
Activation Function (ReLU): We use the ReLU function to introduce non-linearity, enabling the network to learn complex patterns.
Hidden Layer: This layer takes the input features and reduces them to half the number of features (
n_features/2
). The hidden layer is where the network begins to learn abstract representations of the input data.Output Layer: Finally, the output layer reduces the hidden layer's outputs to the number of classes (
n_classes
). This layer produces the final predictions for each input.
You can experiment with adding more hidden layers or changing the number of neurons in each layer to see how it impacts the model's learning and performance. Increasing complexity might help the model learn better, but it could also lead to overfitting if not managed carefully.
In the forward pass, the input data is passed through the hidden layer, transformed by the activation function, and then processed by the output layer to produce the model's predictions.
# Neural Network Setup class Multiclass(nn.Module): def __init__(self): super().__init__() # Activation Function (ReLU) self.act = nn.ReLU() # Hidden Layer self.hidden = nn.Linear(n_features, round(n_features/2)) # Output Layer self.output = nn.Linear(round(n_features/2), n_classes) def forward(self, x): x = self.hidden(x) x = self.act(x) x = self.output(x) return x
Next, we need to configure the parameters for training our neural network.
Learning Rate (
lr = 0.01
): This controls how much to adjust the network's weights with each step. A smaller learning rate means more gradual updates, which can lead to more stable training.Momentum (
momentum = 0.9
): Momentum helps accelerate gradients vectors in the right directions, leading to faster converging.Number of Epochs (
n_epochs = 500
): This defines how many times the entire dataset will pass through the network during training. More epochs generally lead to better learning, up to a point where the model might start overfitting.Batch Size (
batch_size = 10
): This determines how many samples the network will process before updating its weights. Smaller batch sizes can make training more computationally efficient and help the model generalize better.
Feel free to adjust these parameters to see how they affect the training process and the model's performance. Fine-tuning these hyperparameters is often necessary to achieve the best results.
# Training Parameters lr = 0.01 momentum = 0.9 n_epochs = 500 batch_size = 10
Now we will instantiate our neural network and configure it based on the available hardware to ensure optimal performance.
First, we create an instance of our Multiclass
neural network. If a GPU is available (torch.cuda.is_available()
), we transfer the model to the GPU (model.to(device)
) for faster computation. If multiple GPUs are available, we enable parallel processing with nn.DataParallel(model)
to further accelerate training.
When running on DGX, either Cloud or On-Prem, parallel processing will be available if the target container has 2 or more GPUs allocated.
We use CrossEntropyLoss
as our loss function, which is well-suited for multi-class classification tasks, calculating the difference between the predicted and actual labels. Then we initialize the optimizer as Stochastic Gradient Descent (SGD
) with the previously defined learning rate (lr
) and momentum. This optimizer will update the model's parameters during training based on the gradients of the loss function.
# Instantiate Model model = Multiclass() # Enable Parallel Processing (If Applicable) if device == torch.device('cuda'): if torch.cuda.device_count() > 1: model = nn.DataParallel(model) # Move Model to Same Hardware model = model.to(device) # Loss Function & Optmizer loss_fn = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)
These steps ensure that the model is ready to begin training efficiently, leveraging the best available hardware and an appropriate optimization strategy.
# Prepare Model and Training Parameters batches_per_epoch = len(X_train) // batch_size best_acc = - np.inf best_weights = None train_loss_hist = [] train_acc_hist = [] val_loss_hist = [] val_acc_hist = []
It's time to begin the training process, where the model will learn from the training data over multiple epochs. The training is conducted over a set number of epochs (n_epochs
). Each epoch represents one complete pass through the training data. The data is processed in smaller batches (batch_size
), which helps manage memory usage and can lead to better generalization. After each epoch, the model is evaluated on the test set to assess its performance.
tic = time.time() # Training for epoch in range(n_epochs): epoch_loss = [] epoch_acc = [] epoch_f1 = [] # Set Model in Training Mode and Run Through Each Batch model.train() with tqdm.trange(batches_per_epoch, unit='batch', mininterval=0, disable=True) as bar: bar.set_description(f'Epoch {epoch}') for i in bar: # Take a Batch start = i * batch_size X_batch = X_train[start:start + batch_size] y_batch = y_train[start:start + batch_size] # Forward Pass y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) # Backward Pass optimizer.zero_grad() loss.backward() # Update Weights optimizer.step() # Calculate Metrics acc = (torch.argmax(y_pred, 1) == torch.argmax(y_batch, 1)).float().mean() epoch_loss.append(float(loss)) epoch_acc.append(float(acc)) bar.set_postfix( loss=float(loss), acc=float(acc), ) # Set Model in Evaluation Mode and Run Through the Validation Set model.eval() y_pred = model(X_val) ce = loss_fn(y_pred, y_val) acc = (torch.argmax(y_pred, 1) == torch.argmax(y_val, 1)).float().mean() ce = float(ce) acc = float(acc) train_loss_hist.append(np.mean(epoch_loss)) train_acc_hist.append(np.mean(epoch_acc)) val_loss_hist.append(ce) val_acc_hist.append(acc) if acc > best_acc: best_acc = acc best_weights = copy.deepcopy(model.state_dict()) print(f'Epoch {epoch} validation: Cross-entropy={ce:.2f}, Accuracy={acc * 100:.1f}%') toc = time.time() print('Training Completed in %0.2fs' % (toc - tic))
This process iteratively improves the model's ability to generalize to unseen data by refining its parameters based on the training data while monitoring its performance on the test data. After training is completed, we restore the best model weights identified during the training process. This ensures that we are using the most effective version of the model for any further tasks.
# Restore Best Weights model.load_state_dict(best_weights)
Finally, we visualize the results and print the overall training time.
Adding Data Labels: The
add_data_labels
function adds numerical labels to specific data points on the plots to make it easier to interpret the results visually.Plotting Loss: We plot the cross-entropy loss over the epochs for both the training and test sets. This helps us visualize how the model's error decreased during training and how well it generalized to unseen data. Adding data labels highlights the loss at regular intervals.
Plotting Accuracy: Similarly, we plot the accuracy for both the training and test sets over the epochs. This shows how the model's ability to correctly classify inputs improved over time. Data labels are added to display accuracy values at regular intervals.
# Function to Add Data Labels plot_interval = n_epochs if n_epochs < 10 else round(n_epochs / 10) def add_data_labels(x, y, interval=plot_interval): for i, value in enumerate(y): if i % interval == 0: plt.text(i, value, f'{value:.2f}', ha='center')
# Plot Loss Metric plt.plot(train_loss_hist, label='train') plt.plot(val_loss_hist, label='val') add_data_labels(range(len(train_loss_hist)), train_loss_hist) add_data_labels(range(len(val_loss_hist)), val_loss_hist) plt.xlabel('Epochs') plt.ylabel('Cross Entropy') plt.legend() plt.show()
# Plot Accuracy plt.plot(train_acc_hist, label='train') plt.plot(val_acc_hist, label='val') add_data_labels(range(len(train_acc_hist)), train_acc_hist) add_data_labels(range(len(val_acc_hist)), val_acc_hist) plt.xlabel('Epochs') plt.ylabel('Accuracy') plt.legend() plt.show()
These visualizations and metrics provide a comprehensive overview of the model's performance, making it easier to assess whether the training process was successful and where further improvements might be needed.
print ('--- Dataset\n') print('Number of Instances:', n_instances) print('Number of Features:', n_features) print('Number of Classes:', n_classes) print('\n--- Training Parameters\n') print(f'Learning Rate: {lr * 100:.1f}%') print(f'Momentum: {momentum * 100:.1f}%') print('Epochs:', n_epochs) print('Batch Size:', batch_size) print('\n--- Hardware\n') print('Using Device:', device) if device == torch.device('cuda'): print('Available GPUs:', torch.cuda.device_count()) print('\n--- Results\n') print('Training Time: %0.2fs' % (toc - tic)) print(f'Accuracy: {acc * 100:.1f}%')
If the code run successfully, these are the type of results you should expect.
--- Dataset Number of Instances: 150 Number of Features: 4 Number of Classes: 3 --- Training Parameters Learning Rate: 1.0% Momentum: 90.0% Epochs: 500 Batch Size: 10 --- Hardware Using Device: cuda Available GPUs: 1 --- Results Training Time: 0.57s Accuracy: 100.0%
To achieve the best results in the shortest time, ensure your code is properly utilizing the GPU. GPU workloads can be exponentially faster, often by an order of magnitude, compared to CPU-based processing.
For your convenience, this code is also available as a downloadable Jupyter Notebook.
If you have any questions regarding how to run a Jupyter notebook on NVDIA DGX, please refer to the following resources.