AI Tutorials - Building a Multiclass Classification Model with PyTorch and Running on NVIDIA DGX

In this tutorial, you will discover how to use PyTorch to develop neural network models for multi-class classification problems and run them on NVIDIA DGX hardware. This guide will walk you through the fundamentals and provide you with the tools to build machine learning models.

Fundamentals

If you're interested in understanding the fundamentals behind this application, feel free to explore this section. Otherwise, you can jump straight into the code.

What is a Multiclass Classification Problem?

In machine learning, multiclass classification is a task that involves classifying instances into different classes, where each instance can only be assigned to one class. Unlike binary classification, which deals with only two classes, multiclass classification handles three or more classes.

Key Points

Each instance belongs to exactly one class out of three or more possible classes.
The model must learn to distinguish between multiple classes simultaneously.
Examples include:
- Classifying images of different animal species.
- Categorizing news articles into topics.
- Identifying different types of flowers based on their features.

Binary vs. Multiclass Classification

It's important to understand the relationship between binary and multiclass classification:

Binary Classification:
- Involves classifying instances into one of two classes.
- Examples: spam vs. not spam, positive vs. negative sentiment.
Multiclass Classification:
- Involves classifying instances into one of three or more classes.

For the educational purposes of this tutorial, the code should work for both binary and multiclass classification problems.

Technically, binary classification is the simplest form of multiclass classification, where the number of classes is two. However, in practice, they are often treated separately due to:

Specific algorithms designed for binary problems that may need modification for multiclass scenarios.
Differences in evaluation metrics and challenges between binary and multiclass problems.

In the context of neural networks and PyTorch:

Binary classification typically uses a single output neuron with a sigmoid activation function.
Multiclass classification usually uses multiple output neurons (one per class) with a softmax activation function.

While binary classification can be seen as a subset of multiclass classification, it's often beneficial to consider them as separate tasks due to their specific characteristics and implementations.

What is a Neural Network?

A neural network is a computational model inspired by the human brain's structure and function. It consists of interconnected nodes (neurons) organized in layers that process and transmit information.

Key Components

Input Layer: Receives the initial data.
Hidden Layers: Process the information through weighted connections.
Output Layer: Produces the final prediction or classification.
Activation Functions: Introduce non-linearity, allowing the network to learn complex patterns.
Weights and Biases: Adjustable parameters that the network learns during training

Neural networks can learn to perform tasks like classification, regression, and pattern recognition through a process called training, where they adjust their internal parameters based on example data.

What is PyTorch?

PyTorch is an open-source machine learning library developed by Facebook's AI Research lab. It provides a flexible and efficient platform for building and training neural networks.

Key Features

Dynamic Computational Graphs: Allows for more intuitive and flexible model design.
GPU Acceleration: Allows the use of NVIDIA GPUs for faster computation.
Rich Ecosystem: Offers a wide range of pre-built models, optimizers, and tools.

PyTorch is widely used in both research and industry for developing state-of-the-art machine learning models.

What is NVIDIA DGX?

NVIDIA DGX is a line of high-performance computing systems designed specifically for deep learning and AI workloads. These systems are built to provide maximum performance for training and running complex models. UAlbany has currently available two NVIDIA DGX AI Clusters, On-Cloud and On-Premises, which combined offers over 400 NVIDIA A100 GPUs.

Code

Firstly, we need to import the required libraries for this project. In this setup:

We import PyTorch libraries for building and training our neural network.
We include data manipulation libraries like NumPy and Pandas.
We import scikit-learn for data preprocessing and evaluation metrics.
We import Matplotlib for data visualization.

Make sure to select a PyTorch container (e.g., nvidia/pytorch:24.07-py3) in order to have these dependencies automatically resolved.

DGX Cloud - Select from the container dropdown when creating a job.
DGX On-Prem - Specify the the container image attribute on your SLURM job (#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:24.07-py3').

import copy
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm

from sklearn.datasets import make_classification
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

Before we proceed with our model development, it's crucial to understand the hardware resources available to us. This will allow us to optimize our code accordingly, making the most out of the available resources. There are three possible scenarios we need to account for:

CPU
Single GPU
Multiple GPUs

Understanding our hardware setup allows us to make informed decisions about aspects such as whether to use DataParallel or DistributedDataParallel for multi-GPU training.

For more information on these two classes, please refer to DataParallel and DistributedDataParallel.

When running on a container on either DGX Cloud or On-Prem, we will always have at least one GPU available.

DGX Cloud - Select the number of GPUs from the computer resource selection when creating a job.
DGX On-Prem - Specify the the number of GPUs attribute on your SLURM job (#SBATCH --gres=gpu:1).

From now on, we'll use the device variable to ensure our data and model are on the correct hardware for optimal performance.

# Check GPU Availability
if torch.cuda.is_available():
    device = torch.device('cuda')
    print('Using Device:', device)
    print('Available GPUs:', torch.cuda.device_count())
else:
    device = torch.device('cpu')
    print('Using Device:', device)

In this tutorial, we will be leveraging the datasets available on the UCI Machine Learning Repository. The UCI ML Repository provides a convenient library to download datasets, making it easy to access a wide range of machine learning problems. To load the dataset, we'll use the ucimlrepo library.

Installing the ucimlrepo library:

In Jupyter Notebook: You don’t need to take any action as !pip install ucimlrepo will install the library directly within the notebook.
In a Python Script: Before running your script, ensure the library is installed by executing pip install ucimlrepo in your terminal.

For this example, we will use the Iris dataset. The goal of this dataset is to classify iris flowers into three species (setosa, versicolor, and virginica) based on the length and width of their sepals and petals.

# Install UCI ML Repo Lib
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo

# Import Iris from UCI ML Repo
ucirepo = fetch_ucirepo(id=53)

# Data (as Pandas Dataframes)
X = ucirepo.data.features
y = ucirepo.data.targets

# Target Variable (Class)
target = 'class'

This code will fetch the dataset and load it into pandas DataFrames. X contains the feature data, and y contains the target labels.

If you want to try a different UCI dataset, you can do so by changing the id attribute in the fetch_ucirepo function. Make sure to adjust the target column name accordingly in your subsequent code. e.g., to use the Musk (Version 2) dataset, simply use id=75 to run the Notebook.

The following code gathers and prints some basic information about the dataset, displays the first few rows, and creates a bar plot of the class distribution. This initial exploration helps us understand the structure and balance of our dataset.

# Gather Dataset Info
n_instances = X.shape[0]
n_features = X.shape[1]
n_classes = y.nunique()[target]

# Print Dataset Info
print('Number of Instances:', n_instances)
print('Number of Features:', n_features)
print('Number of Classes:', n_classes)

# Print Dataset Sample
print('\nDataset Sample\n')
ds_sample = X.copy()
ds_sample[target] = y
ds_sample.sample(n=5)

For the Iris dataset, these are the numbers you should expect.

Number of Instances	Number of Features	Number of Classes
150	4	3

Furthermore, we can analyze how individuals are distributed among different classes.

# Plot Instances by Class
counts = ds_sample.groupby(target).size()
colors = plt.cm.Paired(range(len(counts)))
ax = counts.plot(kind='bar', color=colors)
for index, value in enumerate(counts):
    ax.text(index, value, str(value), ha='center', va='bottom')
ax.set_ylabel('Instances')
plt.show()

Instances are evenly distributed across the different classes.

Now let's reshape and encode the input (X) and output (y) data to prepare it optimally for the neural network.

# Reshape Input
X = np.array(X)

# Reshape Output
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y[target])
a = np.array(y)
b = np.zeros((a.size, a.max() + 1))
b[np.arange(a.size), a] = 1
y = b

# Print Reshaped Input/Output Samples
print('Input Sample\n')
print(X[0:5])
print('\nOutput Sample\n')
print(y[0:5])

Next, we convert the data from NumPy arrays into PyTorch tensors, which are the primary data structures used in PyTorch for efficient computation. By converting X and `y into tensors, we enable the neural network to perform operations like matrix multiplication and backpropagation.

Moving the tensors (move) to a device (such as a GPU) allows for faster computation, taking advantage of hardware acceleration. This is crucial for training deep learning models efficiently, especially with large datasets or complex architectures. Therefore, this is a pre-requisite for maximizing the workloads on NVIDIA DGX.

# Convert Arrays Into Tensors
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32)

# Move Tensors to Same Hardware
X = X.to(device)
y = y.to(device)

Now we need to allocate portions of our dataset for training, testing, and validation of the neural network. The training set will be used to teach the model, the validation set will help tune the model during development, and the test set will evaluate its performance on unseen data. By splitting the dataset:

60% of the data is assigned to training (X_train, y_train), allowing the model to learn patterns and relationships within the data.
20% is assigned to validation (X_val, y_val), enabling us to fine-tune the model and adjust hyperparameters.
20% is set aside for testing (X_test, y_test), allowing us to assess the model's generalization ability on new, unseen data.

Shuffling the data (shuffle=True) before splitting helps ensure that the subtest sets are representative of the overall dataset, reducing the risk of biased results.

This approach first assigns 60% of the data to training, then divides the remaining 40% equally between validation and testing (20% each). Feel free to adjust the train_size parameter to see how it impacts the results. As a rule of thumb, a larger training set generally helps the model learn better, but it's important to keep enough data for testing to accurately evaluate performance.

# Split Dataset Into Training and Temporary Sets (Validation + Test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, train_size=0.6, shuffle=True)

# Split Temporary Set Into Validation and Test Sets (50% for Validation & 50% for Testing)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, shuffle=True)

The moment do define our model has finally arrived! We will set up our neural network with an input layer, one hidden layer, and an output layer to handle the classification task.

Activation Function (ReLU): We use the ReLU function to introduce non-linearity, enabling the network to learn complex patterns.
Hidden Layer: This layer takes the input features and reduces them to half the number of features (n_features/2). The hidden layer is where the network begins to learn abstract representations of the input data.
Output Layer: Finally, the output layer reduces the hidden layer's outputs to the number of classes (n_classes). This layer produces the final predictions for each input.

You can experiment with adding more hidden layers or changing the number of neurons in each layer to see how it impacts the model's learning and performance. Increasing complexity might help the model learn better, but it could also lead to overfitting if not managed carefully.

In the forward pass, the input data is passed through the hidden layer, transformed by the activation function, and then processed by the output layer to produce the model's predictions.

# Neural Network Setup
class Multiclass(nn.Module):
    def __init__(self):
        super().__init__()
        # Activation Function (ReLU)
        self.act = nn.ReLU()
        # Hidden Layer
        self.hidden = nn.Linear(n_features, round(n_features/2))
        # Output Layer
        self.output = nn.Linear(round(n_features/2), n_classes)
    def forward(self, x):
        x = self.hidden(x)
        x = self.act(x)
        x = self.output(x)
        return x

Next, we need to configure the parameters for training our neural network.

Learning Rate (lr = 0.01): This controls how much to adjust the network's weights with each step. A smaller learning rate means more gradual updates, which can lead to more stable training.
Momentum (momentum = 0.9): Momentum helps accelerate gradients vectors in the right directions, leading to faster converging.
Number of Epochs (n_epochs = 500): This defines how many times the entire dataset will pass through the network during training. More epochs generally lead to better learning, up to a point where the model might start overfitting.
Batch Size (batch_size = 10): This determines how many samples the network will process before updating its weights. Smaller batch sizes can make training more computationally efficient and help the model generalize better.

Feel free to adjust these parameters to see how they affect the training process and the model's performance. Fine-tuning these hyperparameters is often necessary to achieve the best results.

# Training Parameters
lr = 0.01
momentum = 0.9
n_epochs = 500
batch_size = 10

Now we will instantiate our neural network and configure it based on the available hardware to ensure optimal performance.

First, we create an instance of our Multiclass neural network. If a GPU is available (torch.cuda.is_available()), we transfer the model to the GPU (model.to(device)) for faster computation. If multiple GPUs are available, we enable parallel processing with nn.DataParallel(model) to further accelerate training.

When running on DGX, either Cloud or On-Prem, parallel processing will be available if the target container has 2 or more GPUs allocated.

We use CrossEntropyLoss as our loss function, which is well-suited for multi-class classification tasks, calculating the difference between the predicted and actual labels. Then we initialize the optimizer as Stochastic Gradient Descent (SGD) with the previously defined learning rate (lr) and momentum. This optimizer will update the model's parameters during training based on the gradients of the loss function.

# Instantiate Model
model = Multiclass()

# Enable Parallel Processing (If Applicable)
if device == torch.device('cuda'):
    if torch.cuda.device_count() > 1:
      model = nn.DataParallel(model)
# Move Model to Same Hardware
model = model.to(device)

# Loss Function & Optmizer
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

These steps ensure that the model is ready to begin training efficiently, leveraging the best available hardware and an appropriate optimization strategy.

# Prepare Model and Training Parameters
batches_per_epoch = len(X_train) // batch_size
best_acc = - np.inf
best_weights = None
train_loss_hist = []
train_acc_hist = []
val_loss_hist = []
val_acc_hist = []

It's time to begin the training process, where the model will learn from the training data over multiple epochs. The training is conducted over a set number of epochs (n_epochs). Each epoch represents one complete pass through the training data. The data is processed in smaller batches (batch_size), which helps manage memory usage and can lead to better generalization. After each epoch, the model is evaluated on the test set to assess its performance.

tic = time.time()
# Training
for epoch in range(n_epochs):
    epoch_loss = []
    epoch_acc = []
    epoch_f1 = []
    # Set Model in Training Mode and Run Through Each Batch
    model.train()
    with tqdm.trange(batches_per_epoch, unit='batch', mininterval=0, disable=True) as bar:
        bar.set_description(f'Epoch {epoch}')
        for i in bar:
            # Take a Batch
            start = i * batch_size
            X_batch = X_train[start:start + batch_size]
            y_batch = y_train[start:start + batch_size]
            # Forward Pass
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            # Backward Pass
            optimizer.zero_grad()
            loss.backward()
            # Update Weights
            optimizer.step()
            # Calculate Metrics
            acc = (torch.argmax(y_pred, 1) == torch.argmax(y_batch, 1)).float().mean()
            epoch_loss.append(float(loss))
            epoch_acc.append(float(acc))
            bar.set_postfix(
                loss=float(loss),
                acc=float(acc),
            )
    # Set Model in Evaluation Mode and Run Through the Validation Set
    model.eval()
    y_pred = model(X_val)
    ce = loss_fn(y_pred, y_val)
    acc = (torch.argmax(y_pred, 1) == torch.argmax(y_val, 1)).float().mean()
    ce = float(ce)
    acc = float(acc)
    train_loss_hist.append(np.mean(epoch_loss))
    train_acc_hist.append(np.mean(epoch_acc))
    val_loss_hist.append(ce)
    val_acc_hist.append(acc)
    if acc > best_acc:
        best_acc = acc
        best_weights = copy.deepcopy(model.state_dict())
        print(f'Epoch {epoch} validation: Cross-entropy={ce:.2f}, Accuracy={acc * 100:.1f}%')
toc = time.time()
print('Training Completed in %0.2fs' % (toc - tic))

This process iteratively improves the model's ability to generalize to unseen data by refining its parameters based on the training data while monitoring its performance on the test data. After training is completed, we restore the best model weights identified during the training process. This ensures that we are using the most effective version of the model for any further tasks.

# Restore Best Weights
model.load_state_dict(best_weights)

Finally, we visualize the results and print the overall training time.

Adding Data Labels: The add_data_labels function adds numerical labels to specific data points on the plots to make it easier to interpret the results visually.
Plotting Loss: We plot the cross-entropy loss over the epochs for both the training and test sets. This helps us visualize how the model's error decreased during training and how well it generalized to unseen data. Adding data labels highlights the loss at regular intervals.
Plotting Accuracy: Similarly, we plot the accuracy for both the training and test sets over the epochs. This shows how the model's ability to correctly classify inputs improved over time. Data labels are added to display accuracy values at regular intervals.

# Function to Add Data Labels
plot_interval = n_epochs if n_epochs < 10 else round(n_epochs / 10)
def add_data_labels(x, y, interval=plot_interval):
    for i, value in enumerate(y):
        if i % interval == 0:
            plt.text(i, value, f'{value:.2f}', ha='center')

# Plot Loss Metric
plt.plot(train_loss_hist, label='train')
plt.plot(val_loss_hist, label='val')
add_data_labels(range(len(train_loss_hist)), train_loss_hist)
add_data_labels(range(len(val_loss_hist)), val_loss_hist)
plt.xlabel('Epochs')
plt.ylabel('Cross Entropy')
plt.legend()
plt.show()

# Plot Accuracy
plt.plot(train_acc_hist, label='train')
plt.plot(val_acc_hist, label='val')
add_data_labels(range(len(train_acc_hist)), train_acc_hist)
add_data_labels(range(len(val_acc_hist)), val_acc_hist)
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

These visualizations and metrics provide a comprehensive overview of the model's performance, making it easier to assess whether the training process was successful and where further improvements might be needed.

print ('--- Dataset\n')
print('Number of Instances:', n_instances)
print('Number of Features:', n_features)
print('Number of Classes:', n_classes)
print('\n--- Training Parameters\n')
print(f'Learning Rate: {lr * 100:.1f}%')
print(f'Momentum: {momentum * 100:.1f}%')
print('Epochs:', n_epochs)
print('Batch Size:', batch_size)
print('\n--- Hardware\n')
print('Using Device:', device)
if device == torch.device('cuda'):
    print('Available GPUs:', torch.cuda.device_count())
print('\n--- Results\n')
print('Training Time: %0.2fs' % (toc - tic))
print(f'Accuracy: {acc * 100:.1f}%')

If the code run successfully, these are the type of results you should expect.

--- Dataset

Number of Instances: 150
Number of Features: 4
Number of Classes: 3

--- Training Parameters

Learning Rate: 1.0%
Momentum: 90.0%
Epochs: 500
Batch Size: 10

--- Hardware

Using Device: cuda
Available GPUs: 1

--- Results

Training Time: 0.57s
Accuracy: 100.0%

To achieve the best results in the shortest time, ensure your code is properly utilizing the GPU. GPU workloads can be exponentially faster, often by an order of magnitude, compared to CPU-based processing.

For your convenience, this code is also available as a downloadable Jupyter Notebook.

If you have any questions regarding how to run a Jupyter notebook on NVDIA DGX, please refer to the following resources.