IBM AIU System User Guide

The IBM AIU hardware and software is designed to accelerate inference of Deep Neural Networks (DNNs). The design supports a technique pioneered by IBM called approximate computing, which leverages lower precision computation and a purpose-built architecture, resulting in highly energy-efficient gains for AI workloads. The simple layout is designed to streamline AI workflows by sending data directly from one compute engine to the next.

Things to Know Before You Start

The AIU is optimized specifically for the inference process of AI workloads. Any training performed on this environment will rely solely on its CPUs, which may result in suboptimal performance.
As per the latest official documentation, the IBM AIU SDK supports only PyTorch models. Models relying on other libraries, such as TensorFlow, will default to the CPU, which may result in suboptimal performance.

The University at Albany has a cluster of IBM AIU prototype chips available for its students, faculty, and researchers to work on various AI technologies. In this way, please note that these AIU chips are currently prototypes, and the environment in which they operate is experimental. As such, hardware and software configurations are subject to change as development continues.

Connecting to the AIU Cluster

First, connect to the AIU head cluster (aiu-headnode.its.albany.edu) via SSH using your preferred method. Next, log in to OpenShift using the oc login command.

OpenShift is a Kubernetes-based containerized environment. This allows multiple software environments to co-exist on the same hardware without virtualization. IBM will provides two different containers that rides on top of OpenShift and allows utilization of the IBM AIU cards.

The E2E Runtime Environment (RTE) Stable container includes the portions of the IBM AIU software stack for running on an IBM AIU card.
The E2E Software Development Toolkit (SDK) Stable includes the portions of the IBM AIU software stack for running on an IBM AIU card.

For the purposes of this guide, we will be focusing on the SDK container, which includes a compiler backend for the PyTorch library that allows access to the IBM AIU.

Refer to the example below, and be sure to replace <your_user> and <your_password> with the credentials we provide you.

oc login -u <your_user> -p <your_password> --server=https://api.ua-aiu.its.albany.edu:6443

Then, switch to the project that we will also provide you.

oc project <your_project>

Once connected, navigate to the IBM AIU Files directory within your lab directory. This directory will have been placed there by RTS during your onboarding. Please refer to the following example, and be sure to replace <your_lab> with your own lab name.

cd /network/rit/lab/<your_lab>/IBM-AIU-Files

Deploying a Pod

To test deploying a single AIU pod, start by creating a 1aiu.yaml file using the following example YAML. Replace <your pod name> with the desired name for your pod.

apiVersion: v1
kind: Pod
metadata:
  name: <your pod name>
  labels:
    app: <your pod name>
spec:
  securityContext:
    runAsUser: 56551
    runAsGroup: 972
    fsGroup: 3052
  containers:
  - name: c1
    imagePullPolicy: Always
    image: icr.io/ibmaiu/release_2024_08/e2e_stable
    command: ["/usr/bin/pause"] ## starts the pod
    workingDir: /tmp/
    resources: ##starting the variable
      requests:
        ibm.com/aiu_pf: 1
      limits:
        ibm.com/aiu_pf: 1
    env:
    - name: HOME
      value: /tmp
    - name: HF_HOME
      value: /tmp/.cache
    - name: FLEX_COMPUTE
      value: "SENTIENT"
    - name: FLEX_DEVICE
      value: "VFIO"
    - name: dev-shm
      mountPath: /dev/shm
    volumeMounts:
    - name: modeldata
      mountPath: /datasets
  volumes:
  - name: modeldata
    persistentVolumeClaim:
      claimName: modelstore
      readOnly: true

The descriptions of the variables and the accepted values are as follows:

Variable	Values	Description
aiu_pf	1	Variable for single aiu pod
HF_HOME	/tmp/.cache	HF models download path
FLEX_COMPUTE	“SENTIENT“	Targets the IBM AIU card
FLEX_DEVICE	“VFIO“	Virtual Function I/O interface into the IBM AIU card

To start your pod, use the following AIU command:

oc create -f 1aiu.yaml

To verify that your pod has been created, you can use:

oc get pod <pod-name>

If the status indicates "Running," as shown below, you can log into your pod. Otherwise, please wait until it transitions to the "Running" state.

NAME                      READY   STATUS    RESTARTS   AGE
jun-pod                   2/2     Running   0          9s
login-node-client-pzmj7   1/1     Running   0          3d4h

Now log in to the pod you created with 1aiu.yaml (replacing <pod-name> with your actual pod name):

oc exec -it <pod-name> -- bash --login

If you login successfully, it shows like:

Defaulted container "c1" out of: c1, aiu-monitor
[56551@jun-pod ~]$

Inference Using PyTorch Models

As part of the software stack within the container, PyTorch2 is available for use.

Custom Models From HuggingFace

This example will allow the user to point to a pre-trained model on HuggingFace and to run that model with an IBM AIU that is assigned to a container within an OpenShift pod. Within the /opt/ibm/aiu/examples directory, there is a torch_roberta.py script that will download, compile, and run a RoBERTa model from HuggingFace. The script provides context to the model along with a question, as shown below.

question, text = "Who was Miss Piggy?", "Miss Piggy was a muppet"

Now you are ready to run the torch_roberta.py script.

python3 /opt/ibm/aiu/examples/torch_roberta.py

The script will generate several pages of output, and will end with the following.

--------------------------------------------------
Answer: " muppet"
==================================================
2025/01/13 16:50:56 503918      WARNING [pf_interface.cpp::~PfInterface_impl:463]       Stoping MSI monitor
2025/01/13 16:50:56 504060      WARNING [pf_msi_monitor.cpp::PollingThread:144] PfMSIMonitor: Ending
2025/01/13 16:50:56 506712      INFO [vfio_hal_mci_ddr_init_util.cpp::ddr_ECC_error_report:1283]        ECC error count:  UE = 0  CE = 0
2025/01/13 16:50:56 506717      WARNING [pf_interface.cpp::stop_monitor_thread:485]     Stopping monitor thread ...
2025/01/13 16:50:56 506857      INFO [monitoring.cpp::closeMetrics:158] Stopping Metrics thread

If you get the answer “muppet” then the model has run successfully. But why? Let’s take a look at the script contents.

from transformers import AutoTokenizer, RobertaForQuestionAnswering
import torch
from torch_sendnn import torch_sendnn

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
model = RobertaForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

question, text = "Who was Miss Piggy?", "Miss Piggy was a muppet"

inputs = tokenizer(question, text, return_tensors="pt", max_length=384, padding="max_length")
with torch.no_grad():
    model = torch.compile(model, backend="sendnn")
    #model = torch.compile(model, backend="inductor")
    outputs = model(**inputs)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
answer = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

print("-"*50)
print('Answer: "{}"'.format(answer))
print("="*50)

The script pulls in a pre-trained model from the deepset section of HuggingFace and then compiles the model using the torch.compile method. This function optimizes the performance of PyTorch models by compiling them to run faster, applying various optimizations, and leveraging specific hardware capabilities. This makes models more efficient during training and inference.

The backend parameter in torch.compile specifies which optimization backend to use. Different backends apply various optimizations based on factors such as target hardware or the desired balance between speed and accuracy. Choosing the right backend can significantly impact the model's runtime performance. In this scenario, sendnn is a custom backend for the PyTorch library that enables access to the IBM AIU.

You can also run with the CPU instead by changing the torch.compile backend to inductor or leave it blank (see line 13).

Another important aspect to highlight is the use of with torch.no_grad() on line 11. This context manager disables gradient calculation, which is useful during inference or model evaluation when gradients are not needed. This reduces memory consumption and speeds up computation. If gradient calculation remains enabled, you will be unable to compile the model with the AIU backend.

For more information on these PyTorch features, please refer to the following links.

If you want to use a different model from HuggingFace, there are several deepset models available. You'll need an account to select a different model and update lines 5 and 6 (the tokenizer and model, respectively) as needed. You may need to adjust your code slightly, but the overall approach remains the same. Feel free to copy and paste torch_roberta.py and use it as a boilerplate for your own tests.

Your Own Models

To leverage the AIU Cluster for the inference part of your AI workloads, you need to disable gradient calculation in the PyTorch model and compile it with the appropriate backend, as outlined in the IBM documentation. The following code snippet provides a simple way to adjust your model code and run it on this hardware.

# Import the IBM AIU Compiler Backend
from torch_sendnn import torch_sendnn

# Load Your Model
model = torch.load('/path/to/my/model.pt')

# Disable Gradient Calculation
with torch.no_grad():
  # Compile Model
  modelAIU = torch.compile(model, backend="sendnn")
  # Set Model to Evaluation Mode
  modelAIU.eval()
  # Run Inference
  yAIU = modelAIU(X)

As previously mentioned, the AIU is optimized specifically for the inference process of AI workloads. While it is technically possible to train models in this environment, it is not recommended, as it relies solely on CPUs. This can lead to suboptimal performance, particularly for larger models and datasets.

Wrapping Up

When you finish, make sure to release the allocated resources. Please note that the environment storage is not persistent; therefore, every time you start or stop it, all the files you modified will be lost. Be sure to save all your work. Make sure to exit the pod by running exit. Then, run the following commands.

oc delete pod <pod-name>