The IBM AIU hardware and software is designed to accelerate inference of Deep Neural Networks (DNNs). The design supports a technique pioneered by IBM called approximate computing, which leverages lower precision computation and a purpose-built architecture, resulting in highly energy-efficient gains for AI workloads. The simple layout is designed to streamline AI workflows by sending data directly from one compute engine to the next.
Things to Know Before You Start
The AIU is optimized specifically for the inference process of AI workloads. Any training performed on this environment will rely solely on its CPUs, which may result in suboptimal performance.
As per the latest official documentation, the IBM AIU SDK supports only PyTorch models. Models relying on other libraries, such as TensorFlow, will default to the CPU, which may result in suboptimal performance.
The University at Albany has a cluster of IBM AIU prototype chips available for its students, faculty, and researchers to work on various AI technologies. In this way, please note that these AIU chips are currently prototypes, and the environment in which they operate is experimental. As such, hardware and software configurations are subject to change as development continues.
Connecting to the AIU Cluster
First, connect to the AIU head cluster (aiu-headnode.its.albany.edu) via SSH using your preferred method. Next, log in to OpenShift using the oc login
command.
OpenShift is a Kubernetes-based containerized environment. This allows multiple software environments to co-exist on the same hardware without virtualization. IBM will provides two different containers that rides on top of OpenShift and allows utilization of the IBM AIU cards.
The E2E Runtime Environment (RTE) Stable container includes the portions of the IBM AIU software stack for running on an IBM AIU card.
The E2E Software Development Toolkit (SDK) Stable includes the portions of the IBM AIU software stack for running on an IBM AIU card.
For the purposes of this guide, we will be focusing on the SDK container, which includes a compiler backend for the PyTorch library that allows access to the IBM AIU.
Refer to the example below, and be sure to replace <NetID>
and <your_password>
with the credentials we provide you.
oc login -u <NetID> --server=https://api.ua-aiu.its.albany.edu:6443
Then, switch to the project that we will also provide you.
oc project <your_project>
Once connected, navigate to the IBM AIU Files directory within your lab directory. This directory will have been placed there by RTS during your onboarding. Please refer to the following example, and be sure to replace <your_lab>
with your own lab name.
cd /network/rit/lab/<your_lab>/IBM-AIU-Files
Deploying a Pod
To test deploying a single AIU pod, start by creating a 1aiu.yaml
file using the following example YAML. Replace <your pod name>
with the desired name for your pod.
apiVersion: v1 kind: Pod metadata: name: <your pod name> labels: app: <your pod name> spec: securityContext: runAsUser: 56551 runAsGroup: 972 fsGroup: 3052 containers: - name: c1 imagePullPolicy: Always image: icr.io/ibmaiu/release_2024_08/e2e_stable command: ["/usr/bin/pause"] ## starts the pod workingDir: /tmp/ resources: ##starting the variable requests: ibm.com/aiu_pf: 1 limits: ibm.com/aiu_pf: 1 env: - name: HOME value: /tmp - name: HF_HOME value: /tmp/.cache - name: FLEX_COMPUTE value: "SENTIENT" - name: FLEX_DEVICE value: "VFIO" - name: dev-shm mountPath: /dev/shm volumeMounts: - name: modeldata mountPath: /datasets volumes: - name: modeldata persistentVolumeClaim: claimName: modelstore readOnly: true
The descriptions of the variables and the accepted values are as follows:
Variable | Values | Description |
---|---|---|
aiu_pf | 1 | Variable for single aiu pod |
HF_HOME | /tmp/.cache | HF models download path |
FLEX_COMPUTE | “SENTIENT“ | Targets the IBM AIU card |
FLEX_DEVICE | “VFIO“ | Virtual Function I/O interface into the IBM AIU card |
To start your pod, use the following AIU command:
oc create -f 1aiu.yaml
To verify that your pod has been created, you can use:
oc get pod <pod-name>
If the status indicates "Running," as shown below, you can log into your pod. Otherwise, please wait until it transitions to the "Running" state.
NAME READY STATUS RESTARTS AGE jun-pod 2/2 Running 0 9s login-node-client-pzmj7 1/1 Running 0 3d4h
Now log in to the pod you created with 1aiu.yaml
(replacing <pod-name>
with your actual pod name):
oc exec -it <pod-name> -- bash --login
If you login successfully, it shows like:
Defaulted container "c1" out of: c1, aiu-monitor [56551@jun-pod ~]$
Inference Using PyTorch Models
As part of the software stack within the container, PyTorch2 is available for use.
Custom Models From HuggingFace
This example will allow the user to point to a pre-trained model on HuggingFace and to run that model with an IBM AIU that is assigned to a container within an OpenShift pod. Within the /opt/ibm/aiu/examples
directory, there is a torch_roberta.py
script that will download, compile, and run a RoBERTa model from HuggingFace. The script provides context to the model along with a question, as shown below.
question, text = "Who was Miss Piggy?", "Miss Piggy was a muppet"
Now you are ready to run the torch_roberta.py
script.
python3 /opt/ibm/aiu/examples/torch_roberta.py
The script will generate several pages of output, and will end with the following.
-------------------------------------------------- Answer: " muppet" ================================================== 2025/01/13 16:50:56 503918 WARNING [pf_interface.cpp::~PfInterface_impl:463] Stoping MSI monitor 2025/01/13 16:50:56 504060 WARNING [pf_msi_monitor.cpp::PollingThread:144] PfMSIMonitor: Ending 2025/01/13 16:50:56 506712 INFO [vfio_hal_mci_ddr_init_util.cpp::ddr_ECC_error_report:1283] ECC error count: UE = 0 CE = 0 2025/01/13 16:50:56 506717 WARNING [pf_interface.cpp::stop_monitor_thread:485] Stopping monitor thread ... 2025/01/13 16:50:56 506857 INFO [monitoring.cpp::closeMetrics:158] Stopping Metrics thread
If you get the answer “muppet” then the model has run successfully. But why? Let’s take a look at the script contents.
from transformers import AutoTokenizer, RobertaForQuestionAnswering import torch from torch_sendnn import torch_sendnn tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2") model = RobertaForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") question, text = "Who was Miss Piggy?", "Miss Piggy was a muppet" inputs = tokenizer(question, text, return_tensors="pt", max_length=384, padding="max_length") with torch.no_grad(): model = torch.compile(model, backend="sendnn") #model = torch.compile(model, backend="inductor") outputs = model(**inputs) answer_start_index = outputs.start_logits.argmax() answer_end_index = outputs.end_logits.argmax() predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1] answer = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True) print("-"*50) print('Answer: "{}"'.format(answer)) print("="*50)
The script pulls in a pre-trained model from the deepset section of HuggingFace and then compiles the model using the torch.compile
method. This function optimizes the performance of PyTorch models by compiling them to run faster, applying various optimizations, and leveraging specific hardware capabilities. This makes models more efficient during training and inference.
The backend
parameter in torch.compile
specifies which optimization backend to use. Different backends apply various optimizations based on factors such as target hardware or the desired balance between speed and accuracy. Choosing the right backend can significantly impact the model's runtime performance. In this scenario, sendnn
is a custom backend for the PyTorch library that enables access to the IBM AIU.
You can also run with the CPU instead by changing the torch.compile
backend to inductor
or leave it blank (see line 13).
Another important aspect to highlight is the use of with torch.no_grad()
on line 11. This context manager disables gradient calculation, which is useful during inference or model evaluation when gradients are not needed. This reduces memory consumption and speeds up computation. If gradient calculation remains enabled, you will be unable to compile the model with the AIU backend.
For more information on these PyTorch features, please refer to the following links.
If you want to use a different model from HuggingFace, there are several deepset models available. You'll need an account to select a different model and update lines 5 and 6 (the tokenizer
and model
, respectively) as needed. You may need to adjust your code slightly, but the overall approach remains the same. Feel free to copy and paste torch_roberta.py
and use it as a boilerplate for your own tests.
Your Own Models
To leverage the AIU Cluster for the inference part of your AI workloads, you need to disable gradient calculation in the PyTorch model and compile it with the appropriate backend, as outlined in the IBM documentation. The following code snippet provides a simple way to adjust your model code and run it on this hardware.
# Import the IBM AIU Compiler Backend from torch_sendnn import torch_sendnn # Load Your Model model = torch.load('/path/to/my/model.pt') # Disable Gradient Calculation with torch.no_grad(): # Compile Model modelAIU = torch.compile(model, backend="sendnn") # Set Model to Evaluation Mode modelAIU.eval() # Run Inference yAIU = modelAIU(X)
As previously mentioned, the AIU is optimized specifically for the inference process of AI workloads. While it is technically possible to train models in this environment, it is not recommended, as it relies solely on CPUs. This can lead to suboptimal performance, particularly for larger models and datasets.
Wrapping Up
When you finish, make sure to release the allocated resources. Please note that the environment storage is not persistent; therefore, every time you start or stop it, all the files you modified will be lost. Be sure to save all your work. Make sure to exit the pod by running exit
. Then, run the following commands.
oc delete pod <pod-name>