Page Comparison

The IBM AIU hardware and software is designed to accelerate inference of Deep Neural Networks (DNNs). The design supports a technique pioneered by IBM called approximate computing, which leverages lower precision computation and a purpose-built architecture, resulting in highly energy-efficient gains for AI workloads. The simple layout is designed to streamline AI workflows by sending data directly from one compute engine to the next.

Panel

panelIconId	atlassian-info
panelIcon	:info:
bgColor	#F4F5F7

Things to Know Before You Start

The AIU is optimized specifically for the inference process of AI workloads. Any training performed on this environment will rely solely on its CPUs, which may result in suboptimal performance.
As per the latest official documentation, the IBM AIU SDK supports only PyTorch models. Models relying on other libraries, such as TensorFlow, will default to the CPU, which may result in suboptimal performance.

The University at Albany has a cluster of IBM AIU prototype chips available for its students, faculty, and researchers to work on various AI technologies. In this way, please note that these AIU chips are currently prototypes, and the environment in which they operate is experimental. As such, hardware and software configurations are subject to change as development continues.

Connecting to the AIU Cluster

First, connect to the AIU head cluster (aiu-headnode.its.albany.edu) via SSH using your preferred method. Next, log in to OpenShift using the oc login command.

Panel

panelIconId	atlassian-info
panelIcon	:info:
bgColor	#F4F5F7

OpenShift is a Kubernetes-based containerized environment. This allows multiple software environments to co-exist on the same hardware without virtualization. IBM will provides two different containers that rides on top of OpenShift and allows utilization of the IBM AIU cards.

The E2E Runtime Environment (RTE) Stable container includes the portions of the IBM AIU software stack for running on an IBM AIU card.
The E2E Software Development Toolkit (SDK) Stable includes the portions of the IBM AIU software stack for running on an IBM AIU card.

For the purposes of this guide, we will be focusing on the SDK container, which includes a compiler backend for the PyTorch library that allows access to the IBM AIU.

Refer to the example below, and be sure to replace <your_user> and <your_password> with the credentials we provide you.

Code Block

language	bash

oc login -u <your_user> -p <your_password> --server=https://api.ua-aiu.its.albany.edu:6443

Then, switch to the project that we will also provide you.

Code Block
oc project <your_project>

Once connected, navigate to the IBM AIU Files directory within your lab directory. This directory will have been placed there by RTS during your onboarding. Please refer to the following example, and be sure to replace <your_lab> with your own lab name.

Code Block
cd /network/rit/lab/<your_lab>/IBM-AIU-Files

Deploying

There are pre-compiled models, such as BERT and RoBERTa, installed on the system and available to use out of the shelf. If you are not interested in testing inference on pre-compiled models, you can skip this section. From the IBM-AIU-Files directory within your lab, you can deploy all of the models with a single command.

Code Block

language	bash

make deploy-all

This command creates deployments for all the models with a scale of 0 replicas. You will need to manually scale-up/scale-down the preferred models depending on your needs.

Panel

panelIconId	atlassian-info
panelIcon	:info:
bgColor	#F4F5F7

The deployments are based on the number of AIUs available within the racks. For example, UAlbany has 8 nodes with 12 AIUs available in each, then the total number of replicas across all deployments cannot exceed 96.

The following example scales RoBERTa and BERT-Base to 1 replica each.

Code Block
oc scale --replicas=1 deployment/ibm-aiu-server-roberta deployment/ibm-aiu-server-bert-base

Deploying SDK

As mentioned before, the E2E Software Development Toolkit (SDK) Stable includes the portions of the IBM AIU software stack for running on an IBM AIU card such as the following.

Senbfcc: the IBM AIU Build Framework and Common Code.
deepTools: a suite of python tools developed for the efficient analysis of high throughput sequencing data.
senlib: a pure Python-based I2C sensor library for I2C sensors.
TVM (Tensor Virtual Machine): a machine-learning compiler framework for CPUs, GPUs, and machine learning accelerators.
torch_sendnn: a compiler backend for the PyTorch library that allows access to the IBM AIU.

You can start and login into a client job via the following commands.

Code Block

language	bash

make start-sdk
make login-sdk

You will notice a change in your shell prompt - from ibm-ai-pod1 to ibm-aiu-sdk. You can verify this by running the hostname command. Ensure you are connected to ibm-aiu-sdk before proceeding.

Code Block

language	bash

[jv535825@ibm-ai-pod1 IBM-AIU-Files-release_2024_05_v2]$ hostname
ibm-ai-pod1
[jv535825@ibm-ai-pod1 IBM-AIU-Files-release_2024_05_v2]$ make login-sdk
------------------------------------------------------------
Logging into the Software Development Kit (SDK) Environment
------------------------------------------------------------
pod/ibm-aiu-sdk condition met
[1000750000@ibm-aiu-sdk ~]$ hostname
ibm-aiu-sdk

After logging in, you can run any of the "Inference Using Custom Models" examples. To run the "Inference Using Pre-Compiled Models" examples as well, ensure that you have scaled up the deployment as instructed in the previous step. The following sections go into further information on how to run each of the tests.

Inference Using Pre-Compiled Models

In this section, we will be running two Python scripts. The first one is the question_answering_test_request.py, which runs interactive language models (RoBERTa and BERT-Base). It prompts you for a contextual statement followed by a related question, then attempts to predict the answer.

Code Block

language	bash

# To Inference Using RoBERTa
python3 /opt/ibm/aiu/mlserver-tvm/tests/question_answering_test_request.py --model roberta-base

# To Inference Using BERT
python3 /opt/ibm/aiu/mlserver-tvm/tests/question_answering_test_request.py --model bert-base-uncased

In this way, for general question and answering, refer to the following sample interaction.

Code Block

language	bash

Please provide the 'context': Tom likes to eat ice cream.

Please provide the 'question': What does Tom like?

payload {'inputs': [{'name': 'question0', 'datatype': 'BYTES', 'data': 'What does Tom like?',
'shape': [19], 'parameters': {'content_type': 'str'}}, {'name': 'context0', 'datatype':
'BYTES', 'data': 'Tom likes to eat ice cream.', 'shape': [27], 'parameters': {'content_type':
'str'}}]}

{'predict': ['ice cream']}

The second one is the squad_test_request.py, which uses one of the interactive models along with the Stanford Question Answering Dataset (SQuAD) to conduct a stress test designed to run as long as necessary.

Code Block

# To Inference Using RoBERTa
python3 /opt/ibm/aiu/mlserver-tvm/tests/squad_test_request.py --model roberta-base

# To Inference Using BERT
python3 /opt/ibm/aiu/mlserver-tvm/tests/squad_test_request.py --model bert-base-uncased

This stress test is designed to run indefinitely. To stop it, press CTRL-C. When the stress test begins, a squad_stress_test.log file is created, as shown in the following example.

Code Block

language	bash

[INFO] 2023-07-26 14:19:51,220:Validation Dataset Inference Time (seconds):
0.1968059539794922 for 1 num_workers and batch size 1
[INFO] 2023-07-26 14:19:51,221:validation_data_inference : 0.19719219207763672
[INFO] 2023-07-26 14:19:51,221:Predictions: [{'prediction_text': 'Denver Broncos',
'id': '56be4db0acb8001400a502ec'}, {'prediction_text': 'Carolina Panthers', 'id':
'56be4db0acb8001400a502ed'}, {'prediction_text': "Levi's Stadium in the San Francisco
Bay Area at Santa Clara, California", 'id': '56be4db0acb8001400a502ee'},
{'prediction_text': 'Denver Broncos', 'id': '56be4db0acb8001400a502ef'},
{'prediction_text': 'gold', 'id': '56be4db0acb8001400a502f0'}, {'prediction_text':
'golden anniversary', 'id': '56be8e613aeaaa14008c90d1'}, {'prediction_text': 'February
7, 2016', 'id': '56be8e613aeaaa14008c90d2'}, {'prediction_text': 'American Football
Conference', 'id': '56be8e613aeaaa14008c90d3'}, {'prediction_text': 'golden
anniversary', 'id': '56bea9923aeaaa14008c91b9'}, {'prediction_text': 'American
Football Conference', 'id': '56bea9923aeaaa14008c91ba'}]
[INFO] 2023-07-26 14:19:51,225:scores: {'exact_match': tensor(100.), 'f1':
tensor(100.)}

If you encounter an error like ConnectionRefusedError: [Errno 111] Connection refused when running either example, ensure that you have scaled up the models as instructed in the "Deploying Pre-Compiled Models" section. If you're unsure, run the oc get pods command from the head node. If the models aren't running, as shown in the following example, you likely missed the step to scale them up.

Code Block

language	bash

NAME                                        READY   STATUS        RESTARTS   AGE
ibm-aiu-sdk                                 2/2     Running       0          5h11m
ibm-aiu-server-bert-base-6fdc586ddd-wndvf   2/2     Running       0          5h11m
ibm-aiu-server-roberta-5c76cb849b-25vb9     2/2     Running       0          5h11m

Inference Using PyTorch Models

As part of the software stack within the SDK container, PyTorch2 is available for use.

Custom Models From HuggingFace

This example will allow the user to point to a pre-trained model on HuggingFace and to run that model with an IBM AIU that is assigned to a container within an OpenShift pod. Within the /opt/ibm/aiu/examples directory, there is a torch_roberta.py script that will download, compile, and run a RoBERTa model from HuggingFace. The script provides context to the model along with a question, as shown below.

Code Block

language	bash

question, text = "Who was Miss Piggy?", "Miss Piggy was a muppet"

The model will then make a prediction, and the script will verify that the answer is correct. To run this script from within an SDK container, you first need to setup the environmental compiler settings needed. The easiest way to accomplish that is by creating an envars.sh file with the following contents.

Code Block

language	bash

export FLEX_COMPUTE=SENTIENT
export FLEX_DEVICE=VFIO
export DATA_PREC=fp16
export SENCORES=32
export TOKENIZERS_PARALLELISM=false
export DTLOG_LEVEL=error
export TORCH_SENDNN_LOG=CRITICAL

Feel free to create this script using your preferred text editor (e.g., vi, vim, nano). Below is a brief explanation of the compiler environment variables provided by IBM. Additionally, it's a good idea to save a copy of this file for your records, as files in /tmp will be deleted when you end your job.

Variable	Description
FLEX_COMPUTE	Targets the IBM AIU card
FLEX_DEVICE	Refers to the Virtual File Input Output interface into the IBM AIU card.
SENCORES	Refers to the number of cores on the IBM AIU chip. It is recommended to use all 32 cores.
DATA_PREC	Refers to the data precision of the model. The model needs to be quantized into the setting that is chosen.
TOKENIZERS_PARALLELISM	Refers to the HuggingFace transformer, we set this to false to reduce extraneous output.
DTLOG_LEVEL	Refers the general logging (will silence the senlib output).
TORCH_SENDNN_LOG	Prevents the FX graph from printing when it doesn’t need to.

You will also need to set HOME to a directory that you have write permission.

Code Block

language	bash

export HOME=/tmp

Then go ahead and run the script you just created.

Code Block

language	bash

source envars.sh

Now you are ready to run the torch_roberta.py script.

Code Block

language	bash

python3 /opt/ibm/aiu/examples/torch_roberta.py

The script will generate several pages of output, and will end with the following.

Code Block

language	bash

====== Perf Summary End ======
[DeepRT] ===== Perf END =====
[DeepRT] ===== DSM-Act2 BEGIN =====
[DeepRT] ===== Calling DSM (ACT2) ====
[DeepRT] ===== DSM-Act2 END =====
[DeepRT] ===== DSM-AutoPilot BEGIN =====
[DeepRT] ===== DSM-AutoPilot END =====
[DeepRT] ===== DSM-SplitDSenGraph BEGIN =====
[DeepRT] ===== DSM-SplitDSenGraph END =====
[DeepRT] ===== Calling Export ====
Progress: [====================]
--------------------------------------------------
Answer: "muppet"

If you get the answer “muppet” then the model has run successfully. But why? Let’s take a look at the script contents.

Code Block

language	py

from transformers import AutoTokenizer, RobertaForQuestionAnswering
import torch
from torch_sendnn import torch_sendnn

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
model = RobertaForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

question, text = "Who was Miss Piggy?", "Miss Piggy was a muppet"

inputs = tokenizer(question, text, return_tensors="pt", max_length=384, padding="max_length")
with torch.no_grad():
    model = torch.compile(model, backend="sendnn")
    #model = torch.compile(model, backend="inductor")
    outputs = model(**inputs)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
answer = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

print("-"*50)
print('Answer: "{}"'.format(answer))
print("="*50)

The script pulls in a pre-trained model from the deepset section of HuggingFace and then compiles the model using the torch.compile method. This function optimizes the performance of PyTorch models by compiling them to run faster, applying various optimizations, and leveraging specific hardware capabilities. This makes models more efficient during training and inference.

The backend parameter in torch.compile specifies which optimization backend to use. Different backends apply various optimizations based on factors such as target hardware or the desired balance between speed and accuracy. Choosing the right backend can significantly impact the model's runtime performance. In this scenario, sendnn is a custom backend for the PyTorch library that enables access to the IBM AIU.

Panel

panelIconId	atlassian-info
panelIcon	:info:
bgColor	#F4F5F7

You can also run with the CPU instead by changing the torch.compile backend to inductor or leave it blank (see line 13).

Another important aspect to highlight is the use of with torch.no_grad() on line 11. This context manager disables gradient calculation, which is useful during inference or model evaluation when gradients are not needed. This reduces memory consumption and speeds up computation. If gradient calculation remains enabled, you will be unable to compile the model with the AIU backend.

For more information on these PyTorch features, please refer to the following links.

If you want to use a different model from HuggingFace, there are several deepset models available. You'll need an account to select a different model and update lines 5 and 6 (the tokenizer and model, respectively) as needed. You may need to adjust your code slightly, but the overall approach remains the same. Feel free to copy and paste torch_roberta.py and use it as a boilerplate for your own tests.

Your Own Models

To leverage the AIU Cluster for the inference part of your AI workloads, you need to disable gradient calculation in the PyTorch model and compile it with the appropriate backend, as outlined in the IBM documentation. The following code snippet provides a simple way to adjust your model code and run it on this hardware.

Code Block

language	py

# Import the IBM AIU Compiler Backend
from torch_sendnn import torch_sendnn

# Load Your Model
model = torch.load('/path/to/my/model.pt')

# Disable Gradient Calculation
with torch.no_grad():
  # Compile Model
  modelAIU = torch.compile(model, backend="sendnn")
  # Set Model to Evaluation Mode
  modelAIU.eval()
  # Run Inference
  yAIU = modelAIU(X)

As previously mentioned, the AIU is optimized specifically for the inference process of AI workloads. While it is technically possible to train models in this environment, it is not recommended, as it relies solely on CPUs. This can lead to suboptimal performance, particularly for larger models and datasets.

Wrapping Up

When you finish, make sure to release the allocated resources. Please note that the SDK environment storage is not persistent; therefore, every time you start or stop it, all the files you modified will be lost. Be sure to save all your work. Make sure to exit the SDK pod by running exit. Then, run the following commands.

Code Block

language	bash

make stop-sdk
make undeploy-all

Versions Compared

Old Version 1

New Version Current

Key

Things to Know Before You Start

Connecting to the AIU Cluster

Deploying

Deploying SDK

Inference Using Pre-Compiled Models

Inference Using PyTorch Models

Custom Models From HuggingFace

Your Own Models

Wrapping Up