The IBM AIU hardware and software is designed to accelerate inference of Deep Neural Networks (DNNs). The design supports a technique pioneered by IBM called approximate computing, which leverages lower precision computation and a purpose-built architecture, resulting in highly energy-efficient gains for AI workloads. The simple layout is designed to streamline AI workflows by sending data directly from one compute engine to the next.
Things to Know Before You Start
The AIU is optimized specifically for the inference process of AI workloads. Any training performed on this environment will rely solely on its CPUs, which may result in suboptimal performance.
As per the latest official documentation, the IBM AIU SDK supports only PyTorch models. Models relying on other libraries, such as TensorFlow, will default to the CPU, which may result in suboptimal performance.
The University at Albany has a cluster of IBM AIU prototype chips available for its students, faculty, and researchers to work on various AI technologies. In this way, please note that these AIU chips are currently prototypes, and the environment in which they operate is experimental. As such, hardware and software configurations are subject to change as development continues.
Connecting to the AIU Cluster
First, connect to the AIU head cluster (aiu-headnode.its.albany.edu) via SSH using your preferred method. Next, log in to OpenShift using the oc login
command.
OpenShift is a Kubernetes-based containerized environment. This allows multiple software environments to co-exist on the same hardware without virtualization. IBM will provides two different containers that rides on top of OpenShift and allows utilization of the IBM AIU cards.
The E2E Runtime Environment (RTE) Stable container includes the portions of the IBM AIU software stack for running on an IBM AIU card.
The E2E Software Development Toolkit (SDK) Stable includes the portions of the IBM AIU software stack for running on an IBM AIU card.
For the purposes of this guide, we will be focusing on the SDK container, which includes a compiler backend for the PyTorch library that allows access to the IBM AIU.
Refer to the example below, and be sure to replace <your_user>
and <your_password>
with the credentials we provide you.
oc login -u <your_user> -p <your_password> --server=https://api.ua-aiu.its.albany.edu:6443
Then, switch to the project that we will also provide you.
oc project <your_project>
Once connected, navigate to the IBM AIU Files directory within your lab directory. This directory will have been placed there by RTS during your onboarding. Please refer to the following example, and be sure to replace <your_lab>
with your own lab name.
cd /network/rit/lab/<your_lab>/IBM-AIU-Files
Deploying
There are pre-compiled models, such as BERT and RoBERTa, installed on the system and available to use out of the shelf. If you are not interested in testing inference on pre-compiled models, you can skip this section. From the IBM-AIU-Files directory within your lab, you can deploy all of the models with a single command.
make deploy-all
This command creates deployments for all the models with a scale of 0 replicas. You will need to manually scale-up/scale-down the preferred models depending on your needs.
The deployments are based on the number of AIUs available within the racks. For example, UAlbany has 8 nodes with 12 AIUs available in each, then the total number of replicas across all deployments cannot exceed 96.
The following example scales RoBERTa and BERT-Base to 1 replica each.
oc scale --replicas=1 deployment/ibm-aiu-server-roberta deployment/ibm-aiu-server-bert-base
Deploying SDK
As mentioned before, the E2E Software Development Toolkit (SDK) Stable includes the portions of the IBM AIU software stack for running on an IBM AIU card such as the following.
Senbfcc: the IBM AIU Build Framework and Common Code.
deepTools: a suite of python tools developed for the efficient analysis of high throughput sequencing data.
senlib: a pure Python-based I2C sensor library for I2C sensors.
TVM (Tensor Virtual Machine): a machine-learning compiler framework for CPUs, GPUs, and machine learning accelerators.
torch_sendnn: a compiler backend for the PyTorch library that allows access to the IBM AIU.
You can start and login into a client job via the following commands.
make start-sdk make login-sdk
You will notice a change in your shell prompt - from ibm-ai-pod1
to ibm-aiu-sdk
. You can verify this by running the hostname
command. Ensure you are connected to ibm-aiu-sdk
before proceeding.
[jv535825@ibm-ai-pod1 IBM-AIU-Files-release_2024_05_v2]$ hostname ibm-ai-pod1 [jv535825@ibm-ai-pod1 IBM-AIU-Files-release_2024_05_v2]$ make login-sdk ------------------------------------------------------------ Logging into the Software Development Kit (SDK) Environment ------------------------------------------------------------ pod/ibm-aiu-sdk condition met [1000750000@ibm-aiu-sdk ~]$ hostname ibm-aiu-sdk
After logging in, you can run any of the "Inference Using Custom Models" examples. To run the "Inference Using Pre-Compiled Models" examples as well, ensure that you have scaled up the deployment as instructed in the previous step. The following sections go into further information on how to run each of the tests.
Inference Using Pre-Compiled Models
In this section, we will be running two Python scripts. The first one is the question_answering_test_request.py
, which runs interactive language models (RoBERTa and BERT-Base). It prompts you for a contextual statement followed by a related question, then attempts to predict the answer.
# To Inference Using RoBERTa python3 /opt/ibm/aiu/mlserver-tvm/tests/question_answering_test_request.py --model roberta-base # To Inference Using BERT python3 /opt/ibm/aiu/mlserver-tvm/tests/question_answering_test_request.py --model bert-base-uncased
In this way, for general question and answering, refer to the following sample interaction.
Please provide the 'context': Tom likes to eat ice cream. Please provide the 'question': What does Tom like? payload {'inputs': [{'name': 'question0', 'datatype': 'BYTES', 'data': 'What does Tom like?', 'shape': [19], 'parameters': {'content_type': 'str'}}, {'name': 'context0', 'datatype': 'BYTES', 'data': 'Tom likes to eat ice cream.', 'shape': [27], 'parameters': {'content_type': 'str'}}]} {'predict': ['ice cream']}
The second one is the squad_test_request.py
, which uses one of the interactive models along with the Stanford Question Answering Dataset (SQuAD) to conduct a stress test designed to run as long as necessary.
# To Inference Using RoBERTa python3 /opt/ibm/aiu/mlserver-tvm/tests/squad_test_request.py --model roberta-base # To Inference Using BERT python3 /opt/ibm/aiu/mlserver-tvm/tests/squad_test_request.py --model bert-base-uncased
This stress test is designed to run indefinitely. To stop it, press CTRL-C
. When the stress test begins, a squad_stress_test.log
file is created, as shown in the following example.
[INFO] 2023-07-26 14:19:51,220:Validation Dataset Inference Time (seconds): 0.1968059539794922 for 1 num_workers and batch size 1 [INFO] 2023-07-26 14:19:51,221:validation_data_inference : 0.19719219207763672 [INFO] 2023-07-26 14:19:51,221:Predictions: [{'prediction_text': 'Denver Broncos', 'id': '56be4db0acb8001400a502ec'}, {'prediction_text': 'Carolina Panthers', 'id': '56be4db0acb8001400a502ed'}, {'prediction_text': "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California", 'id': '56be4db0acb8001400a502ee'}, {'prediction_text': 'Denver Broncos', 'id': '56be4db0acb8001400a502ef'}, {'prediction_text': 'gold', 'id': '56be4db0acb8001400a502f0'}, {'prediction_text': 'golden anniversary', 'id': '56be8e613aeaaa14008c90d1'}, {'prediction_text': 'February 7, 2016', 'id': '56be8e613aeaaa14008c90d2'}, {'prediction_text': 'American Football Conference', 'id': '56be8e613aeaaa14008c90d3'}, {'prediction_text': 'golden anniversary', 'id': '56bea9923aeaaa14008c91b9'}, {'prediction_text': 'American Football Conference', 'id': '56bea9923aeaaa14008c91ba'}] [INFO] 2023-07-26 14:19:51,225:scores: {'exact_match': tensor(100.), 'f1': tensor(100.)}
If you encounter an error like ConnectionRefusedError: [Errno 111] Connection refused
when running either example, ensure that you have scaled up the models as instructed in the "Deploying Pre-Compiled Models" section. If you're unsure, run the oc get pods
command from the head node. If the models aren't running, as shown in the following example, you likely missed the step to scale them up.
NAME READY STATUS RESTARTS AGE ibm-aiu-sdk 2/2 Running 0 5h11m ibm-aiu-server-bert-base-6fdc586ddd-wndvf 2/2 Running 0 5h11m ibm-aiu-server-roberta-5c76cb849b-25vb9 2/2 Running 0 5h11m
Inference Using PyTorch Models
As part of the software stack within the SDK container, PyTorch2 is available for use.
Custom Models From HuggingFace
This example will allow the user to point to a pre-trained model on HuggingFace and to run that model with an IBM AIU that is assigned to a container within an OpenShift pod. Within the /opt/ibm/aiu/examples
directory, there is a torch_roberta.py
script that will download, compile, and run a RoBERTa model from HuggingFace. The script provides context to the model along with a question, as shown below.
question, text = "Who was Miss Piggy?", "Miss Piggy was a muppet"
The model will then make a prediction, and the script will verify that the answer is correct. To run this script from within an SDK container, you first need to setup the environmental compiler settings needed. The easiest way to accomplish that is by creating an envars.sh
file with the following contents.
export FLEX_COMPUTE=SENTIENT export FLEX_DEVICE=VFIO export DATA_PREC=fp16 export SENCORES=32 export TOKENIZERS_PARALLELISM=false export DTLOG_LEVEL=error export TORCH_SENDNN_LOG=CRITICAL
Feel free to create this script using your preferred text editor (e.g., vi, vim, nano). Below is a brief explanation of the compiler environment variables provided by IBM. Additionally, it's a good idea to save a copy of this file for your records, as files in /tmp
will be deleted when you end your job.
Variable | Description |
---|---|
FLEX_COMPUTE | Targets the IBM AIU card |
FLEX_DEVICE | Refers to the Virtual File Input Output interface into the IBM AIU card. |
SENCORES | Refers to the number of cores on the IBM AIU chip. It is recommended to use all 32 cores. |
DATA_PREC | Refers to the data precision of the model. The model needs to be quantized into the setting that is chosen. |
TOKENIZERS_PARALLELISM | Refers to the HuggingFace transformer, we set this to false to reduce extraneous output. |
DTLOG_LEVEL | Refers the general logging (will silence the senlib output). |
TORCH_SENDNN_LOG | Prevents the FX graph from printing when it doesn’t need to. |
You will also need to set HOME to a directory that you have write permission.
export HOME=/tmp
Then go ahead and run the script you just created.
source envars.sh
Now you are ready to run the torch_roberta.py
script.
python3 /opt/ibm/aiu/examples/torch_roberta.py
The script will generate several pages of output, and will end with the following.
====== Perf Summary End ====== [DeepRT] ===== Perf END ===== [DeepRT] ===== DSM-Act2 BEGIN ===== [DeepRT] ===== Calling DSM (ACT2) ==== [DeepRT] ===== DSM-Act2 END ===== [DeepRT] ===== DSM-AutoPilot BEGIN ===== [DeepRT] ===== DSM-AutoPilot END ===== [DeepRT] ===== DSM-SplitDSenGraph BEGIN ===== [DeepRT] ===== DSM-SplitDSenGraph END ===== [DeepRT] ===== Calling Export ==== Progress: [====================] -------------------------------------------------- Answer: "muppet"
If you get the answer “muppet” then the model has run successfully. But why? Let’s take a look at the script contents.
from transformers import AutoTokenizer, RobertaForQuestionAnswering import torch from torch_sendnn import torch_sendnn tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2") model = RobertaForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") question, text = "Who was Miss Piggy?", "Miss Piggy was a muppet" inputs = tokenizer(question, text, return_tensors="pt", max_length=384, padding="max_length") with torch.no_grad(): model = torch.compile(model, backend="sendnn") #model = torch.compile(model, backend="inductor") outputs = model(**inputs) answer_start_index = outputs.start_logits.argmax() answer_end_index = outputs.end_logits.argmax() predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1] answer = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True) print("-"*50) print('Answer: "{}"'.format(answer)) print("="*50)
The script pulls in a pre-trained model from the deepset section of HuggingFace and then compiles the model using the torch.compile
method. This function optimizes the performance of PyTorch models by compiling them to run faster, applying various optimizations, and leveraging specific hardware capabilities. This makes models more efficient during training and inference.
The backend
parameter in torch.compile
specifies which optimization backend to use. Different backends apply various optimizations based on factors such as target hardware or the desired balance between speed and accuracy. Choosing the right backend can significantly impact the model's runtime performance. In this scenario, sendnn
is a custom backend for the PyTorch library that enables access to the IBM AIU.
You can also run with the CPU instead by changing the torch.compile
backend to inductor
or leave it blank (see line 13).
Another important aspect to highlight is the use of with torch.no_grad()
on line 11. This context manager disables gradient calculation, which is useful during inference or model evaluation when gradients are not needed. This reduces memory consumption and speeds up computation. If gradient calculation remains enabled, you will be unable to compile the model with the AIU backend.
For more information on these PyTorch features, please refer to the following links.
If you want to use a different model from HuggingFace, there are several deepset models available. You'll need an account to select a different model and update lines 5 and 6 (the tokenizer
and model
, respectively) as needed. You may need to adjust your code slightly, but the overall approach remains the same. Feel free to copy and paste torch_roberta.py
and use it as a boilerplate for your own tests.
Your Own Models
To leverage the AIU Cluster for the inference part of your AI workloads, you need to disable gradient calculation in the PyTorch model and compile it with the appropriate backend, as outlined in the IBM documentation. The following code snippet provides a simple way to adjust your model code and run it on this hardware.
# Import the IBM AIU Compiler Backend from torch_sendnn import torch_sendnn # Load Your Model model = torch.load('/path/to/my/model.pt') # Disable Gradient Calculation with torch.no_grad(): # Compile Model modelAIU = torch.compile(model, backend="sendnn") # Set Model to Evaluation Mode modelAIU.eval() # Run Inference yAIU = modelAIU(X)
As previously mentioned, the AIU is optimized specifically for the inference process of AI workloads. While it is technically possible to train models in this environment, it is not recommended, as it relies solely on CPUs. This can lead to suboptimal performance, particularly for larger models and datasets.
Wrapping Up
When you finish, make sure to release the allocated resources. Please note that the SDK environment storage is not persistent; therefore, every time you start or stop it, all the files you modified will be lost. Be sure to save all your work. Make sure to exit the SDK pod by running exit
. Then, run the following commands.
make stop-sdk make undeploy-all