IBM AIU Quick-Start Guide

The IBM AIU hardware and software is designed to accelerate inference of Deep Neural Networks (DNNs). The design supports a technique pioneered by IBM called approximate computing, which leverages lower precision computation and a purpose-built architecture, resulting in highly energy-efficient gains for AI workloads. The simple layout is designed to streamline AI workflows by sending data directly from one compute engine to the next.

Things to Know Before You Start

  • The AIU is optimized specifically for the inference process of AI workloads. Any training performed on this environment will rely solely on its CPUs, which may result in suboptimal performance.

  • As per the latest official documentation, the IBM AIU SDK supports only PyTorch models. Models relying on other libraries, such as TensorFlow, will default to the CPU, which may result in suboptimal performance.

The University at Albany has a cluster of IBM AIU prototype chips available for its students, faculty, and researchers to work on various AI technologies. In this way, please note that these AIU chips are currently prototypes, and the environment in which they operate is experimental. As such, hardware and software configurations are subject to change as development continues.

Connecting to the AIU Cluster

First, connect to the AIU head cluster (aiu-headnode.its.albany.edu) via SSH using your preferred method. Next, log in to OpenShift using the oc login command.

OpenShift is a Kubernetes-based containerized environment. This allows multiple software environments to co-exist on the same hardware without virtualization. IBM will provides two different containers that rides on top of OpenShift and allows utilization of the IBM AIU cards.

  • The E2E Runtime Environment (RTE) Stable container includes the portions of the IBM AIU software stack for running on an IBM AIU card.

  • The E2E Software Development Toolkit (SDK) Stable includes the portions of the IBM AIU software stack for running on an IBM AIU card.

For the purposes of this guide, we will be focusing on the SDK container, which includes a compiler backend for the PyTorch library that allows access to the IBM AIU.

Refer to the example below, and be sure to replace <your_user> and <your_password> with the credentials we provide you.

oc login -u <your_user> -p <your_password> --server=https://api.ua-aiu.its.albany.edu:6443

Then, switch to the project that we will also provide you.

oc project <your_project>

Once connected, navigate to the IBM AIU Files directory within your lab directory. This directory will have been placed there by RTS during your onboarding. Please refer to the following example, and be sure to replace <your_lab> with your own lab name.

cd /network/rit/lab/<your_lab>/IBM-AIU-Files

Deploying

There are pre-compiled models, such as BERT and RoBERTa, installed on the system and available to use out of the shelf. If you are not interested in testing inference on pre-compiled models, you can skip this section. From the IBM-AIU-Files directory within your lab, you can deploy all of the models with a single command.

This command creates deployments for all the models with a scale of 0 replicas. You will need to manually scale-up/scale-down the preferred models depending on your needs.

The deployments are based on the number of AIUs available within the racks. For example, UAlbany has 8 nodes with 12 AIUs available in each, then the total number of replicas across all deployments cannot exceed 96.

The following example scales RoBERTa and BERT-Base to 1 replica each.

Deploying SDK

As mentioned before, the E2E Software Development Toolkit (SDK) Stable includes the portions of the IBM AIU software stack for running on an IBM AIU card such as the following.

  • Senbfcc: the IBM AIU Build Framework and Common Code.

  • deepTools: a suite of python tools developed for the efficient analysis of high throughput sequencing data.

  • senlib: a pure Python-based I2C sensor library for I2C sensors.

  • TVM (Tensor Virtual Machine): a machine-learning compiler framework for CPUs, GPUs, and machine learning accelerators.

  • torch_sendnn: a compiler backend for the PyTorch library that allows access to the IBM AIU.

You can start and login into a client job via the following commands.

You will notice a change in your shell prompt - from ibm-ai-pod1 to ibm-aiu-sdk. You can verify this by running the hostname command. Ensure you are connected to ibm-aiu-sdk before proceeding.

After logging in, you can run any of the "Inference Using Custom Models" examples. To run the "Inference Using Pre-Compiled Models" examples as well, ensure that you have scaled up the deployment as instructed in the previous step. The following sections go into further information on how to run each of the tests.

Inference Using Pre-Compiled Models

In this section, we will be running two Python scripts. The first one is the question_answering_test_request.py, which runs interactive language models (RoBERTa and BERT-Base). It prompts you for a contextual statement followed by a related question, then attempts to predict the answer.

In this way, for general question and answering, refer to the following sample interaction.

The second one is the squad_test_request.py, which uses one of the interactive models along with the Stanford Question Answering Dataset (SQuAD) to conduct a stress test designed to run as long as necessary.

This stress test is designed to run indefinitely. To stop it, press CTRL-C. When the stress test begins, a squad_stress_test.log file is created, as shown in the following example.

If you encounter an error like ConnectionRefusedError: [Errno 111] Connection refused when running either example, ensure that you have scaled up the models as instructed in the "Deploying Pre-Compiled Models" section. If you're unsure, run the oc get pods command from the head node. If the models aren't running, as shown in the following example, you likely missed the step to scale them up.

Inference Using PyTorch Models

As part of the software stack within the SDK container, PyTorch2 is available for use.

Custom Models From HuggingFace

This example will allow the user to point to a pre-trained model on HuggingFace and to run that model with an IBM AIU that is assigned to a container within an OpenShift pod. Within the /opt/ibm/aiu/examples directory, there is a torch_roberta.py script that will download, compile, and run a RoBERTa model from HuggingFace. The script provides context to the model along with a question, as shown below.

The model will then make a prediction, and the script will verify that the answer is correct. To run this script from within an SDK container, you first need to setup the environmental compiler settings needed. The easiest way to accomplish that is by creating an envars.sh file with the following contents.

Feel free to create this script using your preferred text editor (e.g., vi, vim, nano). Below is a brief explanation of the compiler environment variables provided by IBM. Additionally, it's a good idea to save a copy of this file for your records, as files in /tmp will be deleted when you end your job.

Variable

Description

Variable

Description

FLEX_COMPUTE

Targets the IBM AIU card

FLEX_DEVICE

Refers to the Virtual File Input Output interface into the IBM AIU card.

SENCORES

Refers to the number of cores on the IBM AIU chip. It is recommended to use all 32 cores.

DATA_PREC

Refers to the data precision of the model. The model needs to be quantized into the setting that is chosen.

TOKENIZERS_PARALLELISM

Refers to the HuggingFace transformer, we set this to false to reduce extraneous output.

DTLOG_LEVEL

Refers the general logging (will silence the senlib output).

TORCH_SENDNN_LOG

Prevents the FX graph from printing when it doesn’t need to.

You will also need to set HOME to a directory that you have write permission.

Then go ahead and run the script you just created.

Now you are ready to run the torch_roberta.py script.

The script will generate several pages of output, and will end with the following.

If you get the answer “muppet” then the model has run successfully. But why? Let’s take a look at the script contents.

The script pulls in a pre-trained model from the deepset section of HuggingFace and then compiles the model using the torch.compile method. This function optimizes the performance of PyTorch models by compiling them to run faster, applying various optimizations, and leveraging specific hardware capabilities. This makes models more efficient during training and inference.

The backend parameter in torch.compile specifies which optimization backend to use. Different backends apply various optimizations based on factors such as target hardware or the desired balance between speed and accuracy. Choosing the right backend can significantly impact the model's runtime performance. In this scenario, sendnn is a custom backend for the PyTorch library that enables access to the IBM AIU.

Another important aspect to highlight is the use of with torch.no_grad() on line 11. This context manager disables gradient calculation, which is useful during inference or model evaluation when gradients are not needed. This reduces memory consumption and speeds up computation. If gradient calculation remains enabled, you will be unable to compile the model with the AIU backend.

For more information on these PyTorch features, please refer to the following links.

If you want to use a different model from HuggingFace, there are several deepset models available. You'll need an account to select a different model and update lines 5 and 6 (the tokenizer and model, respectively) as needed. You may need to adjust your code slightly, but the overall approach remains the same. Feel free to copy and paste torch_roberta.py and use it as a boilerplate for your own tests.

Your Own Models

To leverage the AIU Cluster for the inference part of your AI workloads, you need to disable gradient calculation in the PyTorch model and compile it with the appropriate backend, as outlined in the IBM documentation. The following code snippet provides a simple way to adjust your model code and run it on this hardware.

As previously mentioned, the AIU is optimized specifically for the inference process of AI workloads. While it is technically possible to train models in this environment, it is not recommended, as it relies solely on CPUs. This can lead to suboptimal performance, particularly for larger models and datasets.

Wrapping Up

When you finish, make sure to release the allocated resources. Please note that the SDK environment storage is not persistent; therefore, every time you start or stop it, all the files you modified will be lost. Be sure to save all your work. Make sure to exit the SDK pod by running exit. Then, run the following commands.