/
UAlbany Supercomputing Resources Guide

UAlbany Supercomputing Resources Guide

Table of Contents

  1. Introduction

  2. Overview of Available Systems

  3. HPC Cluster

  4. DGX Cloud Cluster (H100)

  5. DGX On-Prem Cluster (A100)

  6. Storage Resources and Management

  7. Requesting Access to UAlbany Supercomputing Resources

  8. Connection and Access Guide

  9. Working with SLURM

  10. Using Container Images with DGX Environments

  11. Demo

  12. Additional Resources

1. Introduction

What is a Supercomputer?

Ever tried running a complex analysis on your laptop and watched it freeze up or take forever? That's where supercomputers come in. A supercomputer isn't just one massive computer - it's actually a bunch of computers (we call them “nodes”) working together as a team. Think of it like a group project where each team member tackles part of the work simultaneously instead of one person doing everything.

These systems have a “head node” which is basically the team leader - it's where you log in and tell the system what you want to do. Then it delegates the actual work to all the other computers in the cluster. It's like the difference between having one person solve a 1000-piece puzzle versus having 50 people each working on 20 pieces at once.

Why Use UAlbany's Supercomputing Resources?

Let's be real - some research and computational tasks would take weeks or even months on your personal computer (if they'd run at all). UAlbany's supercomputing systems can turn those weeks into hours or even minutes. Whether you're crunching numbers for climate models, analyzing genome sequences, training the next cool AI model, or rendering complex visualizations, these systems have your back.

Plus, you don't have to worry about upgrading your personal computer or maxing out your credit card on cloud computing fees - these resources are available to you as part of the university community!

2. Overview of Available Systems

Comparison of Available Environments

UAlbany has several different supercomputing environments, each with its own personality and strengths. They're all built on the same basic concept (clusters of computers working together), but the hardware inside each one makes a big difference in what they're good at.

All our clusters have GPUs (Graphics Processing Units), which are specialized chips that excel at certain types of calculations. Think of them as the math wizards of the computing world - they might not be great at everything, but when it comes to specific types of problems (especially in AI and data science), they'll run circles around regular CPUs.

Here's how our systems stack up:

Environments & Resources

HPC

DGX Cloud Cluster (H100)

DGX On-Prem Cluster (H100)

Environments & Resources

HPC

DGX Cloud Cluster (H100)

DGX On-Prem Cluster (H100)

Computing Nodes

60

4

24

GPU/Acceleration Cards

12 NVIDIA A40

32 NVIDIA H100

192 NVIDIA A100

Computing Node Datasheet

N/A

By the way, if you spot a "DGX Cloud Cluster (A100)" mentioned in our documentation, that's not a mistake! Our first cloud offering (from '23 to '24) came with 32 NVIDIA A100 GPUs, but when we renewed our agreement with NVIDIA, we got our hands on these awesome state-of-the-art H100 GPUs instead. Our documentation is constantly being updated - it's tough keeping up with all the cool stuff happening at once. So if you see references to the older A100 setup, just know we've upgraded, and please let us know if you find anything confusing!

Understanding GPU Technology

GPUs were originally created to render video games (making those stunning graphics in your favorite games possible), but researchers quickly realized they could be repurposed for scientific computing. Why? Because they're built to process lots of similar calculations at the same time – super helpful for both rendering explosions in games AND training neural networks.

GPU Comparison Chart

GPUs & Specifications

NVIDIA A40

NVIDIA A100

NVIDIA H100

GPUs & Specifications

NVIDIA A40

NVIDIA A100

NVIDIA H100

GPU Memory (GB)

48

80

80

Memory Bandwidth (GB/s)

696

2039

3350

FP16 Tensor Core (TFLOPS)

299.4

624

1979

GPU Datasheet

Memory, Bandwidth, and TFLOPS Explained

GPU Memory: This is like your GPU's personal workspace. Imagine you're working on a project - GPU memory is the size of your desk. A bigger desk (more memory) means you can spread out more materials (data) without having to constantly file things away and retrieve them later. When you're working with huge datasets or models, having enough GPU memory is crucial so your GPU doesn't waste time shuffling data back and forth to the CPU.

Memory Bandwidth: If GPU memory is the size of your desk, bandwidth is how quickly you can move things around on it. Higher bandwidth means you can grab that reference book, flip to the right page, and get the info you need faster. For data-intensive applications, higher bandwidth means less time waiting and more time computing.

TFLOPS (Trillion Floating Point Operations Per Second): This is the raw horsepower measurement. One TFLOP means a trillion calculations per second (wrap your head around that!). To put this in perspective: if each calculation was a grain of rice, the H100 GPU could fill about 60 Olympic swimming pools with rice... every second. The H100 GPU can handle approximately 3 times more operations per second than the A100, which is mind-blowingly powerful.

Which System Is Right for Me?

System Selection Decision Tree.png

HPC On-Prem Cluster: This is our “starter” supercomputer - think of it as the reliable sedan of our fleet. It has more CPU power than GPU power, with 12 NVIDIA A40 GPUs spread across 60 computing nodes. It's great for cutting your teeth on supercomputing and handles a wide variety of tasks well without being specialized in any particular area. Perfect for data science work, statistical analysis, and smaller machine learning jobs where you're still figuring things out.

DGX Cloud Cluster (H100): This cloud-based system is like having a small team of super-athletes at your disposal. It consists of 4 computing nodes, each packing 8 powerhouse NVIDIA H100 GPUs (32 total). This is our only system that lives “off-campus” in NVIDIA's cloud rather than in our data center. While each H100 is about 3 times more powerful than an A100 (seriously, these things are beasts!), the math still favors our on-prem system in total computing power. Why? Because the DGX A100 On-Prem has 6 times more GPUs in total (192 vs 32). Think of it as choosing between 32 Olympic weightlifters or 192 college athletes - those 192 can collectively lift more, even if individually they're not as strong!

DGX On-Prem Cluster (A100): The heavyweight champion of our supercomputing lineup. With 192 NVIDIA A100 GPUs spread across 24 nodes, this system has the most total computing power. It's perfect for those massive projects where you need serious computational muscle. And if your research budget allows it, there's even a paid tier that lets you reserve a larger chunk of these resources for your team's exclusive use. This system is ideal when you've graduated from experimenting and are ready to train serious models or crunch massive datasets.

Key H100 Advantages

While our on-prem A100 system still wins the total compute battle, the following features make the H100s the go-to choice when you need cutting-edge capabilities rather than just raw horsepower.

A100 vs H100.png
  1. FP8 Precision Support: These GPUs come with remarkable FP8 (8-bit floating point) precision capabilities - it's like upgrading from a standard toolbox to precision surgical instruments! While A100s could handle FP16, the H100s go even further with FP8, letting you run models at lower precision without sacrificing accuracy. The result? Up to 4x performance boost on certain workloads. Pretty incredible, right?

  2. Built-in Transformer Engine: The H100s pack a specialized engine specifically designed for transformer-based AI models (think GPT and BERT). It's like having a Ferrari engine installed specifically for the most demanding part of your AI workload. These custom circuits make language models absolutely fly compared to previous generations.

  3. Dynamic Sparsity: This is a game-changer that works like an efficiency expert for your models. The H100's second-generation sparsity support automatically identifies and skips unnecessary computations during model execution. Imagine having a smart assistant that finds all the shortcuts through a maze - that's what this does for large neural networks, boosting performance without requiring any model modifications! By the way, here is a very good article from NVIDIA on sparsity: What Is Sparsity in AI.

Use Case Scenarios

For Data Science Projects:
Working with R, Python, and your favorite data science libraries? The HPC cluster has you covered for most projects. It's perfect for data wrangling, creating visualizations, and running analyses that don't need specialized hardware. Think of it as your go-to for everyday data science tasks.

For Machine Learning Research:
If you're training traditional ML models (random forests, SVMs, etc.), the HPC environment will serve you well. But once you start getting into more complex territory, you might want to level up to one of the DGX environments. It's like the difference between making a quick sketch and painting a masterpiece - different tools for different ambitions.

For Deep Learning Applications:
Deep learning is where our specialized DGX environments really shine. If you're working with frameworks like TensorFlow, PyTorch, or JAX:

  • Start with HPC for your initial prototyping, then graduate to the bigger systems when you're ready to scale up

  • Try the DGX H100 Cloud when you have a particularly calculation-heavy model that would benefit from the H100's raw number-crunching power

  • Go with the DGX A100 On-Prem when you need to process huge batches or want to spread your workload across multiple GPUs

For Statistical Analysis:
For most statistical work, the HPC cluster will be your best friend. But if you're venturing into computationally intensive methods like large-scale Bayesian modeling or working with truly massive datasets, consider the extra firepower of the DGX systems.

3. HPC Cluster

Technical Specifications

  • 60 computing nodes

  • 12 NVIDIA A40 GPUs (48GB memory each)

  • Best suited for general-purpose computing and entry-level GPU workloads

Ideal Use Cases

  • Data preprocessing and cleaning (the digital equivalent of washing your dishes before cooking)

  • Statistical analysis and modeling

  • Getting your feet wet with machine learning

  • Simulations that don't require the absolute latest in GPU technology

Getting Started

Connect to the head node at head.arcc.albany.edu using SSH. Do you need just a JupyterLab session? Then say no more! The HPC offers an extremely convenient way to start a JupyterLab session through the JupyterHub server, so you don’t have to SSH or do anything like that - just access the following link and be happy: https://jupyterlab.its.albany.edu/. Have any questions? We’ve got your back, take a look at this great Wiki page on how to use this tool: JupyterHub Service Offering.

4. DGX Cloud Cluster (H100)

Technical Specifications

  • 4 computing nodes

  • 32 NVIDIA H100 GPUs (80GB memory each)

  • Cloud-hosted environment provided by NVIDIA

  • The newest, shiniest GPUs in our fleet

Ideal Use Cases

  • Cutting-edge deep learning research

  • Training large language models (think cousin-of-ChatGPT level stuff)

  • Projects that benefit from the latest GPU architecture

  • Workloads that can take advantage of the H100's new FP8 precision capability

Getting Started

Connect to the head node at 207.211.163.76 using SSH with certificate authentication (slightly trickier than password login, but we'll help you get set up).

5. DGX On-Prem Cluster (A100)

Technical Specifications

  • 24 computing nodes

  • 192 NVIDIA A100 GPUs (80GB memory each)

  • Our largest GPU cluster by total compute capacity

  • Available in a paid tier for research groups needing dedicated resources

Ideal Use Cases

  • Large-scale deep learning projects

  • Multi-GPU and multi-node training

  • When you need to throw massive computational resources at a problem

  • Research teams with funding who need guaranteed access to high-end computing

Getting Started

Connect to the head node at dgx-head01.its.albany.edu using SSH.

6. Storage Resources and Management

Let's talk about where you'll keep all that data and code while working on our supercomputers. After all, having powerful computing is great, but you also need somewhere to store your stuff!

What Storage You'll Get

Your Personal Space:

  • Everyone gets 10GB of personal research storage - think of it as your private locker

  • Perfect for keeping your scripts, configuration files, and those special research nuggets no one else needs to access

The Lab Share ($LAB Directory):

  • Every faculty researcher gets a generous 10TiB (that's 10,240GB!) of shared space

  • This is your team's clubhouse - a place where everyone in your research group can collaborate

  • To get your lab space set up, just fill out the Research Storage Request Form

  • Your $LAB directory shows up automatically on all our cluster systems, so it's always right where you need it

Bonus for DGX Users:

  • If you're working on the DGX On-Prem system, you'll also get 1TB of extra-speedy flash storage

  • This is in addition to your regular 10TB lab share

  • It's like having both a filing cabinet and a whiteboard - regular storage plus some space optimized for quick access

How We Keep Your Data Safe

We take your research data seriously - here's how we protect it:

  • Everything's encrypted - so even if someone got physical access to the storage, they couldn't read your data

  • We use clever compression and de-duplication - so you can store more with your allocation

  • Automatic backups galore:

    • Hourly “snapshots” for the last 23 hours (oops, deleted that file 2 hours ago? No problem!)

    • Daily snapshots kept for 21 days (accidentally deleted something last week? We've still got you!)

    • And the best part? These backups don't count against your storage quota!

  • We keep a second copy of your $LAB directory at another location - because one backup is never enough!

Getting to Your Files

Accessing your storage is super easy - you just need to map a network drive to your device, which is fairly simple. If you have any questions on how to do that, we have this neat tutorial right here to get you started: How to Map a Network Drive. That said, are you:

  • On campus? It's available from any university computer you login to

  • Working from home? Just connect to the VPN from your device and it's all there

  • Using the supercomputers? Your storage is automatically mounted and ready to use

Need Even More Space?

If your research is grant-funded and you need additional storage: reach out to askIT@albany.edu for a consultation to get your storage estimates right.

7. Requesting Access to UAlbany Supercomputing Resources

Each of UAlbany's supercomputing environments has a specific access request process. Here's how to get started with each system:

Important Note: All access requests must be initiated by Principal Investigators (PIs). If you are a student, postdoc, or lab member requiring access, please ask your PI to submit the request on your behalf.

HPC Cluster Access

Eligibility: Available to all members of the University at Albany research community, collaborators, and partners.

Request Process:

  • Email askIT@albany.edu to request access

  • Include your NetID, department, and a brief description of your research needs

  • Access is typically granted within 1-2 business days

DGX On-Prem Cluster (A100) Access

For complete information on limits and availability: DGX On-Prem Service Offering.

Eligibility: The DGX On-Prem cluster offers two tiers of access to accommodate different research needs:

Free Tier:

  • No cost for UAlbany faculty

  • Access to GPU resources on a first-come, first-served basis

  • Workloads may be preempted by prioritized jobs

  • Suitable for research projects with flexible timelines

Prioritized Access Tier:

  • $1,200 annually per prioritized GPU

  • Priority scheduling of your workloads over free-tier jobs

  • Guaranteed resource allocation for time-sensitive research projects

Request Process:

  1. Complete the Research Storage Request Form to provision your lab directory

  2. Complete the DGX On-Prem Computation Request Form for access to the NVIDIA On-Prem resources

  3. For prioritized access, indicate your interest in the request form, and the ITS team will follow up with details

DGX Cloud Cluster (H100) Access

For complete information on limits and availability: DGX Cloud Service Offering.

Eligibility: Currently available to UAlbany faculty free of charge.

Request Process:

  • Complete the DGX Cloud Computation Request Form

  • You'll receive connection instructions and certificate authentication details by email

  • Due to the limited number of resources, requests may be subject to approval based on research needs

After Access is Granted

Once you've been granted access to any of our systems:

  • You'll receive an email with your request details and connection instructions

  • Check out our wiki pages for tutorials and example job scripts

For any questions about access or to check on the status of your request, please contact askIT@albany.edu.

8. Connection and Access Guide

VPN Setup

First things first - if you’re off campus, you'll need to connect to the university VPN (Global Protect) before you can access any of our supercomputing systems. Think of it as getting your ID checked before entering the building. It’s a pretty simple and straightforward process, and if you’re not familiar with, please check the instructions here: How to Use the VPN. That said, we recommend connecting to the VPN regardless of where you're working from (even on campus) to avoid any connectivity issues that might pop up.

SSH Connection Instructions

This is also other simple and straightforward process, and if you’re not familiar with, please check the instructions here: How to Connect via SSH. To connect to our systems, you'll need:

  • The hostname (like an address for the system)

  • Port 22 (the standard door for SSH connections)

  • Your NetID (your campus username)

  • Either your NetID password (for on-campus systems) or a certificate (for the cloud system)

Here's a handy reference table:

Environment

Hostname

Environment

Hostname

HPC

head.arcc.albany.edu

DGX On-Prem

dgx-head01.its.albany.edu

DGX Cloud

207.211.163.76

Connection Methods

On macOS or Linux:
You're in luck! These systems come with SSH built in. Just open a terminal window and type:

ssh your_netid@hostname

On Windows:
You'll need to download an SSH client first. We recommend PuTTY (it's free!), but VS Code's Remote - SSH extension is also a great option if you're already using VS Code (also free).

Pro Tip: If you find yourself connecting to these systems often, set up an SSH config file on your computer. It's like creating speed dial entries for your favorite contacts - you'll save time and typing errors.

9. Working with SLURM

All our supercomputing environments use a system called SLURM to manage jobs. Think of SLURM as the scheduling assistant who makes sure everyone gets their fair share of computing time. Our team has put together a very nice page on SLURM that you can check here: How to Schedule via SLURM.

At this point you might be asking: “Where do all these parts fit in? VPN? SLURM? This seems like a lot of trouble...” Don't worry, we've got your back, and once you go through this process for the first time, you'll see it's actually easier done than said (pun intended). We understand there are a lot of moving parts, but hey, this is how the actual AI world works: nothing fancy. Take a look at the following diagram for a clearer picture of the whole process.

Here's the typical workflow:

  1. Connect to the head node via SSH

  2. Prepare your job script (a file telling the system what you want to run)

  3. Submit your job using SLURM commands

  4. Kick back while SLURM finds the right computers to run your job

  5. Come back later to collect your results

Basic Commands

  • squeue: See what jobs are currently running or waiting (like checking the status board at an airport)

  • srun: Run a command directly on compute nodes (for interactive work - see next topic)

  • sbatch: Submit a job script to be run when resources are available (for non-interactive work - see next topic)

  • scancel: Cancel a job (in case you spot a mistake or change your mind)

  • sinfo: See information about the available partitions (groups of nodes)

Differences Between Non-Interactive and Interactive Jobs

With srun you can run a command directly on compute nodes (for interactive work) - Think of this like the movie Inception. You first login to the head node (your first dream level), and then using srun, you “dream deeper” into a compute node where the real computational power exists. Just like in Inception where they needed to go deeper to accomplish the mission, you use srun to dive into the more powerful compute environment where your tasks can actually run. Spoiler alert - No spinning top required to check if you're in reality - just the command prompt will tell you which node you're on!

When should you use srun instead of sbatch? Use srun when you need to be “present in the dream” - when you require real-time interaction with your work. This is perfect for debugging, exploratory data analysis, or interactive Python sessions where you're actively typing commands and expecting immediate responses. Meanwhile, sbatch is like planting an idea and walking away - you submit your job script and let it run in the background while you do something else, checking in later to see the results. Choose srun when you need to be hands-on and sbatch when you want to “set it and forget it.”

Job Submission Guide

The most common way to run jobs is by creating a script and submitting it with sbatch. A basic script looks something like this:

#!/bin/bash #SBATCH --job-name=my_awesome_analysis #SBATCH --output=results_%j.out #SBATCH --error=results_%j.err #SBATCH --time=01:00:00 #SBATCH --gpus=1 # Run your actual program python my_analysis.py

Submit this with sbatch my_script.sh and SLURM will take care of the rest!

Resource Allocation Best Practices

  • Only request what you actually need - overestimating resources means longer wait times

  • For long-running jobs, use checkpointing (see this awesome page: Checkpointing Guide) to save progress periodically

  • Start small when testing new code, then scale up once you know it works

  • Be specific about your requirements (memory, GPUs, etc.) so SLURM can match you with the right hardware

Monitoring Jobs

Once your job is running, you can monitor it with:

  • squeue -u your_netid to see all your jobs

  • sacct -j job_id for detailed info about a specific job

10. Using Container Images with DGX Environments

What Are Containers (And Why Should You Care)?

Think of containers like those meal prep kits that deliver everything you need to cook a specific dish. All the ingredients, spices, and instructions come in one package, so you don't have to shop for each item or worry if your local store carries that obscure spice the recipe needs.

In the computing world, containers work the same way - they package up software, libraries, and dependencies so everything “just works” together. No more “but it worked on my laptop!” frustrations or spending hours installing the right version of every library.

NVIDIA NGC Catalog: Your Container Buffet

For both our DGX environments (Cloud H100 and On-Prem A100), we highly recommend using container images from the NVIDIA NGC Catalog (https://catalog.ngc.nvidia.com/containers). Think of NGC as a massive buffet of pre-configured containers specifically optimized for NVIDIA GPUs.

Why use these containers? Because:

  • They're pre-optimized for our hardware - these containers were literally made by the same folks who built the GPUs in our systems

  • They save you tons of setup time - no need to install and configure all the right libraries

  • They're performance-tuned - often running faster than manually installed software

  • They're regularly updated - security patches and new features get added without breaking your workflow

Popular Container Images for Research

The NGC catalog offers containers for just about everything GPU-related, but here are some crowd favorites:

  • Deep Learning Frameworks: Ready-to-use containers for PyTorch, TensorFlow, JAX, and more

  • HPC Applications: Containers for scientific computing, simulations, and modeling

  • AI/ML Tools: Specialized containers for computer vision, speech recognition, and NLP

Getting Started with Containers

Using containers on our DGX systems is straightforward:

  1. Browse the NGC catalog to find the container you need

  2. In your SLURM job script, specify the container directly:

    #SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:25.01-py3'

Mounting Storage with Containers

Since containers have their own isolated filesystem, you'll need to explicitly mount your storage directories:

This maps your lab folder to a mount point inside the container, so your data and code remain accessible.

Example - Using a Container Image to Start a Jupyter Notebook on DGX On-Prem

Let's walk through a practical example that you'll likely use all the time - setting up a Jupyter notebook session on the DGX On-Prem cluster. This script creates an interactive JupyterLab environment where you can develop and test your code with all the perks of our powerful GPUs. It automatically generates a secure password and gives you a URL to access your notebook from your browser.

Once you submit this script with sbatch jupyter.sh, SLURM will find an available node, launch your container, and start the Jupyter server. Simply copy the URL and password from the output file (check jupyter-*.out) and paste it into your browser. Voilà! You now have a full-featured development environment running on our high-performance hardware. As you may have noticed, the script contains several SLURM flags that you can customize based on your needs. Feel free to adjust the time limit, GPU count, or container image to match your specific research requirements. Let's break down some key flags you might want to customize:

  • --time=8:00:00: Need a longer session? Change this to increase your time allocation

  • --gres=gpu:1: Working with bigger models? Bump this up to request more GPUs (e.g., gpu:2 or gpu:4)

  • --container-image: Want a different framework? Swap in any NGC container that suits your project (check all the amazing available images here: https://catalog.ngc.nvidia.com/containers)

  • --container-mounts: Make sure to change this according to your own lab directories

Remember that your session will run for the time specified (8 hours in this example) and then automatically terminate. Need to save your work? No worries - everything you save in the /mnt/dgx_lab or /mnt/lab directories will be stored in your lab's permanent storage space, so it'll be there waiting for you next time you log in!

Container Tips and Tricks

  • Persistent Storage: Always mount your home directory or lab folder as shown above - containers are temporary!

  • Custom Containers: Already have a working setup? You can build your own container to ensure reproducibility

  • Container Versions: Always specify the exact version of containers (like '25.01-py3' instead of 'latest') to avoid surprise updates

11. Demo

Need some visual assistance? Not a problem! Take a look at the video below and see how to start a JupyterLab session on the DGX On-Prem.

12. Additional Resources

Once again, don't worry - you're not alone on this supercomputing journey! We've created a wealth of resources to help you make the most of UAlbany's computational power, and our documentation is constantly being updated.

Our AI Tutorials page is like a comprehensive table of contents on everything related to our supercomputing resources. It’s like a cheat sheet, where you can quickly navigate to the topic you most need help with.

Looking for ready-to-run examples? Check out the Code Tutorials section. We've prepared sample Python scripts and Jupyter notebooks that are specifically designed for our DGX environments. It's like having a cookbook full of recipes that are guaranteed to work in our kitchen!

The supercomputing community at UAlbany is constantly growing and evolving. Have a question that isn't covered in our documentation? Found a clever way to use our systems that might help others? We'd love to hear from you! Reach out to askIT@albany.edu - your feedback helps us make these resources better for everyone.

Remember: today's supercomputing question is tomorrow's wiki article. Your curiosity drives our documentation!