Table of Contents
1. Introduction
What is a Supercomputer?
Ever tried running a complex analysis on your laptop and watched it freeze up or take forever? That's where supercomputers come in. A supercomputer isn't just one massive computer - it's actually a bunch of computers (we call them “nodes”) working together as a team. Think of it like a group project where each team member tackles part of the work simultaneously instead of one person doing everything.
These systems have a “head node” which is basically the team leader - it's where you log in and tell the system what you want to do. Then it delegates the actual work to all the other computers in the cluster. It's like the difference between having one person solve a 1000-piece puzzle versus having 50 people each working on 20 pieces at once.
Why Use UAlbany's Supercomputing Resources?
Let's be real - some research and computational tasks would take weeks or even months on your personal computer (if they'd run at all). UAlbany's supercomputing systems can turn those weeks into hours or even minutes. Whether you're crunching numbers for climate models, analyzing genome sequences, training the next cool AI model, or rendering complex visualizations, these systems have your back.
Plus, you don't have to worry about upgrading your personal computer or maxing out your credit card on cloud computing fees - these resources are available to you as part of the university community!
2. Overview of Available Systems
Comparison of Available Environments
UAlbany has several different supercomputing environments, each with its own personality and strengths. They're all built on the same basic concept (clusters of computers working together), but the hardware inside each one makes a big difference in what they're good at.
All our clusters have GPUs (Graphics Processing Units), which are specialized chips that excel at certain types of calculations. Think of them as the math wizards of the computing world - they might not be great at everything, but when it comes to specific types of problems (especially in AI and data science), they'll run circles around regular CPUs.
Here's how our systems stack up:
Environments & Resources | HPC | DGX Cloud Cluster (H100) | DGX On-Prem Cluster (H100) |
---|---|---|---|
Computing Nodes | 60 | 4 | 24 |
GPU/Acceleration Cards | 12 NVIDIA A40 | 32 NVIDIA H100 | 192 NVIDIA A100 |
Computing Node Datasheet | N/A |
|
|
By the way, if you spot a "DGX Cloud Cluster (A100)" mentioned in our documentation, that's not a mistake! Our first cloud offering (from '23 to '24) came with 32 NVIDIA A100 GPUs, but when we renewed our agreement with NVIDIA, we got our hands on these awesome state-of-the-art H100 GPUs instead. Our documentation is constantly being updated - it's tough keeping up with all the cool stuff happening at once. So if you see references to the older A100 setup, just know we've upgraded, and please let us know if you find anything confusing!
Understanding GPU Technology
GPUs were originally created to render video games (making those stunning graphics in your favorite games possible), but researchers quickly realized they could be repurposed for scientific computing. Why? Because they're built to process lots of similar calculations at the same time – super helpful for both rendering explosions in games AND training neural networks.
GPU Comparison Chart
GPUs & Specifications | NVIDIA A40 | NVIDIA A100 | NVIDIA H100 |
---|---|---|---|
GPU Memory (GB) | 48 | 80 | 80 |
Memory Bandwidth (GB/s) | 696 | 2039 | 3350 |
FP16 Tensor Core (TFLOPS) | 299.4 | 624 | 1979 |
GPU Datasheet |
|
|
|
Memory, Bandwidth, and TFLOPS Explained
GPU Memory: This is like your GPU's personal workspace. Imagine you're working on a project - GPU memory is the size of your desk. A bigger desk (more memory) means you can spread out more materials (data) without having to constantly file things away and retrieve them later. When you're working with huge datasets or models, having enough GPU memory is crucial so your GPU doesn't waste time shuffling data back and forth to the CPU.
Memory Bandwidth: If GPU memory is the size of your desk, bandwidth is how quickly you can move things around on it. Higher bandwidth means you can grab that reference book, flip to the right page, and get the info you need faster. For data-intensive applications, higher bandwidth means less time waiting and more time computing.
TFLOPS (Trillion Floating Point Operations Per Second): This is the raw horsepower measurement. One TFLOP means a trillion calculations per second (wrap your head around that!). To put this in perspective: if each calculation was a grain of rice, the H100 GPU could fill about 60 Olympic swimming pools with rice... every second. The H100 GPU can handle approximately 3 times more operations per second than the A100, which is mind-blowingly powerful.
Which System Is Right for Me?
HPC On-Prem Cluster: This is our “starter” supercomputer - think of it as the reliable sedan of our fleet. It has more CPU power than GPU power, with 12 NVIDIA A40 GPUs spread across 60 computing nodes. It's great for cutting your teeth on supercomputing and handles a wide variety of tasks well without being specialized in any particular area. Perfect for data science work, statistical analysis, and smaller machine learning jobs where you're still figuring things out.
DGX Cloud Cluster (H100): This cloud-based system is like having a small team of super-athletes at your disposal. It consists of 4 computing nodes, each packing 8 powerhouse NVIDIA H100 GPUs (32 total). This is our only system that lives “off-campus” in NVIDIA's cloud rather than in our data center. While each H100 is about 3 times more powerful than an A100 (seriously, these things are beasts!), the math still favors our on-prem system in total computing power. Why? Because the DGX A100 On-Prem has 6 times more GPUs in total (192 vs 32). Think of it as choosing between 32 Olympic weightlifters or 192 college athletes - those 192 can collectively lift more, even if individually they're not as strong!
DGX On-Prem Cluster (A100): The heavyweight champion of our supercomputing lineup. With 192 NVIDIA A100 GPUs spread across 24 nodes, this system has the most total computing power. It's perfect for those massive projects where you need serious computational muscle. And if your research budget allows it, there's even a paid tier that lets you reserve a larger chunk of these resources for your team's exclusive use. This system is ideal when you've graduated from experimenting and are ready to train serious models or crunch massive datasets.
Key H100 Advantages
While our on-prem A100 system still wins the total compute battle, the following features make the H100s the go-to choice when you need cutting-edge capabilities rather than just raw horsepower.
FP8 Precision Support: These GPUs come with remarkable FP8 (8-bit floating point) precision capabilities - it's like upgrading from a standard toolbox to precision surgical instruments! While A100s could handle FP16, the H100s go even further with FP8, letting you run models at lower precision without sacrificing accuracy. The result? Up to 4x performance boost on certain workloads. Pretty incredible, right?
Built-in Transformer Engine: The H100s pack a specialized engine specifically designed for transformer-based AI models (think GPT and BERT). It's like having a Ferrari engine installed specifically for the most demanding part of your AI workload. These custom circuits make language models absolutely fly compared to previous generations.
Dynamic Sparsity: This is a game-changer that works like an efficiency expert for your models. The H100's second-generation sparsity support automatically identifies and skips unnecessary computations during model execution. Imagine having a smart assistant that finds all the shortcuts through a maze - that's what this does for large neural networks, boosting performance without requiring any model modifications! By the way, here is a very good article from NVIDIA on sparsity: What Is Sparsity in AI.
Use Case Scenarios
For Data Science Projects:
Working with R, Python, and your favorite data science libraries? The HPC cluster has you covered for most projects. It's perfect for data wrangling, creating visualizations, and running analyses that don't need specialized hardware. Think of it as your go-to for everyday data science tasks.
For Machine Learning Research:
If you're training traditional ML models (random forests, SVMs, etc.), the HPC environment will serve you well. But once you start getting into more complex territory, you might want to level up to one of the DGX environments. It's like the difference between making a quick sketch and painting a masterpiece - different tools for different ambitions.
For Deep Learning Applications:
Deep learning is where our specialized DGX environments really shine. If you're working with frameworks like TensorFlow, PyTorch, or JAX:
Start with HPC for your initial prototyping, then graduate to the bigger systems when you're ready to scale up
Try the DGX H100 Cloud when you have a particularly calculation-heavy model that would benefit from the H100's raw number-crunching power
Go with the DGX A100 On-Prem when you need to process huge batches or want to spread your workload across multiple GPUs
For Statistical Analysis:
For most statistical work, the HPC cluster will be your best friend. But if you're venturing into computationally intensive methods like large-scale Bayesian modeling or working with truly massive datasets, consider the extra firepower of the DGX systems.
3. HPC Cluster
Technical Specifications
60 computing nodes
12 NVIDIA A40 GPUs (48GB memory each)
Best suited for general-purpose computing and entry-level GPU workloads
Ideal Use Cases
Data preprocessing and cleaning (the digital equivalent of washing your dishes before cooking)
Statistical analysis and modeling
Getting your feet wet with machine learning
Simulations that don't require the absolute latest in GPU technology
Getting Started
Connect to the head node at head.arcc.albany.edu
using SSH. Do you need just a JupyterLab session? Then say no more! The HPC offers an extremely convenient way to start a JupyterLab session through the JupyterHub server, so you don’t have to SSH or do anything like that - just access the following link and be happy: http://jupyterlab.its.albany.edu/. Have any questions? We’ve got your back, take a look at this great Wiki page on how to use this tool: JupyterHub Service Offering.
4. DGX Cloud Cluster (H100)
Technical Specifications
4 computing nodes
32 NVIDIA H100 GPUs (80GB memory each)
Cloud-hosted environment provided by NVIDIA
The newest, shiniest GPUs in our fleet
Ideal Use Cases
Cutting-edge deep learning research
Training large language models (think cousin-of-ChatGPT level stuff)
Projects that benefit from the latest GPU architecture
Workloads that can take advantage of the H100's new FP8 precision capability
Getting Started
Connect to the head node at 207.211.163.76
using SSH with certificate authentication (slightly trickier than password login, but we'll help you get set up).
5. DGX On-Prem Cluster (A100)
Technical Specifications
24 computing nodes
192 NVIDIA A100 GPUs (80GB memory each)
Our largest GPU cluster by total compute capacity
Available in a paid tier for research groups needing dedicated resources
Ideal Use Cases
Large-scale deep learning projects
Multi-GPU and multi-node training
When you need to throw massive computational resources at a problem
Research teams with funding who need guaranteed access to high-end computing
Getting Started
Connect to the head node at dgx-head01.its.albany.edu
using SSH.
6. Storage Resources and Management
Let's talk about where you'll keep all that data and code while working on our supercomputers. After all, having powerful computing is great, but you also need somewhere to store your stuff!
What Storage You'll Get
Your Personal Space:
Everyone gets 10GB of personal research storage - think of it as your private locker
Perfect for keeping your scripts, configuration files, and those special research nuggets no one else needs to access
The Lab Share ($LAB Directory):
Every faculty researcher gets a generous 10TiB (that's 10,240GB!) of shared space
This is your team's clubhouse - a place where everyone in your research group can collaborate
To get your lab space set up, just fill out the Research Storage Request Form
Your $LAB directory shows up automatically on all our cluster systems, so it's always right where you need it
Bonus for DGX Users:
If you're working on the DGX On-Prem system, you'll also get 1TB of extra-speedy flash storage
This is in addition to your regular 10TB lab share
It's like having both a filing cabinet and a whiteboard - regular storage plus some space optimized for quick access
How We Keep Your Data Safe
We take your research data seriously - here's how we protect it:
Everything's encrypted - so even if someone got physical access to the storage, they couldn't read your data
We use clever compression and de-duplication - so you can store more with your allocation
Automatic backups galore:
Hourly “snapshots” for the last 23 hours (oops, deleted that file 2 hours ago? No problem!)
Daily snapshots kept for 21 days (accidentally deleted something last week? We've still got you!)
And the best part? These backups don't count against your storage quota!
We keep a second copy of your $LAB directory at another location - because one backup is never enough!
Getting to Your Files
Accessing your storage is super easy - you just need to map a network drive to your device, which is fairly simple. If you have any questions on how to do that, we have this neat tutorial right here to get you started: How to Map a Network Drive. That said, are you:
On campus? It's available from any university computer you login to
Working from home? Just connect to the VPN from your device and it's all there
Using the supercomputers? Your storage is automatically mounted and ready to use
Need Even More Space?
If your research is grant-funded and you need additional storage: reach out to askIT@albany.edu for a consultation to get your storage estimates right.
7. Requesting Access to UAlbany Supercomputing Resources
Each of UAlbany's supercomputing environments has a specific access request process. Here's how to get started with each system:
Important Note: All access requests must be initiated by Principal Investigators (PIs). If you are a student, postdoc, or lab member requiring access, please ask your PI to submit the request on your behalf.
HPC Cluster Access
Eligibility: Available to all members of the University at Albany research community, collaborators, and partners.
Request Process:
Email askIT@albany.edu to request access
Include your NetID, department, and a brief description of your research needs
Access is typically granted within 1-2 business days
DGX On-Prem Cluster (A100) Access
For complete information on limits and availability: DGX On-Prem Service Offering.
Eligibility: The DGX On-Prem cluster offers two tiers of access to accommodate different research needs:
Free Tier:
No cost for UAlbany faculty
Access to GPU resources on a first-come, first-served basis
Workloads may be preempted by prioritized jobs
Suitable for research projects with flexible timelines
Prioritized Access Tier:
$1,200 annually per prioritized GPU
Priority scheduling of your workloads over free-tier jobs
Guaranteed resource allocation for time-sensitive research projects
Request Process:
Complete the Research Storage Request Form to provision your lab directory
Complete the DGX On-Prem Computation Request Form for access to the NVIDIA On-Prem resources
For prioritized access, indicate your interest in the request form, and the ITS team will follow up with details
DGX Cloud Cluster (H100) Access
For complete information on limits and availability: DGX Cloud Service Offering.
Eligibility: Currently available to UAlbany faculty free of charge.
Request Process:
Complete the DGX Cloud Computation Request Form
You'll receive connection instructions and certificate authentication details by email
Due to the limited number of resources, requests may be subject to approval based on research needs
After Access is Granted
Once you've been granted access to any of our systems:
You'll receive an email with your request details and connection instructions
Check out our wiki pages for tutorials and example job scripts
For any questions about access or to check on the status of your request, please contact askIT@albany.edu.
8. Connection and Access Guide
VPN Setup
First things first - if you’re off campus, you'll need to connect to the university VPN (Global Protect) before you can access any of our supercomputing systems. Think of it as getting your ID checked before entering the building. It’s a pretty simple and straightforward process, and if you’re not familiar with, please check the instructions here: How to Use the VPN. That said, we recommend connecting to the VPN regardless of where you're working from (even on campus) to avoid any connectivity issues that might pop up.
SSH Connection Instructions
This is also other simple and straightforward process, and if you’re not familiar with, please check the instructions here: How to Connect via SSH. To connect to our systems, you'll need:
The hostname (like an address for the system)
Port 22 (the standard door for SSH connections)
Your NetID (your campus username)
Either your NetID password (for on-campus systems) or a certificate (for the cloud system)
Here's a handy reference table:
Environment | Hostname |
---|---|
HPC | head.arcc.albany.edu |
DGX On-Prem | dgx-head01.its.albany.edu |
DGX Cloud | 207.211.163.76 |
Connection Methods
On macOS or Linux:
You're in luck! These systems come with SSH built in. Just open a terminal window and type:
ssh your_netid@hostname
On Windows:
You'll need to download an SSH client first. We recommend PuTTY (it's free!), but VS Code's Remote - SSH extension is also a great option if you're already using VS Code (also free).
Pro Tip: If you find yourself connecting to these systems often, set up an SSH config file on your computer. It's like creating speed dial entries for your favorite contacts - you'll save time and typing errors.
9. Working with SLURM
All our supercomputing environments use a system called SLURM to manage jobs. Think of SLURM as the scheduling assistant who makes sure everyone gets their fair share of computing time. Our team has put together a very nice page on SLURM that you can check here: How to Schedule via SLURM.
At this point you might be asking: “Where do all these parts fit in? VPN? SLURM? This seems like a lot of trouble...” Don't worry, we've got your back, and once you go through this process for the first time, you'll see it's actually easier done than said (pun intended). We understand there are a lot of moving parts, but hey, this is how the actual AI world works: nothing fancy. Take a look at the following diagram for a clearer picture of the whole process.
Here's the typical workflow:
Connect to the head node via SSH
Prepare your job script (a file telling the system what you want to run)
Submit your job using SLURM commands
Kick back while SLURM finds the right computers to run your job
Come back later to collect your results
Basic Commands
squeue
: See what jobs are currently running or waiting (like checking the status board at an airport)srun
: Run a command directly on compute nodes (for interactive work - see next topic)sbatch
: Submit a job script to be run when resources are available (for non-interactive work - see next topic)scancel
: Cancel a job (in case you spot a mistake or change your mind)sinfo
: See information about the available partitions (groups of nodes)
Differences Between Non-Interactive and Interactive Jobs
With srun
you can run a command directly on compute nodes (for interactive work) - Think of this like the movie Inception. You first login to the head node (your first dream level), and then using srun
, you “dream deeper” into a compute node where the real computational power exists. Just like in Inception where they needed to go deeper to accomplish the mission, you use srun
to dive into the more powerful compute environment where your tasks can actually run. Spoiler alert - No spinning top required to check if you're in reality - just the command prompt will tell you which node you're on!
When should you use srun
instead of sbatch
? Use srun when you need to be “present in the dream” - when you require real-time interaction with your work. This is perfect for debugging, exploratory data analysis, or interactive Python sessions where you're actively typing commands and expecting immediate responses. Meanwhile, sbatch
is like planting an idea and walking away - you submit your job script and let it run in the background while you do something else, checking in later to see the results. Choose srun
when you need to be hands-on and sbatch
when you want to “set it and forget it.”
Job Submission Guide
The most common way to run jobs is by creating a script and submitting it with sbatch
. A basic script looks something like this:
#!/bin/bash #SBATCH --job-name=my_awesome_analysis #SBATCH --output=results_%j.out #SBATCH --error=results_%j.err #SBATCH --time=01:00:00 #SBATCH --gpus=1 # Run your actual program python my_analysis.py
Submit this with sbatch my_script.sh
and SLURM will take care of the rest!
Resource Allocation Best Practices
Only request what you actually need - overestimating resources means longer wait times
For long-running jobs, use checkpointing (see this awesome page: Checkpointing Guide) to save progress periodically
Start small when testing new code, then scale up once you know it works
Be specific about your requirements (memory, GPUs, etc.) so SLURM can match you with the right hardware
Monitoring Jobs
Once your job is running, you can monitor it with:
squeue -u your_netid
to see all your jobssacct -j job_id
for detailed info about a specific job
10. Using Container Images with DGX Environments
What Are Containers (And Why Should You Care)?
Think of containers like those meal prep kits that deliver everything you need to cook a specific dish. All the ingredients, spices, and instructions come in one package, so you don't have to shop for each item or worry if your local store carries that obscure spice the recipe needs.
In the computing world, containers work the same way - they package up software, libraries, and dependencies so everything “just works” together. No more “but it worked on my laptop!” frustrations or spending hours installing the right version of every library.
NVIDIA NGC Catalog: Your Container Buffet
For both our DGX environments (Cloud H100 and On-Prem A100), we highly recommend using container images from the NVIDIA NGC Catalog (https://catalog.ngc.nvidia.com/containers). Think of NGC as a massive buffet of pre-configured containers specifically optimized for NVIDIA GPUs.
Why use these containers? Because:
They're pre-optimized for our hardware - these containers were literally made by the same folks who built the GPUs in our systems
They save you tons of setup time - no need to install and configure all the right libraries
They're performance-tuned - often running faster than manually installed software
They're regularly updated - security patches and new features get added without breaking your workflow
Popular Container Images for Research
The NGC catalog offers containers for just about everything GPU-related, but here are some crowd favorites:
Deep Learning Frameworks: Ready-to-use containers for PyTorch, TensorFlow, JAX, and more
HPC Applications: Containers for scientific computing, simulations, and modeling
AI/ML Tools: Specialized containers for computer vision, speech recognition, and NLP
Getting Started with Containers
Using containers on our DGX systems is straightforward:
Browse the NGC catalog to find the container you need
In your SLURM job script, specify the container directly:
#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:25.01-py3'
Mounting Storage with Containers
Since containers have their own isolated filesystem, you'll need to explicitly mount your storage directories:
#SBATCH --container-mounts=/network/rit/dgx/dgx_[your_lab_here]:/mnt/dgx_[your_lab_here]
This maps your lab folder to a mount point inside the container, so your data and code remain accessible.
Example - Using a Container Image to Start a Jupyter Notebook on DGX On-Prem
Let's walk through a practical example that you'll likely use all the time - setting up a Jupyter notebook session on the DGX On-Prem cluster. This script creates an interactive JupyterLab environment where you can develop and test your code with all the perks of our powerful GPUs. It automatically generates a secure password and gives you a URL to access your notebook from your browser.
#!/bin/bash #SBATCH --job-name=jupyter #SBATCH --output=jupyter-%j.out #SBATCH --error=jupyter-%j.err #SBATCH --time=8:00:00 #SBATCH --gres=gpu:1 #SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:24.09-py3' #SBATCH --container-mounts=/network/rit/dgx/dgx_vieirasobrinho_lab:/mnt/dgx_lab,/network/rit/lab/vieirasobrinho_lab:/mnt/lab # Get the DGX node name node_name="$SLURMD_NODENAME" echo -e "JupyterLab is being loaded..." # Generate a random port number between 8000 and 8999 port=$((RANDOM % 1000 + 8000)) # Generate a random password (alphanumeric, 6 characters) password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 6) # Build the Jupyter URL jupyter_url="http://${node_name}.its.albany.edu:${port}" # Print session details echo -e "\nYour JupyterLab session is available at: ${jupyter_url}\n" echo -e "Your password is: ${password}\n" echo -e "Please copy and paste the link into your browser and use the password to log in.\n" echo -e "================================================================================\n" # Start JupyterLab session jupyter lab --allow-root --no-browser --NotebookApp.token="${password}" --NotebookApp.allow_origin='*' --NotebookApp.log_level='CRITICAL' --notebook-dir=/mnt --port=$port
Once you submit this script with sbatch jupyter.sh
, SLURM will find an available node, launch your container, and start the Jupyter server. Simply copy the URL and password from the output file (check jupyter-*.out
) and paste it into your browser. Voilà! You now have a full-featured development environment running on our high-performance hardware. As you may have noticed, the script contains several SLURM flags that you can customize based on your needs. Feel free to adjust the time limit, GPU count, or container image to match your specific research requirements. Let's break down some key flags you might want to customize:
--time=8:00:00
: Need a longer session? Change this to increase your time allocation--gres=gpu:1
: Working with bigger models? Bump this up to request more GPUs (e.g.,gpu:2
orgpu:4
)--container-image
: Want a different framework? Swap in any NGC container that suits your project (check all the amazing available images here: https://catalog.ngc.nvidia.com/containers)--container-mounts
: Make sure to change this according to your own lab directories
Remember that your session will run for the time specified (8 hours in this example) and then automatically terminate. Need to save your work? No worries - everything you save in the /mnt/dgx_lab
or /mnt/lab
directories will be stored in your lab's permanent storage space, so it'll be there waiting for you next time you log in!
Container Tips and Tricks
Persistent Storage: Always mount your home directory or lab folder as shown above - containers are temporary!
Custom Containers: Already have a working setup? You can build your own container to ensure reproducibility
Container Versions: Always specify the exact version of containers (like '25.01-py3' instead of 'latest') to avoid surprise updates
11. Demo
Need some visual assistance? Not a problem! Take a look at the video below and see how to start a JupyterLab session on the DGX On-Prem.
12. Additional Resources
Don't worry - you're not alone on this supercomputing journey! We've created a wealth of resources to help you make the most of UAlbany's computational power.
Our AI Tutorials page is your one-stop shop for getting started. Think of it as the trailhead that connects to all the important paths through our supercomputing landscape. You'll find guides for each cluster, step-by-step instructions, and best practices developed by folks who've already blazed these trails.
Looking for ready-to-run examples? Check out the Code Tutorials section. We've prepared sample Python scripts and Jupyter notebooks that are specifically designed for our DGX environments. It's like having a cookbook full of recipes that are guaranteed to work in our kitchen!
The supercomputing community at UAlbany is constantly growing and evolving. Have a question that isn't covered in our documentation? Found a clever way to use our systems that might help others? We'd love to hear from you! Reach out to askIT@albany.edu - your feedback helps us make these resources better for everyone.
Remember: today's supercomputing question is tomorrow's wiki article. Your curiosity drives our documentation!