UAlbany Supercomputing Resources Guide

Important - DeepSeek Usage Restriction Notice

In accordance with Governor Hochul's February 10, 2025 statewide directive, DeepSeek Artificial Intelligence software may not be downloaded, installed, or used on any state government networks or devices, including UAlbany's Supercomputing Resources. This restriction applies to all users regardless of affiliation or research purpose. The directive cites serious concerns regarding DeepSeek AI's connection to foreign government surveillance and potential data security risks, and more details can be found here. For any further questions, please contact askIT@albany.edu.

Introduction
Overview of Available Systems
HPC Cluster
DGX Cloud Cluster (H100)
DGX On-Prem Cluster (A100)
Storage Resources and Management
Requesting Access to UAlbany Supercomputing Resources
Connection and Access Guide
Working with SLURM
Using Container Images with DGX Environments
Video Demos
1. Running Jupyter Notebook on DGX On-Prem
2. Running Jupyter Notebook on DGX Cloud
Additional Resources

1. Introduction

What is a Supercomputer?

Ever tried running a complex analysis on your laptop and watched it freeze up or take forever? That's where supercomputers come in. A supercomputer isn't just one massive computer - it's actually a bunch of computers (we call them “nodes”) working together as a team. Think of it like a group project where each team member tackles part of the work simultaneously instead of one person doing everything.

These systems have a “head node” which is basically the team leader - it's where you log in and tell the system what you want to do. Then it delegates the actual work to all the other computers in the cluster. It's like the difference between having one person solve a 1000-piece puzzle versus having 50 people each working on 20 pieces at once.

Why Use UAlbany's Supercomputing Resources?

Let's be real - some research and computational tasks would take weeks or even months on your personal computer (if they'd run at all). UAlbany's supercomputing systems can turn those weeks into hours or even minutes. Whether you're crunching numbers for climate models, analyzing genome sequences, training the next cool AI model, or rendering complex visualizations, these systems have your back.

Plus, you don't have to worry about upgrading your personal computer or maxing out your credit card on cloud computing fees - these resources are available to you as part of the university community!

2. Overview of Available Systems

Comparison of Available Environments

UAlbany has several different supercomputing environments, each with its own personality and strengths. They're all built on the same basic concept (clusters of computers working together), but the hardware inside each one makes a big difference in what they're good at.

All our clusters have GPUs (Graphics Processing Units), which are specialized chips that excel at certain types of calculations. Think of them as the math wizards of the computing world - they might not be great at everything, but when it comes to specific types of problems (especially in AI and data science), they'll run circles around regular CPUs.

Here's how our systems stack up:

Environments & Resources	HPC Cluster	DGX Cloud Cluster (H100)	DGX On-Prem Cluster (A100) a.k.a. AI Supercomputer

Environments & Resources	HPC Cluster	DGX Cloud Cluster (H100)	DGX On-Prem Cluster (A100) a.k.a. AI Supercomputer
Computing Nodes	60	4	24
GPU/Acceleration Cards	12 NVIDIA A40	32 NVIDIA H100	192 NVIDIA A100
Computing Node Datasheet	N/A

By the way, if you spot a "DGX Cloud Cluster (A100)" mentioned in our documentation, that's not a mistake! Our first cloud offering (from '23 to '24) came with 32 NVIDIA A100 GPUs, but when we renewed our agreement with NVIDIA, we got our hands on these awesome state-of-the-art H100 GPUs instead. Our documentation is constantly being updated - it's tough keeping up with all the cool stuff happening at once. So if you see references to the older A100 setup, just know we've upgraded, and please let us know if you find anything confusing!

Understanding GPU Technology

GPUs were originally created to render video games (making those stunning graphics in your favorite games possible), but researchers quickly realized they could be repurposed for scientific computing. Why? Because they're built to process lots of similar calculations at the same time – super helpful for both rendering explosions in games AND training neural networks.

GPU Comparison Chart

GPUs & Specifications	NVIDIA A40	NVIDIA A100	NVIDIA H100

GPUs & Specifications	NVIDIA A40	NVIDIA A100	NVIDIA H100
GPU Memory (GB)	48	80	80
Memory Bandwidth (GB/s)	696	2039	3350
FP16 Tensor Core (TFLOPS)	299.4	624	1979
GPU Datasheet

Memory, Bandwidth, and TFLOPS Explained

GPU Memory: This is like your GPU's personal workspace. Imagine you're working on a project - GPU memory is the size of your desk. A bigger desk (more memory) means you can spread out more materials (data) without having to constantly file things away and retrieve them later. When you're working with huge datasets or models, having enough GPU memory is crucial so your GPU doesn't waste time shuffling data back and forth to the CPU.

Memory Bandwidth: If GPU memory is the size of your desk, bandwidth is how quickly you can move things around on it. Higher bandwidth means you can grab that reference book, flip to the right page, and get the info you need faster. For data-intensive applications, higher bandwidth means less time waiting and more time computing.

TFLOPS (Trillion Floating Point Operations Per Second): This is the raw horsepower measurement. One TFLOP means a trillion calculations per second (wrap your head around that!). To put this in perspective: if each calculation was a grain of rice, the H100 GPU could fill about 60 Olympic swimming pools with rice... every second. The H100 GPU can handle approximately 3 times more operations per second than the A100, which is mind-blowingly powerful.

Which System Is Right for Me?

HPC On-Prem Cluster: This is our “starter” supercomputer - think of it as the reliable sedan of our fleet. It has more CPU power than GPU power, with 12 NVIDIA A40 GPUs spread across 60 computing nodes. It's great for cutting your teeth on supercomputing and handles a wide variety of tasks well without being specialized in any particular area. Perfect for data science work, statistical analysis, and smaller machine learning jobs where you're still figuring things out.

DGX Cloud Cluster (H100): This cloud-based system is like having a small team of super-athletes at your disposal. It consists of 4 computing nodes, each packing 8 powerhouse NVIDIA H100 GPUs (32 total). This is our only system that lives “off-campus” in NVIDIA's cloud rather than in our data center. While each H100 is about 3 times more powerful than an A100 (seriously, these things are beasts!), the math still favors our on-prem system in total computing power. Why? Because the DGX A100 On-Prem has 6 times more GPUs in total (192 vs 32). Think of it as choosing between 32 Olympic weightlifters or 192 college athletes - those 192 can collectively lift more, even if individually they're not as strong!

DGX On-Prem Cluster (A100): The heavyweight champion of our supercomputing lineup. With 192 NVIDIA A100 GPUs spread across 24 nodes, this system has the most total computing power. It's perfect for those massive projects where you need serious computational muscle. And if your research budget allows it, there's even a paid tier that lets you reserve a larger chunk of these resources for your team's exclusive use. This system is ideal when you've graduated from experimenting and are ready to train serious models or crunch massive datasets.

Key H100 Advantages

While our on-prem A100 system still wins the total compute battle, the following features make the H100s the go-to choice when you need cutting-edge capabilities rather than just raw horsepower.

FP8 Precision Support: These GPUs come with remarkable FP8 (8-bit floating point) precision capabilities - it's like upgrading from a standard toolbox to precision surgical instruments! While A100s could handle FP16, the H100s go even further with FP8, letting you run models at lower precision without sacrificing accuracy. The result? Up to 4x performance boost on certain workloads. Pretty incredible, right?
Built-in Transformer Engine: The H100s pack a specialized engine specifically designed for transformer-based AI models (think GPT and BERT). It's like having a Ferrari engine installed specifically for the most demanding part of your AI workload. These custom circuits make language models absolutely fly compared to previous generations.
Dynamic Sparsity: This is a game-changer that works like an efficiency expert for your models. The H100's second-generation sparsity support automatically identifies and skips unnecessary computations during model execution. Imagine having a smart assistant that finds all the shortcuts through a maze - that's what this does for large neural networks, boosting performance without requiring any model modifications! By the way, here is a very good article from NVIDIA on sparsity: What Is Sparsity in AI.

Use Case Scenarios

For Data Science Projects:
Working with R, Python, and your favorite data science libraries? The HPC cluster has you covered for most projects. It's perfect for data wrangling, creating visualizations, and running analyses that don't need specialized hardware. Think of it as your go-to for everyday data science tasks.

For Machine Learning Research:
If you're training traditional ML models (random forests, SVMs, etc.), the HPC environment will serve you well. But once you start getting into more complex territory, you might want to level up to one of the DGX environments. It's like the difference between making a quick sketch and painting a masterpiece - different tools for different ambitions.

For Deep Learning Applications:
Deep learning is where our specialized DGX environments really shine. If you're working with frameworks like TensorFlow, PyTorch, or JAX:

Start with HPC for your initial prototyping, then graduate to the bigger systems when you're ready to scale up
Try the DGX H100 Cloud when you have a particularly calculation-heavy model that would benefit from the H100's raw number-crunching power
Go with the DGX A100 On-Prem when you need to process huge batches or want to spread your workload across multiple GPUs

For Statistical Analysis:
For most statistical work, the HPC cluster will be your best friend. But if you're venturing into computationally intensive methods like large-scale Bayesian modeling or working with truly massive datasets, consider the extra firepower of the DGX systems.

3. HPC Cluster

Technical Specifications

60 computing nodes
12 NVIDIA A40 GPUs (48GB memory each)
Best suited for general-purpose computing and entry-level GPU workloads

Ideal Use Cases

Data preprocessing and cleaning (the digital equivalent of washing your dishes before cooking)
Statistical analysis and modeling
Getting your feet wet with machine learning
Simulations that don't require the absolute latest in GPU technology

Getting Started

Connect to the head node at head.arcc.albany.edu using SSH. Do you need just a JupyterLab session? Then say no more! The HPC offers an extremely convenient way to start a JupyterLab session through the JupyterHub server, so you don’t have to SSH or do anything like that - just access the following link and be happy: https://jupyterlab.its.albany.edu/. Have any questions? We’ve got your back, take a look at this great Wiki page on how to use this tool: JupyterHub Service Offering.

4. DGX Cloud Cluster H100

Technical Specifications

4 computing nodes
32 NVIDIA H100 GPUs (80GB memory each)
Cloud-hosted environment provided by NVIDIA
The newest, shiniest GPUs in our fleet

Ideal Use Cases

Cutting-edge deep learning research
Training large language models (think cousin-of-ChatGPT level stuff)
Projects that benefit from the latest GPU architecture
Workloads that can take advantage of the H100's new FP8 precision capability

Getting Started

Connect to the head node at 207.211.163.76 using SSH with certificate authentication (slightly trickier than password login, but we'll help you get set up).

5. DGX On-Prem Cluster A100

Technical Specifications

24 computing nodes
192 NVIDIA A100 GPUs (80GB memory each)
Our largest GPU cluster by total compute capacity
Available in a paid tier for research groups needing dedicated resources

Ideal Use Cases

Large-scale deep learning projects
Multi-GPU and multi-node training
When you need to throw massive computational resources at a problem
Research teams with funding who need guaranteed access to high-end computing

Getting Started

Connect to the head node at dgx-head01.its.albany.edu using SSH.

6. Storage Resources and Management

Let's talk about where you'll keep all that data and code while working on our supercomputers. After all, having powerful computing is great, but you also need somewhere to store your stuff!

What Storage You'll Get

Your Personal Space:

Everyone gets 10GB of personal research storage - think of it as your private locker
Perfect for keeping your scripts, configuration files, and those special research nuggets no one else needs to access

The Lab Share ($LAB Directory):

Every faculty researcher gets a generous 10TiB (that's 10,240GB!) of shared space
This is your team's clubhouse - a place where everyone in your research group can collaborate
To get your lab space set up, just fill out the Research Storage Request Form
Your $LAB directory shows up automatically on all our cluster systems, so it's always right where you need it

Bonus for DGX Users:

If you're working on the DGX On-Prem system, you'll also get 1TB of extra-speedy flash storage
This is in addition to your regular 10TB lab share
It's like having both a filing cabinet and a whiteboard - regular storage plus some space optimized for quick access

How We Keep Your Data Safe

We take your research data seriously - here's how we protect it:

Everything's encrypted - so even if someone got physical access to the storage, they couldn't read your data
We use clever compression and de-duplication - so you can store more with your allocation
Automatic backups galore:
- Hourly “snapshots” for the last 23 hours (oops, deleted that file 2 hours ago? No problem!)
- Daily snapshots kept for 21 days (accidentally deleted something last week? We've still got you!)
- And the best part? These backups don't count against your storage quota!
We keep a second copy of your $LAB directory at another location - because one backup is never enough!

Getting to Your Files

Accessing your storage is super easy - you just need to map a network drive to your device, which is fairly simple. If you have any questions on how to do that, we have this neat tutorial right here to get you started: How to Map a Network Drive. That said, are you:

On campus? It's available from any university computer you login to
Working from home? Just connect to the VPN from your device and it's all there
Using the supercomputers? Your storage is automatically mounted and ready to use

Need Even More Space?

If your research is grant-funded and you need additional storage: reach out to askIT@albany.edu for a consultation to get your storage estimates right.

7. Requesting Access to UAlbany Supercomputing Resources

Each of UAlbany's supercomputing environments has a specific access request process. Here's how to get started with each system:

Important Note: All access requests must be initiated by Principal Investigators (PIs). If you are a student, postdoc, or lab member requiring access, please ask your PI to submit the request on your behalf.

HPC Cluster Access

Eligibility: Available to all members of the University at Albany research community, collaborators, and partners.

Request Process:

Email askIT@albany.edu to request access
Include your NetID, department, and a brief description of your research needs
Access is typically granted within 1-2 business days

DGX On-Prem Cluster (A100) Access

For complete information on limits and availability: DGX On-Prem Service Offering.

Eligibility: The DGX On-Prem cluster offers two tiers of access to accommodate different research needs:

Free Tier:

No cost for UAlbany faculty
Access to GPU resources on a first-come, first-served basis
Workloads may be preempted by prioritized jobs
Suitable for research projects with flexible timelines

Prioritized Access Tier:

$1,200 annually per prioritized GPU
Priority scheduling of your workloads over free-tier jobs
Guaranteed resource allocation for time-sensitive research projects

Request Process:

Complete the Research Storage Request Form to provision your lab directory
Complete the DGX On-Prem Computation Request Form for access to the NVIDIA On-Prem resources
For prioritized access, indicate your interest in the request form, and the ITS team will follow up with details

DGX Cloud Cluster (H100) Access

For complete information on limits and availability: DGX Cloud Service Offering.

Eligibility: Currently available to UAlbany faculty free of charge.

Request Process:

Complete the DGX Cloud Computation Request Form
You'll receive connection instructions and certificate authentication details by email
Due to the limited number of resources, requests may be subject to approval based on research needs

After Access is Granted

Once you've been granted access to any of our systems:

You'll receive an email with your request details and connection instructions
Check out our wiki pages for tutorials and example job scripts

For any questions about access or to check on the status of your request, please contact askIT@albany.edu.

8. Connection and Access Guide

VPN Setup

First things first - if you’re off campus, you'll need to connect to the university VPN (Global Protect) before you can access any of our supercomputing systems. Think of it as getting your ID checked before entering the building. It’s a pretty simple and straightforward process, and if you’re not familiar with, please check the instructions here: How to Use the VPN. That said, we recommend connecting to the VPN regardless of where you're working from (even on campus) to avoid any connectivity issues that might pop up.

SSH Connection Instructions

This is also other simple and straightforward process, and if you’re not familiar with, please check the instructions here: How to Connect via SSH. To connect to our systems, you'll need:

The hostname (like an address for the system)
Port 22 (the standard door for SSH connections)
Your NetID (your campus username)
Either your NetID password (for on-campus systems) or a certificate (for the cloud system)

Here's a handy reference table:

Environment	Hostname

Environment	Hostname
HPC	head.arcc.albany.edu
DGX On-Prem	dgx-head01.its.albany.edu
DGX Cloud	207.211.163.76

Connection Methods

On macOS or Linux:
You're in luck! These systems come with SSH built in. Just open a terminal window and type:

ssh your_netid@hostname

On Windows:
You'll need to download an SSH client first. We recommend PuTTY (it's free!), but VS Code's Remote - SSH extension is also a great option if you're already using VS Code (also free).

Pro Tip: If you find yourself connecting to these systems often, set up an SSH config file on your computer. It's like creating speed dial entries for your favorite contacts - you'll save time and typing errors.

9. Working with SLURM

All our supercomputing environments use a system called SLURM to manage jobs. Think of SLURM as the scheduling assistant who makes sure everyone gets their fair share of computing time. Our team has put together a very nice page on SLURM that you can check here: How to Schedule via SLURM.

At this point you might be asking: “Where do all these parts fit in? VPN? SLURM? This seems like a lot of trouble...” Don't worry, we've got your back, and once you go through this process for the first time, you'll see it's actually easier done than said (pun intended). We understand there are a lot of moving parts, but hey, this is how the actual AI world works: nothing fancy. Take a look at the following diagram for a clearer picture of the whole process.

Oe important detail to highlight is that access to local storage is limited to on-premises environments. In other words, the DGX Cloud cannot access your home or lab folders.

Here's the typical workflow:

Connect to the head node via SSH
Prepare your job script (a file telling the system what you want to run)
Submit your job using SLURM commands
Kick back while SLURM finds the right computers to run your job
Come back later to collect your results

Basic Commands

squeue: See what jobs are currently running or waiting (like checking the status board at an airport)
srun: Run a command directly on compute nodes (for interactive work - see next topic)
sbatch: Submit a job script to be run when resources are available (for non-interactive work - see next topic)
scancel: Cancel a job (in case you spot a mistake or change your mind)
sinfo: See information about the available partitions (groups of nodes)

Differences Between Non-Interactive and Interactive Jobs

With srun you can run a command directly on compute nodes (for interactive work) - Think of this like the movie Inception. You first login to the head node (your first dream level), and then using srun, you “dream deeper” into a compute node where the real computational power exists. Just like in Inception where they needed to go deeper to accomplish the mission, you use srun to dive into the more powerful compute environment where your tasks can actually run. Spoiler alert - No spinning top required to check if you're in reality - just the command prompt will tell you which node you're on!

When should you use srun instead of sbatch? Use srun when you need to be “present in the dream” - when you require real-time interaction with your work. This is perfect for debugging, exploratory data analysis, or interactive Python sessions where you're actively typing commands and expecting immediate responses. Meanwhile, sbatch is like planting an idea and walking away - you submit your job script and let it run in the background while you do something else, checking in later to see the results. Choose srun when you need to be hands-on and sbatch when you want to “set it and forget it.”

Job Submission Guide

The most common way to run jobs is by creating a script and submitting it with sbatch. A basic script looks something like this:

#!/bin/bash
#SBATCH --job-name=my_awesome_analysis
#SBATCH --output=results_%j.out
#SBATCH --error=results_%j.err
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1

# Run your actual program
python my_analysis.py

Submit this with sbatch my_script.sh and SLURM will take care of the rest!

Resource Allocation Best Practices

Only request what you actually need - overestimating resources means longer wait times
For long-running jobs, use checkpointing (see this awesome page: Checkpointing Guide) to save progress periodically
Start small when testing new code, then scale up once you know it works
Be specific about your requirements (memory, GPUs, etc.) so SLURM can match you with the right hardware

Monitoring Jobs

Once your job is running, you can monitor it with:

squeue -u your_netid to see all your jobs
sacct -j job_id for detailed info about a specific job

10. Using Container Images with DGX Environments

What Are Containers (And Why Should You Care)?

Think of containers like those meal prep kits that deliver everything you need to cook a specific dish. All the ingredients, spices, and instructions come in one package, so you don't have to shop for each item or worry if your local store carries that obscure spice the recipe needs.

In the computing world, containers work the same way - they package up software, libraries, and dependencies so everything “just works” together. No more “but it worked on my laptop!” frustrations or spending hours installing the right version of every library.

NVIDIA NGC Catalog: Your Container Buffet

For both our DGX environments (Cloud H100 and On-Prem A100), we highly recommend using container images from the NVIDIA NGC Catalog (https://catalog.ngc.nvidia.com/containers). Think of NGC as a massive buffet of pre-configured containers specifically optimized for NVIDIA GPUs.

Why use these containers? Because:

They're pre-optimized for our hardware - these containers were literally made by the same folks who built the GPUs in our systems
They save you tons of setup time - no need to install and configure all the right libraries
They're performance-tuned - often running faster than manually installed software
They're regularly updated - security patches and new features get added without breaking your workflow

Popular Container Images for Research

The NGC catalog offers containers for just about everything GPU-related, but here are some crowd favorites:

Deep Learning Frameworks: Ready-to-use containers for PyTorch, TensorFlow, JAX, and more
HPC Applications: Containers for scientific computing, simulations, and modeling
AI/ML Tools: Specialized containers for computer vision, speech recognition, and NLP

Pre-Requisites

Our SLURM setup uses Pyxis, a plugin that lets regular cluster users (that's you!) run containerized tasks without needing admin superpowers. Pyxis works hand-in-hand with Enroot, a nifty tool that transforms standard container images into unprivileged sandboxes. Yes, we know - it sounds like alphabet soup with all these components! But understanding how this tech sandwich works together can really help you squeeze every last drop of performance out of our resources.

Here's the deal: container images live in what's called a "container registry" (fancy, right?). This could be the NVIDIA NGC Catalog, GitHub, Docker, or some other digital warehouse of container goodness. When your SLURM job specifies an image, Enroot swoops in like a delivery service to pull that image from its registry home - but there's a catch! To access this VIP club of containers, you need the proper credentials set up in your environment.

Don't worry! It's a one-time setup that's easier than assembling IKEA furniture. Just check out our instructions here and look for the "API Key" and "Enroot & Container Setup" sections. Five minutes of setup now = hours of smooth sailing later!

We shouldn’t be telling you this, but if you are in a hurry (we get it, deadlines wait for no one!), there's a small hack to bypass the authentication process for the NVIDIA NGC Catalog. This works for most container images in the catalog (except those fancy AI Enterprise images—they're VIPs with strict guest lists). You'll notice image tags usually follow this format: 'docker://nvcr.io/nvidia/pytorch:25.01-py3'. The secret handshake? Just replace the / between nvcr.io and nvidia with a #, like this: docker://nvcr.io#nvidia/pytorch:25.01-py3. It's like finding the service entrance to the exclusive club! That said, we still recommend setting up proper credentials eventually - it's like having a real membership card instead of knowing the bouncer.

Getting Started with Containers

Using containers on our DGX systems is straightforward:

Browse the NGC catalog to find the container you need
In your SLURM job script, specify the container directly:
#SBATCH --container-image='docker://nvcr.io/nvidia/pytorch:25.01-py3'

Mounting Storage with Containers

Since containers have their own isolated filesystem, you'll need to explicitly mount your storage directories:

#SBATCH --container-mounts=/network/rit/dgx/dgx_[your_lab_here]:/mnt/dgx_[your_lab_here]

This maps your lab folder to a mount point inside the container, so your data and code remain accessible.

Example - Using a Container Image to Start a Jupyter Notebook on DGX On-Prem

Let's walk through a practical example that you'll likely use all the time - setting up a Jupyter notebook session on the DGX On-Prem cluster. This script creates an interactive JupyterLab environment where you can develop and test your code with all the perks of our powerful GPUs. It automatically generates a secure password and gives you a URL to access your notebook from your browser.

#!/bin/bash

#SBATCH --job-name=jupyter
#SBATCH --output=jupyter-%j.out
#SBATCH --error=jupyter-%j.err
#SBATCH --time=8:00:00
#SBATCH --gpus=1
#SBATCH --container-image='docker://nvcr.io#nvidia/pytorch:24.09-py3'
#SBATCH --container-mounts=/network/rit/dgx/dgx_vieirasobrinho_lab:/mnt/dgx_lab,/network/rit/lab/vieirasobrinho_lab:/mnt/lab

# Get the DGX node name
node_name="$SLURMD_NODENAME"

echo -e "JupyterLab is being loaded..."

# Generate a random port number between 8000 and 8999
port=$((RANDOM % 1000 + 8000))

# Generate a random password (alphanumeric, 6 characters)
password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 6)

# Build the Jupyter URL
jupyter_url="http://${node_name}.its.albany.edu:${port}"

# Print session details
echo -e "\nYour JupyterLab session is available at: ${jupyter_url}\n"
echo -e "Your password is: ${password}\n"
echo -e "Please copy and paste the link into your browser and use the password to log in.\n"
echo -e "================================================================================\n"

# Start JupyterLab session
jupyter lab --allow-root --no-browser --NotebookApp.token="${password}" --NotebookApp.allow_origin='*' --NotebookApp.log_level='CRITICAL' --notebook-dir=/mnt --port=$port

Once you submit this script with sbatch jupyter.sh, SLURM will find an available node, launch your container, and start the Jupyter server. Simply copy the URL and password from the output file (check jupyter-*.out) and paste it into your browser. Voilà! You now have a full-featured development environment running on our high-performance hardware. As you may have noticed, the script contains several SLURM flags that you can customize based on your needs. Feel free to adjust the time limit, GPU count, or container image to match your specific research requirements. Let's break down some key flags you might want to customize:

--time=8:00:00: Need a longer session? Change this to increase your time allocation
--gpus=1: Working with bigger models? Bump this up to request more GPUs (e.g., --gpus=2 or --gpus=4)
--container-image: Want a different framework? Swap in any NGC container that suits your project (check all the amazing available images here: https://catalog.ngc.nvidia.com/containers)
--container-mounts: Make sure to change this according to your own lab directories

Remember that your session will run for the time specified (8 hours in this example) and then automatically terminate. Need to save your work? No worries - everything you save in the /mnt/dgx_lab or /mnt/lab directories will be stored in your lab's permanent storage space, so it'll be there waiting for you next time you log in!

Container Tips and Tricks

Persistent Storage: Always mount your home directory or lab folder as shown above - containers are temporary!
Custom Containers: Already have a working setup? You can build your own container to ensure reproducibility
Container Versions: Always specify the exact version of containers (like '25.01-py3' instead of 'latest') to avoid surprise updates

Example - Using a Container Image to Start a Jupyter Notebook on DGX Cloud

This is another practical example that you'll likely use all the time - setting up a Jupyter notebook session on the DGX Cloud cluster. This script creates an interactive JupyterLab environment where you can develop and test your code with all the perks of our powerful H100 GPUs. It automatically generates a secure 6-character password and a random port between 8000-8999, then gives you a URL to access your notebook from your browser. It's slightly different than DGX On-Prem because we need to open a second SSH connection to forward the port to our local computer due to NVIDIA's firewall. Nonetheless, it's only one additional step with a clear command provided, and getting it to work is a piece of cake!

#!/bin/bash

#SBATCH --job-name=jupyter
#SBATCH --output=jupyter-%j.out
#SBATCH --error=jupyter-%j.err
#SBATCH --time=8:00:00
#SBATCH --gpus=1
#SBATCH --container-image='docker://nvcr.io#nvidia/pytorch:24.09-py3'
#SBATCH --container-mounts=/home/jv535825:/mnt/jv535825

# Get the DGX node name
node_name="$SLURMD_NODENAME"

echo -e "Jupyter Lab is being loaded..."

# Generate a random port number between 8000 and 8999
port=$((RANDOM % 1000 + 8000))

# Generate a random password (alphanumeric, 6 characters)
password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 6)

# Build the Jupyter URL
jupyter_url="http://localhost:${port}"

# Print session details
echo -e "\nPlease open a new SSH connection on a separate terminal session.\n"
echo -e "ssh -L ${port}:${node_name}:${port} ngc\n"
echo -e "Then your Jupyter Lab session will be available at: ${jupyter_url}\n"
echo -e "Your password is: ${password}\n"
echo -e "Please copy and paste the link into your browser and use the password to log in.\n"
echo -e "================================================================================\n"

# Start Jupyter Lab session
jupyter lab --allow-root --no-browser --NotebookApp.token="${password}" --NotebookApp.allow_origin='*' --NotebookApp.log_level='CRITICAL' --notebook-dir=/mnt --port=$port

Once you submit this script with sbatch jupyter.sh, SLURM will find an available node, launch your container, and start the Jupyter server. Simply copy the SSH command to another terminal session (or tab), then copy the URL and password from the output file (check jupyter-*.out) and paste it into your browser. Voilà! You now have a full-featured development environment running on our high-performance hardware. As you may have noticed, the script contains several SLURM flags that you can customize based on your needs. Feel free to adjust the time limit, GPU count, or container image to match your specific research requirements. Let's break down some key flags you might want to customize:

--time=8:00:00: Need a longer session? Change this to increase your time allocation
--gpus=1: Working with bigger models? Bump this up to request more GPUs (e.g., --gpus=2 or --gpus=4)
--container-image: Want a different framework? Swap in any NGC container that suits your project (check all the amazing available images here: https://catalog.ngc.nvidia.com/containers)
--container-mounts: Make sure to change this according to your own home folder (which is named after your NetID)

Remember that your session will run for the time specified (8 hours in this example) and then automatically terminate. Need to save your work? No worries - everything you save in the /mnt/[NetID] directory will be stored in your home folder permanent storage space, so it'll be there waiting for you next time you log in!

11. Video Demos

Need some visual assistance? Not a problem! Take a look at the video below and see how to start a JupyterLab session on the DGX systems.

Running Jupyter Notebook on DGX On-Prem

Running Jupyter Notebook on DGX Cloud

12. Additional Resources

Once again, don't worry - you're not alone on this supercomputing journey! We've created a wealth of resources to help you make the most of UAlbany's computational power, and our documentation is constantly being updated.

Our AI Tutorials page is like a comprehensive table of contents on everything related to our supercomputing resources. It’s like a cheat sheet, where you can quickly navigate to the topic you most need help with.

Looking for ready-to-run examples? Check out the Code Tutorials section. We've prepared sample Python scripts and Jupyter notebooks that are specifically designed for our DGX environments. It's like having a cookbook full of recipes that are guaranteed to work in our kitchen!

The supercomputing community at UAlbany is constantly growing and evolving. Have a question that isn't covered in our documentation? Found a clever way to use our systems that might help others? We'd love to hear from you! Reach out to askIT@albany.edu - your feedback helps us make these resources better for everyone.

Remember: today's supercomputing question is tomorrow's wiki article. Your curiosity drives our documentation!

UAlbany Supercomputing Resources Guide

Important - DeepSeek Usage Restriction Notice

Table of Contents

1. Introduction

What is a Supercomputer?

Why Use UAlbany's Supercomputing Resources?

2. Overview of Available Systems

Comparison of Available Environments

Understanding GPU Technology

GPU Comparison Chart

Memory, Bandwidth, and TFLOPS Explained

Which System Is Right for Me?

Key H100 Advantages

Use Case Scenarios

3. HPC Cluster

Technical Specifications

Ideal Use Cases

Getting Started

4. DGX Cloud Cluster H100

Technical Specifications

Ideal Use Cases

Getting Started

5. DGX On-Prem Cluster A100

Technical Specifications

Ideal Use Cases

Getting Started

6. Storage Resources and Management

What Storage You'll Get

How We Keep Your Data Safe

Getting to Your Files

Need Even More Space?

7. Requesting Access to UAlbany Supercomputing Resources

HPC Cluster Access

DGX On-Prem Cluster (A100) Access

DGX Cloud Cluster (H100) Access

After Access is Granted

8. Connection and Access Guide

VPN Setup

SSH Connection Instructions

Connection Methods

9. Working with SLURM

Basic Commands

Differences Between Non-Interactive and Interactive Jobs

Job Submission Guide

Resource Allocation Best Practices

Monitoring Jobs

10. Using Container Images with DGX Environments

What Are Containers (And Why Should You Care)?

NVIDIA NGC Catalog: Your Container Buffet

Popular Container Images for Research

Pre-Requisites

Getting Started with Containers

Mounting Storage with Containers

Example - Using a Container Image to Start a Jupyter Notebook on DGX On-Prem

Container Tips and Tricks

Example - Using a Container Image to Start a Jupyter Notebook on DGX Cloud

11. Video Demos

Running Jupyter Notebook on DGX On-Prem

Running Jupyter Notebook on DGX Cloud

12. Additional Resources