Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel
panelIconIdatlassian-info
panelIcon:info:
bgColor#F4F5F7

TL;DR: This is a guide for using the University at Albany's DGX On-Prem cluster's free tier. Key points:

  1. Free tier limits: Max 4 concurrent jobs, up to 4 GPUs/256 CPUs, 7-day job duration limitplease refer to the Service Level Agreement (SLA).

  2. Jobs can be preempted (stopped and requeued) if paid-tier users need resources. When this happens, all progress is lost unless properly saved.

  3. The solution is checkpointing - regularly saving your work's state (like autosave in a document). The guide provides PyTorch code examples for:

    • Saving/loading model training progress

    • Tracking training metrics

    • Managing inference tasks

The main message is: Always implement checkpointing in your jobs, or you risk losing hours/days of work when preemption occurs.

...

The free tier provides substantial computing resources with specific limitations such as:

  • Maximum number of4 concurrent jobs per user

  • Access to up to 4 a specific number of GPUs and 256 CPUs

  • Maximum job duration of 7 daysper user

  • Automatic job requeuing upon preemption

For current free tier limits, please refer to the Service Level Agreement (SLA).

Understanding Job Preemption

...