Zaratan Cluster

The current flagship HPC cluster at the University of Maryand is the Zaratan cluster. Named after a mythological sea turtle known for its long lifetime and gargantuan size, the Zaratan cluster went online in August 2022, and consists of:
- 360 compute nodes each with 128 CPU cores and 512 GiB of memory
- 20 GPU nodes, each with four A100 GPUs, 128 CPU cores, 512 GiB of RAM
- 8 GPU nodes, each with four H100 GPUs, 96 CPU cores
- 6 Large memory nodes: each with 2 TiB RAM, 128 CPU cores
- 2 PB BeeGFS scratch storage
- 10 PB AuriStorFS medium-term SHELL storage
- HDR/HDR-100 Infiniband interconnects between nodes
- 400 Gbps Ethernet connections to national networks
- 5.7 PetaFlops theoretical performance
- Open OnDemand web portal
Details of the Zaratan cluster:
| Description | Number of nodes | Processor | Cores/node | Mem/node (GiB) | Mem/core | /tmp size (TB) | GPUs/node |
|---|---|---|---|---|---|---|---|
| Standard compute | 360 | AMD Zen3 | 128 | 512 | 4 | 0.75 | none |
| Large memory | 6 | AMD Zen3 | 128 | 2048 | 16 | 6 | none |
| H100 | 8 | Intel SapphireRapids | 96 | 512 | 5.3 | 12 | 4 x H100 (80 GB) |
| A100 | 20 | AMD Zen3 | 128 | 512 | 4 | 0.75 | 4 x A100 |
The standard compute, large memory, and A100 nodes have dual AMD EPYC 7763 Zen3 processers, with 64 cores per CPU and a base speed of 2.45 GHz (3.5 GHz turbo speed).
The H100 nodes have dual Intel Xeon Platinum 8468 SapphireRapids processors, with 48 cores per CPU and a base speed of 2.1 GHz (turbo speed of 3/8 GHz).
Each NVIDIA A100 Tensor Core GPU has 40 GiB of GPU RAM (using the Ampere architecture supporting CUDA compute capability 8.0). These are SXM models of the GPUs which support NVLink. As indicated by the name, each fractional a100_1g.5gb multi-instance GPU has 5 GiB of GPU RAM; the CUDA compute capability of 8.0 is not changed.
Each NVIDIA H100 Tensor Core GPUs has 80 GiB of GPU RAM (using the Ampere architecture supporting CUDA compute capability 9.0) These are SXM models which support NVLink.
The compute and large memory nodes have HDR100 interconnects. The GPU nodes have HDR interconnects.
Zaratan Partitions
| Partition name | Maximum Walltime | Notes |
|---|---|---|
| standard | 7 days | All jobs w/out special requirements |
| debug | 15 min | Short test/debug jobs |
| bigmem | 7 days | Jobs needing large amounts of memory |
| gpu | 7 days | Jobs needing GPUs |
| scavenger | 14 days | Free, but low priority and preemptible |
Zaratan Features/Constraints
The following features or constraints are defined on the Zaratan cluster and can be requested with the sbatch --constrain flag:
| Feature | Description |
|---|---|
| amd | Node has AMD based CPUs |
| beeond | Node supports BeeOND |
| epyc_7702 | Node has AMD EPYC 7702 CPUs |
| epyc_7763 | Node has AMD EPYC 7763 CPUs |
| epyc_9124 | Node has AMD EPYC 9124 CPUs |
| ib | Node supports Infiniband |
| intel | Node has Intel based CPUs |
| noib | Node does not have Infiniband |
| nvme | Node has NVMe disks |
| rhel8 | Node is running Red Hat Enterprise Linux version 8 |
| xeon_6248 | Node has Intel Xeon 6248 CPUs |
| xeon_8468 | Node has Intel Xeon 8468 CPUs |
| xeon_8592 | Node has Intel Xeon 8592 CPUs |
Zaratan GRESes
| GRES | Description | Number in cluster | Hourly SU cost | Cuda Compute Capability |
|---|---|---|---|---|
| gpu:h100 | NVIDIA Hopper H100 GPU (80GB) | 32 | 144 SU/hr | 9.0 |
| gpu:a100 | NVIDIA Ampere A100 GPU (40GB) | 76 | 48 SU/hr | 8.0 |
| gpu:a100_1g.5gb | Fractional (1/7) A100 GPU (5GB) | 28† | 7 SU/hr | 8.0 |
† Note:: The number of physical A100 GPUs that are split into smaller virtual GPUs, and potentially the sizes of these smaller virtual GPUs, is subject to fluctation without advanced notice as we gauge how to best distribute the resources to meet user demand (and as the demand changes). The numbers listed are accurate as of the time this was written, and since there are currently 80 physical A100 GPUs on Zaratan, the total of the number of a100 GPUs plus 1/7 of the number a100_1g.5gb virtual GPUs will equal 80.