The Zaratan High-Performance Computing (HPC) cluster is the University of Maryland's flagship HPC cluster, maintained by the Division of Information Technology and replacing the Deepthought2 cluster. Coming on-line in spring 2022, it features 360 compute nodes, each with dual AMD 7763 64-core CPUs. These CPUs are direct-liquid cooled to enable all of the approximately 50,000 CPU cores to run at full speed. There are also 20 GPU nodes, each containing four Nvidia A100 GPUs (for a total of 80 GPUs). Theoretical peak performance is 3.5 PFLOPS .
The cluster has HDR-100 (100 Gbit) Infiniband interconnects between the nodes, with storage and service nodes connected with full HDR (200 Gbit). The cluster is connected with 200 Gbit Ethernet to various national networks.
The cluster provides 2 PB of high-performance parallel file storage (using BeeGFS), and 10 PB of more archival storage (using Auristor).
The following table lists the hardware on the Zaratan cluster:
Description | Processor | Number of nodes |
Cores/node | Total cores | Memory/node GiB |
Memory/coreGiB | Node local (/tmp) per node, GB |
GPUs/node | Interconnect | Comments |
---|---|---|---|---|---|---|---|---|---|---|
Standard compute | AMD EPYC 7763, 2.45 GHz base (3.5 GHz turbo) | 360 | 128 | 46080 | 512 | 4 | 750 | 0 | HDR-100 | DLC of CPUs |
A100 GPU Nodes | AMD EPYC 7763, 2.45 GHz base (3.5 GHz turbo) | 20 | 128 | 2560 | 512 | 4 | 750 | 4 NVidia A100 | HDR-100 | |
H100 GPU Nodes | Intel Xeon Platinum 8468 | 8 | 96 | 768 | 512 | 5.3 | 12 | 4 NVidia H100 | HDR-100 | |
Large Memory Nodes | AMD EPYC 7763, 2.45 GHz base (3.5 GHz turbo) | 6 | 128 | 768 | 2048 | 16 | 6 | 0 | HDR-100 | Paritition bigmem |
The nodes containing GPUs have either quad Nvidia A100 Tensor Core GPUs with 40 GB of GPU RAM (using the Ampere architecture supporting CUDA compute capability 8.0) or quad Nvidia H100 Tensor Core GPUs with 80 GB of GPU RAM (using the Hopper architecture supporting CUDA compute capability 9.0).
The cluster has 2 PB of high performance short term file storage (using BeeGFS) as well as 10 PB of longer term storage (using Auristor).
The standard compute nodes are connected with HDR-100 (100Gb/s) infiniband interconnects, and the GPU nodes have full HDR (200 Gb/s) infiniband.
The theoretical peak performance is 3.5 Pflops . The theoretical peak performance assume ideal conditions, in which the calculations are able to keep all the CPUs and GPUs fully utilized, which of course does not happen in practice. But these numbers are easy to compute and useful for rough comparisons. High-end laptops in 2022 (e.g. MacBook M1 Max with 24 or 32 core GPUs) have theoretical peaks of 5-10 Tflops, so Zaratan should be about 350 to 700 times faster.