The Zaratan HPC cluster

Zaratan (zar-uh-tahn)

n. A mythological sea turtle known for its long lifetime and gargantuan size.
n. The University of Maryland's flagship HPC cluster , debuting in 2022, and described in more detail below.

The Zaratan High-Performance Computing (HPC) cluster is the University of Maryland's flagship HPC cluster, maintained by the Division of Information Technology and replacing the Deepthought2 cluster. Coming on-line in spring 2022, it features 360 compute nodes, each with dual AMD 7763 64-core CPUs. These CPUs are direct-liquid cooled to enable all of the approximately 50,000 CPU cores to run at full speed. There are also 20 GPU nodes, each containing four Nvidia A100 GPUs (for a total of 80 GPUs). Theoretical peak performance is 3.5 PFLOPS .

The cluster has HDR-100 (100 Gbit) Infiniband interconnects between the nodes, with storage and service nodes connected with full HDR (200 Gbit). The cluster is connected with 200 Gbit Ethernet to various national networks.

The cluster provides 2 PB of high-performance parallel file storage (using BeeGFS), and 10 PB of more archival storage (using Auristor).

Hardware

The following table lists the hardware on the Zaratan cluster:

Description	Processor	Number of nodes	Cores/node	Total cores	Memory/node GiB	Memory/core GiB	Node local (/tmp) per node, GB	GPUs/node	Interconnect	Comments
Standard compute	AMD EPYC 7763, 2.45 GHz base (3.5 GHz turbo)	360	128	46080	512	4	750	0	HDR-100	DLC of CPUs
A100 GPU Nodes	AMD EPYC 7763, 2.45 GHz base (3.5 GHz turbo)	20	128	2560	512	4	750	4 NVidia A100	HDR-100
H100 GPU Nodes	Intel Xeon Platinum 8468	8	96	768	512	5.3	12	4 NVidia H100	HDR-100
Large Memory Nodes	AMD EPYC 7763, 2.45 GHz base (3.5 GHz turbo)	6	128	768	2048	16	6	0	HDR-100	Paritition bigmem

The nodes containing GPUs have either quad Nvidia A100 Tensor Core GPUs with 40 GB of GPU RAM (using the Ampere architecture supporting CUDA compute capability 8.0) or quad Nvidia H100 Tensor Core GPUs with 80 GB of GPU RAM (using the Hopper architecture supporting CUDA compute capability 9.0).

The cluster has 2 PB of high performance short term file storage (using BeeGFS) as well as 10 PB of longer term storage (using Auristor).

The standard compute nodes are connected with HDR-100 (100Gb/s) infiniband interconnects, and the GPU nodes have full HDR (200 Gb/s) infiniband.

The theoretical peak performance is 3.5 Pflops . The theoretical peak performance assume ideal conditions, in which the calculations are able to keep all the CPUs and GPUs fully utilized, which of course does not happen in practice. But these numbers are easy to compute and useful for rough comparisons. High-end laptops in 2022 (e.g. MacBook M1 Max with 24 or 32 core GPUs) have theoretical peaks of 5-10 Tflops, so Zaratan should be about 350 to 700 times faster.