Compute Cluster

The CM department's demand for high-performance computing is met by its own compute cluster. Since 2014, the cluster is located in and operated by the Max Planck Computing and Data Facility in Garching (near Munich), which offers not only a high expertise and excellent support, but also an energy-efficient cooling via groundwater. The computational ressources are exclusively reserved for the department.

The cluster has two parts, which are alternatingly replaced after 6 years, i.e. every 3 years half of the cluster is renewed. At present, we have

cmti (since 2020): 358 compute nodes, each with 40 CPU cores (namely 2x 20-core Intel "Skylake" Xeon Gold 6230 @ 2.1 GHz) and 192 GB of RAM, connected via 100 Gbit/s Omnipath (1:12 blocking); Linpack performance ~1.8 Tflops/node when run within islands (~640 Tflops in total).
cmmg (since 2024): 96 compute nodes, each with 256 CPU cores (namely 2x 128-core AMD EPYC 9754, operated @ 1.8 GHz) and 768 GB of RAM, connected via 200 Gbit/s Infiniband; Linpack performance ~7 Tflops/single node (672 Tflops in total; all-node parallel test: 607 Tflops).

The cluster design is optimized for medium-scale plane-wave density-functional theory calculations (20-400 cores, compute bound), which account for the major part of its actual use. The cluster is heavily used and typically loaded to more than 90% throughout the year. A unique advantage of using our own cluster is that we can adapt the usage rules to the scientific demands. For data, a 420 TB fileserver is available.

On the cmti part, the compute nodes are connected in an imbalanced tree topology with a high blocking factor: up to 44 nodes are attached to a single leaf switch. Within this network island, the nodes can communicate with low delays among each other, at the expense of sacrificing bandwidth for communication between islands and with the fileserver. Consequently, all our calculations are forced to run on a single island (max. 1760 cores).

The cmmg part has a single interconnect switch, so we can run even very large jobs if necessary. In order to save electricity (and associated cost), the cmmg part is currently run at reduced frequency, namely 1.8 GHz instead of the usual 2.25-3.1 GHz dynamic scaling. While the compute speed is reduced to 80% (benchmarks) to 85% (realistic workload) compared to the default base frequency at 2.25 GHz, the power saving is 23% (compared to fixed frequency at 2.25 GHz) to 35% (dynamic frequency up to 3.1 GHz, frequency boosts increase performance by up to 9% at single-node level, but the variable speed limits parallel efficiency).

The next cluster upgrade is planned for 2026. In sum, ~38900 CPU cores are available, delivering ~28 million core-hours per month. The total amount of RAM is 142 TB, but this number is about as meaningful as the total number of pages in a library or the total weight of fruits available in a grocery store.