Considerations for HPC cluster compute types

HPC workloads have varying requirements in terms of CPU, memory, network, and storage resources. Some of them are:

core count
memory per core
network bandwidth and latency
processor clock speed

The goal is to pick a compute configuration that returns the best price performance. An HPC workload can be as simple as a single core job or as complex as a job that needs hundreds to thousands of cores. HPC aggregates computing power (clustering) and can deliver higher performance and solve large problems.

Most of HPC workloads fall below the 1000 core range. There are some workloads that require a high number of cores up to and including 10k-50k. Execution time for an HPC workload can be simple (for few seconds) or it can be complex (take several days).

As an example, Electronic Design Simulation (EDA) workloads have component level simulation jobs that require millions of such jobs ran every day, but each single job requires a single core and approximately 10 seconds. Optical Proximity Correction can take multiple hours or even days depending on the size of the chip and the size of the HPC cluster.

Typically HPC clusters contain a set of virtual system and use the cluster to run multiple workloads. It all depends on the workload resource requirements and duration.

To address this range of characteristics, IBM Cloud® provides various VPC virtual machine (VM) configurations in different modes:

Balanced configurations that provide average 4 GB memory per core and can range 4 - 64 Gbps in network bandwidth
Compute-intensive configurations that provide 2 GB memory per core and a network of 4 - 80 Gbps
Memory-intensive configurations that provide 8 - 28 GB memory per core and a network of 2 - 80 Gbps

For more information, see Instance profiles.

The IBM® Spectrum LSF solution supports the creation of static compute nodes by leveraging different instance type profiles based on resource requirements. This allows you to tailor the compute resources to meet the specific workload demands. For instance, if the solution requires 100 static worker nodes, you can allocate 70 nodes from the bx3d-176x880 instance type and the remaining 30 from the cx3d-96x240 instance type. This approach enables a balanced distribution of compute resources, accommodating both high-performance and general-purpose tasks. However, when dynamic nodes are created based on the maximum node count, the automation currently selects only the first instance profile from the list to calculate CPU and memory resources for the dynamic node creation.

Configuration example:

[
  {
    "count": 0,
    "instance_type": "bx2-4x16"
  },
  {
    "count": 0,
    "instance_type": "cx2-8x16"
  }
]

In this scenario, the automation chooses the first profile bx3d-176x880 to compute CPU, memory, and other specifications, leading to dynamic nodes being provisioned solely based on this profile, regardless of additional instance type definitions.

For all of the configurations, the range of core count is 2 - 128 per virtual system. There is a special ultra high memory virtual system type that might be applicable for workloads that require more memory per core. This type can go up to 200 cores and as high as 28 GB per core.

The network bandwidth on a single NIC can reach a maximum 16 Gbps. If a higher bandwidth is wanted, more NIC configurations that go up to 80 Gbps might be needed. Under these circumstances, 5 NICs would need to be configured for the virtual system.

By default, hyper-threading is enabled on an IBM Cloud virtual system, so you get 2 vCPUs per physical core. But this can be disabled easily.

Most HPC applications perform best with one process or thread per physical core.

For communication-intensive workloads that can fit on a single virtual system, it might be good to pick the best match and go up to 128 core virtual system instance instead of splitting the workload across multiple virtual system instances of a smaller core count. This allows the process to take advantage of faster communication through shared memory on a single virtual system rather than communication across multiple virtual machines over an ethernet network.

To put it in perspective, two processes running on the same virtual system might be able to communicate in a fraction of a microsecond (for example, 0.3 microseconds) whereas across two virtual system instances it can take more than 30 microseconds. A factor of 100 times faster communication when it is through shared memory in a single virtual system.

A very cost-effective configuration is cx2-128x256, which allows 128 cores and 2 GB memory per core. This can cover a broad range of MPI workloads.

Scalable MPI jobs can be set up that require multiple virtual machines that are configured at up to 80 Gbps apiece, but that requires multiple NICs and might not be desirable. It is recommended to pick a configuration that provides the best network bandwidth per core with single NIC. bx2-16x64 might be a good starting point for your MPI benchmarking.

Benchmarking of specific workloads

Electronic Design Automation (EDA)

IBM Systems and IBM Research work in this industry domain and have successfully used IBM Cloud for such workloads. The following graph displays a scale test for up to 30 K cores. To showcase how cloud zones can be used as a single data center, we built a large HPC cluster aggregating the resources across three IBM Cloud locations. The setup also uses IBM Storage Scale as a scratch-based, high-performing file system along with IBM Spectrum LSF cluster configuration. We have used BX2-48x192 for IBM® Spectrum LSF worker nodes and MX2d-16x128 for storage nodes under the IBM Storage Scale.

Weather (WRF Model)

IBM Cloud shows linear performance, performs favorably, and can scale well into thousands of cores. The virtual system configuration that is used for this benchmark is bx2-16x64. The WRF model is not sensitive to network latency as it packs many variables into each message, resulting in fairly large messages, and not many small messages.

The red line represents the HPC environment with the InfiniBand HDR that gives the highest bandwidth, lowest latency, and is the best configuration for such workloads. The green line shows IBM Cloud with the Lon2 data center for the benchmarking. The blue line is the Summit super computer. In summary, any workloads that have characteristics similar to the WRF model should scale well with IBM Cloud. As you can see IBM Cloud shows reasonable performance against state-of-the-art HPC systems.

DoE (Department of Energy) benchmarking

SNAP and Quicksilver are two applications that the DoE uses for benchmarking and deciding the specific commodity technology systems.

The following graphs show results on how IBM Cloud compares with the state-of-the-art HPC system.

On IBM Cloud, the benchmarks use two different configurations:

bx2-8x32
bx2-16x64

As you can see, SNAP results show that bx2-8x32 provides more performance because of higher effective network bandwidth ratio per core; where Quicksilver does well with bx2-16x64 as it has moderate communication requirements and is mostly near-neighbor to track particle motion across the global domain.

Figure 3. SNAP scaling diagram — SNAP scaling diagram

Figure 4. Quicksilver scaling diagram — Quicksilver scaling diagram

Even though the scaling is not as good as the weather model, IBM Cloud can scale up to thousands of cores reasonably with a linear curve.

Virtual system use cases

The choice of virtual system profile type for IBM Cloud depends on your workload details on cores, memory, and network requirements.

Single node virtual system use cases

This should be the first evaluation as it can provide the best price and performance for running such jobs on IBM Cloud.

You can choose from the following set of virtual system profiles:

CX2-16x32 to CX2-128x256
BX2-16x64 to BX2-128x512
MX2-16x128 to MX2-128x1024

Depending on the memory required per core, you might pick the MX2 configuration, which can support up to 1 TB on the MX2-128x1024 profile.

If the memory required per core is less than 2 GB, an appropriate CX2 profile might give you the best price and performance. The advantages of faster communication over shared memory help with the performance if the workload can run on a single virtual system.

Some examples of such workloads:

Local area weather forecasting, not high resolution, large models but modest in size.
OpenFOAM computational fluid dynamics, size of the problem between 2 million - 10 million grid cells.
Design rule checking (DRC) in Electronic Design Automation (EDA) as part of chip designs.
EDA single component simulation and verification jobs.

Scale out use case with multiple virtual machines

This is the second category of workloads where the requirement is to have multiple virtual machines for execution. Even for this category, depending on the network bandwidth and latency requirements, a significant set of HPC workloads would scale well on IBM Cloud.

There are cases where small updates might be needed to the application code to adjust it well to run in a cloud-like environment, and these changes are not specific to IBM Cloud but would be desirable to burst to any cloud provider.

IBM carries deep HPC expertise and can provide specific recommendations to get you the best price and performance for an HPC cloud environment.

IBM Cloud has been successful in the following examples:

Optical Proximity Correction (OPC) in EDA
Full chip Integrated Circuit Validator (ICV) in EDA
Any Hadoop map/reduce or Spark workload
MPI workloads that cannot fit on a single virtual system

The recommendation for such workloads is to start with BX2-16x64 as it gives the best performance in tests so far. Based on your results, you might get insight into alternative options that might better suit your specific workload. Some workloads might be network-latency sensitive, and IBM Cloud configurations might not appear promising; yet even for these cases, engage with the offering owner and your sales team so the HPC experts can evaluate your specific requirements and provide assistance.