IBM Cloud Docs
About cluster networks

About cluster networks

Cluster Networks for VPC is available for select customers only. Contact IBM Support if you are interested in using this functionality.

A cluster network is a software-defined network within a Virtual Private Cloud (VPC) used to connect multiple computing systems or nodes in a way that optimizes performance and communication between them. These networks are designed to support tasks that require high-speed data transfer and low latency, such as high-performance computing (HPC) and large-scale data processing. Ideal for large-scale AI training use cases, cluster networks also allow you to define sets of performance criteria for a given group of interconnected systems.

Each cluster network has an associated cluster network profile that describes the type of components that it can connect with. Within the cluster network are a set of basic networking abstractions to provide you with the flexibility and control you need to configure high performance workloads.

Key features include:

Isolation and optimization
Provides high bandwidth and low latency networking for groups of compute resources. Cluster networks are isolated in a separate IPv4 address space, which are not routed externally.
Specialized technologies
Supports advanced networking technologies like Remote Direct Memory Access (RDMA), which allows direct data transfer between the memory of different nodes without involving the CPU, further enhancing performance.
High-performance computing
Suited for demanding applications, such as artificial intelligence (AI) training or complex simulations, where high bandwidth and low latency is critical.
Flexibility and control
Is a supplemental network to the existing VPC network. The cluster network attachments are separate from the VPC networks, allowing for users to mix their high speed RDMA networks along side their VPC networks.

Getting started with cluster networks

A cluster network enhances the efficiency and speed of data transfer within a networked group of systems, making it an essential component for high-performance computing tasks. Follow these general steps to create a simple cluster network for AI training:

  1. Review planning considerations for cluster networks and be aware of any known issues and limitations.

  2. Determine the total resources required for your cluster by multiplying the number of instances you intend to create by the resources defined in the corresponding instance profile.

  3. Check the calculated total resources required for your cluster against the default quotas to determine if a quota increase is necessary.

  4. Ensure that you have an existing VPC in a region that has capacity for NVIDIA H100 profiles with clustering support.

    Currently, the only supported zone is us-east-wdc07-a. For more information about zones, see zone mapping per account.

  5. Create a cluster network. Currently, only the NVIDIA H100 cluster profile is supported.

  6. Create cluster network subnets (8, 16, or 32) as child objects on the cluster network.

    If creating a cluster network in the UI, you can create cluster network subnets at the same time. While it is recommended that you use 8 subnets, certain scenarios will utilize a larger number of subnets.

    Subnets within the H100 cluster network type are routable to each other. However, the cluster network is not routable externally.

  7. Do one of the following:

    Advanced users might want to preallocate IP addresses or interfaces. However, it is recommended that you create IPs or interfaces when creating an instance.

Cluster network use cases

Cluster Networks for VPC supports the following use cases.

Use case 1: Networking H100-enabled instances to use RDMA on IBM Cloud

The following diagram demonstrates how you can connect your networking H100-enabled instances through cluster networks to use RDMA on IBM Cloud:

First, make sure that you have a cluster network set up with cluster network subnets. Then, create H100 instances that will connect to the cluster network. When the instance is created the user specify VPC networks and the cluster networks. The VPC networks provide connectivity to your cloud resources and can provide external routing. The cluster networks are additional interfaces within your instance which provide connectivity between the nodes.

Networking H100-enabled instances to use RDMA on IBM Cloud
Networking H100-enabled instances to use RDMA on IBM Cloud

Use case 2: Securing your cluster network from the rest of the IBM Cloud ecosystem

The cluster network is isolated to a separate network domain than the VPC cloud network. The cluster network isolation domain allows the user to be secure without needing to utilize security groups, Network ACLs, or routing tables. Communication within the cluster network only occurs between devices directly connected to the cluster network.

Resources that are connected to the cluster network also must connect at least one VPC network. The VPC network supports all of the IBM Cloud network use cases of a standard VPC resources - Floating IPs, Public Gateways, Transit Gateway, and more. As such, it's recommended that the user review their security policies of the Virtual Network Interfaces attached to resources on the cluster network.

Securing your cluster network from the rest of the IBM Cloud ecosystem
Securing your cluster network from the rest of the IBM Cloud ecosystem

To maintain minimal access to the cluster network, you must:

  • Limit the access to the instances that are connected to the cluster network.
  • Ensure that there is a tight security group for the VPC Virtual Network Interfaces (VNIs) on each instance that has access to the cluster network.
    • Make sure to carefully guard inbound TCP requests and consider guarding outbound TCP requests. For more information, see Setting up a security group for your resource.
    • Ensure that the subnets attached to your cluster network enabled instances have appropriate network ACLs.
    • Configure your IAM policies for cluster network permissions.