Managing data with Spectrum LSF on IBM Cloud

When you work with HPC workloads in a cloud environment, a critical challenge to address is how best to manage the data that is needed for running the workloads, as well as the output that might need to be analyzed for further processing and decision making. With IBM® Spectrum LSF clusters deployed on IBM Cloud®, you can use the following methods to manage your data.

Hybrid setup with IBM Cloud

If your setup is using a VPN or direct link to connect the on-premises Spectrum LSF environment to the Spectrum LSF cluster on IBM Cloud, you can configure the LSF multicluster capability and use the Spectrum LSF Data Manager component to stage the data from on-premises to your Spectrum LSF cluster on IBM Cloud.

With an on-premises cluster, data typically resides on one or more file systems that are mounted onto every compute node. In general, you don't need to consider data location or locality (certain workloads do benefit from being "closer" to the filer and IBM Spectrum LSF does have few capabilities that allow you to consider data locality).

In a hybrid setup, where work can run on-premises and in the cloud, how and when data is moved to and from the cloud is important. Unfortunately, there is not a one-size-fits-all solution for this, and your data movement strategy depends on multiple factors:

Frequency and volume of workload being sent to the cloud - are you sending a handful of jobs or thousands?
Frequency of change of that data - is the data set static or constantly changing?
Size of the data - how much needs to be moved?
Uniqueness of data per job - does every job require unique data inputs or are most of the jobs reusing or sharing common data?
Runtime of the workload - if the ratio of compute time to time taken to get the data to the cloud is low, it is probably not cost-effective to run that workload on the cloud

For data requirements that are static or that have small incremental changes, mirroring the data set between on-premises and the cloud can be cost-effective by using solutions such as IBM Aspera. Data can be moved in bulk when the cluster is created and resynced on a schedule or a per-job-basis to ensure that the latest changes are available.

For jobs that have unique requirements, the data requirements can be specified as part of the job submission (bsub -f) and the files are transferred to the compute node when the job runs and the results are transferred back upon completion. In this case, the data is purely transient and is erased when the compute node is de-provisioned.

As data transfer times increase, the example solution becomes less efficient as the compute node is provisioned before the data is transferred and the results are transferred back before the node is de-provisioned.

LSF's Data Manager component addresses this by scheduling data movement independently of the job. Input files are transferred to cloud storage before any nodes are provisioned, and results are transferred back after the node is de-provisioned. Data Manager also de-duplicates transfers avoiding the same files that are being transferred time and time again. This is important when running conducting design of experiment type analyses or regression and verification runs where most of the data is common between thousands or hundreds of thousands of jobs.

To further reduce data movement, pre- and post-processing can also be conducted in the cloud. LSF's Application Center provides a web portal and restful API, which allows jobs to be submitted directly to the cloud cluster and starts remote visualization tools on the job or data in the cloud. The Application Center built in support for visualization of common output formats. It also has a client component that can be used to upload or download data directly from your laptop to the cloud cluster.

The NFS instance that is deployed with the IBM Spectrum LSF cluster in the cloud can be used as a destination with Data Manager configuration. When the data is available on NFS, it is visible to the management and worker nodes of your Spectrum LSF cluster.

Stand-alone cluster on IBM Cloud

With a dedicated or stand-alone cluster on the cloud, your data is resident in the cloud. You can use IBM Cloud Object Storage to bring the data into your IBM Cloud account. Object Storage provides a cheap and reliable option to manage your data in IBM Cloud. When the data is available in Object Storage, you can mount the Object Storage bucket onto the management node or copy the data from the Object Storage bucket to the NFS instance to make the data visible to the management and worker nodes of your Spectrum LSF cluster.

Additional resources

For more information on how to configure and handle issues with data management, see Four Hybrid Cloud Data Management Challenges.
For more information on setting up a Spectrum LSF multicluster configuration, see IBM Spectrum LSF multicluster capability.
For more information on setting up a Spectrum LSF data manager configuration, see IBM Spectrum LSF Data Manager.
For more information on how to use the IBM Aspera plug-in with Spectrum LSF, see Configuring IBM Aspera as a data transfer tool.
For more information about using IBM Cloud Object Storage, see Getting started with IBM Cloud Object Storage.
For more information on how to use s3fs interface with IBM Cloud Object Storage, see Mounting a bucket using s3fs.