IBM Cloud Docs
Disaster scenarios in watsonx.data

Disaster scenarios in watsonx.data

Scenario 1: Kubernetes configuration corruption

Description: Corruption or deletion of formations, ConfigMaps, Secrets, and more.

  • General impact:

    • SRE receives alerts and restores the Kubernetes configurations or formations.
    • Inflight queries fail, resulting in a temporary outage.
    • RTO and RPO are reviewed with SRE.
  • Milvus:

    • Inflight queries and data ingestion fail.
    • SRE restores the formation from backup.
    • Temporary outage; RPO and RTO are updated.
  • Presto:

    • Inflight queries and ingestion fail.
    • Formation restored from backup.
  • MDS:

    • Inflight API calls fail.
    • Outage until formation is restored.
    • Velero backup ensures service restoration, but rare port conflicts might require manual service edits.
  • Spark:

    • Only workloads tied to corrupted configurations fail.
    • Other workloads continue.
    • Users must rerun failed jobs.
  • Customer Involvement: None

Scenario 2: Persistent storage corruption

Description: Corruption of persistent volumes.

  • General Impact:

    • Services using PVCs are affected.
  • Milvus:

    • PVC restored from backup.
    • Temporary outage due to ETCD downtime.
    • No data loss.
  • Presto, MDS, Spark:

    • No impact (do not use PVCs).
  • Customer involvement: None

Scenario 3: Data or Metadata corruption

Description: Corruption of stored data or metadata.

  • General impact:

    • Service outages during recovery.
  • Milvus:

    • ETCD metadata restored from hourly backups.
    • Potential loss of up to 1 hour of metadata.
    • Customer responsible for vector storage backups.
  • Presto:

    • Point-in-time backups used to restore configuration and metadata.
  • MDS, Spark:

    • No impact.
  • Customer Involvement: None

Scenario 4: Cluster failure

Description: Complete failure of the cluster.

  • Milvus:

    • Formation and data restored from backup.
    • Possible 1-hour metadata loss.
    • Customer responsible for vector storage backups.
  • Presto:

    • Formation and data restored from backup.
  • MDS:

    • Inflight API calls fail.
    • Outage until cluster or formation is restored.
  • Spark:

    • All running workloads fail.
    • No data loss.
    • SRE restores formation on a new cluster.
    • Users must rerun failed jobs.
  • Customer Involvement: None

Scenario 5: Availability Zone (AZ) outage

Description: One AZ becomes unavailable.

  • General Impact:

    • Cluster has capacity to migrate workloads.
    • Pods are automatically rescheduled to healthy AZs.
  • Milvus:

    • Metadata is Active-Active in Enterprise plan.
    • Inflight queries fail; no long-term impact.
  • Presto:

    • Pods rescheduled; inflight queries fail.
  • MDS:

    • If only one AZ is down, no impact.
    • If two or more AZs are down, service is impacted until at least one AZ is restored.
  • Spark:

    • Workloads with drivers in the failed AZ fail.
    • Executors recover in healthy AZs.
    • No impact on workloads in unaffected AZs.
  • Customer Involvement: None

Scenario 6: Regional disaster

Description: Entire region becomes unavailable.

  • Milvus:

    • Customer provisions a new watsonx.data instance ouse and Milvus formation in another region.
    • Same bucket and path must be used.
    • Customer shares CRNs of old and new formations.
    • SRE restores ETCD metadata.
  • Presto:

    • Customer provisions new formation.
    • SRE restores metadata and console DB.
  • MDS:

    • If hourly Postgres backups are enabled, restore to a new DB instance.
    • Update MDS pod environment variables to point to the new DB.
    • RPO: 1 hour; RTO: 2–3 hours.
    • Console DB and AMS DB also impacted.
  • Spark:

    • All running workloads fail.
    • Customer provisions a new watsonx.data instance and Spark engine.
    • No data loss (logs or events in object store).
  • Customer Involvement:

    • Provision new formations and share CRNs.

    Milvus uses Kafka in Active-Active mode (Enterprise plan), so no customer action is needed for Kafka recovery.