IBM Spectrum Symphony Troubleshooting

Why is IBM Cloud Schematics not able to clone the private GitHub repo?

Schematics isn't able to clone the private GitHub repository, and you are seeing the following error message: Failed to clone git repository, repository not found (check url, also check the scope 'repo' of the personal access token if SCHEMATICSGITTOKEN is used)

You didn't provide the correct GitHub token, or you didn't provide a GitHub token altogether.

Provide a GitHub token and check to see whether the correct GitHub token is provided in the github_token parameter in the created workspace API.

Why is IBM Cloud Schematics not able to clone the public GitHub repo?

Schematics isn't able to clone the public GitHub repository, and you are seeing one of the following error messages:

Fatal, could not download repo, Failed to clone git repository, authentication required (or the git url is incorrect). Problems found with the Repository. Please Rectify and Retry
Template error: Failed to clone git repository, authentication required (or the git url is incorrect)

You didn't provide the correct GitHub URL, or you provided a GitHub token, which is not required to clone a public repo. A GitHub access token is only required to access a private repo.

Do not provide a GitHub token, and check to see whether the GitHub token was provided in the github_token parameter while creating a workspace by using the public repo.

Why is IBM Cloud Schematics not able to create a workspace?

Schematics is not able to create a workspace, and you are seeing the following error message: You don't have the required to create a workspace in any resource groups. You must be assigned the manager role on the Schematics service in at least one resource group. Contact your account administrator for access.

You don't have the required access to create a workspace in any resource groups. You must be assigned the manager role on the Schematics service in at least one resource group.

Contact your account administrator and get assigned with the manager role on the Schematics service in at least one resource group.

Why is IBM Cloud Schematics not able to provision the cluster and fails with an authorization error?

Schematics isn't able to provision the cluster, and you are seeing the following error message: Request is not authorized. Check your user permissions and authorizations and try again.

You don't have the required access to get any VPC resources provisioned.

Contact your account administrator and get all the required accesses. For more information, see Required permissions.

Why is IBM Cloud Schematics not able to provision the cluster and fails with an error that the provided name is not unique?

Schematics isn't able to provision the cluster, and you are seeing the following example error message:

"code": "validation_unique_failed",
"message": "Provided Name (sample-symphony-vpc) is not unique",
"target": {
"name": "name",
"type": "field",
"value": "sample-symphony-vpc"
}

VPC resource names must be unique. If a resource exists with the same name, you might get a similar error.

Deprovision the existing resource and try again.

Why is IBM Cloud Schematics not able to provision the cluster using a custom image?

While using a custom image, Schematics isn't able to provision the cluster, and you are seeing one of the following error messages:

The argument "image" is required, but no definition was found.
Unknown variable. There is no variable named "image_id".

The custom image that is used for one of the virtual server instances isn't present in the target region and zone or it is not accessible by the account and API key that is used to provision the cluster.

If you are using a custom image for any of your virtual server instances, ensure that the custom image is available in the target region and zone and is accessible by the account and API key that is used to provision the cluster.

Why does refresh token occur?

You are receiving a refresh token error in the generate a plan, apply a plan, and destroy resources requests: Error: The provided Refresh Token is invalid. Please provide a proper refresh token for Terraform to run the configuration. Code: 400

You didn't provide the correct refresh token, or you didn't provide a refresh token altogether.

Check to see whether the refresh token that was generated by using the curl command is correct; otherwise, regenerate the refresh token.

Why am I receiving an error when I apply a change to my workspace?

You are receiving the following error when you try to apply a change to your workspace: Apply failed due to "Error: Error Deleting Volume : The volume is still attached to an instance."

After reconfiguring the volume profile, capacity, or IOPS, your workspace needs to be cleaned up before applying the change.

You need to destroy your existing resources and try applying the change again. Your data on the storage node will be deleted if you destroy your existing resources.

Why am I receiving an error with the provided ssh_key_name value?

You are receiving the following error when you try to generate or apply a plan on Schematics workspace: failed due to "Error: No SSH Key found with name <KEY_NAME>".

Terraform might not find the given SSH key names that are provided by you.

Check whether the given SSH key is present in the current region where the cluster is being provisioned. If the given SSH key is not present, create the SSH key in the current region.
While configuring multiple SSH keys, ensure that there is no white space added before or after the SSH key names.
If you are using multiple SSH keys, check whether a comma (,) is used as a delimiter between the SSH keys and that there is no white space added before or after the SSH key.

Why am I getting an error when I try to run the Spectrum Symphony VaR simulation?

You are getting a Failed to Login error when you try to run the Spectrum Symphony Value at Risk (VaR) simulation.

It is possible that you are hitting a limitation with the VaR simulation, which requires that the cluster prefix be no longer than 10 characters and that the Symphony primary host hostname is fewer than 20 characters.

Ensure that your cluster prefix is no longer than 10 characters.

Worker nodes are released when workload is in progress

The symA requester might release a compute node virtual machine while the workload is still in progress. This happens when the property return_idle_only is set to true and the immediate return policy symA is unable to get the allocation for this host, and therefore assumes that it has no allocation. This issue happens if there are only a few tasks remaining for the monitored applications. For more information, see Updating idle time before worker nodes are removed.

Incorrect provider configuration values do not result in an error

When updating the IBM Cloud provider configuration from the Symphony GUI, at Menu icon Resources > Cloud > Configuration, the values set in the configuration are not validated. If there are invalid values in the configuration, the virtual machine provisioning fails. If a failure happens, check the host factory logs on the host running the HostFactory service in /opt/ibm/spectrumcomputing/hostfactory/log for more information.

Limitation of available profiles for dedicated hosts

The offering automatically selects instance profiles for dedicated hosts to be the same prefix (for example, bx2 and cx2) as ones for worker instances (worker_node_instance_type). However, available instance prefixes can be limited, depending on your target region. If you use dedicated hosts, check ibmcloud target -r {region_name} and ibmcloud is dedicated-host-profiles to see whether your worker_node_instance_type has the available prefix for your target region.

Why do I see resource errors due to authentication or timeout issues?

You are receiving the following error messages during the creation of any specific resource:

Error: An error occurred while performing the ‘authenticate’ step: Post “https://iam.cloud.ibm.com/identity/token”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Error: timeout while waiting for state to become 'done, ' (last state: 'provisioning', timeout: 10m0s)

While Schematics deploys the infrastructure resources, it authenticates with IBM Cloud through API calls. If there are too many requests through the API to the cloud environment, Schematics will not be able to authenticate and might error out with the authentication error.

To fix either issue (resource failing due to authentication error or the timeout error), destroy the resources from the Schematics workspace and retry deploying the resources.

Why does cluster creation fail with SSH issues?

You are receiving the following error messages when the Ansible provisioner tries to set up the Storage Scale function on the worker and storage node resources:

msg": "Failed to connect to the host via ssh, Connection closed by UNKNOWN port 65535", "unreachable": true}
Error: Failed to connect to the host via ssh: Connection timed out during banner exchange", "unreachable

While Schematics deploys the infrastructure resources, the automation code is configured with a few Ansible playbooks, which are required to set up the Storage Scale function on the virtual server instance nodes with the help of the Ansible provisioner. When the Ansible provisioner tries to SSH to these nodes to se the Storage Scale feature, the nodes go to an unreachable state.

To fix the issue, you can:

Try to destroy the resources from the workspace and deploy again.
If this issue is observed on all of the deployments, raise a support issue with the IBM Cloud support team to investigate if there is an infrastructure issue.
If there are no issues with the infrastructure, report this issue to the automation team who can investigate further.

Why am I receiving an error that the image is not found?

You are receiving the following error when you try to either generate or apply a plan to your workspace: Apply failed due to "Error: [ERROR] No image found with name hpcc-symp731-scale5151-rhel84-v1-4".

Either during generating or applying a plan, Terraform tries to validate if the provided image name and its image ID is present in the image_map.tf file. If Terraform finds the correct image details, it provisions the instances, but if the correct image details can't be found, Terraform tries to fetch the image details from IBM Cloud through data_source.

Even if the provided image is not present in the cloud from that specific region, you still might receive the error.

You need to check whether the provided image name has any spaces and if that image is present in the region where you want to do the deployment.

Why am I receiving a `cannot_start_capacity` error?

You are receiving the following error when you try to apply a plan to your workspace: Apply failed due to "code : cannot_start_capacity : message : Can't start instance because resource capacity is unavailable.

During the apply plan process, Terraform initiates the virtual server instance provisioning or bare metal server process based on the selected deployment values. If there is a resource capacity issue or a quota issue from the region where you are trying to deploy, the resources will not provision as expected.

You need to talk to the administrator of the account to increase the quota for the specific region, or you can try to clean up all of the unwanted resources that are associated with the cloud infrastructure. If you clean up unwanted resources, you might free up space for the deployment to process.

Why does the cluster fail with an IBM Customer Number error?

You are receiving the following error when you try to apply a plan to your workspace: Apply failed due to "ERROR - [CLOUD-DEPLOY] Provided IBM Customer Number is not entitled to use Spectrum Symphony on Cloud. Kindly contact IBM Support Team. Exiting!

During the apply plan process, the bootstrap node initiates provisioning the resources for storage and compute cluster creation. During the process, RPM- and GPFS-related packages and Symphony packages need to be decrypted through the BYOL concept. If the IBM Customer Number is valid, the deployment begins. If not, the automation errors out the deployment.

You need to provide a valid IBM Customer Number that is entitled to Spectrum Symphony without any spaces in the number. If the value that you provided is valid and you still received this error, contact IBM support to clarify about the entitlement.

Why does my instance provisioning remain in `Starting` state?

After you apply a plan, the workspace takes a long time to provision the virtual server instance, and you notice in the user interface that the virtual server instance remains in Starting state.

During the apply plan process, Terraform initiates the provisioning process of the virtual server instance in the cloud infrastructure. If there is a capacity issue or any issue from the infrastructure side for that particular region and zone, you might see this issue.

You can try to manually create an instance in the user interface with the same image that is used during the automation process to see whether you get the same issue, or you can try a different zone for the deployment. You can also raise a support request for this issue to see whether it's originating from the infrastructure side.

Why am I getting errors when I try to delete resources after a bare metal server fails to provision?

After a bare metal server fails to provision and you try to delete resources from the bootstrap node, you are receiving the following error after you apply the mmcloudworkflows cluster destroy command: [ERROR] Error deleting security group target binding while deleting security group : The specified network interface is not attached to any other security groups.

If cluster provisioning fails, it's recommended to clean up all of the resources from the failed provision before you attempt to reprovision a cluster.

During the destroy process, the bootstrap node tries to clean up all of the resources that were created during the provisioning phase. It's possible that if a bare metal server provisioning took longer than expected and then failed, then during a subsequent cleanup, the destroy process complains that the failed bare metal server is still attached to a security group that it is attempting to destroy.

Copy the cluster prefix name and complete the following steps:

Go to the security group and access the -storage-sg.
Go to the attached resources section for the security group.
Click the attached bare metal server and copy the ID of the server.

Run the following commands from the CLI to stop and delete the server:

ibmcloud is bare-metal-server-stop $bare_metal_server_id

ibmcloud is bare-metal-server-delete $bare_metal_server_id

Deletion of the bare metal server takes a few seconds. After the bare metal node is deleted, go to Schematics and apply for destroy resources.

What is the `subnet_not_in_address_prefix` error or `invalid CIDR format` error during VPC and subnet creation?

You are receiving the following error when you try to apply a plan to your workspace: Apply failed due to Error: [ERROR] Error while creating subnet. The specified CIDR does not fit in any of the address prefixes in the specified VPC. Make sure the subnet's CIDR is a subset of the CIDR of one of the address prefixes.

During the apply plan process, the workspace tries to create the VPC and subnet with the specified range of the CIDR address prefix from the deployment value. If the address prefix range is out of scope or doesn't belong to the family of the IP address range of the VPC, then you get an error that the address is not in range.

Validate if the address prefix range that's provided for subnet creation is from the same range of addresses that are used for the VPC. For example, if the VPC address prefix is 10.241.0.0/18, then the subnet should be in the 10.241.x.x range. If a different IP address range is used, then you need to divide the subnets and choose the IP address range that's required for subnet creation.

Why does the deployment fail with a "process exited with status 2" error message?

You are receiving the following error when you try to apply a plan to your workspace: Apply failed due to Error: Error: remote-exec provisioner error and error executing "/tmp/terraform_1756078506.sh": Process exited with status 2.

The solution is implemented with a logic to evaluate whether the provided license number is valid. This feature is designed on top of the custom image, so every time the deployment is done it evaluates and decrypts the required package for the product deployment to complete. If an image is used that doesn't have this functionality, the automation code errors out.

Use the appropriate custom image that is provided by the solution team, which has the function of evaluating the license and decrypting the packages.

Why is IBM Cloud Schematics not able to provision the cluster and fails with an `Enabling the Custom resolver` error?

While Schematics tries to create the VPC resources, it attempts to create a custom resolver but fails with the following error: [ERROR] Error Enabling the Custom resolver : MaxTimeout

Terraform attempts to create the custom resolver environment and waits for the custom resolver state to reach an active state. During this process, if the custom resolver takes more time than expected, Terraform throws the error message.

After a failed deployment, clean up all resources. During a subsequent attempt, use a new cluster prefix to avoid any name collisions with resources from the previous failed attempt. If the issue continues to occur, open an issue with IBM Cloud Support.

Why is IBM Cloud Schematics not able to provision the cluster and fails with a passwordless SSH error?

After Schematics creates all the resources when spectrum_scale_enabled is set as true, the solution triggers the Ansible code to configure the entire Scale configuration on storage bare metal servers. During the Ansible configuration, the following error occurs: [ERROR] Check passwordless SSH on all scale inventory hosts (1 retries left)

After all the infrastructure-related resources are up and running, the Ansible code tries to perform the Scale configuration through a passwordless SSH method. During this process, on the storage bare metal server, if the SSH service is not in a running state, then Ansible can't SSH to that specific bare metal storage node and it fails with the error.

Why is IBM Cloud Schematics not able to provision the cluster and fails with an `nmmcrcluster` error?

After Schematics creates all the resources when spectrum_scale_enabled is set as true, the solution triggers the Ansible code to configure the entire Scale configuration on storage bare metal servers. During the Ansible configuration, the following error occurs: [ERROR] nmmcrcluster: Error found while checking node descriptor

After all the infrastructure-related resources are up and running, the Ansible code tries to perform the Scale configuration through a passwordless SSH method. During this process, on the storage bare metal server, if one of the bare metal configurations is not cleaned up properly from the previous destroy command, then the error could be seen.