Risk mitigation and rollback strategies
The following information covers common difficulties that you might encounter after a migration and offers tips to correct the issues.
Common migration risks
The following information covers common difficulties that you might encounter after a migration.
Risk: Network connectivity failure
Symptom: After migration, the virtual server can't communicate with other systems.
Detection: Post-migration connectivity tests fail.
Fix:
- Access through VNC console to troubleshoot
- Verify VNI configuration (IP address, security groups)
- Check routing tables in VPC and Transit Gateway
- Review security group rules (use VPC flow logs to see dropped packets)
Rollback: Restart VMware virtual servers and update DNS and load balancers to point back to VMware.
Prevention:
- Test Transit Gateway connectivity thoroughly before migration
- Verify that the security group rules allow required traffic
- Test DNS resolution in the target VPC
Risk: The application doesn't start
Symptom: Application service doesn't start after migration, or starts but doesn't function.
Detection: The service fails to start, or starts but fails health checks.
Fix:
- Check application logs for errors
- Verify configuration files
- Check environment variables
- Verify database connectivity
- Check for licensing issues
Rollback: Stop the application in VPC, restart in a VMware environment.
Prevention:
- Verify that all dependencies are migrated or accessible through Transit Gateway.
- Test application startup procedures that are in the pilot wave.
- Document application-specific configuration that you might need to adjust.
Risk: Data corruption
Symptom: Corrupted data, file system errors, or application data inconsistencies.
Detection: File system check errors, application reports data errors, database doesn't start
Fix:
- Attempt a file system repair
- If repair fails, retry migration from the source virtual server
- Verify that the source virtual server was cleanly shut down
Rollback: Discard the corrupted VPC virtual server, restart the source virtual server, and investigate the root cause before you retry the migration.
Prevention:
- Cleanly shut down virtual servers before migration
- Verify transfers with checksums, where possible
- Use
blockdev --flushbufsbefore detaching volumes
Risk: Performance degradation
Symptom: Application performs worse in VPC than in VMware.
Detection: Increased response times and decreased throughput
Fix:
- Check CPU, memory, disk I/O, network bandwidth metrics
- Verify that the storage profile has adequate IOPS
- Verify that the instance profile has adequate network bandwidth
- Check for application configuration issues
- Consider upgrading instance or storage profiles
Rollback: If critical, failback to VMware while you investigate performance.
Prevention:
- Baseline performance in VMware before migration
- Select appropriate instance and storage profiles that are based on baselines
- Enable pooled storage bandwidth allocation
Rollback strategy design
Use the following criteria to determine whether you need to roll back.
- More than 20% of virtual servers in a wave fails to start
- Critical applications fail functional testing
- Corrupted data is found in migrated virtual servers
- Performance degradation greater than 50% from baseline with no quick fix
- Security group configuration errors expose sensitive services
Rollback decision authority:
- Define who can make the rollback decision
- Define escalation path if decision-makers disagree
- Timebox rollback decision
Rollback procedures
The following section explains the rollback procedure phases.
Using the Phase 1 rollback procedure
Before the virtual servers are created during the migration, follow these steps to use the Phase 1 rollback procedure. This process takes ~30 minutes.
- Stop the migration.
- Discard worker virtual servers and volumes.
- Restart VMware virtual servers.
- Update status communications.
Using the Phase 2 rollback procedure
After virtual servers created during the migration, but before DNS cutover, follow these steps to use the Phase 2 rollback procedure. This process takes ~1 hour.
- Stop and delete migrated virtual servers.
- Restart VMware virtual servers.
- Verify that VMware virtual servers are functional.
- Update status communications.
Using the Phase 3 rollback procedure
After DNS cutover during the migration, data might change. Use the following steps to use the Phase 3 rollback procedure. Depending on data size, this process takes 2-4 hours.
- Stop virtual servers, but don't delete them.
- Update DNS and load balancers to point back to VMware.
- Restart VMware virtual servers.
- Data decision:
- If no data was changed, proceed with the rollback.
- If data changed in VPC, you must sync data back to VMware before you can roll back.
- Sync data, if needed.
- Verify that VMware virtual servers are functional.
- After verification period, delete VPC virtual servers
Source virtual server preservation
To help make sure that the source virtual servers are preserved, use the following information.
- Don't delete VMware virtual servers immediately after the migration.
- The retention period is 7-30 days, depending on your risk tolerance.
- Create VMware snapshots.
- Document snapshot locations and your retention policies.