Terraform State Migration: Achieving Zero Downtime Across Multi-Account AWS
In this post, I detail my experience migrating a critical order service at Trustedshops using Terraform, ensuring zero downtime while consolidating resources across AWS accounts.
# Terraform State Migration: Achieving Zero Downtime Across Multi-Account AWS
Migrating the order service at Trustedshops was both a challenge and a necessity. Processing millions of orders per day, this service is one of the most business-critical facets of the company. With the mandate of zero downtime, I faced a daunting task: not only did I need to migrate the infrastructure, but I also needed to ensure it was left in a maintainable state for the long term.
## The Starting Mess
At the outset, I was confronted with a single Terraform repository containing over 500 resources, spanning five AWS accounts. Two of these accounts were obsolete, which added unnecessary complexity to the migration. On top of that, I discovered significant state drift both within individual accounts and across the entire setup. The Terraform tooling was also broken, featuring outdated versions, mismatched providers, and plans that wouldn't complete cleanly.
## Why a Naive Recreation Was Off the Table
A straightforward approach to recreate the infrastructure in a new repository was not an option. The service could not tolerate any downtime, and tearing down the existing state risked corrupting configurations. I needed a strategy that would allow me to migrate resources without disrupting the live service.
## The State-Manipulation Strategy
I adopted a surgical, resource-by-resource migration strategy. My approach involved directly manipulating the Terraform state files to ensure a smooth transition without causing any interruptions. Here’s how I did it:
1. **Import Resources**: For each resource, I first used the `terraform import` command to bring the live AWS resource under management in the new repository. This ensured that the resource retained its identity, crucial for maintaining traffic.
```bash
terraform import aws_instance.example i-1234567890abcdef0
```
2. **Move Resources**: Next, I leveraged the `terraform state mv` command to move the resource from the old state backend to the new one. This command was instrumental in ensuring that the migration was seamless.
```bash
terraform state mv aws_instance.example aws_instance.new_example
```
3. **Validate States**: After each migration, I ran `terraform plan` to confirm that the new state matched the live resource. This step was crucial for ensuring that my actions did not result in any discrepancies.
By following this methodical approach, I maintained a consistent state throughout the migration, preventing any interruptions to the service.
## Restoring Environment Parity
Once production was stable in the new repository, I focused on rebuilding the two remaining non-production accounts (which were not obsolete). This step was vital for restoring proper environment parity between the production and lower environments, ensuring that all accounts operated under a coherent and unified structure.
## Final Numbers
The outcome of this migration was a resounding success:
- **Zero downtime** end to end.
- The previously scattered 500+ resources were now organized into one coherent codebase.
- I successfully reduced the number of AWS accounts from five to three by retiring the two obsolete accounts.
- The number of Terraform files decreased from around 100 to just 15.
- Lines of code were cut from approximately 2,500 down to around 1,500.
- The team now has a Terraform setup that is not only stable but also easy to evolve moving forward.
This migration journey has reinforced the importance of careful planning and execution when managing infrastructure as code, especially in a multi-account AWS environment. If you're facing similar challenges with Terraform, multi-account setups, or state migrations, I invite you to reach out via my portfolio at [tahayusufkomur.me](https://tahayusufkomur.me).