Terraform at Scale: Managing 40+ AWS Accounts

On a pharma project I managed Terraform across 40+ AWS accounts in a Digital-SDLC organization with SOC2 compliance requirements. At that scale, Terraform stops being “write some HCL and run apply” and becomes a software engineering problem. Module design, state isolation, CI/CD pipelines, and access control all need deliberate architecture.

Module Taxonomy

I built and maintained 20+ reusable Terraform modules organized by domain:

Networking — VPC with standardized CIDR allocation, Transit Gateway attachments, Route 53 zones, NAT Gateway configuration
Compute — ECS Fargate services, Lambda functions with associated IAM roles, EC2 launch templates
Data — RDS (PostgreSQL, MySQL), S3 buckets with versioning and lifecycle policies, DynamoDB tables
Security — IAM roles and policies, KMS keys with cross-account grants, WAF web ACLs, Security Hub configuration
Observability — CloudWatch log groups, metric alarms, CloudTrail organization trails, dashboard templates
DR — AWS Backup plans with cross-account vaulting, replication configurations

Every module follows the same contract: consistent variable naming (environment, project, tags), well-defined outputs that downstream modules can reference, and auto-generated documentation.

Module Design Principles

The biggest mistake I see in large Terraform codebases is modules that do too much. A VPC module shouldn’t also configure Transit Gateway attachments — those have different lifecycles and different teams responsible for them.

I keep modules focused on a single resource or a tightly coupled group. A VPC module creates the VPC, subnets, route tables, and NACLs. A Transit Gateway attachment module takes a VPC ID as input and handles the peering. Composition happens at the root module level, not inside the modules themselves.

Versioning matters. Every module lives in its own repository with semantic versioning. Workload teams pin to specific versions and upgrade on their own schedule. A breaking change in the VPC module doesn’t force every team to update simultaneously.

State Management

State files live in a centralized S3 bucket in the management account with DynamoDB locking. Each workload gets its own state key path — no shared state files, no cross-team lock contention.

Cross-account deployment uses IAM roles. CI/CD pipelines assume a deployment role in the target account via OIDC federation. Interactive development uses Identity Center (SSO) with permission sets scoped to specific accounts.

The state bucket itself has versioning enabled and a lifecycle policy that keeps 90 days of state history. When someone runs a bad apply, we can recover the previous state without scrambling.

CI/CD: GitHub Actions with OIDC

No static AWS credentials in CI/CD. Every pipeline authenticates through OIDC federation with GitHub Actions:

name: Terraform Deploy
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

permissions:
  id-token: write
  contents: read

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-arn: arn:aws:iam::role/GitHubActionsDeployment
          aws-region: eu-west-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.9.x

      - name: Terraform Init
        run: terraform init

      - name: Terraform Plan
        run: terraform plan -out=tfplan

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main'
        run: terraform apply tfplan

The IAM trust policy restricts which repositories and branches can assume the role. A PR from a fork can’t trigger an apply against production. The OIDC thumbprint validation ensures only GitHub’s token service is trusted.

Each account has its own deployment role with least-privilege permissions scoped to the resources Terraform manages. The networking account role can modify VPCs and Transit Gateway. It can’t touch RDS instances. The data account role is the inverse.

Pre-Commit Hooks

Every Terraform repository runs pre-commit hooks:

terraform fmt — consistent formatting, no style debates in reviews
terraform validate — catches syntax errors and missing provider configurations
terraform-docs — regenerates module documentation from variables, outputs, and descriptions
tflint — provider-specific linting (deprecated arguments, invalid instance types)
checkov — static security analysis (S3 buckets without encryption, security groups with 0.0.0.0/0)

These hooks run locally before every commit and again in CI as a gate. The combination catches the vast majority of issues before a human reviewer even looks at the PR.

The AFT Pipeline

New accounts flow through Account Factory for Terraform. The process:

A team submits a PR with an account request (email, name, OU, compliance tags)
PR gets reviewed and merged
AFT provisions the account through Control Tower
Account customizations run automatically — VPC creation, Transit Gateway attachment, CloudTrail configuration, Security Hub enablement, IAM baseline roles
The account appears in Identity Center with the appropriate permission sets

From PR to usable account takes about 30 minutes. The customizations are themselves Terraform modules — the same ones used for day-2 operations. No separate “account setup” scripts that drift from the main codebase.

What Makes It Work

The scale isn’t the hard part. 40 accounts with consistent patterns are easier to manage than 5 accounts with ad-hoc configurations. The discipline is what makes it work: versioned modules, isolated state, OIDC-only authentication, automated compliance checks, and a provisioning pipeline that enforces the same baseline everywhere.

Every shortcut — shared state files, inline resources instead of modules, manually provisioned accounts — becomes a liability at this scale. The upfront investment in structure pays for itself within the first quarter.