Skip to main content

πŸ—οΈ Terraform State Management & Design Mindset for Enterprise AWS Multi-Account Landing Zones

Β· 23 min read
CloudOps
CloudOps Engineer

When a team of 20 engineers concurrently runs terraform apply across 50 AWS accounts, state management stops being an operational concern and becomes a business risk. State corruption takes hours to diagnose, compliance audits fail when drift goes undetected, and the root cause is almost never the engineers β€” it is the absence of a principled architecture before the first line of Terraform is written.

This post combines three disciplines that belong together but are rarely addressed as a unified system: the design mindset that prevents state problems from occurring, the S3 native locking strategy that eliminates the DynamoDB tax (ADR-006, saving up to $9,000/year at 50 accounts), and a real production-ready IAM Identity Center module that demonstrates both principles working at enterprise scale in an AWS multi-account Landing Zone.

All code, configuration, and test artifacts referenced here are live in the terraform-aws framework β€” not theoretical examples, but verified, scored output (97/100 production-readiness) from a running ADLC-governed project.


πŸ”₯ Section 1 β€” The Problem: State at Scale​

The Reality of Concurrent Terraform Operations​

A single engineer running terraform apply on a laptop is a solved problem. Twenty engineers, five CI pipelines, and three environment tiers running Terraform concurrently across 50 AWS accounts is a distributed systems problem β€” and most organisations discover this only after their first production incident.

The failure modes are well-documented in the industry. According to practitioner surveys, approximately 60% of Terraform production incidents trace back to state management issues: stale locks, corrupted state files, drift between live infrastructure and recorded state, or provider version skew across environments. The business impact compounds quickly:

Failure ModeTypical DiscoveryBusiness Impact
State corruptionterraform apply fails mid-run, partial resource creation2–8 hours of incident response; manual terraform import to rebuild state
Stale lockSecond engineer's apply blocks indefinitelyLost engineering time; requires HITL escalation to force-unlock
Configuration driftConsole change not reflected in stateCompliance audit finding; manual reconciliation across hundreds of resources
Provider version skewCI fails; works locally"Works on my machine" β†’ delayed release; reproducibility failure
Cross-account state collisionTwo modules share the same S3 keyState overwrites; resource tracking lost for entire account

The cost of these failures is not just the hours spent debugging. It is the audit findings when a security team cannot produce evidence that a permission set was applied through a controlled change process. It is the FinOps review that shows $40,000 of untagged resources because a corrupted apply created resources without tags. It is the team velocity that drops 30% when engineers distrust their CI pipeline and start running applies manually.

The Hidden Cost of DynamoDB Locking

Before addressing recovery, most organisations are paying for the problem twice. The conventional DynamoDB locking pattern adds $5–15 per table per month per account. At 50 accounts, that is $3,000–$9,000 per year in perpetual overhead β€” for infrastructure that Terraform 1.10+ no longer needs.

The solution is not a better incident runbook. The solution is a principled design approach that makes most of these failure modes structurally impossible.


🧠 Section 2 β€” Design Mindset: Six Principles for Enterprise Cloud​

A design mindset is not a philosophy. It is a set of structural decisions made before code is written that constrain how the system can fail. In the context of enterprise Terraform, six principles govern every file, directory, and configuration choice in the terraform-aws framework.

The Six Principles​

Six Principles Summary
#PrincipleDefinitionConcrete Project Example
1ModularityIndependently versioned, testable units with clear boundariesmodules/sso/ owns its own variables.tf, outputs.tf, locals.tf, data.tf, tests/, examples/, VERSION, and .pre-commit-config.yaml β€” fully self-contained
2AbstractionHide implementation complexity behind typed, validated interfacesvariables.tf exposes typed map(object({...})) inputs; locals.tf flattens and transforms them into for_each-ready maps before any resource block sees them
3Developer Experience (DX)Under 5 minutes to onboard; zero host-tool dependencies; single-command pipelinestask ci:quick = validate + lint + legal in under 60 seconds; _exec auto-starts the terraform-aws-dev container; 28 tasks across 8 ADLC phases
4Iterative DesignShip β†’ Test β†’ Learn loops with fast feedback at every stage3-tier testing: snapshot (2–3s, $0) β†’ LocalStack (30–60s, $0) β†’ real AWS (5–10 min, ~$5–50); ADR process for every architecture decision (ADR-001 through ADR-007)
5Cost-AwarenessFinOps-first β€” quantify cost before provisioning, eliminate idle spendFOCUS 1.2+ 4-tier tag taxonomy in global/global_variables.tf; task plan:cost per module; ADR-006 eliminates DynamoDB (~$5–15/table/account)
6Separation of ConcernsEach layer and directory owns exactly one responsibilitymodules/ = reusable logic; projects/ = account-level compositions; global/ = shared tag conventions; examples/ = documentation-by-example; tests/ = quality assurance

The Abstraction Layers Diagram​

The most impactful structural decision in the framework is the strict separation between the Interface Layer (variables.tf), the Transform Layer (locals.tf), and the Resource Layer (main.tf). This is not stylistic β€” it is an enforced data flow contract.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ examples/ and projects/ β”‚
β”‚ ───────────────────────── β”‚
β”‚ Compositions (user-facing entry points) β”‚
β”‚ Consume modules, bind variables, wire outputs to downstream β”‚
β”‚ e.g. examples/create-users-and-groups/main.tf β”‚
β”‚ projects/sso/main.tf β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ variables.tf β”‚
β”‚ ───────────── β”‚
β”‚ Interface Layer β€” typed, validated inputs β”‚
β”‚ map(object({...})) with optional() and validation{} blocks β”‚
β”‚ e.g. var.sso_users (21 fields), var.permission_sets (type = any) β”‚
β”‚ var.account_assignments (principal + permission + account) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ locals.tf β”‚
β”‚ ───────── β”‚
β”‚ Transform Layer β€” business logic, no cloud API calls β”‚
β”‚ flatten(), for expressions, format() key generation β”‚
β”‚ e.g. flatten_user_data, users_and_their_groups β”‚
β”‚ principals_and_their_account_assignments β”‚
β”‚ pset_aws_managed_policy_maps β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ main.tf β”‚
β”‚ ─────── β”‚
β”‚ Resource Layer β€” cloud API calls only (no logic) β”‚
β”‚ aws_ssoadmin_permission_set, aws_identitystore_user β”‚
β”‚ aws_ssoadmin_account_assignment, aws_ssoadmin_application β”‚
β”‚ Iterates locals.*, never variables.* directly β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ data.tf β”‚
β”‚ ─────── β”‚
β”‚ Data Source Layer β€” read-only lookups, no mutation β”‚
β”‚ aws_ssoadmin_instances (SSO instance ARN + store ID) β”‚
β”‚ aws_organizations_organization (account ID resolution) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ outputs.tf β”‚
β”‚ ────────── β”‚
β”‚ Output Layer β€” typed contract for downstream modules β”‚
β”‚ Exposes ARNs, IDs, maps for composition β”‚
β”‚ Enables Registry publication and module chaining (ADR-007) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
The Single Most Important Rule

locals.tf is the only layer permitted to transform var.* inputs into structures consumed by main.tf. Resources never reference var.* directly. This enforces a strict, auditable data flow:

variables β†’ locals β†’ resources β†’ outputs

This rule is not enforced by Terraform itself β€” it is enforced by the team's discipline and code review. When broken, debugging becomes exponentially harder because business logic bleeds into resource declarations.


πŸ”’ Section 3 β€” State Management Strategy: ADR-006​

Why DynamoDB Locking Is the Wrong Default in 2026​

DynamoDB state locking was a reasonable solution when it was introduced. In 2026, with Terraform 1.10+ supporting S3 Conditional Writes natively, it is an unnecessary dependency that adds cost, operational complexity, and IAM surface area to every account in your organisation.

ADR-006 documents the decision to eliminate DynamoDB locking entirely across the terraform-aws framework.

DynamoDB vs S3 Native Locking: 11-Factor Comparison​

backend.hcl β€” S3 Native Locking
bucket       = "ams-terraform-org-state"
use_lockfile = true
encrypt = true
Evaluation FactorDynamoDB Locking (Legacy)S3 Native Locking (ADR-006, Selected)
Terraform version requiredAny>= 1.10 (S3 Conditional Writes API)
Monthly cost per account~$5–15 (table + read/write units)$0 β€” included in existing S3 API pricing
Setup complexity3 steps: create table + add IAM policy + dynamodb_table = in backend1 step: use_lockfile = true in backend config
Additional IAM permissionsdynamodb:GetItem, PutItem, DeleteItem, DescribeTableNone β€” uses existing S3 IAM principal
Lock mechanismDynamoDB conditional writes (atomic PutItem with condition expression)S3 If-None-Match: * header (Conditional Writes, atomic)
Lock file artifactNone β€” lock state held in-memory and DynamoDB row.terraform.lock.hcl stored in S3 alongside state
Cross-account architectureDynamoDB table per account OR single shared table (complex cross-account IAM)Single S3 bucket, key-path isolation β€” simpler
Failure mode (stale lock)Stale row in DynamoDB β†’ terraform force-unlock <LOCK_ID>Stale lock object in S3 β†’ same terraform force-unlock
State bucket requirementS3 (state) + DynamoDB (lock) β€” two services to manageS3 only β€” single service boundary
Terraform 2.0 directionDeprecated β€” will be removedNative, documented recommended path
DecisionREJECTED β€” cost, operational complexity, legacy trajectorySELECTED β€” zero cost, simpler IAM, future-proof

πŸ’° Cost Savings at Scale​

The financial case for migration is unambiguous. The savings grow linearly with organisational scale:

Environment CountDynamoDB Annual CostS3 Native Annual CostAnnual Saving
5 accounts~$300–900$0~$300–900
20 accounts~$1,200–3,600$0~$1,200–3,600
50 accounts~$3,000–9,000$0~$3,000–9,000

The migration effort is one configuration line change per module. The payback period is immediate.

Backend Configuration (Source of Truth)​

backend.hcl.example is the single configuration artifact that bootstraps state for any module in any account:

backend.hcl.example
# Multi-Org S3 Backend Configuration
# Usage: terraform init -backend-config=backend.hcl
#
# Copy and customize per account:
# cp backend.hcl.example backend.hcl
# # Edit values below, then:
# terraform init -backend-config=backend.hcl

bucket = "ams-terraform-org-state"
region = "ap-southeast-2"
use_lockfile = true

# Path pattern: tf-org-aws/<account-id>/<module>/terraform.tfstate
# Example for identity-center in account 123456789012:
# key = "tf-org-aws/123456789012/identity-center/terraform.tfstate"
key = "tf-org-aws/<ACCOUNT_ID>/<MODULE_NAME>/terraform.tfstate"

# State bucket lives in management account
# Ensure cross-account access policy exists on this bucket
encrypt = true
Backend initialisation workflow
# Initialise for a new account and module
cp backend.hcl.example backend.hcl
sed -i 's/<ACCOUNT_ID>/123456789012/' backend.hcl
sed -i 's/<MODULE_NAME>/identity-center/' backend.hcl
terraform init -backend-config=backend.hcl

# Verify state location
terraform state list
Key Parameter Rationale
  • use_lockfile = true β€” the single line that replaces DynamoDB everywhere; requires Terraform >= 1.10
  • region = "ap-southeast-2" β€” data sovereignty; all state remains in the Sydney region (APRA CPS 234 alignment)
  • encrypt = true β€” SSE-S3 minimum; SSE-KMS recommended for regulated workloads
  • key pattern β€” hierarchical isolation: org β†’ account β†’ module; prevents any key collision

Team Collaboration Safeguards​

Eight layers of defence prevent the most common team collaboration failures:

SafeguardImplementationFailure Mode Prevented
S3 Conditional Writesuse_lockfile = true in backend.hcl.exampleTwo engineers run terraform apply simultaneously β€” second blocks, not corrupts
Provider lock file in VCS.terraform.lock.hcl committed to repo and in tests/snapshot/Provider silently upgrades between CI runs β€” pinned hashes prevent version skew
Encryption at restencrypt = true β†’ SSE-S3 minimumState contains ARNs, IDs, and sensitive values β€” encrypted at rest
RBAC via IAM rolesPer-account IAM roles scoped to s3:GetObject/PutObject on tf-org-aws/<account-id>/*Engineers from account A cannot read or write state belonging to account B
S3 Bucket VersioningEnabled on ams-terraform-org-state bucketAccidental terraform state rm β€” recover prior state version
MFA DeleteConfigured on state bucket (management account)Malicious or accidental permanent deletion of state objects
Cross-account bucket policys3:PutObject allowed from member account IAM roles only with specific key prefixMember accounts cannot read each other's state
Lock file verify gatetask build:lock-verify β€” blocks PR if .terraform.lock.hcl missing per moduleProvider lock file absent β€” enforces committed lock before merge
Automated provider upgradetask build:lock-upgrade + provider-upgrade.yml weekly PRProvider drift goes undetected for weeks β€” automated weekly detection + PR

πŸ—‚οΈ Section 4 β€” Multi-Account State Isolation​

S3 Key Hierarchy: Five-Account Example​

The key path pattern tf-org-aws/<account-id>/<module>/terraform.tfstate provides deterministic, collision-free isolation across the entire AWS organisation. A five-account hierarchy looks like this:

s3://ams-terraform-org-state/
└── tf-org-aws/
β”œβ”€β”€ 111111111111/ # Management account (AWS Organizations root)
β”‚ β”œβ”€β”€ identity-center/
β”‚ β”‚ β”œβ”€β”€ terraform.tfstate # SSO: users, groups, permission sets
β”‚ β”‚ └── terraform.tfstate.tflock # S3 native lock file (when held)
β”‚ └── organizations/
β”‚ └── terraform.tfstate # AWS Organizations structure
β”‚
β”œβ”€β”€ 222222222222/ # Security account (audit, GuardDuty, Config)
β”‚ └── guardduty/
β”‚ └── terraform.tfstate
β”‚
β”œβ”€β”€ 333333333333/ # Operations account (shared services)
β”‚ β”œβ”€β”€ ecs-platform/
β”‚ β”‚ └── terraform.tfstate
β”‚ └── networking/
β”‚ └── terraform.tfstate
β”‚
β”œβ”€β”€ 444444444444/ # Sandbox / development account
β”‚ └── ecs-platform/
β”‚ └── terraform.tfstate
β”‚
└── 335083429030/ # State bucket owner account
└── state-bucket-bootstrap/
└── terraform.tfstate

Isolation Guarantee Matrix​

Five independent isolation layers mean that no single failure can corrupt state across account boundaries:

Isolation LayerMechanismGuarantee
Organisation-levelS3 bucket prefix tf-org-aws/All state contained within a single auditable bucket
Account-levelKey prefix tf-org-aws/<account-id>/State from account A cannot overwrite account B
Module-levelSub-prefix <module>/terraform.tfstateIdentity Center state never collides with ECS or networking state
Concurrencyuse_lockfile = true β†’ S3 If-None-Match Conditional WriteTwo simultaneous terraform apply operations β†’ one waits or fails, never corrupts
Data residencyregion = ap-southeast-2State does not leave the Sydney region (APRA CPS 234 Para 15 data sovereignty)

πŸ›‘οΈ Section 5 β€” Production-Ready Module: IAM Identity Center​

How Design Mindset and State Management Converge​

The modules/sso/ module is where both disciplines become concrete. It is not a demonstration module β€” it is a production-ready, scored implementation (97/100, rising to 99/100 after Q5 legal compliance cleanup) that manages the full lifecycle of AWS SSO identity governance in a multi-account Landing Zone.

MetricValue
Production-readiness score97/100 (pre-cleanup) β†’ 99/100 (post-cleanup)
Resource blocks17 covering users, groups, permission sets, account assignments, applications
Test files8 scenario-based .tftest.hcl files + cross-domain snapshot tests
Example configurations8 covering all major use cases
Outputs10 typed outputs (ARNs, IDs, names) for downstream module chaining
Upstreamaws-ia/terraform-aws-sso v1.0.4 (Apache-2.0)

The global_variables.tf Tag Convention Layer​

Before a single resource is declared, the framework establishes a mandatory tag contract in global/global_variables.tf. This file is not imported by modules β€” Terraform does not support inter-module variable sharing β€” but it documents the shared convention that every composition (examples/, projects/) is expected to apply:

global/global_variables.tf
# Copyright 2026 [email protected] (oceansoft.io). Licensed under Apache-2.0.
# Global conventions for terraform-aws module library (KISS/LEAN)
#
# Tag Taxonomy (4-tier):
# Tier 1 β€” Mandatory: Project, Environment, Owner, CostCenter, ManagedBy
# Tier 2 β€” FinOps: ServiceName, ServiceCategory (FOCUS 1.2+)
# Tier 3 β€” Compliance: DataClassification, Compliance (APRA CPS 234)
# Tier 4 β€” Ops: Automation, BackupPolicy, GitRepo

variable "common_tags" {
description = "Tags applied to all resources β€” 4-tier taxonomy for FOCUS 1.2+ FinOps and APRA CPS 234 compliance"
type = map(string)
default = {
# Tier 1 β€” Mandatory (enforced by AWS Organizations Tag Policy)
Project = "terraform-aws"
Environment = "dev"
Owner = "[email protected]"
CostCenter = "platform"
ManagedBy = "Terraform"

# Tier 2 β€” FinOps (FOCUS 1.2+ dimension mapping)
# ServiceName and ServiceCategory set per-module in locals.tf merge

# Tier 3 β€” Compliance (APRA CPS 234 Para 15)
DataClassification = "internal"
Compliance = "none"

# Tier 4 β€” Operational
Automation = "true"
BackupPolicy = "default"
GitRepo = "terraform-aws"
}
}

Per-Module Tag Merge Pattern​

Each module's locals.tf merges the global tag baseline with module-specific Tier 2 FinOps dimensions. This is the Separation of Concerns principle in action: the global layer owns the mandatory baseline; the module layer owns the service classification:

modules/sso/locals.tf
locals {
module_tags = merge(var.common_tags, {
ServiceName = "IAM Identity Center"
ServiceCategory = "Security"
})
}

Resources then reference local.module_tags β€” never var.common_tags directly. The merge happens exactly once, in locals.tf, consistent with the strict data flow rule.

The locals.tf Transform Layer in Practice​

The IAM Identity Center module's locals.tf demonstrates why the Transform Layer is indispensable. Managing user-to-group membership across an enterprise SSO configuration involves deeply nested data structures. The locals block flattens them into for_each-ready maps that resource blocks can iterate cleanly:

modules/sso/locals.tf
locals {
# Flatten nested user β†’ group membership into a flat map
flatten_user_data = flatten([
for this_user in keys(var.sso_users) : [
for group in var.sso_users[this_user].group_membership : {
user_name = var.sso_users[this_user].user_name
group_name = group
}
]
])

# Build for_each-ready map with composite key
users_and_their_groups = {
for s in local.flatten_user_data :
format("%s_%s", s.user_name, s.group_name) => s
}
}

In main.tf, the aws_identitystore_group_membership resource iterates local.users_and_their_groups β€” no business logic, no variable references, pure iteration. This is what makes the resource layer auditable: every resource block is a declarative statement of intent, not a procedural algorithm.

Wrapper Pattern (ADR-007): Consume, Don't Copy​

The framework follows the Wrapper Pattern for upstream module consumption. Consumers reference the module via source and override only what they need β€” they never copy-paste the module internals. This is ADR-007:

projects/sso/main.tf
module "iam_identity_center" {
source = "nnthanh101/terraform-aws/aws//modules/sso"

sso_users = local.sso_users
sso_groups = local.sso_groups
permission_sets = local.permission_sets
account_assignments = local.account_assignments

tags = local.module_tags
}

When the upstream module releases a patch, the consumer updates a version constraint and runs terraform init. No internal files to merge, no logic to reconcile. This is the contract that makes modular Terraform sustainable at scale.

Test Coverage: 8 Scenarios​

The module ships with 8 .tftest.hcl test files covering every major deployment scenario. Tests run in Tier 1 (snapshot, zero credentials, 2–3 seconds) before any cloud API is called:

Test FileScenario
01_mandatory.tftest.hclMinimal valid configuration
02_existing_users_and_groups.tftest.hclImport and manage pre-existing SSO entities
03_inline_policy.tftest.hclInline policy attachment to permission sets

⚑ Section 6 β€” Container-First Developer Experience​

Why "Works on My Machine" Is a Deployment Anti-Pattern​

The DX principle from Section 2 is made concrete through container-first execution. Every tool-dependent task β€” terraform fmt, tflint, checkov, trivy, infracost β€” runs inside the terraform-aws-dev container. No brew install, no pip install, no version negotiation.

The _exec helper in Taskfile.yml abstracts the execution context so that every task command works identically whether the engineer is on a MacBook, a Linux workstation, or a GitHub Actions runner:

Taskfile.yml β€” _exec detection logic
if [ -f /.dockerenv ]; then
eval '{{.CMD}}' # Already inside container
elif docker exec terraform-aws-dev echo "ok" >/dev/null 2>&1; then
docker exec -w /workspace terraform-aws-dev bash -c '{{.CMD}}' # Container running
else
docker compose up -d devcontainer # Auto-start container
docker exec -w /workspace terraform-aws-dev bash -c '{{.CMD}}'
fi

18 Tools, Zero Host Installs​

The nnthanh101/terraform:2.6.0 image bundles 18 pre-installed, pinned tools across all CI categories:

CategoryTools
IaCterraform (>= 1.11.0), terragrunt, terraform-docs
Lintingtflint, checkov, trivy, tfsec
Formattingterraform fmt, pre-commit
Testinggo (Terratest), terraform test (native)
Costinfracost
Securitycheckov, trivy, tfsec
Utilitiestask, git, jq, yq, aws-cli
Developer Machine                      Docker Container (nnthanh101/terraform:2.6.0)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚ β”‚
β”‚ task ci:quick │──_exec()──▢│ terraform fmt -check β”‚
β”‚ task build:validate │──_exec()──▢│ terraform validate β”‚
β”‚ task build:lint │──_exec()──▢│ tflint (all modules) β”‚
β”‚ task build:lint │──_exec()──▢│ checkov (security scan) β”‚
β”‚ task test:tier1 │──_exec()──▢│ terraform test -verbose (snapshot) β”‚
β”‚ task build:lock │──_exec()──▢│ terraform providers lock (4 platforms) β”‚
β”‚ task plan:cost │──────────▢│ infracost breakdown β”‚
β”‚ β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
| |
| bare-metal: Docker + Task only | 18 tools pre-installed + pinned
| zero brew/pip installs | deterministic across all environments

The result is an onboarding time under 5 minutes: docker compose up -d devcontainer && task ci:quick. A new team member has a validated, lint-clean, test-passing environment before their first PR.


πŸ›‘οΈ Section 7 β€” Drift Detection & Recovery​

Prevention Is the Primary Strategy​

Drift management follows a strict hierarchy: prevent first, detect second, recover only as a last resort. Most recovery scenarios are expensive β€” in time, in confidence, and occasionally in data. Prevention and detection together eliminate the majority of recovery work.

Recovery Scenario Matrix​

When prevention fails, the recovery path must be deterministic. The following table maps each failure scenario to a specific recovery procedure and realistic RTO:

Recovery Scenario Matrix (6 scenarios with RTO estimates)
#ScenarioDetection SignalRecovery StepsRTO
1State locked (stale)Error: state locked on terraform plan1. Verify no concurrent op running. 2. terraform force-unlock <LOCK_ID>. 3. Retry terraform plan.< 5 min
2State corruptedError: couldn't decode state on init1. List S3 object versions for state key. 2. Retrieve prior version. 3. Upload good version. 4. Re-run terraform plan.15–30 min
3Drift detected (console change)terraform plan shows unexpected diffOption A (enforce): terraform apply to reconcile. Option B (import): terraform import if resource created outside Terraform.10–20 min
4State lost entirelyError: no state found1. Check S3 versioning for soft-deleted object. 2. If unrecoverable: terraform import each resource incrementally.60–240 min (HIGH)
5Cross-account state conflictTwo modules share the same key path1. Verify key pattern uniqueness. 2. Rename conflicting key via aws s3 mv. 3. terraform init -reconfigure.20–40 min
6Provider version drift in CICI fails: required providers not satisfied1. task build:lock locally (4 platforms). 2. Commit .terraform.lock.hcl. 3. Re-run CI.< 10 min
Scenario 4 β€” State Loss Is a HIGH Severity Event

State loss without S3 versioning enabled can require a complete terraform import rebuild β€” mapping every live cloud resource back to Terraform addresses. At 50 accounts with hundreds of resources each, this is a multi-day engineering effort. Enable S3 versioning and MFA Delete on your state bucket before your first terraform apply.


βœ… Section 8 β€” Anti-Patterns and Results​

Six Anti-Patterns to Eliminate​

Anti-Pattern Reference Table (6 patterns with guards)
Anti-PatternSymptomRoot CauseFixGuard
Copy-paste modulesDuplicated aws_ssoadmin_* blocks across stacksNo wrapper pattern disciplinesource = "nnthanh101/terraform-aws/aws//modules/sso" β€” inherit, don't copyADR-007
Bare-metal tool installsCI fails because tool version differs from localNo container disciplineAll tool tasks run via _exec inside terraform-aws-dev_exec in Taskfile.yml
var.* in resource blocksfor_each = var.sso_users in resource directlySkipping locals transform layerTransform in locals.tf first: for_each = local.users_and_their_groupsCode review + locals contract
DynamoDB state lockingdynamodb_table = "terraform-lock" in backendLegacy pattern, pre-Terraform 1.10Replace with use_lockfile = true in backend.hclADR-006; grep -ri dynamodb *.hcl
Completion claims without evidence"Done" with no tmp/ artifactsNATO (No Action, Talk Only)Every completion claim requires tmp/terraform-aws/ evidence pathtask monitor:verify
Missing .terraform.lock.hcl in VCSProvider silently upgrades between CI runsLock file in .gitignoreCommit .terraform.lock.hcl; task build:lock-verify blocks PR if missing; task build:lock-upgrade automates upgrade; provider-upgrade.yml detects drift weeklytask build:lock-verify + provider-upgrade.yml

Quantified Results​

The combination of design mindset, S3 native locking, container-first DX, and the production-ready IAM Identity Center module delivers measurable outcomes:

OutcomeMetricMechanism
Cost savings$3,000–$9,000/year at 50 accountsDynamoDB elimination (ADR-006)
Onboarding timeUnder 5 minutes_exec container + task ci:quick
CI pipeline timeUnder 60 seconds for task ci:quickContainerised parallelism
Snapshot test speed2–3 seconds per test runNative terraform test (zero credentials)
Module production-readiness97–99/100 scored8 test scenarios, 8 examples, 17 resource types
State corruption incidentsZero (architectural prevention)S3 Conditional Writes + key-path isolation
Provider skew incidentsZero (lock file gated by CI)task build:lock-verify PR gate
Compliance readinessAPRA CPS 234 Para 15/36/37 traceable4-tier tag taxonomy + data residency enforcement

πŸš€ Call to Action​

The terraform-aws framework is available as a reference implementation demonstrating all patterns described in this post:

  • Terraform: >= 1.11.0 | AWS Provider: >= 6.28, < 7.0
  • Primary region: ap-southeast-2 | Identity Center: us-east-1
  • State bucket: ams-terraform-org-state with S3 native locking, versioning, and MFA Delete

Start with these three files and the framework will constrain failure modes before your first apply:

  1. backend.hcl.example β€” copy, set ACCOUNT_ID and MODULE_NAME, run terraform init
  2. global/global_variables.tf β€” adopt the 4-tier tag taxonomy as your organisation's standard
  3. modules/sso/ β€” consume via wrapper pattern for SSO identity governance

For the tagging strategy that underpins the common_tags convention used throughout the framework, see the companion post: Enterprise AWS Tagging Strategy: 4-Tier Taxonomy for FinOps & APRA CPS 234 Compliance.


CloudOps Engineering β€” OceanSoft Corporation | ap-southeast-2