troubleshooting-faq
sidebar_position: 5 title: Troubleshooting & FAQ description: Practical troubleshooting guide for terraform-aws modules covering YAML config errors, tag validation, provider constraints, state locking, Identity Center IdP issues, CI pipeline failures, Checkov false positives, Infracost, terraform test, and debugging workflows tags: [troubleshooting, FAQ, CI, checkov, infracost, identity-center, APRA, FOCUS, state-locking, terraform-test]
Troubleshooting & FAQ
This guide documents real error patterns encountered in the terraform-aws CI pipeline and during
module development. Each entry includes the exact error, root cause, and the fix. Start every
debugging session with task ci:quick to narrow the failure surface before diving into individual
sections.
1. YAML Config Errors: session_duration Must Be ISO 8601 PT Format
Symptom
terraform plan fails with a check block assertion or variable validation error:
Error: Invalid value for variable
session_duration must follow ISO 8601 format in hours or minutes (e.g., PT4H, PT8H, PT30M).
Or, when the bad value comes from a YAML file (bypassing variables.tf validation):
Warning: Check block assertion failed
check.yaml_session_duration_format
APRA CPS 234 / AC-3-10: The following permission sets (possibly from YAML) have a
session_duration that does not match ISO 8601 PT<n>H or PT<n>M format: ["BadDuration"].
Correct the YAML file or the HCL variable value.
Root Cause
AWS IAM Identity Center accepts only ISO 8601 duration strings. The module enforces ^PT[0-9]+[HM]$
in both variables.tf and the check.yaml_session_duration_format block in locals.tf. The check
block catches values that arrive through the YAML merge path (config_path) and bypass the HCL
variable validation block.
Common authoring mistakes:
| Wrong (will fail) | Correct |
|---|---|
4 hours | PT4H |
4h | PT4H |
PT4 | PT4H |
240M | PT240M or PT4H |
PT1.5H | PT90M (no decimals) |
P1D | PT24H (days not accepted) |
Fix
- Open
permission_sets.yaml(or your HCLpermission_setsvariable). - Find all
session_durationvalues. - Replace with
PT<integer>H(hours) orPT<integer>M(minutes). - APRA CPS 234 Para 37: Administrative permission sets must not exceed
PT1H. Standard permission sets must not exceedPT8H. See custom checkCKV_APRA_003andCKV_APRA_004. - Re-run:
task test:tier1
Test Coverage
modules/iam-identity-center/tests/snapshot/yaml_validation_test.tftest.hcl covers this failure
path via invalid_session_duration_fails_check and valid_session_duration_minutes_passes runs.
2. Tag Validation Failures: Required Keys Missing
Symptom A — variables.tf validation (HCL path)
Error: Invalid value for variable
on variables.tf line 190, in variable "default_tags":
When default_tags is provided, it must include:
CostCenter, Project, Environment, DataClassification.
Symptom B — check block (YAML path)
Warning: Check block assertion failed
check.yaml_required_tag_keys
APRA CPS 234 / AC-3-10: The following permission sets (possibly from YAML) have a tags map
that is missing required keys CostCenter and/or DataClassification: ["MissingCostCenterTag"].
Symptom C — Checkov custom check
Check: CKV_CUSTOM_FOCUS_001: "Ensure cost allocation tags are present for FOCUS 1.2+ exports"
FAILED for resource: aws_ssoadmin_permission_set.permission_set["ReadOnly"]
Missing cost allocation tags: CostCenter, Project, Environment, ServiceName
Root Cause
The module enforces two overlapping tag sets:
| Tag Key | Required By | Checked In |
|---|---|---|
CostCenter | FOCUS 1.2+ + APRA Para 15 | variables.tf, check.yaml_required_tag_keys, CKV_CUSTOM_FOCUS_001 |
Project | FOCUS 1.2+ | variables.tf, CKV_CUSTOM_FOCUS_001 |
Environment | FOCUS 1.2+ | variables.tf, CKV_CUSTOM_FOCUS_001 |
ServiceName | FOCUS 1.2+ | CKV_CUSTOM_FOCUS_001 |
DataClassification | APRA CPS 234 Para 15 | variables.tf, check.yaml_required_tag_keys, CKV_APRA_001 |
When default_tags = {} (empty map), the variables.tf validation passes (empty is explicitly
allowed). However, per-permission-set tags maps are merged with the consumer-supplied
default_tags (not with _effective_default_tags, which contains hardcoded fallbacks for
Checkov). If neither default_tags nor the per-pset tags supplies CostCenter and
DataClassification, the check block fires.
Fix
Set default_tags in the calling module:
module "iam_identity_center" {
source = "oceansoft/terraform-aws/aws//modules/iam-identity-center"
default_tags = {
CostCenter = "platform"
Project = "landing-zone"
Environment = "production"
ServiceName = "sso"
DataClassification = "internal"
}
}
Valid DataClassification values: public, internal, confidential, restricted.
Test Coverage
yaml_validation_test.tftest.hcl tests 5 and 6 cover missing CostCenter and
DataClassification respectively. Test 7 confirms a complete tag map passes.
Reference: ADR-008
3. Provider Version Conflicts
Symptom
Error: Unsatisfied requirements
Provider registry.terraform.io/hashicorp/aws version ">= 6.28.0, < 7.0.0" is required.
lock file contains hashicorp/aws 5.88.0 which does not satisfy ">= 6.28.0"
Or:
Error: Unsatisfied requirements
Terraform >= 1.11.0 is required by this module.
Current version: 1.9.5.
Root Cause
The module enforces strict provider constraints per ADR-003:
terraform {
required_version = ">= 1.11.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 6.28, < 7.0"
}
}
}
AWS provider 6.x introduced breaking changes to aws_ssoadmin_* resource schemas from 5.x.
Terraform 1.11.0 is required for use_lockfile = true S3 native state locking (ADR-006) and
expect_failures in .tftest.hcl (requires >= 1.8, but 1.11 for full feature parity).
Fix
- Check your installed Terraform version:
terraform version - The devcontainer (
nnthanh101/terraform:2.6.0) has the correct version pre-installed. Runtask build:envto start the container and execute all commands inside it. - If the lock file is stale (contains 5.x provider hash), regenerate it:
cd modules/iam-identity-center
rm .terraform.lock.hcl
terraform init -upgrade - Regenerate the lock file for all supported platforms:
This locks for
task build:locklinux_amd64,linux_arm64,darwin_amd64,darwin_arm64. - Verify lock file exists:
task build:lock-verify
Do not run terraform init on bare metal. Use the devcontainer to ensure reproducible
provider resolution. See ADR-003.
4. State Lock Errors: S3 Native Locking
Symptom
Error: Error acquiring the state lock
Lock Info:
ID: 20260215-083412-abc123
Path: s3://my-tfstate-bucket/terraform-aws/iam-identity-center/terraform.tfstate
Operation: OperationTypePlan
Who: github-actions@runner-1234
Created: 2026-02-15 08:34:12 UTC
Terraform acquires a state lock to protect the state from being written
by multiple users at the same time.
Root Cause
The project uses S3 native state locking (use_lockfile = true) per
ADR-006 — no
DynamoDB table required. The lock file is stored alongside the state file as
terraform.tfstate.tflock in the same S3 prefix. A stale lock occurs when a previous Terraform
process was interrupted (GitHub Actions timeout, SIGKILL, network failure) without releasing the
lock.
Fix
- Verify the lock is genuinely stale by checking GitHub Actions run history for the Lock ID.
- Inspect the S3 bucket for the lock file:
aws s3 ls s3://my-tfstate-bucket/terraform-aws/iam-identity-center/ --profile <profile> - Confirm no active plan or deployment is running in CI or locally.
- Force-unlock using the Lock ID from the error message:
terraform force-unlock 20260215-083412-abc123 - This operation requires human approval (HITL gate HITL-006). Do not automate force-unlock.
Prevention
- Set
TERRAFORM_CLI_TIMEOUTin GitHub Actions to avoid partial-lock-then-timeout scenarios. - The
registry-publish.ymlworkflow uses aconcurrencykey to serialize runs per module. Ensure no manual runs execute concurrently against the same state path. - Ensure the S3 bucket has versioning enabled so the
.tflockfile can be inspected and restored if accidentally deleted.
Reference: ADR-006
5. Identity Center IdP Issues: INTERNAL vs EXTERNAL Users and Google Workspace SCIM
Symptom A — Wrong principal_idp value
Error: Invalid value for "principal_idp"
acceptable values are either "INTERNAL" or "EXTERNAL"
got: "external" (lowercase)
Symptom B — Google Workspace user not found
Error: reading Identity Store User (user_id=...) for group membership
ResourceNotFoundException: No user found for id ...
Symptom C — Mock provider UUID validation failure in tests
Error: Invalid UUID
The value for user_id must be a valid UUID format.
Got: "mock-user-12345" (generated by mock_provider)
Root Cause
principal_idp is case-sensitive and must be uppercase: "INTERNAL" for users created directly
in IAM Identity Center, "EXTERNAL" for users sourced from an external IdP (Google Workspace,
Azure AD, Okta) via SCIM sync.
Google Workspace users synced via AWS IAM Identity Center SCIM provisioning appear in the Identity
Store but their user_id is a UUID assigned by IAM Identity Center, not by Google. Terraform must
look up existing_google_sso_users by user_name (the email address), not by a Google-side ID.
The mock provider (mock_provider "aws" {}) generates random strings that fail UUID format
validation when used for user_id and group_id data sources. The .tftest.hcl files work
around this with override_data blocks that supply explicit valid UUIDs.
Fix — Wrong principal_idp
account_assignments = {
google_admins = {
principal_name = "GoogleAdmins"
principal_type = "GROUP"
principal_idp = "EXTERNAL" # Must be uppercase; EXTERNAL for Google Workspace groups
permission_sets = ["AdministratorAccess"]
account_ids = ["123456789012"]
}
}
Fix — Google Workspace users not resolving
Use existing_google_sso_users (not existing_sso_users) for Google-sourced identities:
existing_google_sso_users = {
alice = {
user_name = "[email protected]" # Must match SCIM-synced email exactly
group_membership = ["AdminGroup"]
}
}
Verify SCIM provisioning is active in the AWS Console: IAM Identity Center > Settings >
Automatic provisioning must show "Enabled". If SCIM is not provisioning users, the Terraform
lookup will fail with ResourceNotFoundException.
Fix — Test mock UUID errors
Add override_data blocks with explicit UUIDs in your .tftest.hcl:
override_data {
target = module.aws-iam-identity-center.data.aws_identitystore_user.existing_google_sso_users["alice"]
values = {
user_id = "b1c2d3e4-0002-4000-8000-000000000001"
}
}
See modules/iam-identity-center/tests/04_google_workspace.tftest.hcl for the full pattern.
6. CI Pipeline Failures: EACCES on /__w/_temp/
Symptom
GitHub Actions job fails with:
Error: EACCES: permission denied, open '/__w/_temp/_runner_file_commands/set_env_...'
Or the runner cannot write artifacts, environment variables, or step outputs when a job
uses the container: directive.
Root Cause
When a GitHub Actions job uses the container: directive, the runner agent writes temporary files
to /__w/_temp/ inside the container. By default, the runner uses a non-root user, but the
nnthanh101/terraform:2.6.0 container image may not own that directory, resulting in permission
denied errors.
The fix is options: --user 0, which runs the container as root and gives the runner agent write
access to /__w/_temp/. This is the CI container pattern mandated in the project (documented in
adlc-governance.md as anti-pattern CI_CONTAINER_EACCES).
Fix
Every job that uses container: must include options: --user 0:
jobs:
validate:
runs-on: ubuntu-latest
container:
image: nnthanh101/terraform:2.6.0@sha256:3e159226f661171fb26baa360af7ddc0809076376a3cd6c37b8614186770f16a
options: --user 0 # Required: prevents EACCES on /__w/_temp/
steps:
- uses: actions/checkout@v4
- run: task build:validate
Verification
All jobs in ci.yml and registry-publish.yml already include options: --user 0 on every
container job. When adding a new job with container:, copy this pattern exactly.
The infracost.yml workflow runs on ubuntu-latest without a container: directive (Infracost
uses the official action). Do not add container: to that workflow. See FAQ #9.
7. CI Pipeline Failures: Docs Sync PAT Not Configured
Symptom
The cross-repo sync step in docs-sync.yml is commented out. Attempting to uncomment without
the PAT secret set produces:
Error: Context access might be invalid: DOCS_SYNC_PAT
Root Cause
Cross-repo documentation sync requires three HITL (Human-In-The-Loop) setup steps that cannot be automated:
- H1: Add the DevOps-TechDocs repository as a git submodule at
docs/site/ - H2: Create a fine-grained GitHub PAT scoped to
1xOps/DevOps-TechDocswithContents: Read+Writepermission - H3: Add the PAT as a repository secret named
DOCS_SYNC_PAT
Until all three gates are satisfied, the sync step remains commented out. The terraform-docs
README generation step runs without a PAT and is always active. The comment block in
docs-sync.yml documents the exact commands to run once H1-H3 are complete.
Fix
- HITL gates H1-H3 must be completed by the platform engineer — these require GitHub admin access.
- Once complete, uncomment the "Sync to DevOps-TechDocs" step in
.github/workflows/docs-sync.yml. - Use only a fine-grained PAT scoped to the target repository, not a classic PAT with broad repository access.
- Verify PAT has not expired: GitHub > Settings > Developer settings > Personal access tokens.
Note on SUBMODULE_PAT
The SUBMODULE_PAT pattern was evaluated and rejected (ADR-017): the repository does not use
submodules: recursive in any actions/checkout@v4 step, so adding that secret to the
checkout step would create a secret dependency with zero business value and would break CI.
See ADR-017.
8. Checkov False Positives: Custom Check IDs and checkov:skip Pattern
Symptom
Checkov reports a failure for a resource that is intentionally compliant via a documented exception:
Check: CKV_APRA_002: "Ensure no AdministratorAccess AWS managed policy is attached"
FAILED for resource: aws_ssoadmin_managed_policy_attachment.pset["BreakGlass"]
Or CKV_APRA_005 fires on a high-privilege permission set that intentionally has no permissions
boundary (break-glass emergency access).
Root Cause
The project defines five custom APRA CPS 234 checks in .checkov/custom_checks/check_apra_cps234.py:
| Check ID | Para | What It Checks |
|---|---|---|
CKV_APRA_001 | 15 | DataClassification tag present with valid value |
CKV_APRA_002 | 36 | No AdministratorAccess policy attached |
CKV_APRA_003 | 37 | Session duration does not exceed 8 hours |
CKV_APRA_004 | 36 | Admin-named permission sets: session duration <= 1 hour (SoD) |
CKV_APRA_005 | 37 | High-privilege sets have a permissions boundary |
Break-glass emergency access legitimately requires AdministratorAccess and may not have a
permissions boundary by design. These are documented exceptions, not defects.
Fix — Inline checkov:skip
Add a checkov:skip annotation with an ADR reference:
resource "aws_ssoadmin_managed_policy_attachment" "pset" {
for_each = { ... }
instance_arn = local.ssoadmin_instance_arn
permission_set_arn = aws_ssoadmin_permission_set.permission_set[each.key].arn
managed_policy_arn = each.value
#checkov:skip=CKV_APRA_002:ADR-020 break-glass emergency access requires AdministratorAccess
}
Fix — .checkov.yaml Project-Wide Baseline
For project-wide suppression with evidence:
skip-check:
- CKV_APRA_005 # ADR-020: Break-glass sets intentionally skip permissions boundary
Checkov False Positive: merge() Tag Resolution
Checkov cannot evaluate Terraform built-in functions like merge(). When tags uses
merge(local._effective_default_tags, lookup(each.value, "tags", {})), Checkov stores the
expression as a string. The custom _resolve_tags() helper in both check files handles this by
extracting non-nested dict literals from the string expression. To inspect what Checkov sees:
checkov -d modules/ --external-checks-dir .checkov/custom_checks/ \
--check CKV_APRA_001 --output cli
Reference: task security:trivy for Trivy-based misconfig scan (complementary to Checkov).
9. Infracost Workflow Issues: Separate Workflow and API Key Configuration
Symptom A — Cost job missing from CI
You added cost estimation to ci.yml but it does not post PR comments or appear in CI results.
Symptom B — API key error
Error: You must set the INFRACOST_API_KEY environment variable.
Get a free API key from https://www.infracost.io/docs/#quick-start
Symptom C — Cost threshold gate blocks merge
FAIL: Cost exceeds $5/month threshold (HITL-003)
Root Cause
Per the MONOLITHIC_CI anti-pattern in adlc-governance.md: cost estimation runs in its own
single-responsibility workflow (infracost.yml), not in ci.yml. Keeping Infracost separate:
- Runs only on pull requests (not on push to main)
- Avoids container conflicts (
infracost/actions/setup@v3manages its own runtime) - Makes the cost gate independently configurable
The infracost.yml workflow does NOT use container: because infracost/actions/setup@v3
manages its own runtime. Adding container: to this workflow would break the action.
Fix — Separate Workflow
Ensure Infracost lives in .github/workflows/infracost.yml with on: pull_request only.
Do not add Infracost steps to ci.yml.
Fix — API Key
- Register for a free API key at
https://app.infracost.io/. - Add it as a repository secret: GitHub > Settings > Secrets and variables >
INFRACOST_API_KEY. - The workflow references it as
${{ secrets.INFRACOST_API_KEY }}.
Fix — Cost Threshold
The workflow enforces:
$5/month delta— FAIL gate (HITL-003): merge is blocked$2.50/month delta— WARNING: PR comment is posted, merge is not blocked
If the threshold is exceeded, review the Infracost PR comment for the cost breakdown per example directory. Threshold overrides require HITL approval and cannot be auto-bypassed.
Local Cost Check
task plan:cost # Runs infracost breakdown per module inside devcontainer
Evidence is written to tmp/terraform-aws/cost-reports/.
10. terraform test Failures: .tftest.hcl Common Errors
Symptom A — Missing mock_provider
Error: No mock or override provided
The provider "aws" is required but no mock or override was set up for the test run.
Symptom B — expect_failures syntax error on older Terraform
Error: Unsupported argument
on tests/snapshot/yaml_validation_test.tftest.hcl line 49:
An argument named "expect_failures" is not expected here.
Symptom C — override_data UUID validation failure
Error: Invalid UUID
The user_id value generated by mock_provider does not match the expected UUID format.
Symptom D — Missing override_data for data sources
Error: Reference to undeclared resource
A data resource "aws_ssm_parameter" "account1_account_id" has not been declared.
Root Cause
The Tier 1 snapshot tests use Terraform features that require version >= 1.8:
mock_provider "aws" {}— generates mock resources without AWS credentialsoverride_data { target = ... }— replaces specific data source reads with static valuesexpect_failures = [check.yaml_session_duration_format]— asserts that a named check block fails- All Tier 1 tests use
command = plan— no deployment, no AWS credentials needed
The mock provider generates random strings that fail UUID format validation for data sources
returning user_id or group_id fields. Every such data source requires an explicit
override_data block with a valid UUID.
Fix — mock_provider missing
Every .tftest.hcl file must declare mock_provider "aws" {} at the top level:
mock_provider "aws" {}
Fix — expect_failures not supported
Upgrade Terraform to >= 1.8. The devcontainer (nnthanh101/terraform:2.6.0) ships with
Terraform 1.11.x. Always run tests inside the container: task test:tier1.
Fix — UUID validation errors
Override every data source that returns a UUID:
override_data {
target = module.aws-iam-identity-center.data.aws_identitystore_user.existing_sso_users["alice"]
values = {
user_id = "b1c2d3e4-0002-4000-8000-000000000001"
}
}
See modules/iam-identity-center/tests/02_existing_users_and_groups.tftest.hcl for the full
pattern with both group_id and user_id overrides.
Fix — Missing SSM parameter data source override
The examples/create-users-and-groups example reads data.aws_ssm_parameter.account1_account_id.
Add this override to every test that sources this example:
override_data {
target = data.aws_ssm_parameter.account1_account_id
values = {
value = "111111111111"
}
}
Fix — Assert count mismatch
When an assertion like length(module.aws-iam-identity-center.sso_groups_ids) == 4 fails:
- Verify the output
sso_groups_idsis declared inmodules/iam-identity-center/outputs.tf. - Check that the example under test declares exactly the expected number of groups.
- A count mismatch typically indicates a logic error in the module, not in the test.
Run Tests
task test:tier1 # Tier 1 snapshot tests only (no AWS, no cost, approx 3s per module)
task test:ci # Tier 1 + Tier 2 combined (no AWS cost)
Reference: ADR-004
11. Debugging Workflow: Step-by-Step Narrowing
Use this sequence to diagnose any CI or local failure efficiently. Start broad and narrow to the specific failing component.
Step 1: Run task ci:quick First
task ci:quick
# Covers: terraform fmt, terraform validate, tflint, checkov, legal audit (approx 60s, $0 cost)
# Identifies: format errors, syntax errors, lint failures, license header issues
If ci:quick passes but CI fails, the issue is in a later phase (tests, security scan, or cost gate).
Step 2: Map the Failing Job to Its Local Task
Read the GitHub Actions job name from the workflow run summary:
| Failing Job | Local Reproduction |
|---|---|
Validate & Lint | task build:validate then task build:lint |
Legal Audit | task govern:legal |
Governance Score | task govern:score then task govern:cps234 |
CI-Safe Tests (Tier 1 + 2) | task test:tier1 |
Lock File Verification | task build:lock-verify then task build:lock |
Security Scan (Trivy) | task security:trivy |
Infracost | task plan:cost |
Step 3: Narrow to the Failing Module
The validate job runs as a matrix over [iam-identity-center]. Narrow to the module:
cd /path/to/terraform-aws/modules/iam-identity-center
terraform fmt -check -recursive
terraform validate
Step 4: Check Container Logs
All task commands route through the _exec helper and run inside the
nnthanh101/terraform:2.6.0 devcontainer. If a command fails with a tool-not-found error
on bare metal but passes in CI, the tool is not installed on the host:
task build:env # Start devcontainer (30s timeout)
docker logs terraform-aws-dev --tail 50 # Inspect container startup
docker exec -it terraform-aws-dev bash # Shell into container for manual debugging
Step 5: Inspect Evidence Artifacts
All tasks write evidence to tmp/terraform-aws/. Download artifacts from
GitHub Actions > Summary > Artifacts and inspect:
tmp/terraform-aws/test-results/ # Tier 1 test output
tmp/terraform-aws/security-scans/ # Trivy results
tmp/terraform-aws/governance/ # Governance score and CPS 234 report
tmp/terraform-aws/legal-audit/ # License compliance
tmp/terraform-aws/cost-reports/ # Infracost diff JSON
Step 6: Reproduce the Exact CI Command
Every CI job maps to a single task command. To reproduce locally in the same container
environment as CI:
# Reproduce the 'test' job
docker exec -w /workspace terraform-aws-dev bash -c "task test:ci"
# Reproduce the 'security' job
docker exec -w /workspace terraform-aws-dev bash -c "task security:trivy"
Step 7: Sprint Validation Gate
Before marking a sprint complete, run the full 7-gate sprint validation:
task sprint:validate
# Gates: legal, governance score, Tier 1 tests, lock-verify, fmt, validate, security
# Exit code 0 = all gates pass; non-zero = at least one gate failed
Evidence is written to tmp/terraform-aws/ and is required for sprint completion claims.
Completion claims without evidence paths are rejected as NATO violations.
Quick Reference: Error-to-Task Mapping
| Error Pattern | Task to Run | ADR Reference |
|---|---|---|
session_duration not ISO 8601 | task test:tier1 | ADR-008 |
Missing FOCUS tags (CKV_CUSTOM_FOCUS_001) | task build:lint | ADR-008 |
| Provider version mismatch | task build:lock | ADR-003 |
| S3 state lock stale | Manual force-unlock (HITL required) | ADR-006 |
EACCES /__w/_temp/ | Add options: --user 0 to CI job | adlc-governance.md |
| Infracost API key missing | Set INFRACOST_API_KEY repo secret | — |
expect_failures syntax error | Run inside devcontainer (Terraform >= 1.8) | ADR-004 |
| UUID validation in tests | Add override_data blocks with valid UUIDs | ADR-004 |
principal_idp case error | Use uppercase "INTERNAL" or "EXTERNAL" | — |
CKV_APRA_* false positive | Add checkov:skip with ADR-020 reference | ADR-020 |
| Docs sync not running | Complete HITL gates H1-H3 for PAT setup | ADR-017 |