Skip to main content

troubleshooting-faq


Troubleshooting & FAQ

This guide documents real error patterns encountered in the terraform-aws CI pipeline and during module development. Each entry includes the exact error, root cause, and the fix. Start every debugging session with task ci:quick to narrow the failure surface before diving into individual sections.


1. YAML Config Errors: session_duration Must Be ISO 8601 PT Format

Symptom

terraform plan fails with a check block assertion or variable validation error:

Error: Invalid value for variable
session_duration must follow ISO 8601 format in hours or minutes (e.g., PT4H, PT8H, PT30M).

Or, when the bad value comes from a YAML file (bypassing variables.tf validation):

Warning: Check block assertion failed
check.yaml_session_duration_format
APRA CPS 234 / AC-3-10: The following permission sets (possibly from YAML) have a
session_duration that does not match ISO 8601 PT<n>H or PT<n>M format: ["BadDuration"].
Correct the YAML file or the HCL variable value.

Root Cause

AWS IAM Identity Center accepts only ISO 8601 duration strings. The module enforces ^PT[0-9]+[HM]$ in both variables.tf and the check.yaml_session_duration_format block in locals.tf. The check block catches values that arrive through the YAML merge path (config_path) and bypass the HCL variable validation block.

Common authoring mistakes:

Wrong (will fail)Correct
4 hoursPT4H
4hPT4H
PT4PT4H
240MPT240M or PT4H
PT1.5HPT90M (no decimals)
P1DPT24H (days not accepted)

Fix

  1. Open permission_sets.yaml (or your HCL permission_sets variable).
  2. Find all session_duration values.
  3. Replace with PT<integer>H (hours) or PT<integer>M (minutes).
  4. APRA CPS 234 Para 37: Administrative permission sets must not exceed PT1H. Standard permission sets must not exceed PT8H. See custom check CKV_APRA_003 and CKV_APRA_004.
  5. Re-run: task test:tier1

Test Coverage

modules/iam-identity-center/tests/snapshot/yaml_validation_test.tftest.hcl covers this failure path via invalid_session_duration_fails_check and valid_session_duration_minutes_passes runs.


2. Tag Validation Failures: Required Keys Missing

Symptom A — variables.tf validation (HCL path)

Error: Invalid value for variable
on variables.tf line 190, in variable "default_tags":
When default_tags is provided, it must include:
CostCenter, Project, Environment, DataClassification.

Symptom B — check block (YAML path)

Warning: Check block assertion failed
check.yaml_required_tag_keys
APRA CPS 234 / AC-3-10: The following permission sets (possibly from YAML) have a tags map
that is missing required keys CostCenter and/or DataClassification: ["MissingCostCenterTag"].

Symptom C — Checkov custom check

Check: CKV_CUSTOM_FOCUS_001: "Ensure cost allocation tags are present for FOCUS 1.2+ exports"
FAILED for resource: aws_ssoadmin_permission_set.permission_set["ReadOnly"]
Missing cost allocation tags: CostCenter, Project, Environment, ServiceName

Root Cause

The module enforces two overlapping tag sets:

Tag KeyRequired ByChecked In
CostCenterFOCUS 1.2+ + APRA Para 15variables.tf, check.yaml_required_tag_keys, CKV_CUSTOM_FOCUS_001
ProjectFOCUS 1.2+variables.tf, CKV_CUSTOM_FOCUS_001
EnvironmentFOCUS 1.2+variables.tf, CKV_CUSTOM_FOCUS_001
ServiceNameFOCUS 1.2+CKV_CUSTOM_FOCUS_001
DataClassificationAPRA CPS 234 Para 15variables.tf, check.yaml_required_tag_keys, CKV_APRA_001

When default_tags = {} (empty map), the variables.tf validation passes (empty is explicitly allowed). However, per-permission-set tags maps are merged with the consumer-supplied default_tags (not with _effective_default_tags, which contains hardcoded fallbacks for Checkov). If neither default_tags nor the per-pset tags supplies CostCenter and DataClassification, the check block fires.

Fix

Set default_tags in the calling module:

module "iam_identity_center" {
source = "oceansoft/terraform-aws/aws//modules/iam-identity-center"

default_tags = {
CostCenter = "platform"
Project = "landing-zone"
Environment = "production"
ServiceName = "sso"
DataClassification = "internal"
}
}

Valid DataClassification values: public, internal, confidential, restricted.

Test Coverage

yaml_validation_test.tftest.hcl tests 5 and 6 cover missing CostCenter and DataClassification respectively. Test 7 confirms a complete tag map passes.

Reference: ADR-008


3. Provider Version Conflicts

Symptom

Error: Unsatisfied requirements
Provider registry.terraform.io/hashicorp/aws version ">= 6.28.0, < 7.0.0" is required.
lock file contains hashicorp/aws 5.88.0 which does not satisfy ">= 6.28.0"

Or:

Error: Unsatisfied requirements
Terraform >= 1.11.0 is required by this module.
Current version: 1.9.5.

Root Cause

The module enforces strict provider constraints per ADR-003:

terraform {
required_version = ">= 1.11.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 6.28, < 7.0"
}
}
}

AWS provider 6.x introduced breaking changes to aws_ssoadmin_* resource schemas from 5.x. Terraform 1.11.0 is required for use_lockfile = true S3 native state locking (ADR-006) and expect_failures in .tftest.hcl (requires >= 1.8, but 1.11 for full feature parity).

Fix

  1. Check your installed Terraform version: terraform version
  2. The devcontainer (nnthanh101/terraform:2.6.0) has the correct version pre-installed. Run task build:env to start the container and execute all commands inside it.
  3. If the lock file is stale (contains 5.x provider hash), regenerate it:
    cd modules/iam-identity-center
    rm .terraform.lock.hcl
    terraform init -upgrade
  4. Regenerate the lock file for all supported platforms:
    task build:lock
    This locks for linux_amd64, linux_arm64, darwin_amd64, darwin_arm64.
  5. Verify lock file exists: task build:lock-verify

Do not run terraform init on bare metal. Use the devcontainer to ensure reproducible provider resolution. See ADR-003.


4. State Lock Errors: S3 Native Locking

Symptom

Error: Error acquiring the state lock
Lock Info:
ID: 20260215-083412-abc123
Path: s3://my-tfstate-bucket/terraform-aws/iam-identity-center/terraform.tfstate
Operation: OperationTypePlan
Who: github-actions@runner-1234
Created: 2026-02-15 08:34:12 UTC

Terraform acquires a state lock to protect the state from being written
by multiple users at the same time.

Root Cause

The project uses S3 native state locking (use_lockfile = true) per ADR-006 — no DynamoDB table required. The lock file is stored alongside the state file as terraform.tfstate.tflock in the same S3 prefix. A stale lock occurs when a previous Terraform process was interrupted (GitHub Actions timeout, SIGKILL, network failure) without releasing the lock.

Fix

  1. Verify the lock is genuinely stale by checking GitHub Actions run history for the Lock ID.
  2. Inspect the S3 bucket for the lock file:
    aws s3 ls s3://my-tfstate-bucket/terraform-aws/iam-identity-center/ --profile <profile>
  3. Confirm no active plan or deployment is running in CI or locally.
  4. Force-unlock using the Lock ID from the error message:
    terraform force-unlock 20260215-083412-abc123
  5. This operation requires human approval (HITL gate HITL-006). Do not automate force-unlock.

Prevention

  • Set TERRAFORM_CLI_TIMEOUT in GitHub Actions to avoid partial-lock-then-timeout scenarios.
  • The registry-publish.yml workflow uses a concurrency key to serialize runs per module. Ensure no manual runs execute concurrently against the same state path.
  • Ensure the S3 bucket has versioning enabled so the .tflock file can be inspected and restored if accidentally deleted.

Reference: ADR-006


5. Identity Center IdP Issues: INTERNAL vs EXTERNAL Users and Google Workspace SCIM

Symptom A — Wrong principal_idp value

Error: Invalid value for "principal_idp"
acceptable values are either "INTERNAL" or "EXTERNAL"
got: "external" (lowercase)

Symptom B — Google Workspace user not found

Error: reading Identity Store User (user_id=...) for group membership
ResourceNotFoundException: No user found for id ...

Symptom C — Mock provider UUID validation failure in tests

Error: Invalid UUID
The value for user_id must be a valid UUID format.
Got: "mock-user-12345" (generated by mock_provider)

Root Cause

principal_idp is case-sensitive and must be uppercase: "INTERNAL" for users created directly in IAM Identity Center, "EXTERNAL" for users sourced from an external IdP (Google Workspace, Azure AD, Okta) via SCIM sync.

Google Workspace users synced via AWS IAM Identity Center SCIM provisioning appear in the Identity Store but their user_id is a UUID assigned by IAM Identity Center, not by Google. Terraform must look up existing_google_sso_users by user_name (the email address), not by a Google-side ID.

The mock provider (mock_provider "aws" {}) generates random strings that fail UUID format validation when used for user_id and group_id data sources. The .tftest.hcl files work around this with override_data blocks that supply explicit valid UUIDs.

Fix — Wrong principal_idp

account_assignments = {
google_admins = {
principal_name = "GoogleAdmins"
principal_type = "GROUP"
principal_idp = "EXTERNAL" # Must be uppercase; EXTERNAL for Google Workspace groups
permission_sets = ["AdministratorAccess"]
account_ids = ["123456789012"]
}
}

Fix — Google Workspace users not resolving

Use existing_google_sso_users (not existing_sso_users) for Google-sourced identities:

existing_google_sso_users = {
alice = {
user_name = "[email protected]" # Must match SCIM-synced email exactly
group_membership = ["AdminGroup"]
}
}

Verify SCIM provisioning is active in the AWS Console: IAM Identity Center > Settings > Automatic provisioning must show "Enabled". If SCIM is not provisioning users, the Terraform lookup will fail with ResourceNotFoundException.

Fix — Test mock UUID errors

Add override_data blocks with explicit UUIDs in your .tftest.hcl:

override_data {
target = module.aws-iam-identity-center.data.aws_identitystore_user.existing_google_sso_users["alice"]
values = {
user_id = "b1c2d3e4-0002-4000-8000-000000000001"
}
}

See modules/iam-identity-center/tests/04_google_workspace.tftest.hcl for the full pattern.


6. CI Pipeline Failures: EACCES on /__w/_temp/

Symptom

GitHub Actions job fails with:

Error: EACCES: permission denied, open '/__w/_temp/_runner_file_commands/set_env_...'

Or the runner cannot write artifacts, environment variables, or step outputs when a job uses the container: directive.

Root Cause

When a GitHub Actions job uses the container: directive, the runner agent writes temporary files to /__w/_temp/ inside the container. By default, the runner uses a non-root user, but the nnthanh101/terraform:2.6.0 container image may not own that directory, resulting in permission denied errors.

The fix is options: --user 0, which runs the container as root and gives the runner agent write access to /__w/_temp/. This is the CI container pattern mandated in the project (documented in adlc-governance.md as anti-pattern CI_CONTAINER_EACCES).

Fix

Every job that uses container: must include options: --user 0:

jobs:
validate:
runs-on: ubuntu-latest
container:
image: nnthanh101/terraform:2.6.0@sha256:3e159226f661171fb26baa360af7ddc0809076376a3cd6c37b8614186770f16a
options: --user 0 # Required: prevents EACCES on /__w/_temp/
steps:
- uses: actions/checkout@v4
- run: task build:validate

Verification

All jobs in ci.yml and registry-publish.yml already include options: --user 0 on every container job. When adding a new job with container:, copy this pattern exactly.

The infracost.yml workflow runs on ubuntu-latest without a container: directive (Infracost uses the official action). Do not add container: to that workflow. See FAQ #9.


7. CI Pipeline Failures: Docs Sync PAT Not Configured

Symptom

The cross-repo sync step in docs-sync.yml is commented out. Attempting to uncomment without the PAT secret set produces:

Error: Context access might be invalid: DOCS_SYNC_PAT

Root Cause

Cross-repo documentation sync requires three HITL (Human-In-The-Loop) setup steps that cannot be automated:

  • H1: Add the DevOps-TechDocs repository as a git submodule at docs/site/
  • H2: Create a fine-grained GitHub PAT scoped to 1xOps/DevOps-TechDocs with Contents: Read+Write permission
  • H3: Add the PAT as a repository secret named DOCS_SYNC_PAT

Until all three gates are satisfied, the sync step remains commented out. The terraform-docs README generation step runs without a PAT and is always active. The comment block in docs-sync.yml documents the exact commands to run once H1-H3 are complete.

Fix

  1. HITL gates H1-H3 must be completed by the platform engineer — these require GitHub admin access.
  2. Once complete, uncomment the "Sync to DevOps-TechDocs" step in .github/workflows/docs-sync.yml.
  3. Use only a fine-grained PAT scoped to the target repository, not a classic PAT with broad repository access.
  4. Verify PAT has not expired: GitHub > Settings > Developer settings > Personal access tokens.

Note on SUBMODULE_PAT

The SUBMODULE_PAT pattern was evaluated and rejected (ADR-017): the repository does not use submodules: recursive in any actions/checkout@v4 step, so adding that secret to the checkout step would create a secret dependency with zero business value and would break CI. See ADR-017.


8. Checkov False Positives: Custom Check IDs and checkov:skip Pattern

Symptom

Checkov reports a failure for a resource that is intentionally compliant via a documented exception:

Check: CKV_APRA_002: "Ensure no AdministratorAccess AWS managed policy is attached"
FAILED for resource: aws_ssoadmin_managed_policy_attachment.pset["BreakGlass"]

Or CKV_APRA_005 fires on a high-privilege permission set that intentionally has no permissions boundary (break-glass emergency access).

Root Cause

The project defines five custom APRA CPS 234 checks in .checkov/custom_checks/check_apra_cps234.py:

Check IDParaWhat It Checks
CKV_APRA_00115DataClassification tag present with valid value
CKV_APRA_00236No AdministratorAccess policy attached
CKV_APRA_00337Session duration does not exceed 8 hours
CKV_APRA_00436Admin-named permission sets: session duration <= 1 hour (SoD)
CKV_APRA_00537High-privilege sets have a permissions boundary

Break-glass emergency access legitimately requires AdministratorAccess and may not have a permissions boundary by design. These are documented exceptions, not defects.

Fix — Inline checkov:skip

Add a checkov:skip annotation with an ADR reference:

resource "aws_ssoadmin_managed_policy_attachment" "pset" {
for_each = { ... }

instance_arn = local.ssoadmin_instance_arn
permission_set_arn = aws_ssoadmin_permission_set.permission_set[each.key].arn
managed_policy_arn = each.value

#checkov:skip=CKV_APRA_002:ADR-020 break-glass emergency access requires AdministratorAccess
}

Fix — .checkov.yaml Project-Wide Baseline

For project-wide suppression with evidence:

skip-check:
- CKV_APRA_005 # ADR-020: Break-glass sets intentionally skip permissions boundary

Checkov False Positive: merge() Tag Resolution

Checkov cannot evaluate Terraform built-in functions like merge(). When tags uses merge(local._effective_default_tags, lookup(each.value, "tags", {})), Checkov stores the expression as a string. The custom _resolve_tags() helper in both check files handles this by extracting non-nested dict literals from the string expression. To inspect what Checkov sees:

checkov -d modules/ --external-checks-dir .checkov/custom_checks/ \
--check CKV_APRA_001 --output cli

Reference: task security:trivy for Trivy-based misconfig scan (complementary to Checkov).


9. Infracost Workflow Issues: Separate Workflow and API Key Configuration

Symptom A — Cost job missing from CI

You added cost estimation to ci.yml but it does not post PR comments or appear in CI results.

Symptom B — API key error

Error: You must set the INFRACOST_API_KEY environment variable.
Get a free API key from https://www.infracost.io/docs/#quick-start

Symptom C — Cost threshold gate blocks merge

FAIL: Cost exceeds $5/month threshold (HITL-003)

Root Cause

Per the MONOLITHIC_CI anti-pattern in adlc-governance.md: cost estimation runs in its own single-responsibility workflow (infracost.yml), not in ci.yml. Keeping Infracost separate:

  • Runs only on pull requests (not on push to main)
  • Avoids container conflicts (infracost/actions/setup@v3 manages its own runtime)
  • Makes the cost gate independently configurable

The infracost.yml workflow does NOT use container: because infracost/actions/setup@v3 manages its own runtime. Adding container: to this workflow would break the action.

Fix — Separate Workflow

Ensure Infracost lives in .github/workflows/infracost.yml with on: pull_request only. Do not add Infracost steps to ci.yml.

Fix — API Key

  1. Register for a free API key at https://app.infracost.io/.
  2. Add it as a repository secret: GitHub > Settings > Secrets and variables > INFRACOST_API_KEY.
  3. The workflow references it as ${{ secrets.INFRACOST_API_KEY }}.

Fix — Cost Threshold

The workflow enforces:

  • $5/month delta — FAIL gate (HITL-003): merge is blocked
  • $2.50/month delta — WARNING: PR comment is posted, merge is not blocked

If the threshold is exceeded, review the Infracost PR comment for the cost breakdown per example directory. Threshold overrides require HITL approval and cannot be auto-bypassed.

Local Cost Check

task plan:cost   # Runs infracost breakdown per module inside devcontainer

Evidence is written to tmp/terraform-aws/cost-reports/.


10. terraform test Failures: .tftest.hcl Common Errors

Symptom A — Missing mock_provider

Error: No mock or override provided
The provider "aws" is required but no mock or override was set up for the test run.

Symptom B — expect_failures syntax error on older Terraform

Error: Unsupported argument
on tests/snapshot/yaml_validation_test.tftest.hcl line 49:
An argument named "expect_failures" is not expected here.

Symptom C — override_data UUID validation failure

Error: Invalid UUID
The user_id value generated by mock_provider does not match the expected UUID format.

Symptom D — Missing override_data for data sources

Error: Reference to undeclared resource
A data resource "aws_ssm_parameter" "account1_account_id" has not been declared.

Root Cause

The Tier 1 snapshot tests use Terraform features that require version >= 1.8:

  • mock_provider "aws" {} — generates mock resources without AWS credentials
  • override_data { target = ... } — replaces specific data source reads with static values
  • expect_failures = [check.yaml_session_duration_format] — asserts that a named check block fails
  • All Tier 1 tests use command = plan — no deployment, no AWS credentials needed

The mock provider generates random strings that fail UUID format validation for data sources returning user_id or group_id fields. Every such data source requires an explicit override_data block with a valid UUID.

Fix — mock_provider missing

Every .tftest.hcl file must declare mock_provider "aws" {} at the top level:

mock_provider "aws" {}

Fix — expect_failures not supported

Upgrade Terraform to >= 1.8. The devcontainer (nnthanh101/terraform:2.6.0) ships with Terraform 1.11.x. Always run tests inside the container: task test:tier1.

Fix — UUID validation errors

Override every data source that returns a UUID:

override_data {
target = module.aws-iam-identity-center.data.aws_identitystore_user.existing_sso_users["alice"]
values = {
user_id = "b1c2d3e4-0002-4000-8000-000000000001"
}
}

See modules/iam-identity-center/tests/02_existing_users_and_groups.tftest.hcl for the full pattern with both group_id and user_id overrides.

Fix — Missing SSM parameter data source override

The examples/create-users-and-groups example reads data.aws_ssm_parameter.account1_account_id. Add this override to every test that sources this example:

override_data {
target = data.aws_ssm_parameter.account1_account_id
values = {
value = "111111111111"
}
}

Fix — Assert count mismatch

When an assertion like length(module.aws-iam-identity-center.sso_groups_ids) == 4 fails:

  1. Verify the output sso_groups_ids is declared in modules/iam-identity-center/outputs.tf.
  2. Check that the example under test declares exactly the expected number of groups.
  3. A count mismatch typically indicates a logic error in the module, not in the test.

Run Tests

task test:tier1      # Tier 1 snapshot tests only (no AWS, no cost, approx 3s per module)
task test:ci # Tier 1 + Tier 2 combined (no AWS cost)

Reference: ADR-004


11. Debugging Workflow: Step-by-Step Narrowing

Use this sequence to diagnose any CI or local failure efficiently. Start broad and narrow to the specific failing component.

Step 1: Run task ci:quick First

task ci:quick
# Covers: terraform fmt, terraform validate, tflint, checkov, legal audit (approx 60s, $0 cost)
# Identifies: format errors, syntax errors, lint failures, license header issues

If ci:quick passes but CI fails, the issue is in a later phase (tests, security scan, or cost gate).

Step 2: Map the Failing Job to Its Local Task

Read the GitHub Actions job name from the workflow run summary:

Failing JobLocal Reproduction
Validate & Linttask build:validate then task build:lint
Legal Audittask govern:legal
Governance Scoretask govern:score then task govern:cps234
CI-Safe Tests (Tier 1 + 2)task test:tier1
Lock File Verificationtask build:lock-verify then task build:lock
Security Scan (Trivy)task security:trivy
Infracosttask plan:cost

Step 3: Narrow to the Failing Module

The validate job runs as a matrix over [iam-identity-center]. Narrow to the module:

cd /path/to/terraform-aws/modules/iam-identity-center
terraform fmt -check -recursive
terraform validate

Step 4: Check Container Logs

All task commands route through the _exec helper and run inside the nnthanh101/terraform:2.6.0 devcontainer. If a command fails with a tool-not-found error on bare metal but passes in CI, the tool is not installed on the host:

task build:env                              # Start devcontainer (30s timeout)
docker logs terraform-aws-dev --tail 50 # Inspect container startup
docker exec -it terraform-aws-dev bash # Shell into container for manual debugging

Step 5: Inspect Evidence Artifacts

All tasks write evidence to tmp/terraform-aws/. Download artifacts from GitHub Actions > Summary > Artifacts and inspect:

tmp/terraform-aws/test-results/     # Tier 1 test output
tmp/terraform-aws/security-scans/ # Trivy results
tmp/terraform-aws/governance/ # Governance score and CPS 234 report
tmp/terraform-aws/legal-audit/ # License compliance
tmp/terraform-aws/cost-reports/ # Infracost diff JSON

Step 6: Reproduce the Exact CI Command

Every CI job maps to a single task command. To reproduce locally in the same container environment as CI:

# Reproduce the 'test' job
docker exec -w /workspace terraform-aws-dev bash -c "task test:ci"

# Reproduce the 'security' job
docker exec -w /workspace terraform-aws-dev bash -c "task security:trivy"

Step 7: Sprint Validation Gate

Before marking a sprint complete, run the full 7-gate sprint validation:

task sprint:validate
# Gates: legal, governance score, Tier 1 tests, lock-verify, fmt, validate, security
# Exit code 0 = all gates pass; non-zero = at least one gate failed

Evidence is written to tmp/terraform-aws/ and is required for sprint completion claims. Completion claims without evidence paths are rejected as NATO violations.


Quick Reference: Error-to-Task Mapping

Error PatternTask to RunADR Reference
session_duration not ISO 8601task test:tier1ADR-008
Missing FOCUS tags (CKV_CUSTOM_FOCUS_001)task build:lintADR-008
Provider version mismatchtask build:lockADR-003
S3 state lock staleManual force-unlock (HITL required)ADR-006
EACCES /__w/_temp/Add options: --user 0 to CI jobadlc-governance.md
Infracost API key missingSet INFRACOST_API_KEY repo secret
expect_failures syntax errorRun inside devcontainer (Terraform >= 1.8)ADR-004
UUID validation in testsAdd override_data blocks with valid UUIDsADR-004
principal_idp case errorUse uppercase "INTERNAL" or "EXTERNAL"
CKV_APRA_* false positiveAdd checkov:skip with ADR-020 referenceADR-020
Docs sync not runningComplete HITL gates H1-H3 for PAT setupADR-017