Skip to main content
DevOps
DevOps Engineer
View all authors

Data-Driven GitOps Platform

· 9 min read
DevOps
DevOps Engineer

Welcome to our Data & AI/ML GitOps Platform with Hybrid-Multi-Cloud approach:

  • Dev (k3d): Local ephemeral clusters with k3d for rapid iteration and prototyping.
  • Staging (k3s): A pinned k3s environment providing a realistic test bed for integrated data workflows.
  • Production (AWS): A fully provisioned AWS environment (e.g., Amazon EKS) for large-scale data ingestion, training pipelines, and real-time inference.

Introduction & Goals

  • Data-Centric: We focus on data ingestion, transformation, and AI/ML training/inference pipelines, all driven by GitOps best practices.
  • Multi-Environment: Simplify dev, staging, and prod by reusing the same code, pinned versions, and reference paths.
  • Automation: Terraform provisions the clusters (k3s, AWS), while ArgoCD and related tools (e.g. Argo Workflows, External Secrets Operator, Atlantis) automate day-2 tasks, continuous delivery, and ephemeral environment creation.

Repository Structure

tf-k3s-template/              ## K3s
├── registry/environments/
│ ├── development/ ## ArgoCD resources & config for dev environment (k3d)
│ ├── staging/ ## ArgoCD resources & config for staging environment (k3s)
│ └── production/ ## ArgoCD resources & config for production environment (AWS/EKS)
├── templates/
│ ├── mgmt/ ## Management-plane YAML (ArgoCD, Vault, Atlantis, etc.)
│ └── workload-vcluster/ ## Optional: vcluster-based workloads or environment overlays
├── terraform/
│ ├── k3s/ ## Terraform code for K3s
│ ├── github/ ## Terraform code for Github
│ ├── users/ ## Terraform code for Users
│ └── vault ## Terraform code for Vault
└── Taskfile.yml # Orchestrates tasks for K3s

...

tf-aws-template/ ## AWS
tf-azure-template/ ## Azure

Environments

Development (k3d)

  • Purpose: Rapid local iteration. Docker-based ephemeral clusters via k3d let you spin up and tear down for short dev cycles.
  • Typical Usage: Data engineers build or test smaller data transformations or AI/ML pipeline steps quickly.
  • Deployment: task dev-setup or task cluster-create can spin up the cluster; ArgoCD automatically syncs from environments/dev.

Staging (k3s)

  • Purpose: A realistic but lightweight environment on a pinned k3s version.
  • Terraform: terraform/k3s sets up nodes, networking, domain, etc.
  • Integration: Full data pipeline flows (ingestion, transformation) run here for final QA before prod.
  • Argo Workflows: Typically, you push your data/ML workflow definitions to data-pipelines/, which staging ArgoCD picks up.

Production (AWS)

  • Purpose: The final environment for large-scale data ingestion, AI model training, real-time inference, etc.
  • Terraform: terraform/aws for EKS cluster, VPC, subnets, domain, secrets in AWS parameter store or Vault.
  • ArgoCD: Syncs from environments/production/, deploying the same pipeline definitions but scaled up.
  • Performance: More advanced node sizes, GPU-based instances for deep learning, etc.

Core Components & Tools

  1. ArgoCD: Primary GitOps engine across dev/staging/prod.
  2. Argo Workflows: CI-like data transformations, ML pipeline orchestration.
  3. Vault or External Secrets: Secure secrets management.
  4. Terraform: Infrastructure as code for k3s (staging) and AWS/EKS (prod).
  5. Taskfile: A simple CLI orchestrator for local tasks: spinning up dev clusters, applying mgmt YAML, running test checks.

GitOps Workflow

We maintain a trunk-based or branch-based approach:

  1. Dev branches target environments/dev for local k3d testing.
  2. Staging merges confirm readiness in environments/staging.
  3. Production merges finalize updates to environments/production.

See our Mermaid diagram in .mermaid-diagrams/gitops-flow.mmd for a visual representation of multi-branch data changes.

GitOps Flow

This GitOps strategy leverages Terraform for infrastructure as code, integrates GitOps principles to drive automated deployments, and segments our environments and features to ensure robust, secure, and agile operations.

By combining Terraform-driven IaC with a clear multi-branch workflow, we ensure that hotfixes, AWS cloud foundation enhancements, and developer platform integrations are seamlessly validated and deployed across our environments—from local k3d development to k3s staging and final multi-cloud production deployments. This design adheres to the highest industry standards, ensuring agility, security, and operational excellence in every release.

  • DevContainer Flow:

    • Dev Environment (k3d): Rapid fixes are applied and validated locally using k3d clusters.
    • Staging Environment (k3s): Changes are promoted for integration testing on k3s clusters.
  • Feature Branches:

    • AWS Cloud-Foundation: Focuses on establishing and evolving our AWS cloud infrastructure using Terraform modules.
    • Backstage Software Catalog & Developer Platform: Drives improvements in our internal developer experience and tooling integration.
  • Release Management:

    • Controlled merging from development through staging and into production, ensuring that every commit is automatically validated and deployed.

Detailed GitOps Workflow

  1. Initialization & Base Setup

    • The repository is initialized with a base configuration that includes Terraform modules for our GitOps platform. This sets up the initial infrastructure and defines our multi-cloud foundation.
  2. Development Branch (develop)

    • All day-to-day changes and experiments are committed on the develop branch.
    • This branch contains core Terraform configurations and GitOps automation components (e.g., ArgoCD configurations).
  3. DevContainer & Hotfix Branch (hotfix)

    • When an urgent fix is needed—such as addressing a k3d-related issue in development—a dedicated hotfix branch is created.
    • Once validated in the Dev environment (k3d), the hotfix is merged back into develop to ensure that the fix is propagated.
  4. Feature Branches

    • Feature1 (AWS Cloud-Foundation):
      • Dedicated branch where changes to AWS-specific Terraform modules are developed and tested.
      • After successful local validation, the changes merge into develop, ensuring integration with the existing Terraform state and modules.
    • Feature2 (Backstage Developer Platform):
      • Dedicated branch to integrate and enhance Backstage (or similar developer portal) components.
      • Once integrated and tested, these changes merge back into develop.
  5. Staging Environment (staging)

    • A separate branch is maintained to deploy and test integrated changes in a staging environment (using k3s).
    • This branch receives updates from develop after hotfixes and feature integrations are merged.
    • Automated pipelines validate the end-to-end workflow in a staging scenario before production promotion.
  6. Production Promotion (main and release)

    • Once staging validations are complete, the develop branch is merged into main.
    • A release branch is then used to bundle and finalize production release candidates.
    • Final promotion commits trigger production deployments, ensuring high-availability across our multi-cloud platforms.

Key Points & Best Practices

  • Infrastructure as Code (IaC):

    • All changes are codified using Terraform, ensuring consistency and reproducibility across multi-cloud environments (AWS, Azure, etc.).
  • Automated CI/CD Pipelines:

    • Every merge triggers automated pipelines that validate syntax, security policies, and compliance standards before applying changes.
    • Environments are provisioned and updated using GitOps tools (such as ArgoCD) that monitor the Git repository as the single source of truth.
  • Environment Isolation:

    • Dev (k3d): Rapid iteration and testing occur locally.
    • Staging (k3s): Pre-production tests validate full integration.
    • Production (Multi-Cloud): Production releases are handled via controlled, well-tested merge and release processes.
  • Branch Naming & Semantic Versioning:

    • Each branch and commit is annotated to ensure traceability—from hotfixes and feature updates to full production releases.
    • Version tags (e.g., v0.1, v0.2, etc.) are applied to critical commits, enabling precise rollbacks if necessary.
  • Scalability & Security:

    • The strategy supports seamless integration of multi-cloud components, ensuring scalability.
    • Automated security checks, compliance audits, and monitoring (using integrated tools like Prometheus, Grafana, or ELK) are standard.

Data & AI/ML Pipelines

  1. Data Pipelines: Ingestion from S3 or external sources, then transformations via Argo Workflows.
  2. AI/ML Training: Model training steps defined as Workflow DAGs referencing GPU-based nodes in staging/prod.
  3. Inference: Real-time or batch predictions served via a microservice, continuously updated by ArgoCD from registry/<environment> paths.
  4. Atlantis: Any Terraform changes to data-related infrastructure (e.g. S3 buckets, ECR for model images) are plan/applied in PR, ensuring safe changes.

Installation & Setup

  1. Local Dev:

    • Prerequisites: Docker, k3d, terraform, task.
    • task dev-setup or task cluster-create (depending on your Taskfile definitions).
    • task mgmt-manual-apply: apply mgmt-plane YAML to dev.
    • task test-all: checks pods, namespaces, Terraform code validity.
  2. Staging:

    • cd terraform/k3s
    • terraform init && terraform plan -var-file="../../environments/staging/terraform.tfvars"
    • terraform apply -auto-approve -var-file="../../environments/staging/terraform.tfvars"
    • ArgoCD picks up environments/staging changes, deploys your data pipelines, etc.
  3. Production:

    • cd terraform/aws
    • terraform init && terraform plan -var-file="../../environments/production/terraform.tfvars"
    • terraform apply -auto-approve -var-file="../../environments/production/terraform.tfvars"
    • Ensure your model training DAGs, inference services, or any advanced data flows are pinned to the environments/production folder, letting ArgoCD orchestrate them at scale.

Testing & Validation

  • task test-provision: Runs terraform validate or terraform plan for k3s or AWS code.
  • task test-deployed: Checks pods/namespaces in each environment.
  • task test-all: Aggregates both.
  • ArgoCD UI: watch for “Healthy” and “Synced” states in dev/staging/prod.
  • Argo Workflows: Data pipeline runs can be triggered by commits, verifying transformations and AI tasks succeed end-to-end.

Advanced Topics

  • vclusters: Some data teams isolate ephemeral dev/test pipelines in a “virtual cluster” inside staging or dev. See cluster-types/workload-vcluster/ for example YAML definitions.
  • GPU Workloads: For AI training. In staging, you might have a single GPU node. In production, you can scale up multiple GPU-based instance groups.
  • Multi-Account AWS: Some teams store dev/staging in one AWS account, production in another. The same GitOps approach remains valid.

Contributing

  1. Fork or branch from main.
  2. Add or modify environment code in environments/<dev|staging|production> or terraform/<k3s|aws>.
  3. Open a PR. Atlantis or your chosen CI pipeline comments with plan results.
  4. Review & merge. ArgoCD and your environment watchers do the rest.

We hope this multi-environment GitOps approach empowers your data & AI/ML workflows, ensuring consistent, automated deployments from local dev to production scale in AWS.


DevOps Docker & DevContainer

· 4 min read
DevOps
DevOps Engineer

Overview

The nnthanh101/terraform:latest Docker image is a secure, lightweight, and production-ready environment tailored for modern CloudOps and DevOps workflows. Built on Chainguard's Wolfi Linux, this image incorporates best practices for multi-cloud, Infrastructure-as-Code (IaC), and Kubernetes ecosystem management.

Designed to meet the demands of multi-cloud environments and enterprise-grade automation, it includes tools for provisioning, configuration management, orchestration, and secrets management. The devops tag extends its functionality with Kubernetes tooling, making it ideal for hybrid-cloud operations.

CloudOps Docker Container

· 4 min read
DevOps
DevOps Engineer

Overview

The nnthanh101/runbooks:latest image is a secure, lightweight, and production-grade Python environment built on Chainguard's Wolfi Base. This image has been optimized to support multi-cloud environments (AWS, Azure) and cross-platform workflows for CloudOps, FinOps, Analytics, AI, and Data Science projects.

With a focus on modern CloudOps and DevOps practices, this image incorporates security, maintainability, and scalability into its design. It integrates essential extensions like MkDocs, JupyterLab, and Vizro for documentation and analytics workflows.