Skip to main content

Command Palette

Search for a command to run...

AutoScaleOps: I Built a Production-Grade DevSecOps Platform From Scratch — Here's Everything

Updated
23 min read
AutoScaleOps: I Built a Production-Grade DevSecOps Platform From Scratch — Here's Everything

A month of late nights, broken pipelines, unfixable CVEs, and one cluster that refused to scale — here's the full story of building AutoScaleOps: a complete CI/CD platform with security gates, GitOps, auto-scaling, and zero manual deployments.



Why I Built This

I've always believed that you don't truly understand a tool until you've broken it yourself. I've used Kubernetes, seen ArgoCD in action, and read about security gates in CI pipelines — but I'd never wired all of it together from scratch.

So I spent a month doing exactly that. No managed cloud Kubernetes. No "getting started" tutorials. Just a blank repo, Terraform, and a stubborn refusal to move on to the next phase until the current one actually worked.

The result is AutoScaleOps: a full DevSecOps CI/CD platform that:

  • Provisions a multi-node KinD cluster in under 5 minutes (it used to take 17 — more on that)

  • Runs every Docker build through a Trivy security gate that blocks HIGH/CRITICAL CVEs from reaching prod

  • Deploys via GitOps — ArgoCD watches the repo, not a shell script

  • Auto-scales pods under real load, validated by k6 load tests

  • Tracks everything through Grafana dashboards across dev and prod

  • Uses ArgoCD Image Updater to make the entire flow from git push to production completely hands-free

Let's walk through every decision, every config, and every problem I ran into.


The Full Architecture

The flow, from left to right:

  1. Developer pushes to main

  2. GitHub Actions builds the Docker image with layer caching

  3. Trivy scans the image — blocks on HIGH/CRITICAL CVEs

  4. If it passes, the image is promoted for dev environment then the image is pushed to GHCR and the Helm values file is updated with the new tag

  5. ArgoCD detects the drift and syncs the cluster

  6. HPA scales pods up/down based on CPU

  7. Grafana dashboards track everything in real time

  8. ArgoCD Image Updater handles prod promotion automatically


Phase 1 — Terraform: From 17 Minutes to 5

The problem

Every time I needed to tear down and rebuild the environment — and during development, that happened a lot — I was doing it manually. Create the KinD cluster, wait, install ArgoCD, wait for pods to be ready, port-forward, apply the root app... It was repetitive, error-prone, and genuinely the most annoying part of the day.

I clocked it at 17 minutes from kind create cluster to a fully working ArgoCD with all apps synced.

The solution

I moved the entire setup into Terraform with a bootstrap/start.sh script that does everything in one shot.

Core cluster config (terraform/main.tf):

terraform {
  required_providers {
    kind = {
      source  = "tehcyx/kind"
      version = "0.5.1"
    }
  }
}

provider "kind" {}

resource "kind_cluster" "dev" {
  name       = "devops-cluster"
  node_image = "kindest/node:v1.29.2"

  kind_config {
  kind        = "Cluster"
  api_version = "kind.x-k8s.io/v1alpha4"

  networking {
      api_server_address = "0.0.0.0"
      api_server_port    = 6443
    }

  node {
    role = "control-plane"


    extra_port_mappings {      # Application Service
      container_port = 30007   # NodePort inside cluster for DEV ENV.
      host_port      = 30007   # Port on your machine
      protocol       = "TCP"
    }

    extra_port_mappings {      # ArgoCD
      container_port = 30008   # NodePort inside cluster
      host_port      = 30008   # Port on your machine
      protocol       = "TCP"
    }

    extra_port_mappings {      # Grafana
      container_port = 30009   # NodePort inside cluster
      host_port      = 30009   # Port on your machine
      protocol       = "TCP"
    }

    extra_port_mappings {      # Grafana
      container_port = 30010   # NodePort inside cluster
      host_port      = 30010   # Port on your machine
      protocol       = "TCP"
    }

    extra_port_mappings {
      container_port = 30011   # Application
      host_port      = 30011   # For PROD env setup
      protocol       = "TCP"
}
  }


  node {
    role = "worker"
  }

  node {
    role = "worker"
  }
}
}

📄 Full main.tf (with provider config, kubeconfig output, and null resource provisioners)

Bootstrap script (bootstrap/start.sh):

#!/bin/bash
set -e


ENV_FILE="../.env"
if [ -f "$ENV_FILE" ]; then
    echo "🔑 Loading credentials from $ENV_FILE..."
    # Set permissions to 600 (Owner Read/Write Only) for extra safety
    chmod 600 "$ENV_FILE"
    export \((grep -v '^#' "\)ENV_FILE" | xargs)
else
    echo "⚠️  $ENV_FILE not found!"
    read -p "Enter GitHub Username: " GITHUB_USER
    read -sp "Enter GitHub PAT (Input hidden): " GITHUB_TOKEN
    echo ""
fi

if [ -z "$GITHUB_TOKEN" ]; then
    echo "❌ Error: GitHub Token is required for Image Updater."
    exit 1
fi
# --------------------------------

echo "🚀 Creating cluster..."
terraform -chdir=../terraform apply -auto-approve

echo "📦 Installing Argo CD..."
kubectl create namespace argocd || true
kubectl apply -n argocd --server-side --force-conflicts \
  -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml


# Install Image Updater
echo "📦 Installing Argo CD Image Updater..."
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj-labs/argocd-image-updater/stable/config/install.yaml


echo "🔐 Creating Git credentials for Image Updater..."

kubectl create secret generic git-creds \
  -n argocd \
  --from-literal=username="$GITHUB_USER" \
  --from-literal=password="$GITHUB_TOKEN" \
  --dry-run=client -o yaml | kubectl apply -f -


echo "🌐 Applying custom Argo CD Service..."
kubectl apply -f ../argocd/argocd-server-nodeport.yaml -n argocd

echo "⏳ Waiting for Argo CD Server to be ready..."
kubectl wait --for=condition=available deployment/argocd-server -n argocd --timeout=300s

# Argo CD manages apps
echo "⚙️ Deploying apps via Argo CD..."
kubectl apply --server-side -f ../argocd/root-app.yaml

echo "⏳ Waiting for monitoring stack (Grafana) to be ready..."
sleep 30

# 🔑 Argo CD Password
echo "🔑 Fetching Argo CD Admin Password..."
ARGOPASS=$(kubectl -n argocd get secret argocd-initial-admin-secret \
  -o jsonpath="{.data.password}" | base64 --decode)

# # 🔑 Grafana Password
echo "🔑 Fetching Grafana Admin Password..."
 GRAFANAPASS=$(kubectl -n monitoring get secret monitoring-grafana \ -o jsonpath="{.data.admin-password}" | base64 --decode 2>/dev/null || echo "Grafana not ready yet")


echo "------------------------------------------------------"
echo "✅ SETUP COMPLETE!"
echo ""
echo "📍 Argo CD URL: http://localhost:30008"
echo "👤 Argo Username: admin"
echo "🔐 Argo Password: $ARGOPASS"
echo ""
echo "📍 Grafana URL: http://localhost:30009"
echo "👤 Grafana Username: admin"
echo "🔐 Grafana Password: $GRAFANAPASS"
echo "------------------------------------------------------"

📄 Full script with error handling

The script is idempotent — running it twice doesn't break anything. Terraform's handles the cluster readiness check, and the ArgoCD kubectl wait blocks until the server pod is actually up, not just scheduled.

Result: 17 minutes → < 5 minutes. 70% faster.


Phase 2 — ArgoCD App-of-Apps with Custom Helm Charts

Why App-of-Apps?

The naive approach is to kubectl apply every ArgoCD Application manifest yourself. That works fine for two or three apps, but it breaks the spirit of GitOps — you're still doing manual work on every environment change.

The App-of-Apps pattern fixes this. One root app manages all child apps. Add a YAML file to argocd/apps/, and ArgoCD picks it up automatically.

Root app (argocd/root-app.yaml):

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-app
  namespace: argocd
spec:
  project: default

  source:
    repoURL: https://github.com/psaad2400/gitops-k8s-devops-platform
    targetRevision: main
    path: argocd/apps

  destination:
    server: https://kubernetes.default.svc
    namespace: default

  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - ServerSideApply=true

This single file, applied once, brings up everything else. The selfHeal: true flag means if someone manually edits a resource in the cluster, ArgoCD will correct it back to what Git says. No config drift.

Custom Helm Chart

The app is deployed via a Helm chart in AutoScaleOps/. Same chart, different values files for dev and prod:

replicaCount: 1
image:
  repository: saadpatel2400/devops-app
  tag: dev-d6189e7
service:
  nodePort: 30007 # dev env assigned port 30007
hpa:
  minReplicas: 1
  maxReplicas: 3
# AutoScaleOps/values-prod.yaml (key section)
replicaCount: 2

image:
  tag: "prod"         # CI promotes here only on clean scan

service:
  nodePort: 30011   # assigned another port for prod env

hpa:
  minReplicas: 2
  maxReplicas: 6

Below is values.yaml which is commonly applied for both dev and prod env acting as base config and above values-dev and values-prod can be modified as per environment requirement overwritting base values.yaml file values.

# AutoScaleOps/values.yaml

replicaCount: 2

image:
  repository: saadpatel2400/devops-app
  tag: "v1"
  pullPolicy: IfNotPresent

service:
  type: NodePort
  port: 80
  targetPort: 5000
  nodePort: 30007

resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "500m"
    memory: "256Mi"

livenessProbe:
  path: /health
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  path: /ready
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 6

hpa:
  enabled: true
  minReplicas: 2
  maxReplicas: 6
  targetCPUUtilizationPercentage: 50
  # Add this block back in:
  scaleDown:
    stabilizationWindowSeconds: 60
    policies:
      - type: Percent
        value: 50
        periodSeconds: 15

📄 Full chart templates (deployment, service): https://github.com/psaad2400/gitops-k8s-devops-platform/tree/main/AutoScaleOps


Phase 3 — GitHub Actions CI with Layer Caching

The pipeline structure

Every push to main triggers the CI pipeline. Four stages, in order:

  1. Build the Docker image using Buildx with GitHub Actions cache

  2. Scan with Trivy (pipeline fails here if CVEs are found)

  3. Push to GHCR (only reached if scan passes)

  4. Update the image tag in values-dev.yaml and push the commit

The build step with caching (.github/workflows/ci.yml key section):

build:
    name: 🏗️ Build Docker Image
    runs-on: self-hosted

    steps:
      - name: Checkout Repository
        uses: actions/checkout@v4

      - name: Set Up Docker Buildx
        uses: docker/setup-buildx-action@v3
        with:
          driver-opts: network=host

      - name: Log In to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKER_USERNAME }}
          password: ${{ secrets.DOCKER_PASSWORD }}

      - name: Build Image (load only, no push)
        uses: docker/build-push-action@v5
        with:
          context: ./app
          file: ./app/Dockerfile
          platforms: linux/amd64
          push: false
          load: true
          tags: ${{ env.SHA_TAG }}
          cache-from: |
            type=registry,ref=\({{ secrets.DOCKER_USERNAME }}/devops-app:cache-\){{ github.ref_name }}
            type=registry,ref=${{ secrets.DOCKER_USERNAME }}/devops-app:cache-main
            type=gha,scope=buildkit-${{ github.ref_name }}
            type=gha,scope=buildkit-main
          cache-to: |
            type=registry,ref=\({{ secrets.DOCKER_USERNAME }}/devops-app:cache-\){{ github.ref_name }},mode=max
            type=gha,scope=buildkit-${{ github.ref_name }},mode=max

      - name: Export Image as Tar to /tmp
        run: |
          docker save \({{ env.SHA_TAG }} -o \){{ env.IMAGE_TAR }}
          echo "📦 Saved: \((du -sh \){{ env.IMAGE_TAR }} | cut -f1)"

The cache-from: type=gha / cache-to: type=gha,mode=max pair is where most of the build time savings come from. On the first run, every layer is built fresh. On subsequent runs — even across different commits — Docker reuses cached layers for anything that hasn't changed (OS packages, pip installs, etc). Only the changed application code gets rebuilt.

The image tag update step:

- name: Update image tag in values
  if: success()    # only runs if Trivy passed
  run: |
    sed -i "s/tag:.*/tag: ${{ github.sha }}/" AutoScaleOps/values-dev.yaml
    git config user.email "ci@autoscaleops"
    git config user.name "CI Bot"
    git add AutoScaleOps/values-dev.yaml
    git commit -m "ci: update image tag to ${{ github.sha }}"
    git push

This commit to values-dev.yaml is what triggers ArgoCD. It detects the repo changed, compares the desired state (new tag) to the actual state (old tag), and syncs.

📄 Full ci.yml with all steps and environment variables: https://github.com/psaad2400/gitops-k8s-devops-platform/blob/main/.github/workflows/ci.yml

Self-hosted WSL runner

I ran the CI on a self-hosted WSL runner instead of GitHub-hosted runners. The advantages:

  • No cold-start overhead on the runner itself

  • Docker is already running — no setup step

  • The GitHub Actions layer cache (type=gha) works better with a persistent runner

Setting up a self-hosted runner took about 10 minutes — GitHub's runner application just needs to be registered with a token from the repo settings.

Result: Build time ~8 min → ~4 min. 50% faster.

  1. Downloading and configuring GitHub Self runner on local Machine
  1. Status showing offline since runner is not started
  1. Running GitHub self runner using run.sh file.
  1. Status now showing green (online) and Idle using no job is scheduled.

5. Workflow execution resulting in assigning jobs to GitHub self runner.

6. Status showing Active since jobs are getting executed on self runner.


Phase 4 — Trivy Security Gates (and the Unfixed CVE Problem)

This phase had the most interesting debugging story of the entire project, and I want to document it properly because I couldn't find a clear write-up when I was stuck on it.

The basic gate

Trivy scans the built image and fails the pipeline (exit-code: '1') if it finds any HIGH or CRITICAL vulnerabilities. If the pipeline fails, the image never gets pushed, values-dev.yaml never gets updated, and ArgoCD never deploys the bad image.

security-scan:
    name: 🔒 Trivy Security Scan
    runs-on: self-hosted
    needs: build
    outputs:
      scan-status: ${{ steps.set-result.outputs.result }}
      high-count:  ${{ steps.set-result.outputs.high }}
      crit-count:  ${{ steps.set-result.outputs.critical }}
      total-count: ${{ steps.set-result.outputs.total }}

    steps:
      - name: Load Docker Image
        run: docker load -i ${{ env.IMAGE_TAR }}

      - name: Run Trivy Scan (JSON + hard fail on findings)
        run: |
          docker run --rm \
            -v /var/run/docker.sock:/var/run/docker.sock \
            -v $GITHUB_WORKSPACE:/output \
            aquasec/trivy:0.69.1 image \
            --severity HIGH,CRITICAL \
            --ignore-unfixed \
            --format json \
            --output /output/${{ env.TRIVY_REPORT_PATH }} \
            --exit-code 1 \
            ${{ env.SHA_TAG }}

      - name: Debug — Print Trivy Report
        if: always()
        run: |
          echo "📄 Report: \(GITHUB_WORKSPACE/\){{ env.TRIVY_REPORT_PATH }}"
          cat \(GITHUB_WORKSPACE/\){{ env.TRIVY_REPORT_PATH }} | jq '.Results | length'

      - name: Parse Report & Set Outputs
        id: set-result
        if: always()
        run: |
          REPORT="\(GITHUB_WORKSPACE/\){{ env.TRIVY_REPORT_PATH }}"

          HIGH=\((jq '[.Results[]?.Vulnerabilities[]? | select(.Severity=="HIGH")] | length' "\)REPORT")
          CRITICAL=\((jq '[.Results[]?.Vulnerabilities[]? | select(.Severity=="CRITICAL")] | length' "\)REPORT")
          TOTAL=$((HIGH + CRITICAL))

          echo "high=\(HIGH"         >> \)GITHUB_OUTPUT
          echo "critical=\(CRITICAL" >> \)GITHUB_OUTPUT
          echo "total=\(TOTAL"       >> \)GITHUB_OUTPUT

          if [ "$TOTAL" -gt 0 ]; then
            echo "result=fail" >> $GITHUB_OUTPUT
            echo "❌ Scan FAILED — HIGH: \(HIGH | CRITICAL: \)CRITICAL | Total: $TOTAL"
          else
            echo "result=pass" >> $GITHUB_OUTPUT
            echo "✅ Scan PASSED — no HIGH/CRITICAL vulnerabilities."
          fi

      - name: Upload Trivy JSON Report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: trivy-vulnerability-report
          path: ${{ env.TRIVY_REPORT_PATH }}
          retention-days: 30

I also pinned all dependencies in app/requirements.txt to eliminate floating version surprises:

# Before — floating versions, surprise CVEs on every build
flask
requests
gunicorn

# After — pinned, reproducible, auditable
flask==3.0.3
requests==2.31.0
gunicorn==22.0.0

[SCREENSHOT: GitHub Actions step showing Trivy blocking a build — output table with HIGH CVEs found, pipeline marked as failed]

The problem: unfixed CVEs

After pinning dependencies and resolving the clearly fixable CVEs, the pipeline was still failing consistently. Trivy was flagging CVEs with a status of will_not_fix — meaning the upstream package maintainer had acknowledged the vulnerability but hadn't released a patch yet.

The Trivy output looked like this:

1. Output before fixing versions and vulnerability:

#requirement.txt file before fix

flask==2.3.3
prometheus-flask-exporter

Artifact (trivy-vulnerability-report.json) stored on github artifact

2. Fixing vulnerability of image:

#updated requirements.txt for vulnerability fixing 

#Patch

flask==2.3.3
prometheus-flask-exporter
# 4. Explicitly patching vulnerable sub-dependencies
wheel>=0.46.2
jaraco.context>=6.1.0

All critical and high vulnerability fixed which had known fixes (--ignoring unknown fixes vulnerability).

There is nothing you can do about a will_not_fix vulnerability in the short term. You cannot upgrade to a fixed version because there is no fixed version. You can't remove the library if it's a transitive dependency of something else. You just have to acknowledge it and accept the risk.

But Trivy, by default, treats it exactly the same as a CVE that has a patch available and you simply haven't applied yet.

The fix: --ignore-unfixed

- name: Run Trivy Scan (JSON + hard fail on findings)
  run: |
    docker run --rm \
       -v /var/run/docker.sock:/var/run/docker.sock \
       -v $GITHUB_WORKSPACE:/output \
       aquasec/trivy:0.69.1 image \
       --severity HIGH,CRITICAL \
       --ignore-unfixed \
       --format json \
       --output /output/${{ env.TRIVY_REPORT_PATH }} \
       --exit-code 1 \
       ${{ env.SHA_TAG }}


# Note : below flag is only used for testing purpose since i was able to patch most of the high and critical vuln which had patch but the pipeline was failing due to vuln that did'nt had patch implemented. 
    
ignore-unfixed: true    # only block on CVEs that actually have a patch

This flag tells Trivy: "Only fail the build on vulnerabilities where a fix exists." If there's no patch, don't block — because blocking doesn't help anyone. It just creates alert fatigue.

This is the correct security posture. You're still blocking every vulnerability where you have no excuse not to patch. You're just not punishing yourself for vulnerabilities that are out of your control.

After adding this flag, the security gate started doing exactly what it's supposed to: blocking real, fixable problems while letting the pipeline flow normally otherwise.

[SCREENSHOT: Trivy scan passing after --ignore-unfixed — showing "0 vulnerabilities found with available fixes" in green]

Result: Deployment time to prod dropped by 68% — largely because the security gate stopped being the unpredictable blocker it had been, and pipelines started completing reliably.


Phase 5 — HPA + k6 Load Testing

The HPA config

The Horizontal Pod Autoscaler watches CPU utilization and scales between 1 and 5 replicas:

# AutoScaleOps/templates/hpa.yaml

# values for the configuration are derived from 
# AutoScaleOps/values.yaml

{{- if .Values.hpa.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: {{ include "autoscaleops.fullname" . }}
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ include "autoscaleops.fullname" . }}
  minReplicas: {{ .Values.hpa.minReplicas }}
  maxReplicas: {{ .Values.hpa.maxReplicas }}
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: {{ .Values.hpa.targetCPUUtilizationPercentage }}
  behavior:
    scaleDown:
      stabilizationWindowSeconds: {{ .Values.hpa.scaleDown.stabilizationWindowSeconds }}
      policies:
      {{- range .Values.hpa.scaleDown.policies }}
        - type: {{ .type }}
          value: {{ .value }}
          periodSeconds: {{ .periodSeconds }}
      {{- end }}
{{- end }}

Important: HPA is completely useless without metrics-server installed in the cluster. Without it, every kubectl get hpa shows <unknown>/50% for the targets column, and nothing ever scales. This is one of those silent failures that will waste an hour if you don't know to look for it. I deployed metrics-server as its own ArgoCD application so it's always present.

k6 load test

To validate that HPA actually fires under load, I wrote a k6 test that ramps up to 10 virtual users:

import http from 'k6/http';
import { sleep } from 'k6';

export let options = {
  vus: 10,        // virtual users
  duration: '2m', // run for 2 minutes
};

export default function () {
  http.get('http://localhost:30007/load'); // Dev Application URL

// if want to laod test prod env only change the port to 30011
  sleep(1);
}

Running this while watching the HPA:

kubectl get hpa -w
# NAME               TARGETS    MINPODS  MAXPODS  REPLICAS
# autoscaleops-hpa   12%/50%    1        5        1
# autoscaleops-hpa   67%/50%    1        5        1     # load hits
# autoscaleops-hpa   67%/50%    1        5        3     # scaled up
# autoscaleops-hpa   28%/50%    1        5        3
# autoscaleops-hpa   8%/50%     1        5        1     # scaled back down

Watching that REPLICAS column go from 1 to 3 in real time is genuinely satisfying after everything it took to get there.

[SCREENSHOT: Terminal with kubectl get hpa -w output showing REPLICAS jumping from 1 to 3 as k6 load ramps up, then dropping back to 1]


Phase 6 — Prometheus Metrics and Grafana Dashboards

Prometheus SS showing metrics collection of pods in dev and prod environment.

The monitoring stack

The monitoring ArgoCD app deploys:

  • Prometheus — scrapes metrics from pods and nodes

  • Grafana — visualises them

  • kube-state-metrics — cluster-level metrics (deployment status, HPA current/desired replicas)

  • node-exporter — per-node CPU, memory, disk

What I tracked:

1. Grafana Dashboard Configuration

  • Dynamic variables configured for namespace and pod filtering (dev and prod)

  • Reusable dashboards to compare environments side by side

2. Resource Utilization Metrics:

  • Dev environment under normal condition before auto scaling.

3. Load Testing with k6:

  • Performed k6 load test on dev application api /load.

4. Horizontal Pod Autoscaler Behavior:

  • HPA trigger points under stress

  • Scale-out events during load (for example, dev namespace scaling from 1 pod → 3 pods)

  • Pod replica changes mapped against traffic and resource consumption

  • Validation that scaling improved performance, not just replica count

5. HPA Scale-Up and Scale-Down Behavior:

  • Verified HPA scale-out under load based on 50% CPU utilization target, with replicas growing from minimum to meet demand.

  • Observed pods scale up during stress (for example, 1 → 3 replicas) and scale back down gradually after traffic drops.

  • Tracked controlled scale-down behavior (3 → 2 → 1) and validated the impact of HPA tuning parameters:

    • minReplicas / maxReplicas defining scaling boundaries

    • targetCPUUtilizationPercentage: 50 triggering scale decisions

    • scaleDown stabilization window (60s) preventing premature downscaling after short traffic dips

    • 50% reduction policy every 15s enabling gradual pod reduction instead of aggressive scale-in

  • Confirmed HPA was not only reacting to load spikes, but also scaling in smoothly without causing instability or thrashing.

The 50% scale-down policy allows HPA to remove only half the replicas at a time (rounded conservatively), which is why replicas reduced progressively (3 → 2 → 1) instead of dropping immediately.

HPA scaled down to minimum pod required for dev which is 1.

Production Autoscaling Validation:

Baseline State (Min Replicas) — Application running at steady state with 2 pods (minReplicas) before load generation.

Ran k6 load test for Prod env on api /Load

Scale-Out Triggered Under Load — During k6 stress testing, HPA increased replicas from 2 → 6 based on CPU utilization crossing the 50% target.

Controlled Scale-Down Begins — After load dropped, HPA reduced replicas from 6 → 3, following the configured 50% scale-down policy.

Return to Steady State — Replicas gradually scaled back from 3 → 2, respecting the 60-second stabilization window and returning to baseline.

HPA Tuning Validation — Verified scale-up responsiveness and smooth scale-in behavior without abrupt pod termination or thrashing.

I built dashboards focused on Eight things across both dev and prod namespaces:

CPU Usage per Pod — Shows how much of the configured CPU limit each pod is consuming. Useful to spot pods nearing saturation, validate resource limits, and observe behavior before HPA scale-out triggers:

(
  sum(rate(container_cpu_usage_seconds_total{namespace="\(namespace", pod=~"\)pod", container!="", image!=""}[1m])) by (pod)
/
  sum(kube_pod_container_resource_limits{namespace="\(namespace", resource="cpu", pod=~"\)pod"}) by (pod)
) * 100

Memory Usage Per Pod (MB) — Displays memory consumption as a percentage of pod memory limits. Useful for identifying memory pressure, validating limits, and correlating usage spikes with k6 virtual users and HPA response:

(
sum(container_memory_working_set_bytes{namespace="\(namespace", pod=~"\)pod", container!="POD", container!="", image!=""}) by (pod)
/
  sum(kube_pod_container_resource_limits{namespace="\(namespace", resource="memory", pod=~"\)pod"}) by (pod)
) * 100

Total Requests Per Endpoint — Shows the total number of requests handled by each endpoint over the last 5 minutes. Useful for identifying hot endpoints and traffic distribution:

round(
sum(increase(flask_http_request_duration_seconds_count{namespace="\(namespace", pod=~"\)pod"}[5m])) by (path)
)

Endpoint-Wise Traffic — Shows live request rate (RPS) per endpoint. Useful for understanding traffic patterns and which routes are driving load:

sum(rate(flask_http_request_duration_seconds_count{namespace="\(namespace", pod=~"\)pod"}[5m])) by (path)

Total Traffic (RPS) — Shows total incoming request volume across the service over a 5-minute window. Useful for correlating traffic growth with scaling events:

sum(increase(flask_http_request_total{namespace="\(namespace", pod=~"\)pod"}[5m]))

Average Latency — Shows mean response latency across all requests. Useful for tracking general responsiveness under load:

sum(rate(flask_http_request_duration_seconds_sum{namespace="\(namespace", pod=~"\)pod"}[1m]))
/
sum(rate(flask_http_request_duration_seconds_count{namespace="\(namespace", pod=~"\)pod"}[1m]))

P95 Latency Per Endpoint — Shows the 95th percentile response time for each endpoint, highlighting tail latency and endpoint-specific performance issues:

histogram_quantile(0.95,
sum(rate(flask_http_request_duration_seconds_bucket{namespace="\(namespace", pod=~"\)pod"}[1m])) by (le, path)
)

P95 Latency (Service-Wide) — Shows overall 95th percentile latency across the application, useful for validating whether auto-scaling improves user experience under load:

histogram_quantile(0.95,
sum(rate(flask_http_request_duration_seconds_bucket{namespace="\(namespace", pod=~"\)pod"}[1m])) by (le)
)

Phase 7 — ArgoCD Image Updater (Closing the GitOps Loop)

This was the last phase, and the one that made everything feel complete.

The gap without it

Even with the full CI pipeline running, there was still one manual step: promoting a tested image from dev to prod. Someone had to edit values-prod.yaml, change the image tag, and push. That's not GitOps — that's just a slightly more structured shell script.

What Image Updater does

ArgoCD Image Updater watches your container registry and automatically commits an updated image tag to your Git repo when a new image appears. Combined with ArgoCD's sync automation, this makes the entire flow from git push to production completely automatic.

Configuration — annotation on the dev Application:

# argocd/image-updater-dev.yaml

apiVersion: argocd-image-updater.argoproj.io/v1alpha1
kind: ImageUpdater
metadata:
  name: dev-image-updater
  namespace: argocd
spec:
  applicationRefs:
    - namePattern: "devops-app-dev"
      images:
        - alias: "myapp"
          imageName: "saadpatel2400/devops-app"
          # Strategy must be inside commonUpdateSettings
          commonUpdateSettings:
            updateStrategy: "newest-build"
            allowTags: "regexp:^dev-[a-f0-9]+$"
          manifestTargets:
            helm:
              name: "image.repository"
              tag: "image.tag"
  writeBackConfig:
    # Secret reference goes directly in the 'method' field
    method: "git:secret:argocd/git-creds"
    gitConfig:
      repository: "https://github.com/psaad2400/gitops-k8s-devops-platform"
      branch: "main"
      writeBackTarget: "helmvalues:./values-dev.yaml"

📄 Full app-prod.yaml with sync policy: [GitHub link]

With this in place, the complete flow becomes:

git push
 → CI builds image, Trivy scans
 → image pushed to GHCR as v1.2.3
 → Image Updater detects new semver tag
 → Image Updater commits updated tag to values-prod.yaml
 → ArgoCD detects drift → syncs prod
 → v1.2.3 live in prod

Zero manual steps. Zero human in the loop between writing code and it being live in production.

[SCREENSHOT: ArgoCD Image Updater logs showing "Found new image tag v1.2.4, updating values-prod.yaml"]

[SCREENSHOT: ArgoCD UI showing app-prod syncing automatically after Image Updater's commit]


Final Numbers

What changed Before After Improvement
Cluster provisioning 17 min 5 min 70% faster
Docker build time ~8 min ~4 min 50% faster
End-to-end deployment ~10 min (manual) ~0 min (automated) 68% faster
CVE gate None Blocking HIGH/CRITICAL Production protected
Manual steps per deploy ~8 0 Fully automated

What I'd Do Differently

A month in, here's what I'd change if I started over:

Set up Image Updater on day one. I added it last, but it should be part of the initial bootstrap. Once it's running, you stop thinking about "pushing to prod" as a task — it just happens.

Use ignore-unfixed from the start. I spent two days chasing CVEs that had no fix before I found this flag. Save yourself the time and understand why the flag exists before assuming every Trivy failure means you have work to do.

Metrics-server should be the first ArgoCD app, not the last. Everything that depends on resource metrics (HPA, VPA, kubectl top) is broken until it's running. I learned this the hard way after an hour wondering why HPA wasn't scaling.

Write the k6 tests before setting up the HPA. Having a load test ready makes it trivially easy to verify that HPA thresholds are set correctly. Without it, you're guessing at averageUtilization values.


The full source is on GitHub — [ Full Link]. Feel free to fork it, break it, and build something better.

If you're building something similar or have questions about any of the config, drop a comment below.


Tags: #devops #kubernetes #cicd #devsecops #argocd #terraform #github-actions #grafana #trivy #gitops #k6 #helm

Linux for DevOps: From Fundamentals to Production

Part 1 of 1

Master Linux for DevOps — from fundamentals to production. Learn architecture, commands, server management, SSH, networking, and shell scripting with real-world context. Build the foundation for Docker, Kubernetes, and cloud infrastructure.