AutoScaleOps: I Built a Production-Grade DevSecOps Platform From Scratch — Here's Everything

A month of late nights, broken pipelines, unfixable CVEs, and one cluster that refused to scale — here's the full story of building AutoScaleOps: a complete CI/CD platform with security gates, GitOps, auto-scaling, and zero manual deployments.
Why I Built This
I've always believed that you don't truly understand a tool until you've broken it yourself. I've used Kubernetes, seen ArgoCD in action, and read about security gates in CI pipelines — but I'd never wired all of it together from scratch.
So I spent a month doing exactly that. No managed cloud Kubernetes. No "getting started" tutorials. Just a blank repo, Terraform, and a stubborn refusal to move on to the next phase until the current one actually worked.
The result is AutoScaleOps: a full DevSecOps CI/CD platform that:
Provisions a multi-node KinD cluster in under 5 minutes (it used to take 17 — more on that)
Runs every Docker build through a Trivy security gate that blocks HIGH/CRITICAL CVEs from reaching prod
Deploys via GitOps — ArgoCD watches the repo, not a shell script
Auto-scales pods under real load, validated by k6 load tests
Tracks everything through Grafana dashboards across dev and prod
Uses ArgoCD Image Updater to make the entire flow from
git pushto production completely hands-free
Let's walk through every decision, every config, and every problem I ran into.
The Full Architecture
The flow, from left to right:
Developer pushes to
mainGitHub Actions builds the Docker image with layer caching
Trivy scans the image — blocks on HIGH/CRITICAL CVEs
If it passes, the image is promoted for dev environment then the image is pushed to GHCR and the Helm values file is updated with the new tag
ArgoCD detects the drift and syncs the cluster
HPA scales pods up/down based on CPU
Grafana dashboards track everything in real time
ArgoCD Image Updater handles prod promotion automatically
Phase 1 — Terraform: From 17 Minutes to 5
The problem
Every time I needed to tear down and rebuild the environment — and during development, that happened a lot — I was doing it manually. Create the KinD cluster, wait, install ArgoCD, wait for pods to be ready, port-forward, apply the root app... It was repetitive, error-prone, and genuinely the most annoying part of the day.
I clocked it at 17 minutes from kind create cluster to a fully working ArgoCD with all apps synced.
The solution
I moved the entire setup into Terraform with a bootstrap/start.sh script that does everything in one shot.
Core cluster config (terraform/main.tf):
terraform {
required_providers {
kind = {
source = "tehcyx/kind"
version = "0.5.1"
}
}
}
provider "kind" {}
resource "kind_cluster" "dev" {
name = "devops-cluster"
node_image = "kindest/node:v1.29.2"
kind_config {
kind = "Cluster"
api_version = "kind.x-k8s.io/v1alpha4"
networking {
api_server_address = "0.0.0.0"
api_server_port = 6443
}
node {
role = "control-plane"
extra_port_mappings { # Application Service
container_port = 30007 # NodePort inside cluster for DEV ENV.
host_port = 30007 # Port on your machine
protocol = "TCP"
}
extra_port_mappings { # ArgoCD
container_port = 30008 # NodePort inside cluster
host_port = 30008 # Port on your machine
protocol = "TCP"
}
extra_port_mappings { # Grafana
container_port = 30009 # NodePort inside cluster
host_port = 30009 # Port on your machine
protocol = "TCP"
}
extra_port_mappings { # Grafana
container_port = 30010 # NodePort inside cluster
host_port = 30010 # Port on your machine
protocol = "TCP"
}
extra_port_mappings {
container_port = 30011 # Application
host_port = 30011 # For PROD env setup
protocol = "TCP"
}
}
node {
role = "worker"
}
node {
role = "worker"
}
}
}
📄 Full
main.tf(with provider config, kubeconfig output, and null resource provisioners)
Bootstrap script (bootstrap/start.sh):
#!/bin/bash
set -e
ENV_FILE="../.env"
if [ -f "$ENV_FILE" ]; then
echo "🔑 Loading credentials from $ENV_FILE..."
# Set permissions to 600 (Owner Read/Write Only) for extra safety
chmod 600 "$ENV_FILE"
export \((grep -v '^#' "\)ENV_FILE" | xargs)
else
echo "⚠️ $ENV_FILE not found!"
read -p "Enter GitHub Username: " GITHUB_USER
read -sp "Enter GitHub PAT (Input hidden): " GITHUB_TOKEN
echo ""
fi
if [ -z "$GITHUB_TOKEN" ]; then
echo "❌ Error: GitHub Token is required for Image Updater."
exit 1
fi
# --------------------------------
echo "🚀 Creating cluster..."
terraform -chdir=../terraform apply -auto-approve
echo "📦 Installing Argo CD..."
kubectl create namespace argocd || true
kubectl apply -n argocd --server-side --force-conflicts \
-f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Install Image Updater
echo "📦 Installing Argo CD Image Updater..."
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj-labs/argocd-image-updater/stable/config/install.yaml
echo "🔐 Creating Git credentials for Image Updater..."
kubectl create secret generic git-creds \
-n argocd \
--from-literal=username="$GITHUB_USER" \
--from-literal=password="$GITHUB_TOKEN" \
--dry-run=client -o yaml | kubectl apply -f -
echo "🌐 Applying custom Argo CD Service..."
kubectl apply -f ../argocd/argocd-server-nodeport.yaml -n argocd
echo "⏳ Waiting for Argo CD Server to be ready..."
kubectl wait --for=condition=available deployment/argocd-server -n argocd --timeout=300s
# Argo CD manages apps
echo "⚙️ Deploying apps via Argo CD..."
kubectl apply --server-side -f ../argocd/root-app.yaml
echo "⏳ Waiting for monitoring stack (Grafana) to be ready..."
sleep 30
# 🔑 Argo CD Password
echo "🔑 Fetching Argo CD Admin Password..."
ARGOPASS=$(kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 --decode)
# # 🔑 Grafana Password
echo "🔑 Fetching Grafana Admin Password..."
GRAFANAPASS=$(kubectl -n monitoring get secret monitoring-grafana \ -o jsonpath="{.data.admin-password}" | base64 --decode 2>/dev/null || echo "Grafana not ready yet")
echo "------------------------------------------------------"
echo "✅ SETUP COMPLETE!"
echo ""
echo "📍 Argo CD URL: http://localhost:30008"
echo "👤 Argo Username: admin"
echo "🔐 Argo Password: $ARGOPASS"
echo ""
echo "📍 Grafana URL: http://localhost:30009"
echo "👤 Grafana Username: admin"
echo "🔐 Grafana Password: $GRAFANAPASS"
echo "------------------------------------------------------"
📄 Full script with error handling
The script is idempotent — running it twice doesn't break anything. Terraform's handles the cluster readiness check, and the ArgoCD kubectl wait blocks until the server pod is actually up, not just scheduled.
Result: 17 minutes → < 5 minutes. 70% faster.
Phase 2 — ArgoCD App-of-Apps with Custom Helm Charts
Why App-of-Apps?
The naive approach is to kubectl apply every ArgoCD Application manifest yourself. That works fine for two or three apps, but it breaks the spirit of GitOps — you're still doing manual work on every environment change.
The App-of-Apps pattern fixes this. One root app manages all child apps. Add a YAML file to argocd/apps/, and ArgoCD picks it up automatically.
Root app (argocd/root-app.yaml):
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/psaad2400/gitops-k8s-devops-platform
targetRevision: main
path: argocd/apps
destination:
server: https://kubernetes.default.svc
namespace: default
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- ServerSideApply=true
This single file, applied once, brings up everything else. The selfHeal: true flag means if someone manually edits a resource in the cluster, ArgoCD will correct it back to what Git says. No config drift.
Custom Helm Chart
The app is deployed via a Helm chart in AutoScaleOps/. Same chart, different values files for dev and prod:
replicaCount: 1
image:
repository: saadpatel2400/devops-app
tag: dev-d6189e7
service:
nodePort: 30007 # dev env assigned port 30007
hpa:
minReplicas: 1
maxReplicas: 3
# AutoScaleOps/values-prod.yaml (key section)
replicaCount: 2
image:
tag: "prod" # CI promotes here only on clean scan
service:
nodePort: 30011 # assigned another port for prod env
hpa:
minReplicas: 2
maxReplicas: 6
Below is values.yaml which is commonly applied for both dev and prod env acting as base config and above values-dev and values-prod can be modified as per environment requirement overwritting base values.yaml file values.
# AutoScaleOps/values.yaml
replicaCount: 2
image:
repository: saadpatel2400/devops-app
tag: "v1"
pullPolicy: IfNotPresent
service:
type: NodePort
port: 80
targetPort: 5000
nodePort: 30007
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
livenessProbe:
path: /health
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
path: /ready
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
hpa:
enabled: true
minReplicas: 2
maxReplicas: 6
targetCPUUtilizationPercentage: 50
# Add this block back in:
scaleDown:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 15
📄 Full chart templates (deployment, service): https://github.com/psaad2400/gitops-k8s-devops-platform/tree/main/AutoScaleOps
Phase 3 — GitHub Actions CI with Layer Caching
The pipeline structure
Every push to main triggers the CI pipeline. Four stages, in order:
Build the Docker image using Buildx with GitHub Actions cache
Scan with Trivy (pipeline fails here if CVEs are found)
Push to GHCR (only reached if scan passes)
Update the image tag in
values-dev.yamland push the commit
The build step with caching (.github/workflows/ci.yml key section):
build:
name: 🏗️ Build Docker Image
runs-on: self-hosted
steps:
- name: Checkout Repository
uses: actions/checkout@v4
- name: Set Up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
driver-opts: network=host
- name: Log In to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}
- name: Build Image (load only, no push)
uses: docker/build-push-action@v5
with:
context: ./app
file: ./app/Dockerfile
platforms: linux/amd64
push: false
load: true
tags: ${{ env.SHA_TAG }}
cache-from: |
type=registry,ref=\({{ secrets.DOCKER_USERNAME }}/devops-app:cache-\){{ github.ref_name }}
type=registry,ref=${{ secrets.DOCKER_USERNAME }}/devops-app:cache-main
type=gha,scope=buildkit-${{ github.ref_name }}
type=gha,scope=buildkit-main
cache-to: |
type=registry,ref=\({{ secrets.DOCKER_USERNAME }}/devops-app:cache-\){{ github.ref_name }},mode=max
type=gha,scope=buildkit-${{ github.ref_name }},mode=max
- name: Export Image as Tar to /tmp
run: |
docker save \({{ env.SHA_TAG }} -o \){{ env.IMAGE_TAR }}
echo "📦 Saved: \((du -sh \){{ env.IMAGE_TAR }} | cut -f1)"
The cache-from: type=gha / cache-to: type=gha,mode=max pair is where most of the build time savings come from. On the first run, every layer is built fresh. On subsequent runs — even across different commits — Docker reuses cached layers for anything that hasn't changed (OS packages, pip installs, etc). Only the changed application code gets rebuilt.
The image tag update step:
- name: Update image tag in values
if: success() # only runs if Trivy passed
run: |
sed -i "s/tag:.*/tag: ${{ github.sha }}/" AutoScaleOps/values-dev.yaml
git config user.email "ci@autoscaleops"
git config user.name "CI Bot"
git add AutoScaleOps/values-dev.yaml
git commit -m "ci: update image tag to ${{ github.sha }}"
git push
This commit to values-dev.yaml is what triggers ArgoCD. It detects the repo changed, compares the desired state (new tag) to the actual state (old tag), and syncs.
📄 Full
ci.ymlwith all steps and environment variables: https://github.com/psaad2400/gitops-k8s-devops-platform/blob/main/.github/workflows/ci.yml
Self-hosted WSL runner
I ran the CI on a self-hosted WSL runner instead of GitHub-hosted runners. The advantages:
No cold-start overhead on the runner itself
Docker is already running — no setup step
The GitHub Actions layer cache (
type=gha) works better with a persistent runner
Setting up a self-hosted runner took about 10 minutes — GitHub's runner application just needs to be registered with a token from the repo settings.
Result: Build time ~8 min → ~4 min. 50% faster.
- Downloading and configuring GitHub Self runner on local Machine
- Status showing offline since runner is not started
- Running GitHub self runner using run.sh file.
- Status now showing green (online) and Idle using no job is scheduled.
5. Workflow execution resulting in assigning jobs to GitHub self runner.
6. Status showing Active since jobs are getting executed on self runner.
Phase 4 — Trivy Security Gates (and the Unfixed CVE Problem)
This phase had the most interesting debugging story of the entire project, and I want to document it properly because I couldn't find a clear write-up when I was stuck on it.
The basic gate
Trivy scans the built image and fails the pipeline (exit-code: '1') if it finds any HIGH or CRITICAL vulnerabilities. If the pipeline fails, the image never gets pushed, values-dev.yaml never gets updated, and ArgoCD never deploys the bad image.
security-scan:
name: 🔒 Trivy Security Scan
runs-on: self-hosted
needs: build
outputs:
scan-status: ${{ steps.set-result.outputs.result }}
high-count: ${{ steps.set-result.outputs.high }}
crit-count: ${{ steps.set-result.outputs.critical }}
total-count: ${{ steps.set-result.outputs.total }}
steps:
- name: Load Docker Image
run: docker load -i ${{ env.IMAGE_TAR }}
- name: Run Trivy Scan (JSON + hard fail on findings)
run: |
docker run --rm \
-v /var/run/docker.sock:/var/run/docker.sock \
-v $GITHUB_WORKSPACE:/output \
aquasec/trivy:0.69.1 image \
--severity HIGH,CRITICAL \
--ignore-unfixed \
--format json \
--output /output/${{ env.TRIVY_REPORT_PATH }} \
--exit-code 1 \
${{ env.SHA_TAG }}
- name: Debug — Print Trivy Report
if: always()
run: |
echo "📄 Report: \(GITHUB_WORKSPACE/\){{ env.TRIVY_REPORT_PATH }}"
cat \(GITHUB_WORKSPACE/\){{ env.TRIVY_REPORT_PATH }} | jq '.Results | length'
- name: Parse Report & Set Outputs
id: set-result
if: always()
run: |
REPORT="\(GITHUB_WORKSPACE/\){{ env.TRIVY_REPORT_PATH }}"
HIGH=\((jq '[.Results[]?.Vulnerabilities[]? | select(.Severity=="HIGH")] | length' "\)REPORT")
CRITICAL=\((jq '[.Results[]?.Vulnerabilities[]? | select(.Severity=="CRITICAL")] | length' "\)REPORT")
TOTAL=$((HIGH + CRITICAL))
echo "high=\(HIGH" >> \)GITHUB_OUTPUT
echo "critical=\(CRITICAL" >> \)GITHUB_OUTPUT
echo "total=\(TOTAL" >> \)GITHUB_OUTPUT
if [ "$TOTAL" -gt 0 ]; then
echo "result=fail" >> $GITHUB_OUTPUT
echo "❌ Scan FAILED — HIGH: \(HIGH | CRITICAL: \)CRITICAL | Total: $TOTAL"
else
echo "result=pass" >> $GITHUB_OUTPUT
echo "✅ Scan PASSED — no HIGH/CRITICAL vulnerabilities."
fi
- name: Upload Trivy JSON Report
if: always()
uses: actions/upload-artifact@v4
with:
name: trivy-vulnerability-report
path: ${{ env.TRIVY_REPORT_PATH }}
retention-days: 30
I also pinned all dependencies in app/requirements.txt to eliminate floating version surprises:
# Before — floating versions, surprise CVEs on every build
flask
requests
gunicorn
# After — pinned, reproducible, auditable
flask==3.0.3
requests==2.31.0
gunicorn==22.0.0
[SCREENSHOT: GitHub Actions step showing Trivy blocking a build — output table with HIGH CVEs found, pipeline marked as failed]
The problem: unfixed CVEs
After pinning dependencies and resolving the clearly fixable CVEs, the pipeline was still failing consistently. Trivy was flagging CVEs with a status of will_not_fix — meaning the upstream package maintainer had acknowledged the vulnerability but hadn't released a patch yet.
The Trivy output looked like this:
1. Output before fixing versions and vulnerability:
#requirement.txt file before fix
flask==2.3.3
prometheus-flask-exporter
Artifact (trivy-vulnerability-report.json) stored on github artifact
2. Fixing vulnerability of image:
#updated requirements.txt for vulnerability fixing
#Patch
flask==2.3.3
prometheus-flask-exporter
# 4. Explicitly patching vulnerable sub-dependencies
wheel>=0.46.2
jaraco.context>=6.1.0
All critical and high vulnerability fixed which had known fixes (--ignoring unknown fixes vulnerability).
There is nothing you can do about a will_not_fix vulnerability in the short term. You cannot upgrade to a fixed version because there is no fixed version. You can't remove the library if it's a transitive dependency of something else. You just have to acknowledge it and accept the risk.
But Trivy, by default, treats it exactly the same as a CVE that has a patch available and you simply haven't applied yet.
The fix: --ignore-unfixed
- name: Run Trivy Scan (JSON + hard fail on findings)
run: |
docker run --rm \
-v /var/run/docker.sock:/var/run/docker.sock \
-v $GITHUB_WORKSPACE:/output \
aquasec/trivy:0.69.1 image \
--severity HIGH,CRITICAL \
--ignore-unfixed \
--format json \
--output /output/${{ env.TRIVY_REPORT_PATH }} \
--exit-code 1 \
${{ env.SHA_TAG }}
# Note : below flag is only used for testing purpose since i was able to patch most of the high and critical vuln which had patch but the pipeline was failing due to vuln that did'nt had patch implemented.
ignore-unfixed: true # only block on CVEs that actually have a patch
This flag tells Trivy: "Only fail the build on vulnerabilities where a fix exists." If there's no patch, don't block — because blocking doesn't help anyone. It just creates alert fatigue.
This is the correct security posture. You're still blocking every vulnerability where you have no excuse not to patch. You're just not punishing yourself for vulnerabilities that are out of your control.
After adding this flag, the security gate started doing exactly what it's supposed to: blocking real, fixable problems while letting the pipeline flow normally otherwise.
[SCREENSHOT: Trivy scan passing after --ignore-unfixed — showing "0 vulnerabilities found with available fixes" in green]
Result: Deployment time to prod dropped by 68% — largely because the security gate stopped being the unpredictable blocker it had been, and pipelines started completing reliably.
Phase 5 — HPA + k6 Load Testing
The HPA config
The Horizontal Pod Autoscaler watches CPU utilization and scales between 1 and 5 replicas:
# AutoScaleOps/templates/hpa.yaml
# values for the configuration are derived from
# AutoScaleOps/values.yaml
{{- if .Values.hpa.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "autoscaleops.fullname" . }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "autoscaleops.fullname" . }}
minReplicas: {{ .Values.hpa.minReplicas }}
maxReplicas: {{ .Values.hpa.maxReplicas }}
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: {{ .Values.hpa.targetCPUUtilizationPercentage }}
behavior:
scaleDown:
stabilizationWindowSeconds: {{ .Values.hpa.scaleDown.stabilizationWindowSeconds }}
policies:
{{- range .Values.hpa.scaleDown.policies }}
- type: {{ .type }}
value: {{ .value }}
periodSeconds: {{ .periodSeconds }}
{{- end }}
{{- end }}
Important: HPA is completely useless without metrics-server installed in the cluster. Without it, every kubectl get hpa shows <unknown>/50% for the targets column, and nothing ever scales. This is one of those silent failures that will waste an hour if you don't know to look for it. I deployed metrics-server as its own ArgoCD application so it's always present.
k6 load test
To validate that HPA actually fires under load, I wrote a k6 test that ramps up to 10 virtual users:
import http from 'k6/http';
import { sleep } from 'k6';
export let options = {
vus: 10, // virtual users
duration: '2m', // run for 2 minutes
};
export default function () {
http.get('http://localhost:30007/load'); // Dev Application URL
// if want to laod test prod env only change the port to 30011
sleep(1);
}
Running this while watching the HPA:
kubectl get hpa -w
# NAME TARGETS MINPODS MAXPODS REPLICAS
# autoscaleops-hpa 12%/50% 1 5 1
# autoscaleops-hpa 67%/50% 1 5 1 # load hits
# autoscaleops-hpa 67%/50% 1 5 3 # scaled up
# autoscaleops-hpa 28%/50% 1 5 3
# autoscaleops-hpa 8%/50% 1 5 1 # scaled back down
Watching that REPLICAS column go from 1 to 3 in real time is genuinely satisfying after everything it took to get there.
[SCREENSHOT: Terminal with kubectl get hpa -w output showing REPLICAS jumping from 1 to 3 as k6 load ramps up, then dropping back to 1]
Phase 6 — Prometheus Metrics and Grafana Dashboards
Prometheus SS showing metrics collection of pods in dev and prod environment.
The monitoring stack
The monitoring ArgoCD app deploys:
Prometheus — scrapes metrics from pods and nodes
Grafana — visualises them
kube-state-metrics — cluster-level metrics (deployment status, HPA current/desired replicas)
node-exporter — per-node CPU, memory, disk
What I tracked:
1. Grafana Dashboard Configuration
Dynamic variables configured for namespace and pod filtering (dev and prod)
Reusable dashboards to compare environments side by side
2. Resource Utilization Metrics:
- Dev environment under normal condition before auto scaling.
3. Load Testing with k6:
- Performed k6 load test on dev application api /load.
4. Horizontal Pod Autoscaler Behavior:
HPA trigger points under stress
Scale-out events during load (for example, dev namespace scaling from 1 pod → 3 pods)
Pod replica changes mapped against traffic and resource consumption
Validation that scaling improved performance, not just replica count
5. HPA Scale-Up and Scale-Down Behavior:
Verified HPA scale-out under load based on 50% CPU utilization target, with replicas growing from minimum to meet demand.
Observed pods scale up during stress (for example, 1 → 3 replicas) and scale back down gradually after traffic drops.
Tracked controlled scale-down behavior (3 → 2 → 1) and validated the impact of HPA tuning parameters:
minReplicas / maxReplicas defining scaling boundaries
targetCPUUtilizationPercentage: 50 triggering scale decisions
scaleDown stabilization window (60s) preventing premature downscaling after short traffic dips
50% reduction policy every 15s enabling gradual pod reduction instead of aggressive scale-in
Confirmed HPA was not only reacting to load spikes, but also scaling in smoothly without causing instability or thrashing.
The 50% scale-down policy allows HPA to remove only half the replicas at a time (rounded conservatively), which is why replicas reduced progressively (3 → 2 → 1) instead of dropping immediately.
HPA scaled down to minimum pod required for dev which is 1.
Production Autoscaling Validation:
Baseline State (Min Replicas) — Application running at steady state with 2 pods (minReplicas) before load generation.
Ran k6 load test for Prod env on api /Load
Scale-Out Triggered Under Load — During k6 stress testing, HPA increased replicas from 2 → 6 based on CPU utilization crossing the 50% target.
Controlled Scale-Down Begins — After load dropped, HPA reduced replicas from 6 → 3, following the configured 50% scale-down policy.
Return to Steady State — Replicas gradually scaled back from 3 → 2, respecting the 60-second stabilization window and returning to baseline.
HPA Tuning Validation — Verified scale-up responsiveness and smooth scale-in behavior without abrupt pod termination or thrashing.
I built dashboards focused on Eight things across both dev and prod namespaces:
CPU Usage per Pod — Shows how much of the configured CPU limit each pod is consuming. Useful to spot pods nearing saturation, validate resource limits, and observe behavior before HPA scale-out triggers:
(
sum(rate(container_cpu_usage_seconds_total{namespace="\(namespace", pod=~"\)pod", container!="", image!=""}[1m])) by (pod)
/
sum(kube_pod_container_resource_limits{namespace="\(namespace", resource="cpu", pod=~"\)pod"}) by (pod)
) * 100
Memory Usage Per Pod (MB) — Displays memory consumption as a percentage of pod memory limits. Useful for identifying memory pressure, validating limits, and correlating usage spikes with k6 virtual users and HPA response:
(
sum(container_memory_working_set_bytes{namespace="\(namespace", pod=~"\)pod", container!="POD", container!="", image!=""}) by (pod)
/
sum(kube_pod_container_resource_limits{namespace="\(namespace", resource="memory", pod=~"\)pod"}) by (pod)
) * 100
Total Requests Per Endpoint — Shows the total number of requests handled by each endpoint over the last 5 minutes. Useful for identifying hot endpoints and traffic distribution:
round(
sum(increase(flask_http_request_duration_seconds_count{namespace="\(namespace", pod=~"\)pod"}[5m])) by (path)
)
Endpoint-Wise Traffic — Shows live request rate (RPS) per endpoint. Useful for understanding traffic patterns and which routes are driving load:
sum(rate(flask_http_request_duration_seconds_count{namespace="\(namespace", pod=~"\)pod"}[5m])) by (path)
Total Traffic (RPS) — Shows total incoming request volume across the service over a 5-minute window. Useful for correlating traffic growth with scaling events:
sum(increase(flask_http_request_total{namespace="\(namespace", pod=~"\)pod"}[5m]))
Average Latency — Shows mean response latency across all requests. Useful for tracking general responsiveness under load:
sum(rate(flask_http_request_duration_seconds_sum{namespace="\(namespace", pod=~"\)pod"}[1m]))
/
sum(rate(flask_http_request_duration_seconds_count{namespace="\(namespace", pod=~"\)pod"}[1m]))
P95 Latency Per Endpoint — Shows the 95th percentile response time for each endpoint, highlighting tail latency and endpoint-specific performance issues:
histogram_quantile(0.95,
sum(rate(flask_http_request_duration_seconds_bucket{namespace="\(namespace", pod=~"\)pod"}[1m])) by (le, path)
)
P95 Latency (Service-Wide) — Shows overall 95th percentile latency across the application, useful for validating whether auto-scaling improves user experience under load:
histogram_quantile(0.95,
sum(rate(flask_http_request_duration_seconds_bucket{namespace="\(namespace", pod=~"\)pod"}[1m])) by (le)
)
Phase 7 — ArgoCD Image Updater (Closing the GitOps Loop)
This was the last phase, and the one that made everything feel complete.
The gap without it
Even with the full CI pipeline running, there was still one manual step: promoting a tested image from dev to prod. Someone had to edit values-prod.yaml, change the image tag, and push. That's not GitOps — that's just a slightly more structured shell script.
What Image Updater does
ArgoCD Image Updater watches your container registry and automatically commits an updated image tag to your Git repo when a new image appears. Combined with ArgoCD's sync automation, this makes the entire flow from git push to production completely automatic.
Configuration — annotation on the dev Application:
# argocd/image-updater-dev.yaml
apiVersion: argocd-image-updater.argoproj.io/v1alpha1
kind: ImageUpdater
metadata:
name: dev-image-updater
namespace: argocd
spec:
applicationRefs:
- namePattern: "devops-app-dev"
images:
- alias: "myapp"
imageName: "saadpatel2400/devops-app"
# Strategy must be inside commonUpdateSettings
commonUpdateSettings:
updateStrategy: "newest-build"
allowTags: "regexp:^dev-[a-f0-9]+$"
manifestTargets:
helm:
name: "image.repository"
tag: "image.tag"
writeBackConfig:
# Secret reference goes directly in the 'method' field
method: "git:secret:argocd/git-creds"
gitConfig:
repository: "https://github.com/psaad2400/gitops-k8s-devops-platform"
branch: "main"
writeBackTarget: "helmvalues:./values-dev.yaml"
📄 Full app-prod.yaml with sync policy: [GitHub link]
With this in place, the complete flow becomes:
git push
→ CI builds image, Trivy scans
→ image pushed to GHCR as v1.2.3
→ Image Updater detects new semver tag
→ Image Updater commits updated tag to values-prod.yaml
→ ArgoCD detects drift → syncs prod
→ v1.2.3 live in prod
Zero manual steps. Zero human in the loop between writing code and it being live in production.
[SCREENSHOT: ArgoCD Image Updater logs showing "Found new image tag v1.2.4, updating values-prod.yaml"]
[SCREENSHOT: ArgoCD UI showing app-prod syncing automatically after Image Updater's commit]
Final Numbers
| What changed | Before | After | Improvement |
|---|---|---|---|
| Cluster provisioning | 17 min | 5 min | 70% faster |
| Docker build time | ~8 min | ~4 min | 50% faster |
| End-to-end deployment | ~10 min (manual) | ~0 min (automated) | 68% faster |
| CVE gate | None | Blocking HIGH/CRITICAL | Production protected |
| Manual steps per deploy | ~8 | 0 | Fully automated |
What I'd Do Differently
A month in, here's what I'd change if I started over:
Set up Image Updater on day one. I added it last, but it should be part of the initial bootstrap. Once it's running, you stop thinking about "pushing to prod" as a task — it just happens.
Use ignore-unfixed from the start. I spent two days chasing CVEs that had no fix before I found this flag. Save yourself the time and understand why the flag exists before assuming every Trivy failure means you have work to do.
Metrics-server should be the first ArgoCD app, not the last. Everything that depends on resource metrics (HPA, VPA, kubectl top) is broken until it's running. I learned this the hard way after an hour wondering why HPA wasn't scaling.
Write the k6 tests before setting up the HPA. Having a load test ready makes it trivially easy to verify that HPA thresholds are set correctly. Without it, you're guessing at averageUtilization values.
The full source is on GitHub — [ Full Link]. Feel free to fork it, break it, and build something better.
If you're building something similar or have questions about any of the config, drop a comment below.
Tags: #devops #kubernetes #cicd #devsecops #argocd #terraform #github-actions #grafana #trivy #gitops #k6 #helm



