I Built an AI-Powered Nginx Observability Platform Using Temporal, Ollama, and Splunk — Here's Everything I Learned

I recently started learning DevOps. After finishing Linux, shell scripting, Git and GitHub — I wanted to build something real. Not a tutorial project. Something that actually works end to end and shows how all these tools fit together in practice.

So I built nginx-ai-ops — a platform where you can ask your nginx logs questions in plain English and get real answers, with full infrastructure monitoring and automated security response.

This post documents everything I built, why I made each decision, and what I learned along the way.

🏗️ Architecture Overview

Before I explain each piece, here's the full picture:

The platform has 6 layers:

Primary Server (VM1) — nginx web server with iptables firewall
Splunk Stack — log ingestion, indexing, dashboards and alerts
AI Agent Layer — Temporal + Ollama converts plain English to Splunk queries
Monitoring Stack — Prometheus + Grafana for system metrics
Automation — shell scripts for log rotation, backup and IP blocking
Secondary Server (VM2) — receives log backups via SCP

🌐 Part 1 — Nginx with a Custom Log Format

The foundation of everything is nginx — it serves traffic and writes logs. But default nginx logs are hard to parse. I created a custom log format called splunk_format that names every field explicitly:

nginx

log_format splunk_format '\(remote_addr - \)remote_user [$time_local] '
                         '"\(request" \)status $body_bytes_sent '
                         '"\(http_referer" "\)http_user_agent" '
                         'request_time=$request_time '
                         'upstream_time=$upstream_response_time '
                         'upstream_addr=$upstream_addr '
                         'host=$host '
                         'server_name=$server_name '
                         'request_method=$request_method '
                         'uri=$uri '
                         'args=$args '
                         'bytes_sent=$bytes_sent '
                         'request_length=$request_length';

access_log /var/log/nginx/access.log splunk_format;

Why this matters: When Splunk ingests these logs, it can automatically extract every field without any manual configuration. Fields like status, uri, request_time, remote_addr are all named and ready to query.

192.168.0.105 - - [05/Mar/2026:15:58:26 +0530] "GET / HTTP/1.1" 200 409 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36" request_time=0.000 upstream_time=- upstream_addr=- host=192.168.0.110 server_name=_ request_method=GET uri=/index.nginx-debian.html args=- bytes_sent=667 request_length=569
192.168.0.105 - - [05/Mar/2026:15:58:27 +0530] "GET /favicon.ico HTTP/1.1" 404 196 "http://192.168.0.110/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36" request_time=0.000 upstream_time=- upstream_addr=- host=192.168.0.110 server_name=_ request_method=GET uri=/favicon.ico args=- bytes_sent=391 request_length=511

📊 Part 2 — Splunk Stack

Installing the Splunk Universal Forwarder

The Splunk Universal Forwarder runs on VM1 and watches the nginx log files:

ini

# inputs.conf
[monitor:///var/log/nginx/access.log]
disabled = false
index = nginx
sourcetype = nginx:access

[monitor:///var/log/nginx/error.log]
disabled = false
index = nginx
sourcetype = nginx:error

It forwards everything to the Splunk Indexer on port 9997:

ini

# outputs.conf
[tcpout]
defaultGroup = default-autolb-group

[tcpout:default-autolb-group]
server = 192.168.0.110:9997

Field Extraction

Because of my custom log format, Splunk extracted all fields automatically. No regex needed. Just search index=nginx and all fields appear instantly.

Building the Dashboard

I built a Splunk dashboard with multiple tabs showing:

Total requests by status code
Top IP addresses
Slowest endpoints
Error rate over time
Bandwidth usage per URI

Setting Up Alerts

The most powerful part of Splunk is alerts. I set up a real-time alert that fires when any IP makes more than 10 requests in a minute:

index=nginx | stats count by remote_addr | where count > 10

When this alert fires, it triggers block_ip.sh automatically.

⚙️ Part 3 — Automation Scripts

Log Rotation

I wrote log_rotation.sh which runs via cron every 10 minutes:

Compresses access.log → access_TIMESTAMP.log.gz
Clears the current log and reloads nginx
Keeps only the 3 most recent backups — deletes oldest automatically
SCPs the compressed file to the backup server (VM2)

bash

# Cron entry
*/10 * * * * /usr/local/bin/log_rotation.sh >> /var/log/log_rotation.log 2>&1

IP Blocking Script

block_ip.sh is triggered by Splunk when an alert fires. Splunk passes a gzipped CSV of results as argument $8. The script:

Validates the results file exists
Extracts the IP from the CSV
Validates it looks like a real IP
Checks if already blocked
Runs iptables -I INPUT -s <IP> -j DROP
Logs the result to /var/log/ddos_block.log

bash

sudo /sbin/iptables -I INPUT -s "$IP" -j DROP
echo "\((date) - SUCCESS: Blocked \)IP due to DoS alert" >> $LOGFILE

📈 Part 4 — Monitoring Stack

Node Exporter

Node Exporter runs on VM1 and exposes system metrics on port 9100 — CPU usage, memory, disk space, network traffic and more.

Prometheus

Prometheus scrapes Node Exporter every 15 seconds:

yaml

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          nodename: 'ubuntu'

Grafana

I imported the official Node Exporter Full dashboard (ID: 1860) which gives a complete view of the VM's health.

The dashboard shows in real time:

CPU busy %
System load
RAM usage (77.2% in my case)
Network traffic (kb/s in and out)
Disk space usage

🤖 Part 5 — The AI Agent (Most Exciting Part)

This is what makes the project unique. Instead of writing SPL queries manually in Splunk, I built an AI agent that:

Takes your question in plain English
Generates the correct Splunk SPL query using a local LLM
Executes it against your Splunk server
Returns a plain English answer with the raw data

Why Temporal?

I used Temporal as the workflow orchestration engine. The biggest advantage is durable execution — if the process crashes mid-query, Temporal replays from the last completed step automatically. No lost state, no starting over.

Each step in the agent is a Temporal Activity:

python

@workflow.defn
class SplunkAgentWorkflow:
    @workflow.run
    async def run(self, user_prompt: str) -> dict:

        # Step 1: Ollama converts NL → SPL
        query_info = await workflow.execute_activity(
            generate_splunk_query,
            args=[user_prompt],
            start_to_close_timeout=timedelta(seconds=90)
        )

        # Step 2: Execute on Splunk
        splunk_results = await workflow.execute_activity(
            execute_splunk_query,
            args=[query_info],
            start_to_close_timeout=timedelta(seconds=120)
        )

        # Step 3: Format answer
        final = await workflow.execute_activity(
            format_answer,
            args=[user_prompt, query_info, splunk_results],
            start_to_close_timeout=timedelta(seconds=90)
        )

        return final

Why Ollama?

I used Ollama to run llama3 locally — no API keys, no internet dependency, no cost. The model runs entirely on my machine.

The key to making it accurate was giving Ollama the exact nginx field names in the system prompt:

python

system_context = """
ALWAYS use these exact field names:
- remote_addr   : client IP address
- status        : HTTP response status code
- request_method: HTTP method
- uri           : request path
- bytes_sent    : response size in bytes
- request_time  : processing time in seconds
...
"""

Without this, the model would guess field names like level=ERROR or source which don't exist in nginx logs.

Example Queries

"Give me total requests with 200 status code"
→ index=nginx status=200 | stats count

"Show top 10 IPs by request count last 7 days"  
→ index=nginx | top limit=10 remote_addr

"For each IP show URLs hit, count and status codes last 15 days"
→ index=nginx earliest=-15d | stats count by remote_addr uri status | sort -count

Temporal Dashboard

You can watch every step of the agent execute in real time at http://localhost:8233:

Proof — Splunk Query History

Every query the agent generated and executed is visible in Splunk's search history:

🔑 Key Lessons Learned

1. Custom log formats save hours The single best decision I made was defining splunk_format in nginx. It made every downstream tool — Splunk, the AI agent — work better immediately.

2. Field names matter for LLMs The AI agent was generating wrong queries until I added exact field names to the prompt. Giving the LLM a schema of your data is the most impactful thing you can do for accuracy.

3. Temporal is overkill for simple tasks but perfect for agents For a simple script, Temporal is unnecessary. But for an AI agent that makes LLM calls, hits external APIs, and needs to handle failures gracefully — it's exactly the right tool.

4. Splunk alerts are powerful automation triggers Connecting Splunk alerts to shell scripts creates a real event-driven security system. The whole pipeline — detect anomaly → trigger script → block IP — happens in seconds with zero manual intervention.

5. Build things that are observable Every component in this project writes logs somewhere. log_rotation.log, ddos_block.log, Temporal's web UI, Splunk's search history, Prometheus metrics — you can always see what's happening and why.

📦 GitHub Repository

The full project with all configs, scripts, and READMEs is on GitHub:

🔗 nginx-ai-ops

nginx-ai-ops/
├── agents/query_agent/    — Temporal + Ollama + Flask
├── nginx/                 — nginx.conf with custom log format
├── splunk/                — forwarder and indexer configs
├── monitoring/            — Prometheus + Grafana setup
└── automation/            — log_rotation.sh + block_ip.sh

🛠️ What's Next

This project gave me a strong foundation in observability and automation. Next I'm moving to Docker — the goal is to containerize this entire stack so it can be deployed with a single docker-compose up command.

After that: Kubernetes.

If you're also learning DevOps, my advice is to pick one project and go deep rather than doing 10 shallow tutorials. The best way to learn is to break things and fix them in a real environment.

🙏 Connect

If you found this useful or want to discuss DevOps, connect with me on LinkedIn or drop a comment below.

⭐ If you use any part of this project, a star on GitHub means a lot!

I Built an AI-Powered Nginx Observability Platform Using Temporal, Ollama, and Splunk — Here's Everything I Learned

I Built an AI-Powered Nginx Observability Platform Using Temporal, Ollama, and Splunk — Here's Everything I Learned

🏗️ Architecture Overview

🌐 Part 1 — Nginx with a Custom Log Format

📊 Part 2 — Splunk Stack

Installing the Splunk Universal Forwarder

Field Extraction

Building the Dashboard

Setting Up Alerts

⚙️ Part 3 — Automation Scripts

Log Rotation

IP Blocking Script

📈 Part 4 — Monitoring Stack

Node Exporter

Prometheus

Grafana

🤖 Part 5 — The AI Agent (Most Exciting Part)

Why Temporal?

Why Ollama?

Example Queries

Temporal Dashboard

Proof — Splunk Query History

🔑 Key Lessons Learned

📦 GitHub Repository

🛠️ What's Next

🙏 Connect

Comments

The DevOps Path: Zero to Production

🐳 Docker – Day 1: Container Fundamentals

More from this blog

🐳 Docker – Day 7: Docker Compose

AutoScaleOps: I Built a Production-Grade DevSecOps Platform From Scratch — Here's Everything

🐳 Docker – Day 6: Docker Networking

🐳 Docker – Day 5: Volumes, Bind Mounts & Persistent Data

🐳 Docker – Day 3: Writing Production-Grade Dockerfiles

Command Palette

I Built an AI-Powered Nginx Observability Platform Using Temporal, Ollama, and Splunk — Here's Everything I Learned

🏗️ Architecture Overview

🌐 Part 1 — Nginx with a Custom Log Format

📊 Part 2 — Splunk Stack

Installing the Splunk Universal Forwarder

Field Extraction

Building the Dashboard

Setting Up Alerts

⚙️ Part 3 — Automation Scripts

Log Rotation

IP Blocking Script

📈 Part 4 — Monitoring Stack

Node Exporter

Prometheus

Grafana

🤖 Part 5 — The AI Agent (Most Exciting Part)

Why Temporal?

Why Ollama?

Example Queries

Temporal Dashboard

Proof — Splunk Query History

🔑 Key Lessons Learned

📦 GitHub Repository

🛠️ What's Next

🙏 Connect

Comments

The DevOps Path: Zero to Production

🐳 Docker – Day 1: Container Fundamentals

More from this blog