Skip to main content

Command Palette

Search for a command to run...

I Built an AI-Powered Nginx Observability Platform Using Temporal, Ollama, and Splunk — Here's Everything I Learned

From raw nginx logs to natural language queries, automated IP blocking, and full infrastructure monitoring — a complete DevOps project walkthrough.

Updated
9 min read
I Built an AI-Powered Nginx Observability Platform Using Temporal, Ollama, and Splunk — Here's Everything I Learned

I Built an AI-Powered Nginx Observability Platform Using Temporal, Ollama, and Splunk — Here's Everything I Learned

I recently started learning DevOps. After finishing Linux, shell scripting, Git and GitHub — I wanted to build something real. Not a tutorial project. Something that actually works end to end and shows how all these tools fit together in practice.

So I built nginx-ai-ops — a platform where you can ask your nginx logs questions in plain English and get real answers, with full infrastructure monitoring and automated security response.

This post documents everything I built, why I made each decision, and what I learned along the way.


🏗️ Architecture Overview

Before I explain each piece, here's the full picture:

The platform has 6 layers:

  1. Primary Server (VM1) — nginx web server with iptables firewall

  2. Splunk Stack — log ingestion, indexing, dashboards and alerts

  3. AI Agent Layer — Temporal + Ollama converts plain English to Splunk queries

  4. Monitoring Stack — Prometheus + Grafana for system metrics

  5. Automation — shell scripts for log rotation, backup and IP blocking

  6. Secondary Server (VM2) — receives log backups via SCP


🌐 Part 1 — Nginx with a Custom Log Format

The foundation of everything is nginx — it serves traffic and writes logs. But default nginx logs are hard to parse. I created a custom log format called splunk_format that names every field explicitly:

nginx

log_format splunk_format '\(remote_addr - \)remote_user [$time_local] '
                         '"\(request" \)status $body_bytes_sent '
                         '"\(http_referer" "\)http_user_agent" '
                         'request_time=$request_time '
                         'upstream_time=$upstream_response_time '
                         'upstream_addr=$upstream_addr '
                         'host=$host '
                         'server_name=$server_name '
                         'request_method=$request_method '
                         'uri=$uri '
                         'args=$args '
                         'bytes_sent=$bytes_sent '
                         'request_length=$request_length';

access_log /var/log/nginx/access.log splunk_format;

Why this matters: When Splunk ingests these logs, it can automatically extract every field without any manual configuration. Fields like status, uri, request_time, remote_addr are all named and ready to query.

192.168.0.105 - - [05/Mar/2026:15:58:26 +0530] "GET / HTTP/1.1" 200 409 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36" request_time=0.000 upstream_time=- upstream_addr=- host=192.168.0.110 server_name=_ request_method=GET uri=/index.nginx-debian.html args=- bytes_sent=667 request_length=569
192.168.0.105 - - [05/Mar/2026:15:58:27 +0530] "GET /favicon.ico HTTP/1.1" 404 196 "http://192.168.0.110/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36" request_time=0.000 upstream_time=- upstream_addr=- host=192.168.0.110 server_name=_ request_method=GET uri=/favicon.ico args=- bytes_sent=391 request_length=511

📊 Part 2 — Splunk Stack

Installing the Splunk Universal Forwarder

The Splunk Universal Forwarder runs on VM1 and watches the nginx log files:

ini

# inputs.conf
[monitor:///var/log/nginx/access.log]
disabled = false
index = nginx
sourcetype = nginx:access

[monitor:///var/log/nginx/error.log]
disabled = false
index = nginx
sourcetype = nginx:error

It forwards everything to the Splunk Indexer on port 9997:

ini

# outputs.conf
[tcpout]
defaultGroup = default-autolb-group

[tcpout:default-autolb-group]
server = 192.168.0.110:9997

Field Extraction

Because of my custom log format, Splunk extracted all fields automatically. No regex needed. Just search index=nginx and all fields appear instantly.

Building the Dashboard

I built a Splunk dashboard with multiple tabs showing:

  • Total requests by status code

  • Top IP addresses

  • Slowest endpoints

  • Error rate over time

  • Bandwidth usage per URI


Setting Up Alerts

The most powerful part of Splunk is alerts. I set up a real-time alert that fires when any IP makes more than 10 requests in a minute:

index=nginx | stats count by remote_addr | where count > 10

When this alert fires, it triggers block_ip.sh automatically.


⚙️ Part 3 — Automation Scripts

Log Rotation

I wrote log_rotation.sh which runs via cron every 10 minutes:

  1. Compresses access.logaccess_TIMESTAMP.log.gz

  2. Clears the current log and reloads nginx

  3. Keeps only the 3 most recent backups — deletes oldest automatically

  4. SCPs the compressed file to the backup server (VM2)

bash

# Cron entry
*/10 * * * * /usr/local/bin/log_rotation.sh >> /var/log/log_rotation.log 2>&1

IP Blocking Script

block_ip.sh is triggered by Splunk when an alert fires. Splunk passes a gzipped CSV of results as argument $8. The script:

  1. Validates the results file exists

  2. Extracts the IP from the CSV

  3. Validates it looks like a real IP

  4. Checks if already blocked

  5. Runs iptables -I INPUT -s <IP> -j DROP

  6. Logs the result to /var/log/ddos_block.log

bash

sudo /sbin/iptables -I INPUT -s "$IP" -j DROP
echo "\((date) - SUCCESS: Blocked \)IP due to DoS alert" >> $LOGFILE

📈 Part 4 — Monitoring Stack

Node Exporter

Node Exporter runs on VM1 and exposes system metrics on port 9100 — CPU usage, memory, disk space, network traffic and more.

Prometheus

Prometheus scrapes Node Exporter every 15 seconds:

yaml

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          nodename: 'ubuntu'

Grafana

I imported the official Node Exporter Full dashboard (ID: 1860) which gives a complete view of the VM's health.

The dashboard shows in real time:

  • CPU busy %

  • System load

  • RAM usage (77.2% in my case)

  • Network traffic (kb/s in and out)

  • Disk space usage


🤖 Part 5 — The AI Agent (Most Exciting Part)

This is what makes the project unique. Instead of writing SPL queries manually in Splunk, I built an AI agent that:

  1. Takes your question in plain English

  2. Generates the correct Splunk SPL query using a local LLM

  3. Executes it against your Splunk server

  4. Returns a plain English answer with the raw data

Why Temporal?

I used Temporal as the workflow orchestration engine. The biggest advantage is durable execution — if the process crashes mid-query, Temporal replays from the last completed step automatically. No lost state, no starting over.

Each step in the agent is a Temporal Activity:

python

@workflow.defn
class SplunkAgentWorkflow:
    @workflow.run
    async def run(self, user_prompt: str) -> dict:

        # Step 1: Ollama converts NL → SPL
        query_info = await workflow.execute_activity(
            generate_splunk_query,
            args=[user_prompt],
            start_to_close_timeout=timedelta(seconds=90)
        )

        # Step 2: Execute on Splunk
        splunk_results = await workflow.execute_activity(
            execute_splunk_query,
            args=[query_info],
            start_to_close_timeout=timedelta(seconds=120)
        )

        # Step 3: Format answer
        final = await workflow.execute_activity(
            format_answer,
            args=[user_prompt, query_info, splunk_results],
            start_to_close_timeout=timedelta(seconds=90)
        )

        return final

Why Ollama?

I used Ollama to run llama3 locally — no API keys, no internet dependency, no cost. The model runs entirely on my machine.

The key to making it accurate was giving Ollama the exact nginx field names in the system prompt:

python

system_context = """
ALWAYS use these exact field names:
- remote_addr   : client IP address
- status        : HTTP response status code
- request_method: HTTP method
- uri           : request path
- bytes_sent    : response size in bytes
- request_time  : processing time in seconds
...
"""

Without this, the model would guess field names like level=ERROR or source which don't exist in nginx logs.

Example Queries

"Give me total requests with 200 status code"
→ index=nginx status=200 | stats count

"Show top 10 IPs by request count last 7 days"  
→ index=nginx | top limit=10 remote_addr

"For each IP show URLs hit, count and status codes last 15 days"
→ index=nginx earliest=-15d | stats count by remote_addr uri status | sort -count

Temporal Dashboard

You can watch every step of the agent execute in real time at http://localhost:8233:


Proof — Splunk Query History

Every query the agent generated and executed is visible in Splunk's search history:


🔑 Key Lessons Learned

1. Custom log formats save hours The single best decision I made was defining splunk_format in nginx. It made every downstream tool — Splunk, the AI agent — work better immediately.

2. Field names matter for LLMs The AI agent was generating wrong queries until I added exact field names to the prompt. Giving the LLM a schema of your data is the most impactful thing you can do for accuracy.

3. Temporal is overkill for simple tasks but perfect for agents For a simple script, Temporal is unnecessary. But for an AI agent that makes LLM calls, hits external APIs, and needs to handle failures gracefully — it's exactly the right tool.

4. Splunk alerts are powerful automation triggers Connecting Splunk alerts to shell scripts creates a real event-driven security system. The whole pipeline — detect anomaly → trigger script → block IP — happens in seconds with zero manual intervention.

5. Build things that are observable Every component in this project writes logs somewhere. log_rotation.log, ddos_block.log, Temporal's web UI, Splunk's search history, Prometheus metrics — you can always see what's happening and why.


📦 GitHub Repository

The full project with all configs, scripts, and READMEs is on GitHub:

🔗 nginx-ai-ops

nginx-ai-ops/
├── agents/query_agent/    — Temporal + Ollama + Flask
├── nginx/                 — nginx.conf with custom log format
├── splunk/                — forwarder and indexer configs
├── monitoring/            — Prometheus + Grafana setup
└── automation/            — log_rotation.sh + block_ip.sh

🛠️ What's Next

This project gave me a strong foundation in observability and automation. Next I'm moving to Docker — the goal is to containerize this entire stack so it can be deployed with a single docker-compose up command.

After that: Kubernetes.

If you're also learning DevOps, my advice is to pick one project and go deep rather than doing 10 shallow tutorials. The best way to learn is to break things and fix them in a real environment.


🙏 Connect

If you found this useful or want to discuss DevOps, connect with me on LinkedIn or drop a comment below.

⭐ If you use any part of this project, a star on GitHub means a lot!

The DevOps Path: Zero to Production

Part 10 of 11

A hands-on DevOps series covering Linux, Shell scripting, Git, CI/CD, Docker, Kubernetes, cloud, and real-world projects—taking you from zero to production with practical examples and best practices.

Up next

🐳 Docker – Day 1: Container Fundamentals

Why Containers Exist and How Docker Actually Works (Production View)