I Built an AI-Powered Nginx Observability Platform Using Temporal, Ollama, and Splunk — Here's Everything I Learned
From raw nginx logs to natural language queries, automated IP blocking, and full infrastructure monitoring — a complete DevOps project walkthrough.

I Built an AI-Powered Nginx Observability Platform Using Temporal, Ollama, and Splunk — Here's Everything I Learned
I recently started learning DevOps. After finishing Linux, shell scripting, Git and GitHub — I wanted to build something real. Not a tutorial project. Something that actually works end to end and shows how all these tools fit together in practice.
So I built nginx-ai-ops — a platform where you can ask your nginx logs questions in plain English and get real answers, with full infrastructure monitoring and automated security response.
This post documents everything I built, why I made each decision, and what I learned along the way.
🏗️ Architecture Overview
Before I explain each piece, here's the full picture:
The platform has 6 layers:
Primary Server (VM1) — nginx web server with iptables firewall
Splunk Stack — log ingestion, indexing, dashboards and alerts
AI Agent Layer — Temporal + Ollama converts plain English to Splunk queries
Monitoring Stack — Prometheus + Grafana for system metrics
Automation — shell scripts for log rotation, backup and IP blocking
Secondary Server (VM2) — receives log backups via SCP
🌐 Part 1 — Nginx with a Custom Log Format
The foundation of everything is nginx — it serves traffic and writes logs. But default nginx logs are hard to parse. I created a custom log format called splunk_format that names every field explicitly:
nginx
log_format splunk_format '\(remote_addr - \)remote_user [$time_local] '
'"\(request" \)status $body_bytes_sent '
'"\(http_referer" "\)http_user_agent" '
'request_time=$request_time '
'upstream_time=$upstream_response_time '
'upstream_addr=$upstream_addr '
'host=$host '
'server_name=$server_name '
'request_method=$request_method '
'uri=$uri '
'args=$args '
'bytes_sent=$bytes_sent '
'request_length=$request_length';
access_log /var/log/nginx/access.log splunk_format;
Why this matters: When Splunk ingests these logs, it can automatically extract every field without any manual configuration. Fields like status, uri, request_time, remote_addr are all named and ready to query.
192.168.0.105 - - [05/Mar/2026:15:58:26 +0530] "GET / HTTP/1.1" 200 409 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36" request_time=0.000 upstream_time=- upstream_addr=- host=192.168.0.110 server_name=_ request_method=GET uri=/index.nginx-debian.html args=- bytes_sent=667 request_length=569 192.168.0.105 - - [05/Mar/2026:15:58:27 +0530] "GET /favicon.ico HTTP/1.1" 404 196 "http://192.168.0.110/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36" request_time=0.000 upstream_time=- upstream_addr=- host=192.168.0.110 server_name=_ request_method=GET uri=/favicon.ico args=- bytes_sent=391 request_length=511
📊 Part 2 — Splunk Stack
Installing the Splunk Universal Forwarder
The Splunk Universal Forwarder runs on VM1 and watches the nginx log files:
ini
# inputs.conf
[monitor:///var/log/nginx/access.log]
disabled = false
index = nginx
sourcetype = nginx:access
[monitor:///var/log/nginx/error.log]
disabled = false
index = nginx
sourcetype = nginx:error
It forwards everything to the Splunk Indexer on port 9997:
ini
# outputs.conf
[tcpout]
defaultGroup = default-autolb-group
[tcpout:default-autolb-group]
server = 192.168.0.110:9997
Field Extraction
Because of my custom log format, Splunk extracted all fields automatically. No regex needed. Just search index=nginx and all fields appear instantly.
Building the Dashboard
I built a Splunk dashboard with multiple tabs showing:
Total requests by status code
Top IP addresses
Slowest endpoints
Error rate over time
Bandwidth usage per URI
Setting Up Alerts
The most powerful part of Splunk is alerts. I set up a real-time alert that fires when any IP makes more than 10 requests in a minute:
index=nginx | stats count by remote_addr | where count > 10
When this alert fires, it triggers block_ip.sh automatically.
⚙️ Part 3 — Automation Scripts
Log Rotation
I wrote log_rotation.sh which runs via cron every 10 minutes:
Compresses
access.log→access_TIMESTAMP.log.gzClears the current log and reloads nginx
Keeps only the 3 most recent backups — deletes oldest automatically
SCPs the compressed file to the backup server (VM2)
bash
# Cron entry
*/10 * * * * /usr/local/bin/log_rotation.sh >> /var/log/log_rotation.log 2>&1
IP Blocking Script
block_ip.sh is triggered by Splunk when an alert fires. Splunk passes a gzipped CSV of results as argument $8. The script:
Validates the results file exists
Extracts the IP from the CSV
Validates it looks like a real IP
Checks if already blocked
Runs
iptables -I INPUT -s <IP> -j DROPLogs the result to
/var/log/ddos_block.log
bash
sudo /sbin/iptables -I INPUT -s "$IP" -j DROP
echo "\((date) - SUCCESS: Blocked \)IP due to DoS alert" >> $LOGFILE
📈 Part 4 — Monitoring Stack
Node Exporter
Node Exporter runs on VM1 and exposes system metrics on port 9100 — CPU usage, memory, disk space, network traffic and more.
Prometheus
Prometheus scrapes Node Exporter every 15 seconds:
yaml
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
labels:
nodename: 'ubuntu'
Grafana
I imported the official Node Exporter Full dashboard (ID: 1860) which gives a complete view of the VM's health.
The dashboard shows in real time:
CPU busy %
System load
RAM usage (77.2% in my case)
Network traffic (kb/s in and out)
Disk space usage
🤖 Part 5 — The AI Agent (Most Exciting Part)
This is what makes the project unique. Instead of writing SPL queries manually in Splunk, I built an AI agent that:
Takes your question in plain English
Generates the correct Splunk SPL query using a local LLM
Executes it against your Splunk server
Returns a plain English answer with the raw data
Why Temporal?
I used Temporal as the workflow orchestration engine. The biggest advantage is durable execution — if the process crashes mid-query, Temporal replays from the last completed step automatically. No lost state, no starting over.
Each step in the agent is a Temporal Activity:
python
@workflow.defn
class SplunkAgentWorkflow:
@workflow.run
async def run(self, user_prompt: str) -> dict:
# Step 1: Ollama converts NL → SPL
query_info = await workflow.execute_activity(
generate_splunk_query,
args=[user_prompt],
start_to_close_timeout=timedelta(seconds=90)
)
# Step 2: Execute on Splunk
splunk_results = await workflow.execute_activity(
execute_splunk_query,
args=[query_info],
start_to_close_timeout=timedelta(seconds=120)
)
# Step 3: Format answer
final = await workflow.execute_activity(
format_answer,
args=[user_prompt, query_info, splunk_results],
start_to_close_timeout=timedelta(seconds=90)
)
return final
Why Ollama?
I used Ollama to run llama3 locally — no API keys, no internet dependency, no cost. The model runs entirely on my machine.
The key to making it accurate was giving Ollama the exact nginx field names in the system prompt:
python
system_context = """
ALWAYS use these exact field names:
- remote_addr : client IP address
- status : HTTP response status code
- request_method: HTTP method
- uri : request path
- bytes_sent : response size in bytes
- request_time : processing time in seconds
...
"""
Without this, the model would guess field names like level=ERROR or source which don't exist in nginx logs.
Example Queries
"Give me total requests with 200 status code"
→ index=nginx status=200 | stats count
"Show top 10 IPs by request count last 7 days"
→ index=nginx | top limit=10 remote_addr
"For each IP show URLs hit, count and status codes last 15 days"
→ index=nginx earliest=-15d | stats count by remote_addr uri status | sort -count
Temporal Dashboard
You can watch every step of the agent execute in real time at http://localhost:8233:
Proof — Splunk Query History
Every query the agent generated and executed is visible in Splunk's search history:
🔑 Key Lessons Learned
1. Custom log formats save hours The single best decision I made was defining splunk_format in nginx. It made every downstream tool — Splunk, the AI agent — work better immediately.
2. Field names matter for LLMs The AI agent was generating wrong queries until I added exact field names to the prompt. Giving the LLM a schema of your data is the most impactful thing you can do for accuracy.
3. Temporal is overkill for simple tasks but perfect for agents For a simple script, Temporal is unnecessary. But for an AI agent that makes LLM calls, hits external APIs, and needs to handle failures gracefully — it's exactly the right tool.
4. Splunk alerts are powerful automation triggers Connecting Splunk alerts to shell scripts creates a real event-driven security system. The whole pipeline — detect anomaly → trigger script → block IP — happens in seconds with zero manual intervention.
5. Build things that are observable Every component in this project writes logs somewhere. log_rotation.log, ddos_block.log, Temporal's web UI, Splunk's search history, Prometheus metrics — you can always see what's happening and why.
📦 GitHub Repository
The full project with all configs, scripts, and READMEs is on GitHub:
nginx-ai-ops/
├── agents/query_agent/ — Temporal + Ollama + Flask
├── nginx/ — nginx.conf with custom log format
├── splunk/ — forwarder and indexer configs
├── monitoring/ — Prometheus + Grafana setup
└── automation/ — log_rotation.sh + block_ip.sh
🛠️ What's Next
This project gave me a strong foundation in observability and automation. Next I'm moving to Docker — the goal is to containerize this entire stack so it can be deployed with a single docker-compose up command.
After that: Kubernetes.
If you're also learning DevOps, my advice is to pick one project and go deep rather than doing 10 shallow tutorials. The best way to learn is to break things and fix them in a real environment.
🙏 Connect
If you found this useful or want to discuss DevOps, connect with me on LinkedIn or drop a comment below.
⭐ If you use any part of this project, a star on GitHub means a lot!














