Oracleworld: Grafana Monitoring Stack — Layman to SRE Expert Guide

First — The Analogy (Understand This, You Understand Everything)

Think of the monitoring stack like a Hospital Reporting System:

Oracle DB = The Patient (has all the health data inside)

Oracle Exporter = The Doctor (examines patient, writes report)

metrics.yaml = The Checklist (what symptoms/vitals to check)

DB Connection = The Hospital ID Card (proves doctor can see patient)

Grafana Agent = The Courier (picks up report, delivers to HQ)

River File = Courier's Job Instructions (where to pick up, where to deliver)

Prometheus = The Hospital HQ Archive (stores all reports over time)

Grafana = The Dashboard Screen in HQ (shows trends, alerts)

Component 1 — Oracle DB Exporter

What is it in layman terms?

It is a small program that sits next to your Oracle DB, logs in, runs some SQL queries, and then stands on a door waiting for someone to ask "what did you find?"

That door = http://localhost:9161/metrics

What does it actually return?

Plain text. Like this:

oracledb_sessions_total{status="ACTIVE"} 42

oracledb_tablespace_used_percent{name="USERS"} 78.5

oracledb_up 1

oracledb_up 1 means exporter connected to Oracle successfully. oracledb_up 0 means connection failed — first thing you check when something breaks.

SRE Level — What to verify:

# Is exporter running?

ps -ef | grep oracledb_exporter

# Can you hit its door?

curl http://localhost:9161/metrics | head -50

# Is Oracle reachable from exporter user?

sqlplus monitoring_user/password@//host:1521/SERVICE

Component 2 — DB Connection File

Layman explanation

This is the key to Oracle's front door. Without this the exporter cannot get in.

DATA_SOURCE_NAME="username/password@//hostname:1521/SERVICENAME"

Breaking it apart like a URL:

username → who is logging in

password → their password

hostname → which server Oracle is on

1521 → which port Oracle's listener is on

SERVICENAME → which database (like a room number inside the building)

Common mistakes beginners make:

Mistake	Result
Using SID instead of Service Name	Connection fails silently
Wrong port	Timeout error
User missing V$ grants	Exporter connects but metrics show zero
CRLF in file (Windows line endings)	Variable corrupted

SRE Level — Grants the monitoring user needs:

-- Minimum required grants

GRANT CREATE SESSION TO monitoring_user;

GRANT SELECT ON V_$SESSION TO monitoring_user;

GRANT SELECT ON V_$DATABASE TO monitoring_user;

GRANT SELECT ON V_$TABLESPACE TO monitoring_user;

GRANT SELECT ON V_$DATAFILE TO monitoring_user;

GRANT SELECT ON DBA_TABLESPACE_USAGE_METRICS TO monitoring_user;

GRANT SELECT ON V_$SYSSTAT TO monitoring_user;

GRANT SELECT ON V_$WAITSTAT TO monitoring_user;

Component 3 — metrics.yaml

Layman explanation

This is the SQL query book. It tells the exporter:

"Every time someone asks for metrics, run these SQL queries"
"Take the results and name them like this"
"These columns become labels, this column becomes the number"

Simple example — dissected line by line:

yaml

- context: "sessions" # ← metric will be named oracledb_sessions_XXXX

metricsdesc:

total: "Session count" # ← column 'total' from SQL = the number measured

request: > # ← the actual SQL that runs in Oracle

SELECT status,

COUNT(*) as total

FROM v$session

GROUP BY status

labels: # ← these SQL columns become Prometheus labels

- "status"

What Prometheus stores after this:

oracledb_sessions_total{status="ACTIVE"} = 42

oracledb_sessions_total{status="INACTIVE"} = 108

oracledb_sessions_total{status="KILLED"} = 3

SRE Level — Types of metrics you should build:

Component 4 — River File (Grafana Agent Brain)

Layman explanation

Think of it as the agent's job description written in a file. It tells the agent:

"Go to this address every 60 seconds and collect metrics"
"After collecting, send them to this central server"
"Add these labels so we know which server this came from"

Full river file explained line by line:

hcl

// ━━━ BLOCK 1: Go collect from Oracle Exporter ━━━━━━━━━

prometheus.scrape "oracle" {

targets = [{

__address__ = "localhost:9161" // exporter's door address

}]

scrape_interval = "60s" // knock on door every 60 seconds

// add extra labels to every metric collected

// so in Grafana you know which server it came from

extra_labels = {

env = "production",

db_name = "ORCL",

region = "EU",

}

forward_to = [prometheus.remote_write.central.receiver]

// ↑ after collecting, pass to Block 2

}

// ━━━ BLOCK 2: Send collected metrics to central server ━━

prometheus.remote_write "central" {

endpoint {

url = "http://central-server:9090/api/v1/write"

}

SRE Level — Key things to tune:

Setting	What it means	Recommended
`scrape_interval`	How often metrics are collected	60s for DB, 15s for critical
`scrape_timeout`	How long to wait before giving up	Always less than interval
`extra_labels`	Tags added to every metric	env, db_name, host, region
`remote_write url`	Where to push data	Your central Prometheus

Complete Flow — Step by Step What Happens Every 60 Seconds

Second 0: Agent timer fires — "time to scrape"

│

Second 1: Agent hits http://localhost:9161/metrics

│

Second 1-3: Exporter receives request

→ opens Oracle connection using connection.env

→ runs all SQL queries from metrics.yaml

→ formats results as Prometheus text

→ returns to agent

│

Second 3: Agent receives ~500 metric lines

→ adds your extra labels (env, db_name etc)

→ puts them in local buffer

│

Second 5: Agent sends buffer to central Prometheus

via HTTP POST to /api/v1/write

│

Second 6: Prometheus receives, stores in its time series DB

│

Any time: Grafana queries Prometheus

→ you see graphs, dashboards, alerts

What to Check When Something Breaks

Step 1: Is Oracle DB up?

→ sqlplus / tnsping

Step 2: Is Exporter running and connected?

→ curl localhost:9161/metrics

→ check oracledb_up value (1=good, 0=bad)

Step 3: Is Agent running?

→ systemctl status grafana-agent

→ check agent logs for scrape errors

Step 4: Is data reaching Prometheus?

→ open Prometheus UI :9090

→ query: oracledb_up

→ check Targets page — should show UP

Step 5: Is Grafana showing data?

→ check data source connection

→ test a simple query in Explore tab

Reference Documents — What to Read

Official Docs (Bookmark These):

What	URL
Grafana Alloy (new agent)	https://grafana.com/docs/alloy/latest/
River/Alloy config reference	https://grafana.com/docs/alloy/latest/reference/components/
Oracle DB Exporter	https://github.com/iamseth/oracledb_exporter
Prometheus remote write	https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write
PromQL queries	https://prometheus.io/docs/prometheus/latest/querying/basics/

Oracleworld

Thursday, May 28, 2026

Grafana Monitoring Stack — Layman to SRE Expert Guide

First — The Analogy (Understand This, You Understand Everything)

Component 1 — Oracle DB Exporter

What is it in layman terms?

What does it actually return?

Component 2 — DB Connection File

Layman explanation

Component 3 — metrics.yaml

Layman explanation

Component 4 — River File (Grafana Agent Brain)

Layman explanation

Full river file explained line by line:

No comments:

Post a Comment