Thursday, May 28, 2026

Grafana Monitoring Stack — Layman to SRE Expert Guide

First — The Analogy (Understand This, You Understand Everything)

Think of the monitoring stack like a Hospital Reporting System: 

Oracle DB               = The Patient (has all the health data inside)
Oracle Exporter       =     The Doctor (examines patient, writes report)
metrics.yaml            = The Checklist (what symptoms/vitals to check)
DB Connection        = The Hospital ID Card (proves doctor can see patient)
Grafana Agent         = The Courier (picks up report, delivers to HQ)
River File                = Courier's Job Instructions (where to pick up, where to deliver)
Prometheus             = The Hospital HQ Archive (stores all reports over time)
Grafana                   = The Dashboard Screen in HQ (shows trends, alerts)


Component 1 — Oracle DB Exporter

What is it in layman terms?

It is a small program that sits next to your Oracle DB, logs in, runs some SQL queries, and then stands on a door waiting for someone to ask "what did you find?"

That door = http://localhost:9161/metrics

What does it actually return?

Plain text. Like this:

oracledb_sessions_total{status="ACTIVE"} 42 

oracledb_tablespace_used_percent{name="USERS"} 78.5 

oracledb_up 1

oracledb_up 1 means exporter connected to Oracle successfully. oracledb_up 0 means connection failed — first thing you check when something breaks.

SRE Level — What to verify:

# Is exporter running?

ps -ef | grep oracledb_exporter

# Can you hit its door?

curl http://localhost:9161/metrics | head -50

# Is Oracle reachable from exporter user?

sqlplus monitoring_user/password@//host:1521/SERVICE


Component 2 — DB Connection File

Layman explanation

This is the key to Oracle's front door. Without this the exporter cannot get in.

DATA_SOURCE_NAME="username/password@//hostname:1521/SERVICENAME"

Breaking it apart like a URL:

username        → who is logging in

password        → their password

hostname        → which server Oracle is on

1521            → which port Oracle's listener is on

SERVICENAME     → which database (like a room number inside the building)

Common mistakes beginners make:

MistakeResult
Using SID instead of Service NameConnection fails silently
Wrong portTimeout error
User missing V$ grantsExporter connects but metrics show zero
CRLF in file (Windows line endings)Variable corrupted 

SRE Level — Grants the monitoring user needs:

-- Minimum required grants

GRANT CREATE SESSION TO monitoring_user;

GRANT SELECT ON V_$SESSION TO monitoring_user;

GRANT SELECT ON V_$DATABASE TO monitoring_user;

GRANT SELECT ON V_$TABLESPACE TO monitoring_user;

GRANT SELECT ON V_$DATAFILE TO monitoring_user;

GRANT SELECT ON DBA_TABLESPACE_USAGE_METRICS TO monitoring_user;

GRANT SELECT ON V_$SYSSTAT TO monitoring_user;

GRANT SELECT ON V_$WAITSTAT TO monitoring_user;

Component 3 — metrics.yaml

Layman explanation

This is the SQL query book. It tells the exporter:

  • "Every time someone asks for metrics, run these SQL queries"
  • "Take the results and name them like this"
  • "These columns become labels, this column becomes the number"
Simple example — dissected line by line:

yaml

- context: "sessions"           # ← metric will be named oracledb_sessions_XXXX
  
  metricsdesc:
    total: "Session count"      # ← column 'total' from SQL = the number measured

  request: >                    # ← the actual SQL that runs in Oracle
    SELECT status,
           COUNT(*) as total
    FROM v$session
    GROUP BY status

  labels:                       # ← these SQL columns become Prometheus labels
    - "status"
What Prometheus stores after this:

oracledb_sessions_total{status="ACTIVE"} = 42 
oracledb_sessions_total{status="INACTIVE"} = 108 
oracledb_sessions_total{status="KILLED"} = 3

SRE Level — Types of metrics you should build:


Component 4 — River File (Grafana Agent Brain)

Layman explanation

Think of it as the agent's job description written in a file. It tells the agent:

  • "Go to this address every 60 seconds and collect metrics"
  • "After collecting, send them to this central server"
  • "Add these labels so we know which server this came from"

Full river file explained line by line:

hcl
// ━━━ BLOCK 1: Go collect from Oracle Exporter ━━━━━━━━━
prometheus.scrape "oracle" {

  targets = [{
    __address__ = "localhost:9161"   // exporter's door address
  }]

  scrape_interval = "60s"           // knock on door every 60 seconds

  // add extra labels to every metric collected
  // so in Grafana you know which server it came from
  extra_labels = {
    env      = "production",
    db_name  = "ORCL",
    region   = "EU",
  }

  forward_to = [prometheus.remote_write.central.receiver]
  //            ↑ after collecting, pass to Block 2
}

// ━━━ BLOCK 2: Send collected metrics to central server ━━
prometheus.remote_write "central" {
  endpoint {
    url = "http://central-server:9090/api/v1/write"
  }
}

SRE Level — Key things to tune:

SettingWhat it meansRecommended
scrape_intervalHow often metrics are collected60s for DB, 15s for critical
scrape_timeoutHow long to wait before giving upAlways less than interval
extra_labelsTags added to every metricenv, db_name, host, region
remote_write urlWhere to push dataYour central Prometheus

Complete Flow — Step by Step What Happens Every 60 Seconds

Second 0:   Agent timer fires — "time to scrape"
            │
Second 1:   Agent hits http://localhost:9161/metrics
            │
Second 1-3: Exporter receives request
            → opens Oracle connection using connection.env
            → runs all SQL queries from metrics.yaml
            → formats results as Prometheus text
            → returns to agent
            │
Second 3:   Agent receives ~500 metric lines
            → adds your extra labels (env, db_name etc)
            → puts them in local buffer
            │
Second 5:   Agent sends buffer to central Prometheus
            via HTTP POST to /api/v1/write
            │
Second 6:   Prometheus receives, stores in its time series DB
            │
Any time:   Grafana queries Prometheus
            → you see graphs, dashboards, alerts

What to Check When Something Breaks

Step 1: Is Oracle DB up?
        → sqlplus / tnsping

Step 2: Is Exporter running and connected?
        → curl localhost:9161/metrics
        → check oracledb_up value (1=good, 0=bad)

Step 3: Is Agent running?
        → systemctl status grafana-agent
        → check agent logs for scrape errors

Step 4: Is data reaching Prometheus?
        → open Prometheus UI :9090
        → query: oracledb_up
        → check Targets page — should show UP

Step 5: Is Grafana showing data?
        → check data source connection
        → test a simple query in Explore tab

Reference Documents — What to Read

Official Docs (Bookmark These):

No comments:

Post a Comment