Oracleworld

Thursday, May 28, 2026

Oracle Monitoring with Grafana using TOM agent

Oracle DB → (someone needs to read it) → collect data → (store it somewhere) → visualize it

That "someone" is your monitoring stack. Let me introduce the cast of characters.

👥 Meet the Characters (Components)

1. 🤖 TOM Agent — "The Field Reporter"

Installed on every DB host (the actual server where Oracle runs)
It's the boots-on-the-ground agent
It contains two things inside it: SQL Exporter + River File
Think of it as a reporter stationed at each office branch

2. 🔍 SQL Exporter — "The Interviewer"

Lives inside TOM Agent
Its job: run SQL queries against Oracle and convert the results into a format called metrics
Example: It runs SELECT count(*) FROM v$session → converts that number into a metric called oracle_sessions_active
It speaks to Oracle, Oracle doesn't know it's being monitored

3. 📄 River File — "The Traffic Controller"

Also lives inside TOM Agent
It's a configuration file (written in a language called River/Alloy syntax)
It tells the agent: "Hey, go collect these metrics, then send them to THIS address"
Think of it as the routing instructions — what to collect, how often, where to send

4. 📊 Metrics — "The Data Itself"

Not a tool — it's the output format
Metrics are simple number+label combinations
Example: oracle_db_sessions{host="prod-db1", status="active"} 42
This means: "On prod-db1, there are 42 active sessions right now"

5. 🗄️ Central Server — "The Post Office"

All agents from all DB hosts push their metrics here
It's a central collection point
Prometheus sits here waiting to receive or scrape data

6. 📈 Prometheus — "The Database for Numbers"

Stores all the metrics with timestamps
It's a time-series database — meaning it remembers that at 2PM there were 42 sessions, at 3PM there were 67, etc.
Grafana talks to Prometheus to get historical data
Also handles alerting rules

7. 📉 Grafana — "The Dashboard TV Screen"

The thing your team actually looks at
Beautiful charts, graphs, tables
Queries Prometheus for data and renders it visually
Where you set up alert notifications (Slack, email, PagerDuty, etc.)

🌊 The Complete Data Flow

+----------------------------------------------------------+

| DB HOST 1 |

| |

| +-----------------------------------------------+ |

| | TOM AGENT | |

| | | |

| | +----------------+ +----------------+ | |

| | | SQL Exporter | | River File | | |

| | | | | (config) | | |

| | | Runs SQL on | | | | |

| | | Oracle DB | | - what SQL | | |

| | | | | - how often | | |

| | | Outputs | | - where to | | |

| | | metrics | | send | | |

| | +-------+--------+ +----------------+ | |

| | | | |

| +----------+-----------------------------------+ |

| | |

| Oracle DB <--- SQL Exporter queries it |

+-------------+--------------------------------------------+

| PUSH metrics (HTTP)

+---------------------------+

| CENTRAL SERVER |

| |

| +---------------------+ |

| | Prometheus | |

| | | |

| | Stores metrics | |

| | with timestamps | |

| | | |

| | Evaluates alert | |

| | rules | |

| +---------+-----------+ |

| | |

+------------+--------------+

| Query (PromQL)

+------------------------+

| GRAFANA |

| |

| [*] Dashboards |

| [!] Alert UI |

| [~] Visual monitoring |

+------------------------+

📄 Deep Dive: The River File

This is the brain of the TOM Agent. Let me explain it line by line conceptually.

What River File Does:

1. Define WHERE your Oracle DB is (connection string)

2. Define WHAT to collect (point to SQL exporter config)

3. Define HOW OFTEN to collect

4. Define WHERE TO SEND the data (your central server)

River File Structure (conceptual):

Hcl

// ================================================

// BLOCK 1: Discovery / Connection

// Tell the agent about your Oracle DB

// ================================================

prometheus.exporter.oracle "my_oracle_db" {

connection_string = "oracle://username:password@localhost:1521/ORCL"

// This is like a phone number for your Oracle DB

}

// ================================================

// BLOCK 2: Scrape Config

// Tell the agent HOW to collect metrics

// ================================================

prometheus.scrape "collect_oracle" {

targets = prometheus.exporter.oracle.my_oracle_db.targets

forward_to = [prometheus.remote_write.central.receiver]

scrape_interval = "60s" // collect every 60 seconds

scrape_timeout = "30s" // give up if Oracle doesn't respond in 30s

}

// ================================================

// BLOCK 3: Remote Write

// Tell the agent WHERE to send the data

// ================================================

prometheus.remote_write "central" {

endpoint {

url = "http://central-server:9090/api/v1/write"

// This is your central Prometheus server address

}

Mentor Tip: Think of River file blocks like LEGO pieces — each block does one job, and you connect them together using the names you give them (like my_oracle_db and central)

📋 Deep Dive: SQL Exporter & Metrics YAML

The SQL Exporter needs a YAML file that tells it which SQL queries to run and what to name the metrics.

YAML File Structure:

Yaml

# ================================================

# GLOBAL SETTINGS

# ================================================

global:

scrape_timeout: 30s # How long to wait for Oracle to respond

scrape_timeout_offset: 500ms

min_interval: 0s # Minimum time between scrapes

cache_size: 1000

# ================================================

# JOBS — Each "job" = one group of related queries

# ================================================

jobs:

# JOB 1: Session Monitoring

- job_name: oracle_sessions

interval: '1m' # Run these queries every 1 minute

connections:

- "oracle://user:pass@host:1521/SID"

queries:

# QUERY 1: Count active sessions

- query: |

SELECT

status,

COUNT(*) as session_count

FROM v$session

WHERE type = 'USER'

GROUP BY status

metrics:

# Each column from your SQL becomes a metric or a label

- metric_name: oracle_sessions_total

type: gauge # gauge = current value (can go up or down)

help: "Total Oracle sessions by status"

values: [session_count] # This SQL column becomes the NUMBER

labels:

status: status # This SQL column becomes a LABEL (category)

# QUERY 2: Tablespace usage

- query: |

SELECT

tablespace_name,

used_space,

tablespace_size,

used_percent

FROM dba_tablespace_usage_metrics

metrics:

- metric_name: oracle_tablespace_used_percent

type: gauge

help: "Tablespace usage percentage"

values: [used_percent]

labels:

tablespace: tablespace_name

# JOB 2: Performance Monitoring

- job_name: oracle_performance

interval: '5m' # Less frequent — every 5 minutes

queries:

- query: |

SELECT

metric_name,

value

FROM v$sysmetric

WHERE group_id = 2

metrics:

- metric_name: oracle_sysmetric

type: gauge

help: "Oracle system metrics"

values: [value]

labels:

metric_name: metric_name

🔑 Key YAML Concepts:

🔔 Alert Mechanism — How Alerts Work

Alerts flow through two stages:

Stage 1: Prometheus detects the problem

Stage 2: Grafana/Alertmanager notifies your team

Stage 1 — Prometheus Alert Rules:

Yaml

# alert_rules.yml — lives on your Prometheus server

groups:

- name: oracle_alerts

rules:

# Alert 1: Too many active sessions

- alert: OracleHighSessions

expr: oracle_sessions_total{status="ACTIVE"} > 200

for: 5m # Must be true for 5 CONTINUOUS minutes (avoid false alarms)

labels:

severity: warning

annotations:

summary: "High session count on {{ $labels.instance }}"

description: "Sessions = {{ $value }} (threshold: 200)"

# Alert 2: Tablespace almost full

- alert: OracleTablespaceFull

expr: oracle_tablespace_used_percent > 85

for: 10m

labels:

severity: critical

annotations:

summary: "Tablespace {{ $labels.tablespace }} is {{ $value }}% full"

description: "Free up space immediately on {{ $labels.instance }}"

# Alert 3: DB host not responding (agent stopped pushing)

- alert: OracleDBDown

expr: up{job="oracle"} == 0

for: 2m

labels:

severity: critical

annotations:

summary: "Oracle DB unreachable: {{ $labels.instance }}"

Stage 2 — Where Alerts Go (Alertmanager):

Yaml

# Alertmanager routes alerts to your team

route:

receiver: 'team-slack'

routes:

- match:

severity: critical

receiver: 'oncall-pagerduty' # Critical = wake someone up!

- match:

severity: warning

receiver: 'team-slack' # Warning = Slack message is fine

receivers:

- name: 'team-slack'

slack_configs:

- channel: '#db-alerts'

text: '{{ .Annotations.description }}'

- name: 'oncall-pagerduty'

pagerduty_configs:

- service_key: 'your-key-here'

🗺️ The Complete Mental Model

Here's how to think about all of this together:

EACH DB SERVER:

┌──────────────────────────────────────────────────┐

│ TOM Agent │

│ │

│ River File answers: │

│ "Where is Oracle?" ──► SQL Exporter connects │

│ "What to collect?" ──► Points to YAML file │

│ "How often?" ──► scrape_interval │

│ "Where to send?" ──► Central server URL │

│ │

│ YAML File answers: │

│ "Which SQL queries to run?" │

│ "What to name each metric?" │

│ "Which columns are values vs labels?" │

└──────────────────────────────────────────────────┘

↓ pushes metrics every N seconds

CENTRAL SERVER:

Prometheus stores → oracle_sessions_total{host="prod1"} 42 @timestamp

↓ Grafana queries Prometheus

GRAFANA:

Shows charts + fires alerts when rules are breached

Important:--

River file vs dbconnection, which one really make connection to oracle

Who Makes the Connection to Oracle?

SQL Exporter makes the actual connection. But it needs to be told how — and that's where the two configs play different roles.

The Simple Truth

River File = ORCHESTRATOR (tells the agent what to do and where to send)

DB Connection = CREDENTIALS (tells SQL Exporter how to reach Oracle)

SQL Exporter = THE ONE WHO ACTUALLY CONNECTS to Oracle

Think of it Like This

Imagine you're a delivery driver (SQL Exporter)

DB Connection string = the ADDRESS of the house (where to go, door code)

River File = your MANAGER's instructions (go collect from this address, every 60 seconds, bring it back to HQ)

You (SQL Exporter) = the one who actually drives there and knocks on the door

Where Each One Lives and What it Does

Config	Lives In	Purpose	Makes DB Connection?
DATA_SOURCE_NAME	.env file or environment variable	Oracle credentials + host + port + service	❌ No, just stores info
River File	TOM Agent config	Orchestrates collection and routing	❌ No, just instructs
SQL Exporter	TOM Agent	Reads the connection string and actually connects	✅ YES

The Correct Flow

.env file / Environment Variable

DATA_SOURCE_NAME="user/pass@//host:1521/SERVICE"

| SQL Exporter READS this

SQL Exporter

| Uses it to CONNECT to Oracle

Oracle DB <--- actual connection happens HERE

| Results come back as metrics

River File picks up those metrics

| and ships them to Central Server

Prometheus

Grafana Monitoring Stack — Layman to SRE Expert Guide

First — The Analogy (Understand This, You Understand Everything)

Think of the monitoring stack like a Hospital Reporting System:

Oracle DB = The Patient (has all the health data inside)

Oracle Exporter = The Doctor (examines patient, writes report)

metrics.yaml = The Checklist (what symptoms/vitals to check)

DB Connection = The Hospital ID Card (proves doctor can see patient)

Grafana Agent = The Courier (picks up report, delivers to HQ)

River File = Courier's Job Instructions (where to pick up, where to deliver)

Prometheus = The Hospital HQ Archive (stores all reports over time)

Grafana = The Dashboard Screen in HQ (shows trends, alerts)

Component 1 — Oracle DB Exporter

What is it in layman terms?

It is a small program that sits next to your Oracle DB, logs in, runs some SQL queries, and then stands on a door waiting for someone to ask "what did you find?"

That door = http://localhost:9161/metrics

What does it actually return?

Plain text. Like this:

oracledb_sessions_total{status="ACTIVE"} 42

oracledb_tablespace_used_percent{name="USERS"} 78.5

oracledb_up 1

oracledb_up 1 means exporter connected to Oracle successfully. oracledb_up 0 means connection failed — first thing you check when something breaks.

SRE Level — What to verify:

# Is exporter running?

ps -ef | grep oracledb_exporter

# Can you hit its door?

curl http://localhost:9161/metrics | head -50

# Is Oracle reachable from exporter user?

sqlplus monitoring_user/password@//host:1521/SERVICE

Component 2 — DB Connection File

Layman explanation

This is the key to Oracle's front door. Without this the exporter cannot get in.

DATA_SOURCE_NAME="username/password@//hostname:1521/SERVICENAME"

Breaking it apart like a URL:

username → who is logging in

password → their password

hostname → which server Oracle is on

1521 → which port Oracle's listener is on

SERVICENAME → which database (like a room number inside the building)

Common mistakes beginners make:

Mistake	Result
Using SID instead of Service Name	Connection fails silently
Wrong port	Timeout error
User missing V$ grants	Exporter connects but metrics show zero
CRLF in file (Windows line endings)	Variable corrupted

SRE Level — Grants the monitoring user needs:

-- Minimum required grants

GRANT CREATE SESSION TO monitoring_user;

GRANT SELECT ON V_$SESSION TO monitoring_user;

GRANT SELECT ON V_$DATABASE TO monitoring_user;

GRANT SELECT ON V_$TABLESPACE TO monitoring_user;

GRANT SELECT ON V_$DATAFILE TO monitoring_user;

GRANT SELECT ON DBA_TABLESPACE_USAGE_METRICS TO monitoring_user;

GRANT SELECT ON V_$SYSSTAT TO monitoring_user;

GRANT SELECT ON V_$WAITSTAT TO monitoring_user;

Component 3 — metrics.yaml

Layman explanation

This is the SQL query book. It tells the exporter:

"Every time someone asks for metrics, run these SQL queries"
"Take the results and name them like this"
"These columns become labels, this column becomes the number"

Simple example — dissected line by line:

yaml

- context: "sessions" # ← metric will be named oracledb_sessions_XXXX

metricsdesc:

total: "Session count" # ← column 'total' from SQL = the number measured

request: > # ← the actual SQL that runs in Oracle

SELECT status,

COUNT(*) as total

FROM v$session

GROUP BY status

labels: # ← these SQL columns become Prometheus labels

- "status"

What Prometheus stores after this:

oracledb_sessions_total{status="ACTIVE"} = 42

oracledb_sessions_total{status="INACTIVE"} = 108

oracledb_sessions_total{status="KILLED"} = 3

SRE Level — Types of metrics you should build:

Component 4 — River File (Grafana Agent Brain)

Layman explanation

Think of it as the agent's job description written in a file. It tells the agent:

"Go to this address every 60 seconds and collect metrics"
"After collecting, send them to this central server"
"Add these labels so we know which server this came from"

Full river file explained line by line:

hcl

// ━━━ BLOCK 1: Go collect from Oracle Exporter ━━━━━━━━━

prometheus.scrape "oracle" {

targets = [{

__address__ = "localhost:9161" // exporter's door address

}]

scrape_interval = "60s" // knock on door every 60 seconds

// add extra labels to every metric collected

// so in Grafana you know which server it came from

extra_labels = {

env = "production",

db_name = "ORCL",

region = "EU",

}

forward_to = [prometheus.remote_write.central.receiver]

// ↑ after collecting, pass to Block 2

}

// ━━━ BLOCK 2: Send collected metrics to central server ━━

prometheus.remote_write "central" {

endpoint {

url = "http://central-server:9090/api/v1/write"

}

SRE Level — Key things to tune:

Setting	What it means	Recommended
`scrape_interval`	How often metrics are collected	60s for DB, 15s for critical
`scrape_timeout`	How long to wait before giving up	Always less than interval
`extra_labels`	Tags added to every metric	env, db_name, host, region
`remote_write url`	Where to push data	Your central Prometheus

Complete Flow — Step by Step What Happens Every 60 Seconds

Second 0: Agent timer fires — "time to scrape"

│

Second 1: Agent hits http://localhost:9161/metrics

│

Second 1-3: Exporter receives request

→ opens Oracle connection using connection.env

→ runs all SQL queries from metrics.yaml

→ formats results as Prometheus text

→ returns to agent

│

Second 3: Agent receives ~500 metric lines

→ adds your extra labels (env, db_name etc)

→ puts them in local buffer

│

Second 5: Agent sends buffer to central Prometheus

via HTTP POST to /api/v1/write

│

Second 6: Prometheus receives, stores in its time series DB

│

Any time: Grafana queries Prometheus

→ you see graphs, dashboards, alerts

What to Check When Something Breaks

Step 1: Is Oracle DB up?

→ sqlplus / tnsping

Step 2: Is Exporter running and connected?

→ curl localhost:9161/metrics

→ check oracledb_up value (1=good, 0=bad)

Step 3: Is Agent running?

→ systemctl status grafana-agent

→ check agent logs for scrape errors

Step 4: Is data reaching Prometheus?

→ open Prometheus UI :9090

→ query: oracledb_up

→ check Targets page — should show UP

Step 5: Is Grafana showing data?

→ check data source connection

→ test a simple query in Explore tab

Reference Documents — What to Read

Official Docs (Bookmark These):

What	URL
Grafana Alloy (new agent)	https://grafana.com/docs/alloy/latest/
River/Alloy config reference	https://grafana.com/docs/alloy/latest/reference/components/
Oracle DB Exporter	https://github.com/iamseth/oracledb_exporter
Prometheus remote write	https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write
PromQL queries	https://prometheus.io/docs/prometheus/latest/querying/basics/