Thursday, May 28, 2026

Oracle Monitoring with Grafana using TOM agent

Oracle DB → (someone needs to read it) → collect data → (store it somewhere) → visualize it

That "someone" is your monitoring stack. Let me introduce the cast of characters.

👥 Meet the Characters (Components)

1. 🤖 TOM Agent — "The Field Reporter"

  • Installed on every DB host (the actual server where Oracle runs)
  • It's the boots-on-the-ground agent
  • It contains two things inside it: SQL Exporter + River File
  • Think of it as a reporter stationed at each office branch

2. 🔍 SQL Exporter — "The Interviewer"

  • Lives inside TOM Agent
  • Its job: run SQL queries against Oracle and convert the results into a format called metrics
  • Example: It runs SELECT count(*) FROM v$session → converts that number into a metric called oracle_sessions_active
  • It speaks to Oracle, Oracle doesn't know it's being monitored

3. 📄 River File — "The Traffic Controller"

  • Also lives inside TOM Agent
  • It's a configuration file (written in a language called River/Alloy syntax)
  • It tells the agent: "Hey, go collect these metrics, then send them to THIS address"
  • Think of it as the routing instructions — what to collect, how often, where to send

4. 📊 Metrics — "The Data Itself"

  • Not a tool — it's the output format
  • Metrics are simple number+label combinations
  • Example: oracle_db_sessions{host="prod-db1", status="active"} 42
  • This means: "On prod-db1, there are 42 active sessions right now"

5. 🗄️ Central Server — "The Post Office"

  • All agents from all DB hosts push their metrics here
  • It's a central collection point
  • Prometheus sits here waiting to receive or scrape data

6. 📈 Prometheus — "The Database for Numbers"

  • Stores all the metrics with timestamps
  • It's a time-series database — meaning it remembers that at 2PM there were 42 sessions, at 3PM there were 67, etc.
  • Grafana talks to Prometheus to get historical data
  • Also handles alerting rules

7. 📉 Grafana — "The Dashboard TV Screen"

  • The thing your team actually looks at
  • Beautiful charts, graphs, tables
  • Queries Prometheus for data and renders it visually
  • Where you set up alert notifications (Slack, email, PagerDuty, etc.)

🌊 The Complete Data Flow

+----------------------------------------------------------+

|  DB HOST 1                                               |

|                                                          |

|  +-----------------------------------------------+      |

|  |              TOM AGENT                        |      |

|  |                                               |      |

|  |  +----------------+     +----------------+   |      |

|  |  |  SQL Exporter  |     |   River File   |   |      |

|  |  |                |     |   (config)     |   |      |

|  |  |  Runs SQL on   |     |                |   |      |

|  |  |  Oracle DB     |     |  - what SQL    |   |      |

|  |  |                |     |  - how often   |   |      |

|  |  |  Outputs       |     |  - where to    |   |      |

|  |  |  metrics       |     |    send        |   |      |

|  |  +-------+--------+     +----------------+   |      |

|  |          |                                   |      |

|  +----------+-----------------------------------+      |

|             |                                          |

|         Oracle DB  <--- SQL Exporter queries it        |

+-------------+--------------------------------------------+

              |

              |  PUSH metrics (HTTP)

              v

+---------------------------+

|      CENTRAL SERVER       |

|                           |

|  +---------------------+  |

|  |     Prometheus      |  |

|  |                     |  |

|  |  Stores metrics     |  |

|  |  with timestamps    |  |

|  |                     |  |

|  |  Evaluates alert    |  |

|  |  rules              |  |

|  +---------+-----------+  |

|            |              |

+------------+--------------+

             |

             |  Query (PromQL)

             v

+------------------------+

|        GRAFANA         |

|                        |

|  [*] Dashboards        |

|  [!] Alert UI          |

|  [~] Visual monitoring |

+------------------------+

 

📄 Deep Dive: The River File

This is the brain of the TOM Agent. Let me explain it line by line conceptually.

What River File Does:

1. Define WHERE your Oracle DB is (connection string)

2. Define WHAT to collect (point to SQL exporter config)

3. Define HOW OFTEN to collect

4. Define WHERE TO SEND the data (your central server)

 

River File Structure (conceptual):

Hcl

// ================================================

// BLOCK 1: Discovery / Connection

// Tell the agent about your Oracle DB

// ================================================

prometheus.exporter.oracle "my_oracle_db" {

  connection_string = "oracle://username:password@localhost:1521/ORCL"

  // This is like a phone number for your Oracle DB

}

 

// ================================================

// BLOCK 2: Scrape Config

// Tell the agent HOW to collect metrics

// ================================================

prometheus.scrape "collect_oracle" {

  targets    = prometheus.exporter.oracle.my_oracle_db.targets

  forward_to = [prometheus.remote_write.central.receiver]

 

  scrape_interval = "60s"   // collect every 60 seconds

  scrape_timeout  = "30s"   // give up if Oracle doesn't respond in 30s

}

 

// ================================================

// BLOCK 3: Remote Write

// Tell the agent WHERE to send the data

// ================================================

prometheus.remote_write "central" {

  endpoint {

    url = "http://central-server:9090/api/v1/write"

    // This is your central Prometheus server address

  }

}



Mentor Tip: Think of River file blocks like LEGO pieces — each block does one job, and you connect them together using the names you give them (like my_oracle_db and central)

📋 Deep Dive: SQL Exporter & Metrics YAML

The SQL Exporter needs a YAML file that tells it which SQL queries to run and what to name the metrics.

YAML File Structure:

Yaml

Yaml

# ================================================

# GLOBAL SETTINGS

# ================================================

global:

  scrape_timeout: 30s          # How long to wait for Oracle to respond

  scrape_timeout_offset: 500ms

  min_interval: 0s             # Minimum time between scrapes

  cache_size: 1000

 

# ================================================

# JOBS — Each "job" = one group of related queries

# ================================================

jobs:

 

  # JOB 1: Session Monitoring

  - job_name: oracle_sessions

    interval: '1m'             # Run these queries every 1 minute

    connections:

      - "oracle://user:pass@host:1521/SID"

 

    queries:

 

      # QUERY 1: Count active sessions

      - query: |

          SELECT

            status,

            COUNT(*) as session_count

          FROM v$session

          WHERE type = 'USER'

          GROUP BY status

 

        metrics:

          # Each column from your SQL becomes a metric or a label

          - metric_name: oracle_sessions_total

            type: gauge               # gauge = current value (can go up or down)

            help: "Total Oracle sessions by status"

            values: [session_count]   # This SQL column becomes the NUMBER

            labels:

              status: status          # This SQL column becomes a LABEL (category)

 

      # QUERY 2: Tablespace usage

      - query: |

          SELECT

            tablespace_name,

            used_space,

            tablespace_size,

            used_percent

          FROM dba_tablespace_usage_metrics

 

        metrics:

          - metric_name: oracle_tablespace_used_percent

            type: gauge

            help: "Tablespace usage percentage"

            values: [used_percent]

            labels:

              tablespace: tablespace_name

 

  # JOB 2: Performance Monitoring

  - job_name: oracle_performance

    interval: '5m'             # Less frequent — every 5 minutes

 

    queries:

      - query: |

          SELECT

            metric_name,

            value

          FROM v$sysmetric

          WHERE group_id = 2

 

        metrics:

          - metric_name: oracle_sysmetric

            type: gauge

            help: "Oracle system metrics"

            values: [value]

            labels:

              metric_name: metric_name


🔑 Key YAML Concepts:


🔔 Alert Mechanism — How Alerts Work

Alerts flow through two stages:

Stage 1: Prometheus detects the problem

Stage 2: Grafana/Alertmanager notifies your team

 

Stage 1 — Prometheus Alert Rules:

Yaml

# alert_rules.yml — lives on your Prometheus server

 

groups:

  - name: oracle_alerts

    rules:

 

      # Alert 1: Too many active sessions

      - alert: OracleHighSessions

        expr: oracle_sessions_total{status="ACTIVE"} > 200

        for: 5m          # Must be true for 5 CONTINUOUS minutes (avoid false alarms)

        labels:

          severity: warning

        annotations:

          summary: "High session count on {{ $labels.instance }}"

          description: "Sessions = {{ $value }} (threshold: 200)"

 

      # Alert 2: Tablespace almost full

      - alert: OracleTablespaceFull

        expr: oracle_tablespace_used_percent > 85

        for: 10m

        labels:

          severity: critical

        annotations:

          summary: "Tablespace {{ $labels.tablespace }} is {{ $value }}% full"

          description: "Free up space immediately on {{ $labels.instance }}"

 

      # Alert 3: DB host not responding (agent stopped pushing)

      - alert: OracleDBDown

        expr: up{job="oracle"} == 0

        for: 2m

        labels:

          severity: critical

        annotations:

          summary: "Oracle DB unreachable: {{ $labels.instance }}"

 

Stage 2 — Where Alerts Go (Alertmanager):

Yaml

# Alertmanager routes alerts to your team

route:

  receiver: 'team-slack'

  routes:

    - match:

        severity: critical

      receiver: 'oncall-pagerduty'   # Critical = wake someone up!

    - match:

        severity: warning

      receiver: 'team-slack'         # Warning = Slack message is fine

 

receivers:

  - name: 'team-slack'

    slack_configs:

      - channel: '#db-alerts'

        text: '{{ .Annotations.description }}'

 

  - name: 'oncall-pagerduty'

    pagerduty_configs:

      - service_key: 'your-key-here'

 

🗺️ The Complete Mental Model

Here's how to think about all of this together:

 

EACH DB SERVER:

┌──────────────────────────────────────────────────┐

│  TOM Agent                                       │

│                                                  │

│  River File answers:                             │

│    "Where is Oracle?"  ──► SQL Exporter connects │

│    "What to collect?"  ──► Points to YAML file   │

│    "How often?"        ──► scrape_interval       │

│    "Where to send?"    ──► Central server URL    │

│                                                  │

│  YAML File answers:                              │

│    "Which SQL queries to run?"                   │

│    "What to name each metric?"                   │

│    "Which columns are values vs labels?"         │

└──────────────────────────────────────────────────┘

              ↓ pushes metrics every N seconds

CENTRAL SERVER:

  Prometheus stores → oracle_sessions_total{host="prod1"} 42 @timestamp

              ↓ Grafana queries Prometheus

GRAFANA:

  Shows charts + fires alerts when rules are breached

Important:--

River file vs dbconnection, which one really make connection to oracle

Who Makes the Connection to Oracle?

SQL Exporter makes the actual connection. But it needs to be told how — and that's where the two configs play different roles.

The Simple Truth

River File = ORCHESTRATOR (tells the agent what to do and where to send)

DB Connection = CREDENTIALS (tells SQL Exporter how to reach Oracle)

SQL Exporter = THE ONE WHO ACTUALLY CONNECTS to Oracle

Think of it Like This

Imagine you're a delivery driver (SQL Exporter)

DB Connection string = the ADDRESS of the house (where to go, door code)

River File = your MANAGER's instructions (go collect from this address, every 60 seconds, bring it back to HQ)

You (SQL Exporter) = the one who actually drives there and knocks on the door

Where Each One Lives and What it Does

Config

Lives In

Purpose

Makes DB Connection?

DATA_SOURCE_NAME

.env file or environment variable

Oracle credentials + host + port + service

No, just stores info

River File

TOM Agent config

Orchestrates collection and routing

No, just instructs

SQL Exporter

TOM Agent

Reads the connection string and actually connects

YES

 

The Correct Flow

 

 

 

.env file / Environment Variable

  DATA_SOURCE_NAME="user/pass@//host:1521/SERVICE"

              |

              | SQL Exporter READS this

              v

         SQL Exporter

              |

              | Uses it to CONNECT to Oracle

              v

          Oracle DB  <--- actual connection happens HERE

              |

              | Results come back as metrics

              v

         River File picks up those metrics

              |

              | and ships them to Central Server

              v

          Prometheus




Grafana Monitoring Stack — Layman to SRE Expert Guide

First — The Analogy (Understand This, You Understand Everything)

Think of the monitoring stack like a Hospital Reporting System: 

Oracle DB               = The Patient (has all the health data inside)
Oracle Exporter       =     The Doctor (examines patient, writes report)
metrics.yaml            = The Checklist (what symptoms/vitals to check)
DB Connection        = The Hospital ID Card (proves doctor can see patient)
Grafana Agent         = The Courier (picks up report, delivers to HQ)
River File                = Courier's Job Instructions (where to pick up, where to deliver)
Prometheus             = The Hospital HQ Archive (stores all reports over time)
Grafana                   = The Dashboard Screen in HQ (shows trends, alerts)


Component 1 — Oracle DB Exporter

What is it in layman terms?

It is a small program that sits next to your Oracle DB, logs in, runs some SQL queries, and then stands on a door waiting for someone to ask "what did you find?"

That door = http://localhost:9161/metrics

What does it actually return?

Plain text. Like this:

oracledb_sessions_total{status="ACTIVE"} 42 

oracledb_tablespace_used_percent{name="USERS"} 78.5 

oracledb_up 1

oracledb_up 1 means exporter connected to Oracle successfully. oracledb_up 0 means connection failed — first thing you check when something breaks.

SRE Level — What to verify:

# Is exporter running?

ps -ef | grep oracledb_exporter

# Can you hit its door?

curl http://localhost:9161/metrics | head -50

# Is Oracle reachable from exporter user?

sqlplus monitoring_user/password@//host:1521/SERVICE


Component 2 — DB Connection File

Layman explanation

This is the key to Oracle's front door. Without this the exporter cannot get in.

DATA_SOURCE_NAME="username/password@//hostname:1521/SERVICENAME"

Breaking it apart like a URL:

username        → who is logging in

password        → their password

hostname        → which server Oracle is on

1521            → which port Oracle's listener is on

SERVICENAME     → which database (like a room number inside the building)

Common mistakes beginners make:

MistakeResult
Using SID instead of Service NameConnection fails silently
Wrong portTimeout error
User missing V$ grantsExporter connects but metrics show zero
CRLF in file (Windows line endings)Variable corrupted 

SRE Level — Grants the monitoring user needs:

-- Minimum required grants

GRANT CREATE SESSION TO monitoring_user;

GRANT SELECT ON V_$SESSION TO monitoring_user;

GRANT SELECT ON V_$DATABASE TO monitoring_user;

GRANT SELECT ON V_$TABLESPACE TO monitoring_user;

GRANT SELECT ON V_$DATAFILE TO monitoring_user;

GRANT SELECT ON DBA_TABLESPACE_USAGE_METRICS TO monitoring_user;

GRANT SELECT ON V_$SYSSTAT TO monitoring_user;

GRANT SELECT ON V_$WAITSTAT TO monitoring_user;

Component 3 — metrics.yaml

Layman explanation

This is the SQL query book. It tells the exporter:

  • "Every time someone asks for metrics, run these SQL queries"
  • "Take the results and name them like this"
  • "These columns become labels, this column becomes the number"
Simple example — dissected line by line:

yaml

- context: "sessions"           # ← metric will be named oracledb_sessions_XXXX
  
  metricsdesc:
    total: "Session count"      # ← column 'total' from SQL = the number measured

  request: >                    # ← the actual SQL that runs in Oracle
    SELECT status,
           COUNT(*) as total
    FROM v$session
    GROUP BY status

  labels:                       # ← these SQL columns become Prometheus labels
    - "status"
What Prometheus stores after this:

oracledb_sessions_total{status="ACTIVE"} = 42 
oracledb_sessions_total{status="INACTIVE"} = 108 
oracledb_sessions_total{status="KILLED"} = 3

SRE Level — Types of metrics you should build:


Component 4 — River File (Grafana Agent Brain)

Layman explanation

Think of it as the agent's job description written in a file. It tells the agent:

  • "Go to this address every 60 seconds and collect metrics"
  • "After collecting, send them to this central server"
  • "Add these labels so we know which server this came from"

Full river file explained line by line:

hcl
// ━━━ BLOCK 1: Go collect from Oracle Exporter ━━━━━━━━━
prometheus.scrape "oracle" {

  targets = [{
    __address__ = "localhost:9161"   // exporter's door address
  }]

  scrape_interval = "60s"           // knock on door every 60 seconds

  // add extra labels to every metric collected
  // so in Grafana you know which server it came from
  extra_labels = {
    env      = "production",
    db_name  = "ORCL",
    region   = "EU",
  }

  forward_to = [prometheus.remote_write.central.receiver]
  //            ↑ after collecting, pass to Block 2
}

// ━━━ BLOCK 2: Send collected metrics to central server ━━
prometheus.remote_write "central" {
  endpoint {
    url = "http://central-server:9090/api/v1/write"
  }
}

SRE Level — Key things to tune:

SettingWhat it meansRecommended
scrape_intervalHow often metrics are collected60s for DB, 15s for critical
scrape_timeoutHow long to wait before giving upAlways less than interval
extra_labelsTags added to every metricenv, db_name, host, region
remote_write urlWhere to push dataYour central Prometheus

Complete Flow — Step by Step What Happens Every 60 Seconds

Second 0:   Agent timer fires — "time to scrape"
            │
Second 1:   Agent hits http://localhost:9161/metrics
            │
Second 1-3: Exporter receives request
            → opens Oracle connection using connection.env
            → runs all SQL queries from metrics.yaml
            → formats results as Prometheus text
            → returns to agent
            │
Second 3:   Agent receives ~500 metric lines
            → adds your extra labels (env, db_name etc)
            → puts them in local buffer
            │
Second 5:   Agent sends buffer to central Prometheus
            via HTTP POST to /api/v1/write
            │
Second 6:   Prometheus receives, stores in its time series DB
            │
Any time:   Grafana queries Prometheus
            → you see graphs, dashboards, alerts

What to Check When Something Breaks

Step 1: Is Oracle DB up?
        → sqlplus / tnsping

Step 2: Is Exporter running and connected?
        → curl localhost:9161/metrics
        → check oracledb_up value (1=good, 0=bad)

Step 3: Is Agent running?
        → systemctl status grafana-agent
        → check agent logs for scrape errors

Step 4: Is data reaching Prometheus?
        → open Prometheus UI :9090
        → query: oracledb_up
        → check Targets page — should show UP

Step 5: Is Grafana showing data?
        → check data source connection
        → test a simple query in Explore tab

Reference Documents — What to Read

Official Docs (Bookmark These):