Thursday, May 28, 2026

Oracle Monitoring with Grafana using TOM agent

Oracle DB → (someone needs to read it) → collect data → (store it somewhere) → visualize it

That "someone" is your monitoring stack. Let me introduce the cast of characters.

👥 Meet the Characters (Components)

1. 🤖 TOM Agent — "The Field Reporter"

  • Installed on every DB host (the actual server where Oracle runs)
  • It's the boots-on-the-ground agent
  • It contains two things inside it: SQL Exporter + River File
  • Think of it as a reporter stationed at each office branch

2. 🔍 SQL Exporter — "The Interviewer"

  • Lives inside TOM Agent
  • Its job: run SQL queries against Oracle and convert the results into a format called metrics
  • Example: It runs SELECT count(*) FROM v$session → converts that number into a metric called oracle_sessions_active
  • It speaks to Oracle, Oracle doesn't know it's being monitored

3. 📄 River File — "The Traffic Controller"

  • Also lives inside TOM Agent
  • It's a configuration file (written in a language called River/Alloy syntax)
  • It tells the agent: "Hey, go collect these metrics, then send them to THIS address"
  • Think of it as the routing instructions — what to collect, how often, where to send

4. 📊 Metrics — "The Data Itself"

  • Not a tool — it's the output format
  • Metrics are simple number+label combinations
  • Example: oracle_db_sessions{host="prod-db1", status="active"} 42
  • This means: "On prod-db1, there are 42 active sessions right now"

5. 🗄️ Central Server — "The Post Office"

  • All agents from all DB hosts push their metrics here
  • It's a central collection point
  • Prometheus sits here waiting to receive or scrape data

6. 📈 Prometheus — "The Database for Numbers"

  • Stores all the metrics with timestamps
  • It's a time-series database — meaning it remembers that at 2PM there were 42 sessions, at 3PM there were 67, etc.
  • Grafana talks to Prometheus to get historical data
  • Also handles alerting rules

7. 📉 Grafana — "The Dashboard TV Screen"

  • The thing your team actually looks at
  • Beautiful charts, graphs, tables
  • Queries Prometheus for data and renders it visually
  • Where you set up alert notifications (Slack, email, PagerDuty, etc.)

🌊 The Complete Data Flow

+----------------------------------------------------------+

|  DB HOST 1                                               |

|                                                          |

|  +-----------------------------------------------+      |

|  |              TOM AGENT                        |      |

|  |                                               |      |

|  |  +----------------+     +----------------+   |      |

|  |  |  SQL Exporter  |     |   River File   |   |      |

|  |  |                |     |   (config)     |   |      |

|  |  |  Runs SQL on   |     |                |   |      |

|  |  |  Oracle DB     |     |  - what SQL    |   |      |

|  |  |                |     |  - how often   |   |      |

|  |  |  Outputs       |     |  - where to    |   |      |

|  |  |  metrics       |     |    send        |   |      |

|  |  +-------+--------+     +----------------+   |      |

|  |          |                                   |      |

|  +----------+-----------------------------------+      |

|             |                                          |

|         Oracle DB  <--- SQL Exporter queries it        |

+-------------+--------------------------------------------+

              |

              |  PUSH metrics (HTTP)

              v

+---------------------------+

|      CENTRAL SERVER       |

|                           |

|  +---------------------+  |

|  |     Prometheus      |  |

|  |                     |  |

|  |  Stores metrics     |  |

|  |  with timestamps    |  |

|  |                     |  |

|  |  Evaluates alert    |  |

|  |  rules              |  |

|  +---------+-----------+  |

|            |              |

+------------+--------------+

             |

             |  Query (PromQL)

             v

+------------------------+

|        GRAFANA         |

|                        |

|  [*] Dashboards        |

|  [!] Alert UI          |

|  [~] Visual monitoring |

+------------------------+

 

📄 Deep Dive: The River File

This is the brain of the TOM Agent. Let me explain it line by line conceptually.

What River File Does:

1. Define WHERE your Oracle DB is (connection string)

2. Define WHAT to collect (point to SQL exporter config)

3. Define HOW OFTEN to collect

4. Define WHERE TO SEND the data (your central server)

 

River File Structure (conceptual):

Hcl

// ================================================

// BLOCK 1: Discovery / Connection

// Tell the agent about your Oracle DB

// ================================================

prometheus.exporter.oracle "my_oracle_db" {

  connection_string = "oracle://username:password@localhost:1521/ORCL"

  // This is like a phone number for your Oracle DB

}

 

// ================================================

// BLOCK 2: Scrape Config

// Tell the agent HOW to collect metrics

// ================================================

prometheus.scrape "collect_oracle" {

  targets    = prometheus.exporter.oracle.my_oracle_db.targets

  forward_to = [prometheus.remote_write.central.receiver]

 

  scrape_interval = "60s"   // collect every 60 seconds

  scrape_timeout  = "30s"   // give up if Oracle doesn't respond in 30s

}

 

// ================================================

// BLOCK 3: Remote Write

// Tell the agent WHERE to send the data

// ================================================

prometheus.remote_write "central" {

  endpoint {

    url = "http://central-server:9090/api/v1/write"

    // This is your central Prometheus server address

  }

}



Mentor Tip: Think of River file blocks like LEGO pieces — each block does one job, and you connect them together using the names you give them (like my_oracle_db and central)

📋 Deep Dive: SQL Exporter & Metrics YAML

The SQL Exporter needs a YAML file that tells it which SQL queries to run and what to name the metrics.

YAML File Structure:

Yaml

Yaml

# ================================================

# GLOBAL SETTINGS

# ================================================

global:

  scrape_timeout: 30s          # How long to wait for Oracle to respond

  scrape_timeout_offset: 500ms

  min_interval: 0s             # Minimum time between scrapes

  cache_size: 1000

 

# ================================================

# JOBS — Each "job" = one group of related queries

# ================================================

jobs:

 

  # JOB 1: Session Monitoring

  - job_name: oracle_sessions

    interval: '1m'             # Run these queries every 1 minute

    connections:

      - "oracle://user:pass@host:1521/SID"

 

    queries:

 

      # QUERY 1: Count active sessions

      - query: |

          SELECT

            status,

            COUNT(*) as session_count

          FROM v$session

          WHERE type = 'USER'

          GROUP BY status

 

        metrics:

          # Each column from your SQL becomes a metric or a label

          - metric_name: oracle_sessions_total

            type: gauge               # gauge = current value (can go up or down)

            help: "Total Oracle sessions by status"

            values: [session_count]   # This SQL column becomes the NUMBER

            labels:

              status: status          # This SQL column becomes a LABEL (category)

 

      # QUERY 2: Tablespace usage

      - query: |

          SELECT

            tablespace_name,

            used_space,

            tablespace_size,

            used_percent

          FROM dba_tablespace_usage_metrics

 

        metrics:

          - metric_name: oracle_tablespace_used_percent

            type: gauge

            help: "Tablespace usage percentage"

            values: [used_percent]

            labels:

              tablespace: tablespace_name

 

  # JOB 2: Performance Monitoring

  - job_name: oracle_performance

    interval: '5m'             # Less frequent — every 5 minutes

 

    queries:

      - query: |

          SELECT

            metric_name,

            value

          FROM v$sysmetric

          WHERE group_id = 2

 

        metrics:

          - metric_name: oracle_sysmetric

            type: gauge

            help: "Oracle system metrics"

            values: [value]

            labels:

              metric_name: metric_name


🔑 Key YAML Concepts:


🔔 Alert Mechanism — How Alerts Work

Alerts flow through two stages:

Stage 1: Prometheus detects the problem

Stage 2: Grafana/Alertmanager notifies your team

 

Stage 1 — Prometheus Alert Rules:

Yaml

# alert_rules.yml — lives on your Prometheus server

 

groups:

  - name: oracle_alerts

    rules:

 

      # Alert 1: Too many active sessions

      - alert: OracleHighSessions

        expr: oracle_sessions_total{status="ACTIVE"} > 200

        for: 5m          # Must be true for 5 CONTINUOUS minutes (avoid false alarms)

        labels:

          severity: warning

        annotations:

          summary: "High session count on {{ $labels.instance }}"

          description: "Sessions = {{ $value }} (threshold: 200)"

 

      # Alert 2: Tablespace almost full

      - alert: OracleTablespaceFull

        expr: oracle_tablespace_used_percent > 85

        for: 10m

        labels:

          severity: critical

        annotations:

          summary: "Tablespace {{ $labels.tablespace }} is {{ $value }}% full"

          description: "Free up space immediately on {{ $labels.instance }}"

 

      # Alert 3: DB host not responding (agent stopped pushing)

      - alert: OracleDBDown

        expr: up{job="oracle"} == 0

        for: 2m

        labels:

          severity: critical

        annotations:

          summary: "Oracle DB unreachable: {{ $labels.instance }}"

 

Stage 2 — Where Alerts Go (Alertmanager):

Yaml

# Alertmanager routes alerts to your team

route:

  receiver: 'team-slack'

  routes:

    - match:

        severity: critical

      receiver: 'oncall-pagerduty'   # Critical = wake someone up!

    - match:

        severity: warning

      receiver: 'team-slack'         # Warning = Slack message is fine

 

receivers:

  - name: 'team-slack'

    slack_configs:

      - channel: '#db-alerts'

        text: '{{ .Annotations.description }}'

 

  - name: 'oncall-pagerduty'

    pagerduty_configs:

      - service_key: 'your-key-here'

 

🗺️ The Complete Mental Model

Here's how to think about all of this together:

 

EACH DB SERVER:

┌──────────────────────────────────────────────────┐

│  TOM Agent                                       │

│                                                  │

│  River File answers:                             │

│    "Where is Oracle?"  ──► SQL Exporter connects │

│    "What to collect?"  ──► Points to YAML file   │

│    "How often?"        ──► scrape_interval       │

│    "Where to send?"    ──► Central server URL    │

│                                                  │

│  YAML File answers:                              │

│    "Which SQL queries to run?"                   │

│    "What to name each metric?"                   │

│    "Which columns are values vs labels?"         │

└──────────────────────────────────────────────────┘

              ↓ pushes metrics every N seconds

CENTRAL SERVER:

  Prometheus stores → oracle_sessions_total{host="prod1"} 42 @timestamp

              ↓ Grafana queries Prometheus

GRAFANA:

  Shows charts + fires alerts when rules are breached

Important:--

River file vs dbconnection, which one really make connection to oracle

Who Makes the Connection to Oracle?

SQL Exporter makes the actual connection. But it needs to be told how — and that's where the two configs play different roles.

The Simple Truth

River File = ORCHESTRATOR (tells the agent what to do and where to send)

DB Connection = CREDENTIALS (tells SQL Exporter how to reach Oracle)

SQL Exporter = THE ONE WHO ACTUALLY CONNECTS to Oracle

Think of it Like This

Imagine you're a delivery driver (SQL Exporter)

DB Connection string = the ADDRESS of the house (where to go, door code)

River File = your MANAGER's instructions (go collect from this address, every 60 seconds, bring it back to HQ)

You (SQL Exporter) = the one who actually drives there and knocks on the door

Where Each One Lives and What it Does

Config

Lives In

Purpose

Makes DB Connection?

DATA_SOURCE_NAME

.env file or environment variable

Oracle credentials + host + port + service

No, just stores info

River File

TOM Agent config

Orchestrates collection and routing

No, just instructs

SQL Exporter

TOM Agent

Reads the connection string and actually connects

YES

 

The Correct Flow

 

 

 

.env file / Environment Variable

  DATA_SOURCE_NAME="user/pass@//host:1521/SERVICE"

              |

              | SQL Exporter READS this

              v

         SQL Exporter

              |

              | Uses it to CONNECT to Oracle

              v

          Oracle DB  <--- actual connection happens HERE

              |

              | Results come back as metrics

              v

         River File picks up those metrics

              |

              | and ships them to Central Server

              v

          Prometheus




No comments:

Post a Comment