OEM 24ai Incident Rules: Why Your Alerts Are Firing at the Wrong People
Oracle observability post #3 — the last post covered connecting OEM 24ai to OCI Observability services. This one is about what happens when something actually goes wrong and why most OEM deployments route that signal badly.
There's a pattern I see at almost every customer site after a fresh OEM 24ai deployment.
The monitoring is working. Metrics are being collected. Thresholds are set. But six months in, the DBAs have started ignoring the alert emails — because there are too many, half of them don't apply to the person receiving them, and the format tells you something fired but not what to actually do about it.
That's not a monitoring problem. That's an incident rules problem.
Getting an alert is useless if the wrong person gets it, at the wrong time, in the wrong format, with no context about severity. This post is about getting that right in OEM 24ai — the rule set architecture, the common failure patterns I see in the field, and how to connect OEM's notification pipeline into the broader ops toolchain, including OCI Notifications for hybrid environments.
The OEM Incident Pipeline: What Actually Happens
Before you touch a rule set, you need to understand how OEM moves from a raw metric value to a page on someone's phone. There are four stages:
Metric → raw data collected by the OEM Agent on the target. CPU utilization, tablespace usage, active sessions, etc. Collected on a polling interval, stored in the OMS.
Event → a metric threshold crossing. When CPU utilization exceeds your configured warning or critical threshold, OEM generates an event. Events are also generated by target availability changes, compliance violations, and EM jobs. Incident → a grouping of related events, managed through Incident Manager. By default OEM creates one incident per event, but you can configure rules to correlate multiple events into a single incident. This is how you avoid 47 separate incidents when an Exadata cell node goes offline and cascades.
Notification → an action triggered by an incident rule. Email, SNMP trap, PagerDuty webhook, OCI Notifications endpoint, or an EM CLI script. This is the step where routing decisions are made.
The rule set layer sits between incident creation and notification. Rules evaluate: For this incident, at this severity, on this target type — who gets told, how, and what do they see?
Rule Sets vs. Rules: The Container Model
In OEM 24ai, notification routing is controlled by Incident Rule Sets. The hierarchy is:
Rule Set — a container with an ordered list of rules, evaluated top-to-bottom. Rule sets are assigned to a scope: all targets, a specific group, or a named target list.
Rule — an individual condition + action pair. "For Severity 1 and 2 incidents on targets in the Production-DBs group, send email to dba-oncall@company.com and create a PagerDuty incident."
Action — what happens when the rule matches. Email, SNMP, custom script, OCI endpoint.
The key thing to understand: rule evaluation stops at the first matching rule within a rule set. Order matters. If your generic "send all alerts to dba-team" rule is first, your targeted routing rules below it will never fire.
What Ships Out of the Box (and Why It's Not Enough)
OEM 24ai ships with two out-of-box (System Generated) enterprise rule sets, visible under Setup → Incidents → Incident Rules. Note that these Oracle-supplied rule sets cannot be exported or imported via EM CLI. They are:
Incident management rule set for all targets — the main system-generated set, scoped to all targets. It bundles roughly twenty rules that create, compress, and clear incidents for events like target-down and agent-unreachable, high-availability events, critical metric alerts, compliance violations, and SLA alerts. Its job is to create and manage incidents in Incident Manager — none of these default rules sends email or any other notification on its own.
A metric-alert rule set — fires email notifications for all Warning and Critical metric alerts. Default recipient: whatever email address was configured during OMS installation. The second one is the source of most of the alert fatigue I see. It sends everything, to one address, at all hours. Within a few months in a real environment — dozens of targets, hundreds of monitored metrics — that mailbox is noise. DBAs stop reading it.
The fix is to replace the default routing with rule sets that are built around how your team actually works.
Building Rule Sets That Work
Here's the model I use with enterprise customers. Three rule sets, ordered by specificity:
Rule Set 1: Production Critical (Severity 1–2)
Scope: A named group containing your production Oracle databases, RAC clusters, and Exadata targets.
Rules inside:
Severity 1 incidents → immediate email + PagerDuty → DBA on-call rotation + database manager
Severity 2 incidents → email → DBA team distribution list, business hours only (06:00–22:00 on weekdays)
All target availability Down events → immediate email + SMS → DBA on-call, any hour
Key configuration detail: Set Notification Repeat Interval for Severity 1 to every 30 minutes until acknowledged. OEM supports this natively — set it in the Advanced section of the notification action. Without it, you get one email when the incident opens and silence afterward.
Rule Set 2: Non-Production and Dev/Test (Severity 1–3)
Scope: Non-prod target group.
Rules inside:
Severity 1 incidents only → email → DBA team DL, business hours only
Severity 2–3 incidents → email daily digest to team lead (not to individuals)
Non-prod environments don't need paging. A daily digest is enough visibility without generating noise that erodes trust in the alerting system.
Rule Set 3: Catch-All (Severity 4–5 / Everything Else)
Scope: All targets.
Rules inside:
All Severity 4–5 incidents → log to Incident Manager only, no notification
All unmatched incidents → email to monitoring team alias, low priority flag
Severity 4–5 in OEM are advisory. They should be reviewed periodically, not interrupt anyone's workflow.
Path: Setup → Incidents → Incident Rules → Create Rule Set
Notification Templates: Stop Sending Raw Alerts
Default OEM email notifications look like this:
Target: PRODDB01
Metric: CPU Utilization
Value: 94.3
Threshold: 90
That tells you what fired. It doesn't tell you whether this is normal weekend batch behavior, whether this same alert fired last Tuesday, or what the on-call engineer should look at first.
OEM 24ai supports custom notification message templates. Use them.
Path: Setup → Notifications → Notification Methods → Manage Notification Methods → OS Command or Email → Edit Message Template
A better template includes:
Target name and type
Metric name, current value, threshold breached
Severity level
Time of event
Direct link to the incident in Incident Manager (via the URL substitution variables available in OEM notification templates)
Last 3 occurrences of this metric alert on the same target (query from EM CLI)
The last point requires a small wrapper script — but a message that says "this is the 4th time this week" changes how urgently an on-call DBA responds versus "CPU was high one time."
Escalation Rules: What Happens When No One Responds
OEM supports escalation rules within a rule set. An escalation rule fires when an incident has been open (or acknowledged but unresolved) for a defined time window.
Common pattern:
Severity 1 incident not acknowledged within 15 minutes → escalate to DBA manager
Severity 1 incident acknowledged but not resolved within 2 hours → escalate to database team lead + incident bridge
Configure this under the rule's Actions → Add Escalation section. The escalation target is a separate notification method, so you can route initial alerts to individuals and escalations to a group.
This is the piece most teams skip — and it's the piece that matters at 2am when the primary on-call is unavailable.
EM CLI: Audit and Manage Rules Programmatically
If you're managing more than one OMS or handing rule sets across environments, EM CLI is the right approach. You don't want to manually replicate a 15-rule set through the console.
Note: EM CLI has no verb that lists incident rules. You review rule sets in the console under Setup → Incidents → Incident Rules. To capture a rule set's full definition for review or backup, export it to XML:
emcli export_incident_rule_set -rule_set_name="Production Critical" -rule_set_owner=sysman -export_file="/tmp/"
Import the rule set into another OMS (for multi-OMS HA environments):
emcli import_incident_rule_set -import_file="/tmp/Production_Critical.xml" -alt_rule_set_name="Production Critical"
This is especially useful in OEM 24ai HA environments where you run multiple OMS nodes behind a load balancer. Rule sets replicate through the OMR, but having version-controlled exports in a git repo means you can always roll back a bad rule change without hunting through Incident Manager history.
Connecting OEM to the Broader Notification Stack
For teams running hybrid environments — which, as I covered in the last post, is most Oracle shops today — OEM's notification pipeline needs to connect to the same tools the rest of the ops team uses. The DBA team might live in OEM and email. The cloud ops team is in Slack and PagerDuty. Leadership wants a dashboard, not an inbox.
Three patterns I've seen work well:
Pattern 1: OEM → SNMP → PagerDuty OEM has native SNMP trap support. PagerDuty's generic webhook integration can receive SNMP traps via a bridge like OpsGenie's SNMP integration or a lightweight snmptrapd → HTTP relay. Lower latency than email, works without custom code. Pattern 2: OEM → EM CLI Custom Notification → OCI Notifications Set up a custom OS command notification that calls a Python script. That script POSTs the incident payload to an OCI Notifications topic via the OCI SDK. From there, the topic fans out to Slack, PagerDuty, email — whatever your cloud ops team uses. I covered the OEM REST API endpoint for this in the last post:
https://<OMS_HOST>:<OMS_PORT>/em/api/v1/incidents # verify the exact host, port, and path for your OMS install
The same endpoint works in reverse — your notification script can pull full incident details and include them in the payload, not just the sparse EM notification defaults.
Pattern 3: OCI Notifications as the Single Alerting Plane If you've already set up OCI Monitoring alarms (as described in Step 4 of the last post), consider routing OEM Severity 1–2 notifications and OCI Monitoring alarms to the same OCI Notifications topic. One topic, one subscriber list, one PagerDuty service. Your on-call doesn't need to know whether the alert originated from OEM or OCI — they need to know something's broken and where to look.
This pattern requires some discipline: you need to deduplicate. If both OEM and OCI Monitoring are watching CPU on the same host and both alert, you get double pages. Split the responsibility by layer — OEM monitors Oracle process-level metrics, OCI Monitoring handles host-level infrastructure metrics. Don't overlap.
The Anti-Patterns (What I Fix First at New Customer Sites)
Alert on everything = alert on nothing. When the DBA team has 200 unread alert emails, the one that matters gets missed. Start by auditing which alerts have fired more than 50 times in the last 30 days. If it's that frequent, it's either a false positive or a chronic issue that needs a fix, not a repeated notification.
No severity segmentation. Warning and Critical going to the same mailbox, at the same priority, with the same notification format. If an on-call can't distinguish a "hey watch this" from a "wake up now," the severity system is broken. No escalation path. A single point of contact for Severity 1 alerts with no escalation rule is a production risk. People travel, phones die, PagerDuty apps get removed. If your monitoring system has no fallback when the primary path fails, you've created a single point of failure in your incident response.
Notification bloat on test environments. Non-prod alerts drowning prod alerts in the same inbox is one of the most common problems I fix on day one of an engagement. Separate rule sets, separate distribution lists, or at minimum separate email subjects that let filters do the work.
No blackouts configured. Maintenance windows with no blackout = alert storms at 2am. Post #4 will cover OEM blackout configuration in detail — because this is its own topic worth getting right.
What Good Looks Like
When OEM 24ai's incident routing is working correctly, here's the experience:
Severity 1 fires → the right on-call engineer gets paged within 60 seconds, with enough context (target, metric, value, link to Incident Manager) to act immediately
Severity 2 fires → the DBA team distribution list gets an email during business hours, formatted with enough detail to triage without opening OEM first
Non-prod fires → a daily digest review, not individual notifications
A 15-minute silence from the paged engineer → the manager gets a follow-up automatically
Post-incident → Incident Manager has a full audit trail, linked to the original metric event, with timestamps for acknowledgment and resolution
That's not a complex setup. It's around three rule sets, a handful of rules each, and a couple of notification methods — scale that up or down to fit your environment. The complexity isn't in the implementation — it's in taking the time to think through how your specific team works before clicking through the console.
The Bottom Line
OEM 24ai's monitoring is only as effective as its notification routing. An alert that goes to the wrong person, arrives with no context, or gets buried in noise from lower-severity events might as well not exist.
Start with a rule set audit: review your rule sets in the console (Setup → Incidents → Incident Rules), exporting them with export_incident_rule_set if you want a full offline copy, and map what you have against what your team actually receives and responds to. In most environments I've seen, there's a significant gap between what OEM is configured to send and what the ops team treats as actionable.
Fix the routing first. The rest of the observability work in this series depends on people trusting that when OEM fires, it means something real.
Next in this series: OEM 24ai Blackouts and Maintenance Windows — how to suppress alerts during planned maintenance without creating monitoring blind spots or drowning your team when the window ends.


