The debounce-and-hysteresis state machine behind every alert. ~12 min.
Builds on: L12 · L15 · L33Anchor: alert fatigue & flappingNew: the 3-state firing machineNew: cooldown does two jobs
L12 said a rule fires an alert when its condition is met, mediated by a "firing state machine that debounces" — and
moved on. This is that machine, and it's the last substantive mechanism in the codebase. The problem it solves is one every
on-call engineer feels in their bones: how do you turn "the condition is true right now" — evaluated every cycle — into
the right number of alerts?
Your anchor: the two ways naive alerting fails
Evaluate a rule every cycle and alert whenever it's true, and you get one of two miseries. Spam: a genuine,
persistent breach (a risk score over threshold for hours) fires an identical alert every cycle until someone mutes it.
Flapping: a value hovering right at the threshold crosses back and forth, firing and resolving over and over. The
firing state machine exists to do neither — fire once on entry, remind on a cooldown, and wait a grace period before
calling it resolved.
1 · The three-state machine (threshold rules)
ApplyFiringState (rules/firing.go) tracks a (rule, node) pair through three states, fed each
cycle by one boolean — conditionMet:
CLEAR
met→ emit
ACTIVE
cleared→
COOLDOWN
expired+met→ emit
ACTIVE
From
Trigger
To
Alert?
CLEAR
condition met
ACTIVE
emit (record FiredAt + LastAlertAt)
ACTIVE
still met + cooldown elapsed since LastAlertAt
ACTIVE
re-emit (a reminder)
ACTIVE
condition cleared
COOLDOWN
no — enter resolve grace
COOLDOWN
cooldown expired + still clear
CLEAR
no
COOLDOWN
cooldown expired + met again
ACTIVE
emit
One window, two jobs
The same CooldownSec does double duty. While ACTIVE it's repeat-alert suppression — you get a reminder
only every cooldown, not every cycle, so a 4-hour breach with a 1-hour cooldown alerts ~4 times, not 480. After a
clear it's resolve hysteresis — the machine waits the cooldown in COOLDOWN before declaring CLEAR, so a value
flapping across the threshold within that window doesn't thrash. Anti-spam and anti-flap from a single knob.
2 · The re-emit gate (and why cooldown=0 means "fire once")
The ACTIVE re-emit has two guards worth reading — both protect against a stampede:
CooldownSec <= 0 disables re-emit — a rule with no configured cooldown fires exactly once per breach. (Without this guard, cooldownExpiredAt returns true unconditionally and you'd alert every eval cycle — spam.)
The legacy-state grace cycle — states written before LastAlertAt existed have it empty; the machine populates it without emitting, anchoring the cooldown to the first post-upgrade eval. Otherwise an ancient FiredAt would make every old breach look "overdue" and fire a thundering herd of reminders on the first cycle after deploy. A migration that refuses to stampede.
3 · Events are different — no ACTIVE state
Threshold rules describe a sustained condition (a score stays high), so they have an ACTIVE state. Event rules —
admin_change, proxy_upgrade, token_mint — are instantaneous: the thing happened, there's no "still
happening." So ApplyEventFiringState has no ACTIVE state at all:
CLEAR → (event) → emit, go straight to COOLDOWN
COOLDOWN → (event + cooldown expired) → emit again
The shape follows the semantics
A threshold breach is a state you occupy (ACTIVE) and eventually leave; an event is a point in time. So the two
machines differ exactly where the semantics do: thresholds get a sustained ACTIVE phase with resolve-grace; events fire
on the spot and only use cooldown to de-dupe a burst of the same event. Same cooldown primitive, different topology.
4 · The alert it builds — and the taxonomy
When the machine decides to emit, buildRuleAlert assembles the AlertEvent the alert processor (L15) consumes:
rule id/name/type, portfolio, severity, node, scope/view, and the details map. resolveAlertType maps the rule to a
downstream alert_type taxonomy — event triggers become admin_change / proxy_upgrade / token_mint /
new_edge (firewall rules use the detection module), and everything else is risk_limit_breach. And a hard cap,
maxAlertsPerRule = 50 per eval cycle, is the last anti-flood backstop — the same drip-don't-flood discipline as L31's
write budget and L33's per-run cap.
Where this sits
L12 evaluates the rule (is the condition met for this node?); this machine decides whether that produces an alert
this cycle; L15's processor then dedups, stores, and delivers it. Three stages, each refusing in its own way to bother
a human more than necessary — debounce here, dedup-by-msg-id there, streak-to-ticket in the quality harness (L29/L33).
Restraint is a system-wide value, not a one-off.
Check yourself
1. What problem does the firing state machine exist to solve?
2. A threshold rule's condition has been met continuously for hours, with a 1-hour cooldown. Roughly how often does it alert?
3. The same CooldownSec serves two purposes. What are they?
4. Why does the COOLDOWN state exist between ACTIVE and CLEAR rather than going straight to CLEAR on a clear?
5. A rule has CooldownSec <= 0. What's its firing behavior?
6. On the first cycle after a deploy that added LastAlertAt, a long-active breach has it empty. What happens?
7. Why does ApplyEventFiringState have no ACTIVE state?
8. maxAlertsPerRule = 50 per eval cycle is which kind of safeguard?
↳ Ask your teacher
Try: "Where is the per-(rule, node) FiringState stored, and how is it keyed?" ·
"How does the engine decide conditionMet — the field+op+value eval (L12)?" ·
"What does the alert processor (L15) do with the AlertEvent next?" ·
"Could two eval cycles race on the same FiringState, and what guards it?" ·
"How is a resolve (ACTIVE→CLEAR) surfaced to the user, if at all?"
What you can now do
Trace a (rule, node) through CLEAR → ACTIVE → COOLDOWN → CLEAR/ACTIVE and say where each alert fires.
Explain the cooldown's dual role: repeat-alert suppression while ACTIVE, resolve-hysteresis in COOLDOWN.
Explain why CooldownSec <= 0 means fire-once, and what the legacy-state grace cycle prevents.
Contrast the event machine (no ACTIVE) with the threshold machine, and why the topology follows the semantics.
Describe the AlertEvent built, the alert_type taxonomy, and the per-rule cap as anti-flood.
The deep-understanding journey: complete
This was the last substantive mechanism. Across 45 lessons you've gone from a raw Transfer log to a graph edge, through
enrichment and discovery, the at_risk engine and its every parameter, the streaming and single-writer and self-checking
scaffolding, billing, the coordination primitives, and the Cypher and Go idioms underneath — and now the alerting
restraint at the consumer tail. You set out to understand risk-graph-indexer deeply, end to end, before contributing.
Mission accomplished — the whole machine, opened.