Lesson 34 · The Validation Suite · New Subsystem

The graph's self-test

A periodic, two-tier data-quality monitor — and how it differs from chainref. ~13 min.

Builds on: L29 · L32 · L8 Anchor: usd_value arithmetic, Safe signers New: fast / slow tiers New: self-consistency vs chain-truth

I pitched this lesson as "validation — the inline cousin of chainref." Reading the code corrects that, and the correction is the lesson: pkg/validation isn't inline at all. It's a second periodic monitor, running its own loop, asking a different question than chainref. Two quality systems, two altitudes — and knowing which answers which is the point.

Two questions, two subsystems

chainref (L29–33) asks "does the graph match the blockchain?" — it re-reads on-chain truth, and it acts (heals, tickets). validation asks "does the graph obey its own rules and stay internally coherent?" — mostly self-consistency checks, and it only reports (logs + metrics, no auto-fix). One guards reality; the other guards coherence. They overlap deliberately at the edges, which we'll see.

1 · A catalog of numbered checks

Validation is a flat registry of small CheckFuncs, each with a stable ID, a category, and a graded severity. A check returns Finding{Severity, Category, CheckID, Message, Count, Examples}. The catalog reads like a linter for the graph:

Category	Examples	Asks
Schema (S)	S02 every node has `id`+`graph_id`, S03 IDs lowercase-hex, S04 no self-loops, S05 no duplicate edges, S06 no orphaned edges	structural well-formedness
Classification (C)	C01 valid `behavior_class`, C03 no junk protocol names, C04 multisig has signers + `safe_threshold`, C06 no type downgrades	labels are sane & stable
Balance (B)	B01 sampled `balanceOf()` ≈ `HOLDS.quantity_raw`, B02 `usd_value = qty×price/10^dec`, B03 Σ HOLDS ≤ totalSupply, B04 dust threshold	quantities are arithmetically right
Risk (R)	R07 focus-token count under a warn threshold, plus cross-source price agreement	derived data is in range

Severity is graded per finding, not binary — a check computes info warn error from how bad the result is (e.g. B01 is info when clean, warn on minor drift, error on a real mismatch). The suite is a dashboard of dozens of these, each a fresh gauge every cycle.

2 · The two-tier cadence — separate by cost

The core design decision: checks are split into two independent tiers by what they touch, each with its own loop and cadence, so a cheap check never waits on an expensive one.

Fast tier

Neo4j + Redis only. Cadence ~10 min, 2-min per-check timeout. The structural/consistency checks (most of S, C, R).

Slow tier

RPC-dependent (on-chain spot-checks like B01's balanceOf). Cadence ~1 hour, 15-min per-check timeout.

Two registries, two goroutines (RunFast / RunSlow). The reasoning is exactly L24/L29's RPC-budget tension: RPC checks are slow and rate-limited, so isolating them means the cheap graph checks give fast feedback every few minutes instead of being dragged to the hourly RPC cadence. Cost dictates cadence; cadence dictates the loop.

3 · Per-check isolation — a check that breaks is itself a finding

Here's the robustness move. Each check runs in its own goroutine under a per-check timeout, and the runner converts its own failures into findings:

go func() {
    defer func() { if r := recover(); r != nil { out.panicVal = r } }()  // a panic is caught…
    out.findings = check(checkCtx, deps)
}()
select {
case out := <-done:    // panic → emit a CHECK_PANIC finding (SeverityError), keep going
case <-checkCtx.Done(): // timeout → emit a CHECK_TIMEOUT finding (SeverityError), keep going
}

The monitor monitors itself

A panicking or hung check doesn't crash the suite or wedge the cycle — it's recovered and emitted as a synthetic CHECK_PANIC / CHECK_TIMEOUT finding, then the next check runs. This is L8's fail-loop discipline pushed to per-check granularity, with a twist: the suite's own breakage becomes first-class data on the same dashboard as the graph's. You can't have a silently-dead check — a dead check reports itself.

4 · Monitor, not control loop

The deepest contrast with chainref: validation has no actuator. Findings flow to telemetry gauges (per check, per category, per severity) and structured logs — and stop there. No streak, no Linear ticket, no healer. Humans read the dashboard and decide.

A subtle gauge detail worth stealing

Every check records its count each cycle, including 0 for passing checks. Why emit a zero? So a check that flips failing → passing doesn't leave a stale non-zero series lingering on the dashboard. A monitor that only reports problems can't show "the problem went away" — recording the zero is what makes the green state visible. Small habit, big difference in an observability system.

5 · The deliberate overlap with chainref

Notice B03 here — Σ HOLDS ≤ totalSupply — is the same invariant as L32's BalanceConservationVerifier. That's not duplication by accident; it's the same fact checked at two altitudes:

	validation B03	chainref BalanceConservation (L32)
Role	a coarse data-quality alarm (warn finding on a gauge)	a precise audited drift check with an asymmetric band
Acts?	no — reports only	yes — feeds streak → ticket / heal
Question	"is the graph internally coherent right now?"	"does it match the chain, within policy?"

Defense in depth: the cheap monitor flags it fast; the precise harness confirms, grades, and acts. Real systems check the same critical invariant in more than one place, on purpose.

The full self-checking picture

The indexer guards itself three ways: chainref (does it match the chain? — audited + actuated, L29–33), validation (does it obey its own rules? — monitored, here), and parity (does Go match Python batch? — L23). Three lenses on "is the graph any good," each catching what the others can't.

Check yourself

1. What question does the validation suite answer, versus chainref?

2. Why are checks split into a fast tier and a slow tier with separate loops?

3. A validation check panics mid-cycle. What happens?

4. What does it mean that validation is a "monitor, not a control loop"?

5. Why does each check record a count every cycle, including 0 for passing checks?

6. B02 validates usd_value = quantity_raw × price / 10^decimals. What category of error does that catch?

7. Validation's B03 and chainref's BalanceConservation verifier check the same Σ HOLDS ≤ totalSupply invariant. Why have both?

8. A check's severity is graded (info / warn / error) rather than a pass/fail boolean. What does that buy?

↳ Ask your teacher

Try: "Show me S05's per-edge-type duplicate query and its memory cap." · "How does C06 detect a type downgrade (pool → token)?" · "Where is the Validator wired up — which binary runs it?" · "How do nil rdb / rpcPool make checks skip gracefully?" · "How does the /quality dashboard combine validation + chainref signals?"

What you can now do

State validation's question (internal coherence / own-rules) versus chainref's (chain truth) and parity's (Python match).
Read the numbered-check catalog (S / C / B / R) and the graded-severity Finding shape.
Explain the fast/slow tier split as cost-dictates-cadence, and why two independent loops matter.
Explain per-check isolation and how a panic/timeout becomes a first-class self-reported finding.
Explain monitor-vs-control-loop, the zero-count gauge habit, and the deliberate B03 ↔ chainref overlap.

Self-checking, fully mapped

With chainref (L29–33) and validation (here), you've seen both of the indexer's self-watching systems — one that audits against the chain and acts, one that monitors its own coherence and reports. A production data system that's trusted with billions doesn't assume it's correct; it continuously proves it, from several angles.

← PreviousLesson 33 · The Linear Promoter · Quality Internals Next →Lesson 35 · The Cost-Allocation Model · New Subsystem

Grounded in: pkg/validation/validator.go (Validator two-tier TierFast/TierSlow registries, RunFast/RunSlow independent loops, per-check goroutine + timeout + panic→CHECK_PANIC / timeout→CHECK_TIMEOUT, gauge-per-check incl. 0, graceful nil rdb/rpcPool), check_schema.go (S01–S09 incl. S05 duplicate edges, FORTA-3063 mem cap), check_classification.go (C01/C03/C04/C06), check_balances.go (B01–B04, graded severity), types.go (Finding/Severity). Verify against source — the code is the truth.