Lesson 30 · The Healer Subsystem · Quality Internals

Letting a machine fix the graph

How chainref auto-corrects drift — and the layered guards that make that safe. ~14 min.

Builds on: L29 · L9 · L8 Anchor: Safe getOwners / AddedOwner New: heal / prune ops New: shadow mode + safety guards

L29 ended with a fork: a finding either becomes a Linear ticket, or a healer auto-fixes it. This lesson is that second branch — and it's less about the fix than about the fear. Auto-mutating a 2M-node production graph that prices billions in risk is genuinely dangerous: one bad prune deletes real edges. So the healer earns the right to write through a stack of guards. Studying them is a masterclass in safe automation.

Your anchor: keeping the Safe owner set true

Take the running example, the OWNS healer. A monitored Gnosis Safe's true owners are whatever getOwners() returns right now. The graph stores (safe)-[:OWNS]->(owner) edges — written from AddedOwner/SafeSetup events, which can be missed or go stale. The verifier (L29) finds the mismatch; the healer's job is to quietly bring the stored edges back in line with getOwners(). Simple intent — the danger is all in how.

1 · What a healer is, and the two ops

A Healer (runner_healer.go) is a per-class reconcile hook the runner invokes right after that class's verifier runs, handing it the HealInput (the gap/excess diff sets the run computed). It renders those diffs into exactly two op kinds — mirroring L29's finding taxonomy:

Op	Triggered by	Action
HEAL	a gap (owner on-chain, no stored edge)	`MERGE` the missing `(safe)-[:OWNS]->(owner)` edge with the exact props an event handler would stamp
PRUNE	an excess (stored edge, owner gone on-chain)	a temporal-guarded `DELETE` of the stale edge

"Converge the graph toward chain truth" is the whole mandate. But every word of the guards below exists because a careless converge could corrupt the very graph it's auditing.

2 · The five guards (the actual lesson)

Subordinate to the audit. Heal is best-effort: a healer error is logged, never propagated into the verifier result. "A reconcile transport hiccup must never wedge the quality-gate cycle." Auditing is the job; healing is a bonus that can fail without consequence — the same fail-loop posture as L8/L23.

Through the single writer only. A healer never touches the graphstore directly. Every mutation is a graphwrite.Request published through the reconcile transport onto the single-writer stream (L9). Auto-fixes go through the exact same one door as the indexer's writes — no side channel that could race the canonical writer.

2½

Provenance stamp. A healed edge carries source = 'reconcile:owns', distinct from an event-written edge (source = 'event:AddedOwner'). You can always tell what the machine touched versus what the chain's own events wrote.

The partial-enumeration mass-delete guard. The crux. If the verifier's EnumerateOnChain was partial (an RPC truncated the truth set), the runner empties the Excess set before the healer sees it. Why: a partial chain read can only under-report — so it inflates "excess" with edges that are actually still live. Pruning on that would mass-delete real data. Gaps stay safe (a partial read can't fabricate a missing owner), so healing continues; only pruning is suppressed.

Temporal guard on prunes. The DELETE is conditioned on block: it preserves an edge an event handler re-asserted at a newer block than the verifier read. So a race — AddedOwner fires after the reconcile snapshot — doesn't get clobbered. Newer-block-wins, the idempotency discipline from L9.

Shadow mode first. The OWNS healer ships with its write budget pinned to 0: it renders + counts every op (would_heal / would_prune on the metrics) but publishes nothing. Operators watch the counts in production before any write lands. Unparking is config-gated behind two tickets — volume calibration and race-skip visibility.

Why the asymmetry — heal freely, prune fearfully

Notice guards 3 and 4 protect pruning specifically. That's the asymmetry at the heart of safe reconciliation: a wrong heal adds a duplicate-ish edge you can later prune; a wrong prune destroys data you may not be able to recover. So the system treats adds as low-risk and deletes as high-risk, and pours its guards into the delete path — exactly how you'd hand-reconcile a production database.

3 · Why shadow mode, concretely

Shadow mode is the difference between "we wrote an auto-healer" and "we trust an auto-healer in prod." The two gates blocking the budget raise spell out what trust requires:

Volume calibration — how many heal/prune writes per cycle is the single-writer stream safe to absorb? You can't know without watching the shadow counts first.
Race-skip visibility — a temporal-guarded prune that matches 0 rows (because an event handler won the race) must be observable before you trust the prune path not to delete live edges. Shadow mode is how you confirm the guard fires as designed.

The package-structure detail (a real Go constraint)

Healers live in their own healers/ package, not in chainref. Reason: the shared transport (pkg/reconcile) imports chainref for its Ref/Kind types, and a concrete healer imports both reconcile and chainref. If the healer lived inside chainref, that'd be an import cycle. So the Healer interface stays in chainref (the runner depends on it) while the implementations live outside — a clean example of breaking a cycle by separating interface from implementation.

4 · The reap cousin

One adjacent cleanup worth naming: ReapOrphanedReports (reap.go). When a verifier's Class() is renamed, the old :QualityReport node is orphaned (no verifier writes it anymore, so its coverage freezes at the last pre-rename run). Reap deletes any QualityReport not in the live registry's keep set — routed, of course, through the single-writer path. A small reminder that self-maintenance includes cleaning up after the maintainers' own renames.

The control loop is now complete — and safe

L29 measured drift; L30 closes it: heal gaps, prune excess, re-write drift — but only through the single writer, only with partial-safe and race-safe deletes, and only after shadow-mode proves the volume. Derive (L24–28) → measure (L29) → correct (L30), with every step that mutates production wrapped in a guard. That's the whole self-healing story.

Check yourself

1. When is a healer invoked, and what is it handed?

2. A healer renders two op kinds. Which pairing is correct?

3. A healer's Heal call returns an error mid-cycle. What does the runner do?

4. Why does a healer publish through the reconcile transport instead of writing to graphstore directly?

5. The verifier's EnumerateOnChain came back partial this cycle. What does the runner do to the Excess set before the healer sees it?

6. Why do the guards protect pruning far more heavily than healing?

7. The OWNS healer ships in shadow mode. What does that mean in practice?

8. The temporal guard on a prune exists to handle which situation?

↳ Ask your teacher

Try: "Show me the temporal guard's Cypher — how does it compare blocks?" · "What's in pkg/reconcile's Transport, and how does the write budget work?" · "Which classes have healers today vs. only ticket?" · "How does a healer's idem-key avoid double-applying across cycles?" · "What would unparking the OWNS budget actually require?"

What you can now do

Explain a healer as a per-class reconcile hook invoked after its verifier, rendering gap→HEAL and excess→PRUNE ops.
Recite the five guards: best-effort subordination, single-writer-only, provenance stamp, partial mass-delete guard, temporal prune guard.
Explain the heal-freely / prune-fearfully asymmetry and why deletes carry the heavy guards.
Describe shadow mode and the two trust gates (volume calibration, race-skip visibility) before writes are unparked.
Explain why healers live in their own package (the chainref↔reconcile import-cycle break) and what reap cleans up.

A study in earning the right to write

The healer isn't clever — its fixes are one-line MERGEs and DELETEs. What's sophisticated is the discipline around them: every guard answers a specific way auto-mutation could hurt a production graph. That's the transferable lesson — automated remediation is mostly about the guardrails, not the fix.

← PreviousLesson 29 · Chain-Reference Quality Harness · New Subsystem Next →Lesson 31 · The Reconcile Transport · Quality Internals

Grounded in: pkg/quality/chainref/runner_healer.go (Healer interface, HealInput Gaps/Excess/Report/Block, Partial mass-delete guard FORTA-2886, best-effort never-propagate contract, RegisterHealer), healers/owns.go (OwnsHealer HEAL/PRUNE ops, temporal-guarded DELETE, source='reconcile:owns' provenance, WithShadowMode budget=0 + FORTA-2776/2850 gates, single-writer-only via reconcile.Transport), reap.go (ReapOrphanedReports). Verify against source — the code is the truth.