L29 ended with a fork: a finding either becomes a Linear ticket, or a healer auto-fixes it. This lesson is
that second branch — and it's less about the fix than about the fear. Auto-mutating a 2M-node production graph that
prices billions in risk is genuinely dangerous: one bad prune deletes real edges. So the healer earns the right to write
through a stack of guards. Studying them is a masterclass in safe automation.
Your anchor: keeping the Safe owner set true
Take the running example, the OWNS healer. A monitored Gnosis Safe's true owners are whatever getOwners()
returns right now. The graph stores (safe)-[:OWNS]->(owner) edges — written from AddedOwner/SafeSetup
events, which can be missed or go stale. The verifier (L29) finds the mismatch; the healer's job is to quietly bring the
stored edges back in line with getOwners(). Simple intent — the danger is all in how.
1 · What a healer is, and the two ops
A Healer (runner_healer.go) is a per-class reconcile hook the runner invokes right
after that class's verifier runs, handing it the HealInput (the gap/excess diff sets the run computed). It
renders those diffs into exactly two op kinds — mirroring L29's finding taxonomy:
Op
Triggered by
Action
HEAL
a gap (owner on-chain, no stored edge)
MERGE the missing (safe)-[:OWNS]->(owner) edge with the exact props an event handler would stamp
PRUNE
an excess (stored edge, owner gone on-chain)
a temporal-guarded DELETE of the stale edge
"Converge the graph toward chain truth" is the whole mandate. But every word of the guards below exists because a
careless converge could corrupt the very graph it's auditing.
2 · The five guards (the actual lesson)
1
Subordinate to the audit. Heal is best-effort: a healer error is logged, never propagated into the verifier result. "A reconcile transport hiccup must never wedge the quality-gate cycle." Auditing is the job; healing is a bonus that can fail without consequence — the same fail-loop posture as L8/L23.
2
Through the single writer only. A healer never touches the graphstore directly. Every mutation is a graphwrite.Request published through the reconcile transport onto the single-writer stream (L9). Auto-fixes go through the exact same one door as the indexer's writes — no side channel that could race the canonical writer.
2½
Provenance stamp. A healed edge carries source = 'reconcile:owns', distinct from an event-written edge (source = 'event:AddedOwner'). You can always tell what the machine touched versus what the chain's own events wrote.
3
The partial-enumeration mass-delete guard. The crux. If the verifier's EnumerateOnChain was partial (an RPC truncated the truth set), the runner empties the Excess set before the healer sees it. Why: a partial chain read can only under-report — so it inflates "excess" with edges that are actually still live. Pruning on that would mass-delete real data. Gaps stay safe (a partial read can't fabricate a missing owner), so healing continues; only pruning is suppressed.
4
Temporal guard on prunes. The DELETE is conditioned on block: it preserves an edge an event handler re-asserted at a newer block than the verifier read. So a race — AddedOwner fires after the reconcile snapshot — doesn't get clobbered. Newer-block-wins, the idempotency discipline from L9.
5
Shadow mode first. The OWNS healer ships with its write budget pinned to 0: it renders + counts every op (would_heal / would_prune on the metrics) but publishes nothing. Operators watch the counts in production before any write lands. Unparking is config-gated behind two tickets — volume calibration and race-skip visibility.
Why the asymmetry — heal freely, prune fearfully
Notice guards 3 and 4 protect pruning specifically. That's the asymmetry at the heart of safe reconciliation: a
wrong heal adds a duplicate-ish edge you can later prune; a wrong prune destroys data you may not be able to
recover. So the system treats adds as low-risk and deletes as high-risk, and pours its guards into the delete path —
exactly how you'd hand-reconcile a production database.
3 · Why shadow mode, concretely
Shadow mode is the difference between "we wrote an auto-healer" and "we trust an auto-healer in prod." The two gates
blocking the budget raise spell out what trust requires:
Volume calibration — how many heal/prune writes per cycle is the single-writer stream safe to absorb? You can't know without watching the shadow counts first.
Race-skip visibility — a temporal-guarded prune that matches 0 rows (because an event handler won the race) must be observable before you trust the prune path not to delete live edges. Shadow mode is how you confirm the guard fires as designed.
The package-structure detail (a real Go constraint)
Healers live in their own healers/ package, not in chainref. Reason: the shared transport
(pkg/reconcile) imports chainref for its Ref/Kind types, and a concrete healer imports
bothreconcile and chainref. If the healer lived inside chainref, that'd be an import cycle. So the
Healerinterface stays in chainref (the runner depends on it) while the implementations live outside —
a clean example of breaking a cycle by separating interface from implementation.
4 · The reap cousin
One adjacent cleanup worth naming: ReapOrphanedReports (reap.go). When a verifier's
Class() is renamed, the old :QualityReport node is orphaned (no verifier writes it anymore, so its coverage
freezes at the last pre-rename run). Reap deletes any QualityReport not in the live registry's keep set — routed,
of course, through the single-writer path. A small reminder that self-maintenance includes cleaning up after the
maintainers' own renames.
The control loop is now complete — and safe
L29 measured drift; L30 closes it: heal gaps, prune excess, re-write drift — but only through the single writer, only
with partial-safe and race-safe deletes, and only after shadow-mode proves the volume. Derive (L24–28) → measure (L29) →
correct (L30), with every step that mutates production wrapped in a guard. That's the whole self-healing story.
Check yourself
1. When is a healer invoked, and what is it handed?
2. A healer renders two op kinds. Which pairing is correct?
3. A healer's Heal call returns an error mid-cycle. What does the runner do?
4. Why does a healer publish through the reconcile transport instead of writing to graphstore directly?
5. The verifier's EnumerateOnChain came back partial this cycle. What does the runner do to the Excess set before the healer sees it?
6. Why do the guards protect pruning far more heavily than healing?
7. The OWNS healer ships in shadow mode. What does that mean in practice?
8. The temporal guard on a prune exists to handle which situation?
↳ Ask your teacher
Try: "Show me the temporal guard's Cypher — how does it compare blocks?" ·
"What's in pkg/reconcile's Transport, and how does the write budget work?" ·
"Which classes have healers today vs. only ticket?" ·
"How does a healer's idem-key avoid double-applying across cycles?" ·
"What would unparking the OWNS budget actually require?"
What you can now do
Explain a healer as a per-class reconcile hook invoked after its verifier, rendering gap→HEAL and excess→PRUNE ops.
Recite the five guards: best-effort subordination, single-writer-only, provenance stamp, partial mass-delete guard, temporal prune guard.
Explain the heal-freely / prune-fearfully asymmetry and why deletes carry the heavy guards.
Describe shadow mode and the two trust gates (volume calibration, race-skip visibility) before writes are unparked.
Explain why healers live in their own package (the chainref↔reconcile import-cycle break) and what reap cleans up.
A study in earning the right to write
The healer isn't clever — its fixes are one-line MERGEs and DELETEs. What's sophisticated is the discipline around them:
every guard answers a specific way auto-mutation could hurt a production graph. That's the transferable lesson — automated
remediation is mostly about the guardrails, not the fix.