Lesson 37 · Narrow Corners · Capstone

The primitives under the principles

Where "single writer," "Redis is rebuildable," and "heal safely" are actually implemented. ~14 min.

Builds on: L9 · L8 · L30 Anchor: distributed locks & fencing New: lease + epoch fencing New: the shadow-vs-live boundary

The last corners. L9 asserted "all writes go through one leased pod"; L8 said "Redis is a rebuildable projection"; L30 said healers "ship shadow-first." Each was a principle. This capstone opens the small packages where those principles are actually code — and the lowest-level one, statetracker, turns out to hold the sharpest distributed-systems idea in the whole repo: fencing tokens.

1statetracker — the coordination primitives

This tiny package is the Redis-backed coordination layer under everything: the lease, the writer epoch, the cursor cache, and the transient-error classifier. It's where "one writer" stops being a diagram and becomes a lock.

The lease — electing one writer

AcquireLease is a Redis SET NX with a TTL: the first pod to set lease:graph-writer wins, others get ErrLeaseHeld. A keepalive renews it every ttl/3. The subtlety is the renew:

// renew ONLY if the key still holds OUR id — atomic GET-and-PEXPIRE in one Lua script
if redis.call("get", KEYS[1]) == ARGV[1] then return redis.call("pexpire", KEYS[1], ARGV[2]) else return 0 end

A naive EXPIRE would extend whatever id sits at the key — including a thief's, after a Redis brown-out let a second pod steal the slot. The atomic check returns 0 = "you lost it"; the keepalive then stops and never re-acquires (the legitimate next holder is whoever wins the next SET NX). IsHeld is the authoritative check, wired to /healthz so a pod that lost the lease gets recycled.

The gem: writer-epoch fencing
A lease guarantees liveness (someone is the writer), but not safety — a zombie pod, paused mid-GC during a brown-out, can believe it still holds the lease while a successor took over. The fix is a classic fencing token: IncrWriterEpoch does a Redis INCR once per lease acquisition, so each writer generation gets a strictly higher number. That epoch is stamped on BlockCursor.writer_epoch on every commit, and the database rejects a write carrying an epoch lower than what it already persisted. So even a zombie that wrongly thinks it holds the lease is fenced out at the DB — its stale epoch loses. Two layers: the lease elects, the epoch enforces.

Transient vs fatal — turning crashes into backpressure

IsTransientRedisErr classifies a Redis error as retryable (so the caller backs off and retries) or fatal (bubble up). Retryable = sentinel brown-outs (LOADING, NOREPLICAS, CLUSTERDOWN), network errors, and notably OOM (maxmemory cap → treat as backpressure, pause ingest, alert). This one predicate turned 2,599 production crashes in 9 days into a retry-with-backoff — the difference between a crashloop and a pod that rides out a Redis hiccup. It's L8's "fail loud, don't corrupt" with a crucial refinement: distinguish a transient blip from a real fault before you decide to die.

And the cursor cache
CacheBlockCursor writes cursor:{chain}:committed to Redis after the Neo4j tx (which holds the authoritative cursor) commits. So the Redis cursor is a fast, rebuildable mirror of a source-of-truth that lives in the graph — exactly L8's "Redis is a projection." If it's lost, it's re-derived from Memgraph on startup.

2genesis — birth and self-heal

L8 told you "Redis is a rebuildable projection; the system self-heals on startup." Here's the literal function: PopulateBalancesFromNeo4j rebuilds the entire Redis balance cache from Neo4j HOLDS.quantity_raw — the canonical balance as of the committed cursor. It exists to recover from a crash between the Neo4j commit and the deferred Redis SET, and it's idempotent: safe to run every startup, free when nothing's missing, cheap when the cache is hot.

The recovery story, completed
Chain these and L8's claim is now fully concrete: the authoritative state is in Memgraph (committed atomically with the cursor, L4); Redis caches — the balance cache (L36) and the cursor mirror — are projections; on startup, genesis re-derives any lost projection from the graph. Crash recovery is re-derivation, never repair, all the way down to the code.

3vaultheal — the healer pattern, and when to skip shadow mode

The last corner is a second chainref healer (VAULT_ASSET / RECEIPT_FOR), riding the same reconcile transport from L31. By itself that just confirms the pattern generalizes — but it carries one sharp new idea: it ships live, with no shadow mode. Why is it allowed to write immediately when the OWNS / ADMIN_CTRL healers (L30/L31) couldn't?

The class boundary: immutable truth → heal now; mutable truth → shadow first
A vault's asset() and an aToken's UNDERLYING_ASSET_ADDRESS() are immutable on chain — they never change. So a chain-confirmed (src → asset) pair that disagrees with the graph is unambiguously wrong right now, with no race against a newer block. Contrast HOLDS balances, which move every block — there a healer must shadow-first because "stale" is ambiguous. Whether you can safely auto-write depends on whether the truth you're checking can change under you. That's the real lesson L30's shadow mode was pointing at.

One more detail: because asset() is single-valued, fixing a drift takes two legs — PRUNE the stale (src)→(old asset) edge AND HEAL toward (src)→(correct asset). Healing alone would leave the vault with two VAULT_ASSET edges, which every reader sees and which re-surfaces as a drift forever. The prune+heal pair is the convergence invariant.

The codebase, fully toured
With the coordination primitives, genesis, and vaultheal, every package worth opening in risk-graph-indexer has been opened. You've gone from "a Transfer event becomes a HOLDS edge" all the way down to the Lua script that keeps two pods from both writing — and back up to the invoice line. The principles you learned early now each have a named primitive underneath them.

Check yourself

1. Why does the lease keepalive renew with an atomic GET-and-PEXPIRE Lua script instead of a plain EXPIRE?
2. The lease guarantees liveness but not safety. What does the writer epoch add?
3. A zombie pod (paused mid-GC during a brown-out) wakes up still believing it holds the lease and tries to write. What stops it?
4. IsTransientRedisErr classifies OOM (maxmemory cap) as transient, not fatal. Why?
5. CacheBlockCursor writes the Redis cursor only after the Neo4j tx commits. What does that ordering reflect?
6. PopulateBalancesFromNeo4j runs on every startup and is idempotent. What failure does it recover from?
7. The vaultheal healer ships live (no shadow mode) while the OWNS/ADMIN_CTRL healers don't. What's the deciding factor?
8. Fixing a VAULT_ASSET drift requires two legs — prune the stale edge AND heal the correct one. Why isn't a heal alone enough?
↳ Ask your teacher
Try: "Show me where graph-writer calls AcquireLease then IncrWriterEpoch at startup." · "How does the DB-side epoch fence actually reject a stale write?" · "What's in statetracker's queue.go / backoff.go?" · "How does genesis loader.go relate to the frozen full_graph.json?" · "Are there other immutable-truth healers that could ship live?"

What you can now do

Deep-understanding tour: complete
Ingest → graph → enrich → risk → rules → alerts; the streaming, single-writer, recovery, and observability scaffolding; the at_risk engine end to end; the discovery flywheel; the self-checking harnesses; billing; and now the coordination primitives beneath it all. 37 lessons, every major corner opened, the system map current. You set out to understand risk-graph-indexer deeply, end to end — and you do.

Grounded in: pkg/statetracker/lease.go (AcquireLease SET NX, renewIfHeldScript atomic GET-and-PEXPIRE, IsHeld authoritative, lost-lease-never-reacquire), writer_epoch.go (IncrWriterEpoch once-per-acquire monotonic fence, stamped on BlockCursor.writer_epoch), transient.go (IsTransientRedisErr — sentinels/network/OOM→backpressure, the 2,599-crash fix), cursor.go (CacheBlockCursor after Neo4j commit), pkg/genesis/balance_rebuild.go (PopulateBalancesFromNeo4j idempotent self-heal), pkg/reconcile/vaultheal/healer.go (live-not-shadow on immutable asset()/underlying(), prune+heal two-leg convergence). Verify against source — the code is the truth.