Lesson 08 · Failure Modes & Recovery

What happens when things go wrong

Reorgs, crashes, lost caches, diverging cursors — and why none of them corrupt the graph. ~13 min.

Builds on: L4 · L7 Anchor: reorgs & finality The deep payoff New: source-of-truth vs cache

A realtime indexer runs forever against an adversarial world: chains reorg, pods crash mid-write, Redis gets flushed, the graph-writer stalls. The remarkable thing about this system is that essentially every failure recovers automatically, and none of them corrupt the graph. This lesson is why. It rests on a single idea you've been circling since Lesson 4.

1 · The invariant that makes recovery possible

Straight from the codebase (cmd/indexer/cursor_watchdog.go), the load-bearing rule:

The Block-determinism invariant

State at BlockCursor = N is a pure function of blocks 0..N. Replay the same blocks → get the same graph, byte for byte. (You earned this in Lesson 4: coalesced state + MERGE + the monotonic guard. Lesson 7 then made redelivery safe.)

Why is this the key to recovery? Because if the state is a deterministic function of the blocks, then any lost or corrupted derived state can simply be recomputed by replaying blocks — recovery is never a delicate repair, it's a re-derivation.

2 · Source of truth vs. rebuildable cache

The second pillar: the system is explicit about what is canonical and what is disposable.

Store	Role	If lost…
Memgraph	Source of truth — the graph + the applied `BlockCursor`	This is the real problem (mitigated by DB backups/replication, out of scope here).
Redis (:6379)	Rebuildable cache — monitored set, balances, cursor	Re-derived from Memgraph on startup (§4). No data loss.
redis-cache (:6380)	RPC cache + block stream	RPC cache refills from RPC; the stream is replayable from the cursor.

The mental model in one line

Memgraph is truth; Redis is a fast projection of truth. Anything in Redis can be thrown away and rebuilt from the graph. That single asymmetry is what makes most of the recovery code possible.

3 · The failure catalog

① Chain reorg — prevented, not repaired

A reorg would orphan blocks the indexer already processed — state would silently diverge from the canonical chain. The defense is refreshingly blunt: don't ingest blocks that could still reorg. The feed exposes only a safe head, lagging the chain tip by a finality margin:

// pkg/feed/rpc.go — safeHead
return head - f.finalityLag   // only expose blocks this far behind the tip

The comment notes ~64 blocks for L1: "no realistic reorg lives that long." Set it to 0 and you ingest head-of-chain — and "state silently diverges from the canonical chain after a reorg."

Recovery: none needed — the failure is designed out. Finality lag trades a little latency for never having to unwind a reorg. (A simple, robust choice over complex rewind logic.)

Your EVM instinct, validated

You already know reorgs are shallow and finality is probabilistic. The system encodes exactly that knowledge as a constant. There's no clever reorg-rollback engine because, by waiting for finality, there's nothing to roll back. Deep-systems lesson: the best handling of a hard failure mode is often to arrange never to hit it.

② Indexer crash mid-batch — self-healing by design

A pod dies halfway through writing a block's mutations. Two mechanisms you already know combine:

Atomic commit (L4): mutations + cursor are one transaction → a half-write rolls back entirely. The graph never reflects a partial block.
At-least-once delivery (L7): the block was never XACKed → it's redelivered and reprocessed.

Recovery: automatic. The redelivered block re-applies idempotently (the invariant), landing the system in the exact state it would have reached without the crash. No operator action.

③ Redis flushed / restarted — rebuilt from the graph

Redis loses the monitored set, balances, and its cursor. Because all of it is derivable from Memgraph, startup runs a sequence of self-heal steps (cmd/indexer/main.go):

// re-populate Redis state from Neo4j if lost (e.g. Redis flush/restart)
// 1. monitored set  ← all node ids in the graph
// 2. module set     ← from graph nodes
// 3. balance cache  ← HOLDS.quantity_raw (the canonical balances)
// 4. block cursor   ← Memgraph BlockCursor

Recovery: the graph re-seeds Redis. HOLDS.quantity_raw is the canonical balance, the monitored set is just "every node," and the cursor lives in Memgraph too.

A subtle ordering dependency worth noticing (this is the kind of detail you came for): the balance reconciler must run before the cursor self-heal. If you healed the cursor first, the Redis and Memgraph cursors would match — masking the drift signal — and the balance rebuild would be skipped even though the cache was empty. Order is load-bearing.

Plausibility gate (don't rebuild from garbage)

The balance rebuild refuses implausible HOLDS.quantity_raw rows (uint256 wrap-around) rather than copy them back into Redis — a guard echoing Lesson 2's "single position > $1T = bug." Self-healing must not launder corruption. (The next real Transfer re-derives the correct balance anyway.)

④ Cursor divergence (graph-writer stalls) — the watchdog & graceful pause

This is the most interesting one, and it needs a refinement to your Lesson 4 model. There are really two cursors:

Cursor	Means	Advanced by
Redis `BlockCursor`	"queued for apply"	indexer, immediately on publishing a batch
Memgraph `BlockCursor`	"actually applied"	graph-writer, when it commits the batch tx

Normally they track closely. But if the graph-writer halts (e.g. a DLQ/halt event), Redis keeps racing ahead while Memgraph stays put. The gap is dangerous: the indexer's in-process accumulators pile up state against a graph that has received none of it.

The cursor watchdog samples that gap. When it exceeds the threshold for N consecutive ticks, it flips a shared watchdogState to paused, and the stream-consumer loop reads that flag before every ReadBatch and short-circuits:

// pkg/indexer/indexer.go — Phase 2 consumer loop
if idx.ingestPaused() {   // watchdog flipped us to paused
    // short-circuit before ReadBatch — stop pulling new blocks
    continue
}

Recovery: in-process backpressure. The indexer stops consuming until Memgraph catches up, then resumes. Crucially (FORTA-2660), this replaced the old "log.Fatal → exit → let kubernetes restart the pod" approach — graceful pause beats a crash-loop.

🔗 The same idea, third time

This is Lesson 7's backpressure applied at the write side: when the downstream (graph-writer) can't keep up, the upstream (stream consumer) pauses — which, via L7, pauses ingest, which pauses RPC. One flow-control philosophy runs end to end: slow down, don't drop, don't crash.

⑤ Dead consumer leaves blocks in-flight — reclaim

A pod dies with entries still in its PEL (Lesson 7). Those blocks are stranded until another consumer takes them over with XAUTOCLAIM — the literal cause of the "99 stuck blocks across two dead indexer pods" war story.

Recovery: XAUTOCLAIM reassigns idle PEL entries to a live consumer; they reprocess idempotently.

4 · The pattern behind all of it

Step back and the five recoveries are one design:

Determinism → lost/corrupt derived state is recomputable, never irreparable.
One source of truth (Memgraph) → caches (Redis) are disposable and rebuildable.
Atomic cursor + at-least-once + idempotency → crashes and redelivery are no-ops.
Prevention over repair (finality lag) and pause over crash (watchdog) → avoid the hard cases, degrade gracefully when you can't.

That's why a system that runs forever against reorgs, crashes, and flushes still produces a correct graph.

Check yourself

1. Why can lost Redis state (monitored set, balances, cursor) be recovered with no data loss?

2. How does the system handle chain reorgs?

3. The "Block-determinism invariant" states that…

4. Redis cursor vs Memgraph cursor — what's the difference the watchdog cares about?

5. When the cursor gap exceeds threshold, the watchdog makes the indexer…

6. Why must the balance reconciler run before the cursor self-heal on startup?

7. The balance rebuild skips implausible HOLDS.quantity_raw rows (uint256 wrap-around) because…

↳ Ask your teacher

Try: "Show me the real cursor watchdog sampling logic," · "What's the DLQ + halt protocol in the graph-writer?" · "Walk me through ReconcileBalancesIfStale line by line," · "How are Memgraph backups handled (the one truth that isn't rebuildable)?"

What you can now do

State the Block-determinism invariant and explain why it makes recovery a re-derivation, not a repair.
Classify each store as source-of-truth (Memgraph) or rebuildable cache (Redis).
Walk through the five failure modes — reorg, crash, Redis flush, cursor divergence, dead consumer — and their recovery.
Explain the two-cursor model, the watchdog's graceful pause, and the startup ordering + plausibility-gate subtleties.
See backpressure as one philosophy running end to end: slow down, don't drop, don't crash.

The correctness story, complete

Cursor (L4) → atomic commit (L4) → at-least-once (L7) → idempotency (L4) → determinism + one source of truth + prevention/pause (L8). Eight lessons in, you can now argue why this system is correct, not just how it runs. That's a deep understanding most contributors never reach.

← PreviousLesson 07 · The Streaming Backbone Next →Lesson 09 · The Single-Writer Architecture

Grounded in: cmd/indexer/cursor_watchdog.go (determinism invariant, two-cursor model, paused state, FORTA-2660/2642), cmd/indexer/main.go (self-heal sequence + ordering), pkg/feed/rpc.go (finality lag / safeHead), pkg/genesis/balance_rebuild.go (ReconcileBalancesIfStale, plausibility gate), pkg/indexer/indexer.go (ingestPaused short-circuit), pkg/queue/queue.go (XAUTOCLAIM). Verify against source — the code is the truth.