Lesson 08 · Failure Modes & Recovery

What happens when things go wrong

Reorgs, crashes, lost caches, diverging cursors — and why none of them corrupt the graph. ~13 min.

Builds on: L4 · L7 Anchor: reorgs & finality The deep payoff New: source-of-truth vs cache

A realtime indexer runs forever against an adversarial world: chains reorg, pods crash mid-write, Redis gets flushed, the graph-writer stalls. The remarkable thing about this system is that essentially every failure recovers automatically, and none of them corrupt the graph. This lesson is why. It rests on a single idea you've been circling since Lesson 4.

1 · The invariant that makes recovery possible

Straight from the codebase (cmd/indexer/cursor_watchdog.go), the load-bearing rule:

The Block-determinism invariant
State at BlockCursor = N is a pure function of blocks 0..N. Replay the same blocks → get the same graph, byte for byte. (You earned this in Lesson 4: coalesced state + MERGE + the monotonic guard. Lesson 7 then made redelivery safe.)

Why is this the key to recovery? Because if the state is a deterministic function of the blocks, then any lost or corrupted derived state can simply be recomputed by replaying blocks — recovery is never a delicate repair, it's a re-derivation.

2 · Source of truth vs. rebuildable cache

The second pillar: the system is explicit about what is canonical and what is disposable.

StoreRoleIf lost…
MemgraphSource of truth — the graph + the applied BlockCursorThis is the real problem (mitigated by DB backups/replication, out of scope here).
Redis (:6379)Rebuildable cache — monitored set, balances, cursorRe-derived from Memgraph on startup (§4). No data loss.
redis-cache (:6380)RPC cache + block streamRPC cache refills from RPC; the stream is replayable from the cursor.
The mental model in one line
Memgraph is truth; Redis is a fast projection of truth. Anything in Redis can be thrown away and rebuilt from the graph. That single asymmetry is what makes most of the recovery code possible.

3 · The failure catalog

① Chain reorg — prevented, not repaired

A reorg would orphan blocks the indexer already processed — state would silently diverge from the canonical chain. The defense is refreshingly blunt: don't ingest blocks that could still reorg. The feed exposes only a safe head, lagging the chain tip by a finality margin:

// pkg/feed/rpc.go — safeHead
return head - f.finalityLag   // only expose blocks this far behind the tip

The comment notes ~64 blocks for L1: "no realistic reorg lives that long." Set it to 0 and you ingest head-of-chain — and "state silently diverges from the canonical chain after a reorg."

Recovery: none needed — the failure is designed out. Finality lag trades a little latency for never having to unwind a reorg. (A simple, robust choice over complex rewind logic.)
Your EVM instinct, validated
You already know reorgs are shallow and finality is probabilistic. The system encodes exactly that knowledge as a constant. There's no clever reorg-rollback engine because, by waiting for finality, there's nothing to roll back. Deep-systems lesson: the best handling of a hard failure mode is often to arrange never to hit it.

② Indexer crash mid-batch — self-healing by design

A pod dies halfway through writing a block's mutations. Two mechanisms you already know combine:

Recovery: automatic. The redelivered block re-applies idempotently (the invariant), landing the system in the exact state it would have reached without the crash. No operator action.

③ Redis flushed / restarted — rebuilt from the graph

Redis loses the monitored set, balances, and its cursor. Because all of it is derivable from Memgraph, startup runs a sequence of self-heal steps (cmd/indexer/main.go):

// re-populate Redis state from Neo4j if lost (e.g. Redis flush/restart)
// 1. monitored set  ← all node ids in the graph
// 2. module set     ← from graph nodes
// 3. balance cache  ← HOLDS.quantity_raw (the canonical balances)
// 4. block cursor   ← Memgraph BlockCursor
Recovery: the graph re-seeds Redis. HOLDS.quantity_raw is the canonical balance, the monitored set is just "every node," and the cursor lives in Memgraph too.

A subtle ordering dependency worth noticing (this is the kind of detail you came for): the balance reconciler must run before the cursor self-heal. If you healed the cursor first, the Redis and Memgraph cursors would match — masking the drift signal — and the balance rebuild would be skipped even though the cache was empty. Order is load-bearing.

Plausibility gate (don't rebuild from garbage)
The balance rebuild refuses implausible HOLDS.quantity_raw rows (uint256 wrap-around) rather than copy them back into Redis — a guard echoing Lesson 2's "single position > $1T = bug." Self-healing must not launder corruption. (The next real Transfer re-derives the correct balance anyway.)

④ Cursor divergence (graph-writer stalls) — the watchdog & graceful pause

This is the most interesting one, and it needs a refinement to your Lesson 4 model. There are really two cursors:

CursorMeansAdvanced by
Redis BlockCursor"queued for apply"indexer, immediately on publishing a batch
Memgraph BlockCursor"actually applied"graph-writer, when it commits the batch tx

Normally they track closely. But if the graph-writer halts (e.g. a DLQ/halt event), Redis keeps racing ahead while Memgraph stays put. The gap is dangerous: the indexer's in-process accumulators pile up state against a graph that has received none of it.

The cursor watchdog samples that gap. When it exceeds the threshold for N consecutive ticks, it flips a shared watchdogState to paused, and the stream-consumer loop reads that flag before every ReadBatch and short-circuits:

// pkg/indexer/indexer.go — Phase 2 consumer loop
if idx.ingestPaused() {   // watchdog flipped us to paused
    // short-circuit before ReadBatch — stop pulling new blocks
    continue
}
Recovery: in-process backpressure. The indexer stops consuming until Memgraph catches up, then resumes. Crucially (FORTA-2660), this replaced the old "log.Fatal → exit → let kubernetes restart the pod" approach — graceful pause beats a crash-loop.
🔗 The same idea, third time
This is Lesson 7's backpressure applied at the write side: when the downstream (graph-writer) can't keep up, the upstream (stream consumer) pauses — which, via L7, pauses ingest, which pauses RPC. One flow-control philosophy runs end to end: slow down, don't drop, don't crash.

⑤ Dead consumer leaves blocks in-flight — reclaim

A pod dies with entries still in its PEL (Lesson 7). Those blocks are stranded until another consumer takes them over with XAUTOCLAIM — the literal cause of the "99 stuck blocks across two dead indexer pods" war story.

Recovery: XAUTOCLAIM reassigns idle PEL entries to a live consumer; they reprocess idempotently.

4 · The pattern behind all of it

Step back and the five recoveries are one design:

That's why a system that runs forever against reorgs, crashes, and flushes still produces a correct graph.

Check yourself

1. Why can lost Redis state (monitored set, balances, cursor) be recovered with no data loss?
2. How does the system handle chain reorgs?
3. The "Block-determinism invariant" states that…
4. Redis cursor vs Memgraph cursor — what's the difference the watchdog cares about?
5. When the cursor gap exceeds threshold, the watchdog makes the indexer…
6. Why must the balance reconciler run before the cursor self-heal on startup?
7. The balance rebuild skips implausible HOLDS.quantity_raw rows (uint256 wrap-around) because…
↳ Ask your teacher
Try: "Show me the real cursor watchdog sampling logic," · "What's the DLQ + halt protocol in the graph-writer?" · "Walk me through ReconcileBalancesIfStale line by line," · "How are Memgraph backups handled (the one truth that isn't rebuildable)?"

What you can now do

The correctness story, complete
Cursor (L4) → atomic commit (L4) → at-least-once (L7) → idempotency (L4) → determinism + one source of truth + prevention/pause (L8). Eight lessons in, you can now argue why this system is correct, not just how it runs. That's a deep understanding most contributors never reach.

Grounded in: cmd/indexer/cursor_watchdog.go (determinism invariant, two-cursor model, paused state, FORTA-2660/2642), cmd/indexer/main.go (self-heal sequence + ordering), pkg/feed/rpc.go (finality lag / safeHead), pkg/genesis/balance_rebuild.go (ReconcileBalancesIfStale, plausibility gate), pkg/indexer/indexer.go (ingestPaused short-circuit), pkg/queue/queue.go (XAUTOCLAIM). Verify against source — the code is the truth.