Reorgs, crashes, lost caches, diverging cursors — and why none of them corrupt the graph. ~13 min.
A realtime indexer runs forever against an adversarial world: chains reorg, pods crash mid-write, Redis gets flushed, the graph-writer stalls. The remarkable thing about this system is that essentially every failure recovers automatically, and none of them corrupt the graph. This lesson is why. It rests on a single idea you've been circling since Lesson 4.
Straight from the codebase (cmd/indexer/cursor_watchdog.go), the load-bearing rule:
BlockCursor = N is a pure function of blocks 0..N.
Replay the same blocks → get the same graph, byte for byte. (You earned this in Lesson 4: coalesced
state + MERGE + the monotonic guard. Lesson 7 then made redelivery safe.)Why is this the key to recovery? Because if the state is a deterministic function of the blocks, then any lost or corrupted derived state can simply be recomputed by replaying blocks — recovery is never a delicate repair, it's a re-derivation.
The second pillar: the system is explicit about what is canonical and what is disposable.
| Store | Role | If lost… |
|---|---|---|
| Memgraph | Source of truth — the graph + the applied BlockCursor | This is the real problem (mitigated by DB backups/replication, out of scope here). |
| Redis (:6379) | Rebuildable cache — monitored set, balances, cursor | Re-derived from Memgraph on startup (§4). No data loss. |
| redis-cache (:6380) | RPC cache + block stream | RPC cache refills from RPC; the stream is replayable from the cursor. |
A reorg would orphan blocks the indexer already processed — state would silently diverge from the canonical chain. The defense is refreshingly blunt: don't ingest blocks that could still reorg. The feed exposes only a safe head, lagging the chain tip by a finality margin:
// pkg/feed/rpc.go — safeHead return head - f.finalityLag // only expose blocks this far behind the tip
The comment notes ~64 blocks for L1: "no realistic reorg lives that long." Set it to 0 and you ingest head-of-chain — and "state silently diverges from the canonical chain after a reorg."
A pod dies halfway through writing a block's mutations. Two mechanisms you already know combine:
XACKed → it's redelivered and reprocessed.Redis loses the monitored set, balances, and its cursor. Because all of it is derivable from Memgraph, startup runs a sequence of self-heal steps (cmd/indexer/main.go):
// re-populate Redis state from Neo4j if lost (e.g. Redis flush/restart) // 1. monitored set ← all node ids in the graph // 2. module set ← from graph nodes // 3. balance cache ← HOLDS.quantity_raw (the canonical balances) // 4. block cursor ← Memgraph BlockCursor
HOLDS.quantity_raw is the canonical balance, the monitored set is just "every node," and the cursor lives in Memgraph too.A subtle ordering dependency worth noticing (this is the kind of detail you came for): the balance reconciler must run before the cursor self-heal. If you healed the cursor first, the Redis and Memgraph cursors would match — masking the drift signal — and the balance rebuild would be skipped even though the cache was empty. Order is load-bearing.
HOLDS.quantity_raw rows (uint256 wrap-around) rather
than copy them back into Redis — a guard echoing Lesson 2's "single position > $1T = bug." Self-healing
must not launder corruption. (The next real Transfer re-derives the correct balance anyway.)This is the most interesting one, and it needs a refinement to your Lesson 4 model. There are really two cursors:
| Cursor | Means | Advanced by |
|---|---|---|
Redis BlockCursor | "queued for apply" | indexer, immediately on publishing a batch |
Memgraph BlockCursor | "actually applied" | graph-writer, when it commits the batch tx |
Normally they track closely. But if the graph-writer halts (e.g. a DLQ/halt event), Redis keeps racing ahead while Memgraph stays put. The gap is dangerous: the indexer's in-process accumulators pile up state against a graph that has received none of it.
The cursor watchdog samples that gap. When it exceeds the threshold for N consecutive ticks,
it flips a shared watchdogState to paused, and the stream-consumer loop reads
that flag before every ReadBatch and short-circuits:
// pkg/indexer/indexer.go — Phase 2 consumer loop if idx.ingestPaused() { // watchdog flipped us to paused // short-circuit before ReadBatch — stop pulling new blocks continue }
A pod dies with entries still in its PEL (Lesson 7). Those blocks are stranded until another consumer
takes them over with XAUTOCLAIM — the literal cause of the "99 stuck blocks across two dead
indexer pods" war story.
XAUTOCLAIM reassigns idle PEL entries to a live consumer; they reprocess idempotently.Step back and the five recoveries are one design:
That's why a system that runs forever against reorgs, crashes, and flushes still produces a correct graph.
HOLDS.quantity_raw rows (uint256 wrap-around) because…Grounded in: cmd/indexer/cursor_watchdog.go (determinism invariant, two-cursor model, paused state, FORTA-2660/2642),
cmd/indexer/main.go (self-heal sequence + ordering), pkg/feed/rpc.go (finality lag / safeHead),
pkg/genesis/balance_rebuild.go (ReconcileBalancesIfStale, plausibility gate), pkg/indexer/indexer.go (ingestPaused short-circuit),
pkg/queue/queue.go (XAUTOCLAIM). Verify against source — the code is the truth.