Every mechanism you've learned has a concrete signal. This is how you see them. ~12 min.
You've learned the invariants (determinism, atomicity, idempotency), the failure modes (reorg, crash, divergence, backpressure), and the topology (one writer, two cursors, lanes). All of that is invisible in production unless it's instrumented. This final phase-1 lesson is the lens: the three pillars of observability, and a map from every concept in this course to the exact signal you'd watch.
logrus here). "What exactly happened at 14:03?" The detail you reach for after a metric or trace points you somewhere.Metrics tell you something is wrong; traces tell you where; logs tell you what. You've seen all three already without naming them — the L8 watchdog logs, the L9 backlog gauge, the L7 backpressure counter.
Every binary is instrumented with OpenTelemetry (vendor-neutral). Services emit traces+metrics over OTLP to a collector, which fans them out:
Source: otel-collector-config.yaml (OTLP in → Jaeger + Prometheus out), pkg/telemetry/telemetry.go, tracers.go (one named tracer per package: feed, decoder, indexer, enrichment, risk…).
Three metric shapes, and you'll recognise each subsystem's metrics instantly because they're named after the mechanisms you learned:
| Shape | Means | Example here |
|---|---|---|
| Counter | Monotonic count (rate matters) | blockscout.api.calls, at_risk.cycle.errors |
| Gauge | Value that goes up & down | block_ingest.backpressure.blocked, graphwrite_stream_backlog |
| Histogram | Distribution (p50/p95/p99) | at_risk.cycle.duration, blockscout.api.duration |
| Concept (lesson) | Metric / signal |
|---|---|
| Backpressure (L7) | block_ingest.backpressure.blocked / .waits |
| Two cursors & divergence (L8) | Redis vs Memgraph BlockCursor gap (watchdog gauge) |
| Self-healing reconcilers (L8) | chainref_reconcile.drift.total / .heal.total |
| Single-writer backlog & lanes (L9) | graphwrite_stream_backlog{chain_id} |
| Enrichment external APIs & breakers (L5) | blockscout.api.{calls,errors,duration}, breaker.state |
| Risk engine cycles (L6) | at_risk.cycle.duration, at_risk.edges, at_risk.tokens.stamped |
| Bootstrap task DAG (L10) | bootstrap.task.attempts / .duration |
calls), Errors
(errors), Duration (duration). That's the RED method, a standard recipe
for instrumenting any request-driven component. Once you see it, you can read the enrichment-apis dashboard at a glance.
A metric says "stream backlog is rising." A trace says "this block took 4s, and 3.8s of it
was in at_risk.writeback." Each instrumented operation opens a span (e.g.
indexer.accumulate, enrichment.rpc.classify, risk.exposure.cycle); spans nest into a tree.
// producer (block-ingest) — write side telemetry.InjectIntoStreamValues(ctx, values) // trace id rides inside the XADD // consumer (indexer) — read side ctx = telemetry.ExtractFromStreamValues(parent, values) // re-parent to the producer's spanSo in Jaeger you see one trace:
block-ingest fetch → (queue) → indexer decode → … → graph-writer apply,
even though it crossed two process boundaries and a message broker. The StreamCarrier makes the Redis
Stream trace-transparent. This is exactly the L7 backbone, now observable end to end.
debug_traceBlockByNumber (an EVM execution trace) — that's a different "trace." Here, a
distributed trace is like following one transaction's journey, except across services instead of
opcodes. Same instinct (follow one thing through a system), different layer.
Recall the indexer reads blocks in batches (L7, XReadGroup Count). One consumer span can't
have one parent then. So the code uses span links — LinksFromBatchHeaders attaches one
link per message to its publisher's span, preserving the 1:N producer→consumer relationship instead of forcing a
single parent. A small detail that shows how carefully the streaming model and the tracing model were made to agree.
The metrics are assembled into six Grafana dashboards (in grafana/), each a view of one subsystem you now know cold:
| Dashboard | Watches (lesson) |
|---|---|
| overview | block lag, throughput, latency, errors — one-pane health |
| block-pipeline | ingestion, decoding, HOLDS updates, promotions (L1–4·L7) |
| enrichment-apis | worker + Etherscan/Blockscout/DefiLlama RED + breakers (L5) |
| risk-engine | cycle progress, throughput, AT_RISK/DebtRank latency (L6) |
| infrastructure | RPC, graph writes, reconciler, query-api (L8·L9) |
| graph-parity | compares two graph_id partitions over Bolt (L2·L6·L10) |
The graph-parity board is special: it speaks Bolt directly to Memgraph (Neo4j plugin) and diffs
two partitions — e.g. risk-graph-rt vs a test_carlos clone (L10). It's how the team enforces
the L6 parity requirement operationally.
graphwrite_stream_backlog warns at >1000 for 5m,
critical at >10000 for 5m. That alert is the human-facing edge of the watchdog story — it fires before the
cursor gap grows large enough that a crash would need a long balance-cache rebuild (L8). The repo also ships
Prometheus rule files for at-risk and oracle pricing (docs/*_prometheus_rules.yaml).
graphwrite_stream_backlog is which metric shape, and which lesson's mechanism does it watch?StreamCarrier, and span links for batched reads.Grounded in: otel-collector-config.yaml (OTLP→Jaeger+Prometheus), pkg/telemetry/{telemetry,tracers,meters,propagation}.go
(named tracers, metric names, StreamCarrier inject/extract, LinksFromBatchHeaders), grafana/*.json (six dashboards),
docs/*_prometheus_rules.yaml (alert rules), README.md (dashboard table). Verify against source — the code is the truth.