Lesson 11 · Observability · Phase-1 Capstone

Watching the machine run

Every mechanism you've learned has a concrete signal. This is how you see them. ~12 min.

Builds on: L7 · L8 · L9 The capstone New: metrics · traces · logs New: distributed tracing

You've learned the invariants (determinism, atomicity, idempotency), the failure modes (reorg, crash, divergence, backpressure), and the topology (one writer, two cursors, lanes). All of that is invisible in production unless it's instrumented. This final phase-1 lesson is the lens: the three pillars of observability, and a map from every concept in this course to the exact signal you'd watch.

1 · The three pillars

📊 Metrics

Aggregate numbers over time — counters, gauges, histograms. "How many? How fast? How backed up?" Cheap, always-on, great for alerts & dashboards. → Prometheus.

🔍 Traces

The path of one request across services, as a tree of timed spans. "Where did this block spend its time / where did it fail?" → Jaeger.

📝 Logs

Structured event records (logrus here). "What exactly happened at 14:03?" The detail you reach for after a metric or trace points you somewhere.

Metrics tell you something is wrong; traces tell you where; logs tell you what. You've seen all three already without naming them — the L8 watchdog logs, the L9 backlog gauge, the L7 backpressure counter.

2 · The telemetry pipeline (OTel)

Every binary is instrumented with OpenTelemetry (vendor-neutral). Services emit traces+metrics over OTLP to a collector, which fans them out:

binaries
OTel SDK
— OTLP gRPC :4317 →
otel-collector
routes
Jaeger
traces · :16686
Prometheus
metrics · :9091 · :8889

Source: otel-collector-config.yaml (OTLP in → Jaeger + Prometheus out), pkg/telemetry/telemetry.go, tracers.go (one named tracer per package: feed, decoder, indexer, enrichment, risk…).

3 · Metrics: the signal map

Three metric shapes, and you'll recognise each subsystem's metrics instantly because they're named after the mechanisms you learned:

ShapeMeansExample here
CounterMonotonic count (rate matters)blockscout.api.calls, at_risk.cycle.errors
GaugeValue that goes up & downblock_ingest.backpressure.blocked, graphwrite_stream_backlog
HistogramDistribution (p50/p95/p99)at_risk.cycle.duration, blockscout.api.duration
⭐ The map: every lesson → a signal you can watch
This table is the capstone of phase 1 — proof that the abstractions are real, measured things:
Concept (lesson)Metric / signal
Backpressure (L7)block_ingest.backpressure.blocked / .waits
Two cursors & divergence (L8)Redis vs Memgraph BlockCursor gap (watchdog gauge)
Self-healing reconcilers (L8)chainref_reconcile.drift.total / .heal.total
Single-writer backlog & lanes (L9)graphwrite_stream_backlog{chain_id}
Enrichment external APIs & breakers (L5)blockscout.api.{calls,errors,duration}, breaker.state
Risk engine cycles (L6)at_risk.cycle.duration, at_risk.edges, at_risk.tokens.stamped
Bootstrap task DAG (L10)bootstrap.task.attempts / .duration
A reusable pattern: RED
Notice the external-API metrics come in threes — Rate (calls), Errors (errors), Duration (duration). That's the RED method, a standard recipe for instrumenting any request-driven component. Once you see it, you can read the enrichment-apis dashboard at a glance.

4 · Traces: one block, all three binaries

A metric says "stream backlog is rising." A trace says "this block took 4s, and 3.8s of it was in at_risk.writeback." Each instrumented operation opens a span (e.g. indexer.accumulate, enrichment.rpc.classify, risk.exposure.cycle); spans nest into a tree.

⭐⭐ The deep one: traces hop through the Redis stream
The binaries are separate processes — yet a single block's trace spans all of them. How? The trace context is injected into the stream message itself and re-extracted on the other side (pkg/telemetry/propagation.go):
// producer (block-ingest) — write side
telemetry.InjectIntoStreamValues(ctx, values)  // trace id rides inside the XADD
// consumer (indexer) — read side
ctx = telemetry.ExtractFromStreamValues(parent, values) // re-parent to the producer's span
So in Jaeger you see one trace: block-ingest fetch → (queue) → indexer decode → … → graph-writer apply, even though it crossed two process boundaries and a message broker. The StreamCarrier makes the Redis Stream trace-transparent. This is exactly the L7 backbone, now observable end to end.

feed.fetch_block ▸ block-ingest
indexer.accumulate ▸ indexer
decoder.decode_block
graphwrite apply ▸ graph-writer
Your anchor — but mind the two meanings of "trace"
You know debug_traceBlockByNumber (an EVM execution trace) — that's a different "trace." Here, a distributed trace is like following one transaction's journey, except across services instead of opcodes. Same instinct (follow one thing through a system), different layer.

Span links for batched reads (L7 callback)

Recall the indexer reads blocks in batches (L7, XReadGroup Count). One consumer span can't have one parent then. So the code uses span linksLinksFromBatchHeaders attaches one link per message to its publisher's span, preserving the 1:N producer→consumer relationship instead of forcing a single parent. A small detail that shows how carefully the streaming model and the tracing model were made to agree.

5 · Dashboards & alerts

The metrics are assembled into six Grafana dashboards (in grafana/), each a view of one subsystem you now know cold:

DashboardWatches (lesson)
overviewblock lag, throughput, latency, errors — one-pane health
block-pipelineingestion, decoding, HOLDS updates, promotions (L1–4·L7)
enrichment-apisworker + Etherscan/Blockscout/DefiLlama RED + breakers (L5)
risk-enginecycle progress, throughput, AT_RISK/DebtRank latency (L6)
infrastructureRPC, graph writes, reconciler, query-api (L8·L9)
graph-paritycompares two graph_id partitions over Bolt (L2·L6·L10)

The graph-parity board is special: it speaks Bolt directly to Memgraph (Neo4j plugin) and diffs two partitions — e.g. risk-graph-rt vs a test_carlos clone (L10). It's how the team enforces the L6 parity requirement operationally.

Alerts close the loop you opened in L9
Metrics become alerts at thresholds. From L9: graphwrite_stream_backlog warns at >1000 for 5m, critical at >10000 for 5m. That alert is the human-facing edge of the watchdog story — it fires before the cursor gap grows large enough that a crash would need a long balance-cache rebuild (L8). The repo also ships Prometheus rule files for at-risk and oracle pricing (docs/*_prometheus_rules.yaml).

Check yourself

1. You get paged: "block lag rising." Which pillar tells you where a slow block spent its time?
2. graphwrite_stream_backlog is which metric shape, and which lesson's mechanism does it watch?
3. How does one Jaeger trace span block-ingest, indexer, and graph-writer — three separate processes?
4. The enrichment-apis metrics come as calls / errors / duration. That trio is…
5. Why does the consumer use span links (not a single parent) when it reads a batch of blocks?
6. The graph-parity dashboard is special because it…
7. Roughly, what's the right order of tools when debugging a production issue here?
8. Why is OpenTelemetry used instead of writing straight to Prometheus/Jaeger?
↳ Ask your teacher
Try: "Show me a real span being opened in indexer code," · "How would I find the slowest block in the last hour in Jaeger?" · "Walk me through the overview dashboard's panels," · "What PromQL would compute the cursor gap?"

What you can now do

🎓 Phase 1 complete — you understand the whole system
Eleven lessons: the three-binary pipeline (L1) · the graph data model (L2) · decoding (L3) · the write path (L4) · enrichment & discovery (L5) · the risk engine (L6) · the streaming backbone (L7) · failure & recovery (L8) · the single-writer topology (L9) · bootstrap (L10) · and now observability (L11). You can trace a block from RPC to a risk number, argue why the system stays correct under failure, explain the disaster that shaped its write topology, and name the signal that watches each part. That's a genuinely deep, end-to-end mental model.

Grounded in: otel-collector-config.yaml (OTLP→Jaeger+Prometheus), pkg/telemetry/{telemetry,tracers,meters,propagation}.go (named tracers, metric names, StreamCarrier inject/extract, LinksFromBatchHeaders), grafana/*.json (six dashboards), docs/*_prometheus_rules.yaml (alert rules), README.md (dashboard table). Verify against source — the code is the truth.