Lesson 11 · Observability · Phase-1 Capstone

Watching the machine run

Every mechanism you've learned has a concrete signal. This is how you see them. ~12 min.

Builds on: L7 · L8 · L9 The capstone New: metrics · traces · logs New: distributed tracing

You've learned the invariants (determinism, atomicity, idempotency), the failure modes (reorg, crash, divergence, backpressure), and the topology (one writer, two cursors, lanes). All of that is invisible in production unless it's instrumented. This final phase-1 lesson is the lens: the three pillars of observability, and a map from every concept in this course to the exact signal you'd watch.

1 · The three pillars

📊 Metrics

Aggregate numbers over time — counters, gauges, histograms. "How many? How fast? How backed up?" Cheap, always-on, great for alerts & dashboards. → Prometheus.

🔍 Traces

The path of one request across services, as a tree of timed spans. "Where did this block spend its time / where did it fail?" → Jaeger.

📝 Logs

Structured event records (logrus here). "What exactly happened at 14:03?" The detail you reach for after a metric or trace points you somewhere.

Metrics tell you something is wrong; traces tell you where; logs tell you what. You've seen all three already without naming them — the L8 watchdog logs, the L9 backlog gauge, the L7 backpressure counter.

2 · The telemetry pipeline (OTel)

Every binary is instrumented with OpenTelemetry (vendor-neutral). Services emit traces+metrics over OTLP to a collector, which fans them out:

binaries

OTel SDK

— OTLP gRPC :4317 →

otel-collector

routes

→

Jaeger

traces · :16686

→

↘

Prometheus

metrics · :9091 · :8889

Source: otel-collector-config.yaml (OTLP in → Jaeger + Prometheus out), pkg/telemetry/telemetry.go, tracers.go (one named tracer per package: feed, decoder, indexer, enrichment, risk…).

3 · Metrics: the signal map

Three metric shapes, and you'll recognise each subsystem's metrics instantly because they're named after the mechanisms you learned:

Shape	Means	Example here
Counter	Monotonic count (rate matters)	`blockscout.api.calls`, `at_risk.cycle.errors`
Gauge	Value that goes up & down	`block_ingest.backpressure.blocked`, `graphwrite_stream_backlog`
Histogram	Distribution (p50/p95/p99)	`at_risk.cycle.duration`, `blockscout.api.duration`

⭐ The map: every lesson → a signal you can watch

This table is the capstone of phase 1 — proof that the abstractions are real, measured things:

Concept (lesson)	Metric / signal
Backpressure (L7)	`block_ingest.backpressure.blocked` / `.waits`
Two cursors & divergence (L8)	Redis vs Memgraph `BlockCursor` gap (watchdog gauge)
Self-healing reconcilers (L8)	`chainref_reconcile.drift.total` / `.heal.total`
Single-writer backlog & lanes (L9)	`graphwrite_stream_backlog{chain_id}`
Enrichment external APIs & breakers (L5)	`blockscout.api.{calls,errors,duration}`, `breaker.state`
Risk engine cycles (L6)	`at_risk.cycle.duration`, `at_risk.edges`, `at_risk.tokens.stamped`
Bootstrap task DAG (L10)	`bootstrap.task.attempts` / `.duration`

A reusable pattern: RED

Notice the external-API metrics come in threes — Rate (calls), Errors (errors), Duration (duration). That's the RED method, a standard recipe for instrumenting any request-driven component. Once you see it, you can read the enrichment-apis dashboard at a glance.

4 · Traces: one block, all three binaries

A metric says "stream backlog is rising." A trace says "this block took 4s, and 3.8s of it was in at_risk.writeback." Each instrumented operation opens a span (e.g. indexer.accumulate, enrichment.rpc.classify, risk.exposure.cycle); spans nest into a tree.

⭐⭐ The deep one: traces hop through the Redis stream

The binaries are separate processes — yet a single block's trace spans all of them. How? The trace context is injected into the stream message itself and re-extracted on the other side (pkg/telemetry/propagation.go):

// producer (block-ingest) — write side
telemetry.InjectIntoStreamValues(ctx, values)  // trace id rides inside the XADD
// consumer (indexer) — read side
ctx = telemetry.ExtractFromStreamValues(parent, values) // re-parent to the producer's span

So in Jaeger you see one trace: block-ingest fetch → (queue) → indexer decode → … → graph-writer apply, even though it crossed two process boundaries and a message broker. The StreamCarrier makes the Redis Stream trace-transparent. This is exactly the L7 backbone, now observable end to end.

feed.fetch_block ▸ block-ingest

indexer.accumulate ▸ indexer

decoder.decode_block

graphwrite apply ▸ graph-writer

Your anchor — but mind the two meanings of "trace"

You know debug_traceBlockByNumber (an EVM execution trace) — that's a different "trace." Here, a distributed trace is like following one transaction's journey, except across services instead of opcodes. Same instinct (follow one thing through a system), different layer.

Span links for batched reads (L7 callback)

Recall the indexer reads blocks in batches (L7, XReadGroup Count). One consumer span can't have one parent then. So the code uses span links — LinksFromBatchHeaders attaches one link per message to its publisher's span, preserving the 1:N producer→consumer relationship instead of forcing a single parent. A small detail that shows how carefully the streaming model and the tracing model were made to agree.

5 · Dashboards & alerts

The metrics are assembled into six Grafana dashboards (in grafana/), each a view of one subsystem you now know cold:

Dashboard	Watches (lesson)
overview	block lag, throughput, latency, errors — one-pane health
block-pipeline	ingestion, decoding, HOLDS updates, promotions (L1–4·L7)
enrichment-apis	worker + Etherscan/Blockscout/DefiLlama RED + breakers (L5)
risk-engine	cycle progress, throughput, AT_RISK/DebtRank latency (L6)
infrastructure	RPC, graph writes, reconciler, query-api (L8·L9)
graph-parity	compares two `graph_id` partitions over Bolt (L2·L6·L10)

The graph-parity board is special: it speaks Bolt directly to Memgraph (Neo4j plugin) and diffs two partitions — e.g. risk-graph-rt vs a test_carlos clone (L10). It's how the team enforces the L6 parity requirement operationally.

Alerts close the loop you opened in L9

Metrics become alerts at thresholds. From L9: graphwrite_stream_backlog warns at >1000 for 5m, critical at >10000 for 5m. That alert is the human-facing edge of the watchdog story — it fires before the cursor gap grows large enough that a crash would need a long balance-cache rebuild (L8). The repo also ships Prometheus rule files for at-risk and oracle pricing (docs/*_prometheus_rules.yaml).

Check yourself

1. You get paged: "block lag rising." Which pillar tells you where a slow block spent its time?

2. graphwrite_stream_backlog is which metric shape, and which lesson's mechanism does it watch?

3. How does one Jaeger trace span block-ingest, indexer, and graph-writer — three separate processes?

4. The enrichment-apis metrics come as calls / errors / duration. That trio is…

5. Why does the consumer use span links (not a single parent) when it reads a batch of blocks?

6. The graph-parity dashboard is special because it…

7. Roughly, what's the right order of tools when debugging a production issue here?

8. Why is OpenTelemetry used instead of writing straight to Prometheus/Jaeger?

↳ Ask your teacher

Try: "Show me a real span being opened in indexer code," · "How would I find the slowest block in the last hour in Jaeger?" · "Walk me through the overview dashboard's panels," · "What PromQL would compute the cursor gap?"

What you can now do

Name the three pillars (metrics / traces / logs) and when to reach for each.
Describe the OTel pipeline: binaries → collector → Jaeger (traces) + Prometheus (metrics).
Map any mechanism from this course to its concrete signal (the signal map).
Explain distributed tracing across the Redis stream via StreamCarrier, and span links for batched reads.
Read the six dashboards and the backlog alert as the operator-facing edge of L7–L10.

🎓 Phase 1 complete — you understand the whole system

Eleven lessons: the three-binary pipeline (L1) · the graph data model (L2) · decoding (L3) · the write path (L4) · enrichment & discovery (L5) · the risk engine (L6) · the streaming backbone (L7) · failure & recovery (L8) · the single-writer topology (L9) · bootstrap (L10) · and now observability (L11). You can trace a block from RPC to a risk number, argue why the system stays correct under failure, explain the disaster that shaped its write topology, and name the signal that watches each part. That's a genuinely deep, end-to-end mental model.

← PreviousLesson 10 · Bootstrap & Fresh-Start Next →Lesson 12 · The Rule Engine · Deeper Track

Grounded in: otel-collector-config.yaml (OTLP→Jaeger+Prometheus), pkg/telemetry/{telemetry,tracers,meters,propagation}.go (named tracers, metric names, StreamCarrier inject/extract, LinksFromBatchHeaders), grafana/*.json (six dashboards), docs/*_prometheus_rules.yaml (alert rules), README.md (dashboard table). Verify against source — the code is the truth.