Lesson 28 · Project Inference · Discovery Internals

Attributing a contract to a protocol

Stage 10: how a bare contract becomes "belongs to Aave" — and why it's deliberately a heuristic. ~12 min.

Builds on: L24 · L23 · L2 Anchor: Aave aTokens, Curve crvUSD New: precedence cascade New: canonical-slug parity

The graph knows a contract is a vault, has an admin, holds a token. One thing it can't read off-chain: which protocol does this contract belong to? There's no protocol() getter. Yet grouping contracts by project — "these 40 nodes are all Aave" — drives the admin panel, rule scoping, and cost attribution. Project inference is the small heuristic that derives it, and studying it is a lesson in knowing when a dumb string match is the right tool.

Your anchor: you name protocols by their contract names

You already read attribution off names instinctively: a contract called AToken or VariableDebtToken is Aave; MetaMorpho is Morpho; crvUSD is Curve; wstETH is Lido. Block explorers encode the same intuition in their nametags ("Aave: Pool V3"). Project inference just makes that human pattern-matching explicit and deterministic so every node gets the same answer the batch pipeline would give.

1 · Three signals, in a precedence cascade

InferProject (pkg/enrichment/project.go) consults three sources, in descending order of trust, and returns on the first hit:

Nametag — substring-matched against knownNametagPatterns. Checked first because explorer nametags are human-curated and the most reliable signal available. "Aave: Pool V3" → aave

Contract name — lowercased, underscores stripped, substring-matched against knownProjects. The fallback when no nametag exists, drawn from the verified source name. "VariableDebtToken" → aave

labels_slug — first pipe-delimited token (InferProjectFromSlug), used when the explorer offers a slug but no usable nametag or contract name. "curve|amm|stableswap" → curve-finance

All three are plain strings.Contains matches — no RPC, no graph traversal, no ABI parsing. It's the cheapest stage in the whole pipeline, which is precisely why it runs as a quick label pass rather than an on-chain probe.

2 · Specificity ordering — the correctness trick

Substring matching has an obvious trap: "curve" is a substring of a "crvusd" contract's nametag, and both are real but different canonical projects. The fix is purely structural — the pattern lists are ordered most-specific-first, and the first match wins:

var knownProjects = []projectPattern{
    {"metamorpho", "morpho"},      // before the generic "morpho"
    {"morpho",     "morpho"},
    {"atoken",     "aave"},        // before the generic "aave"
    {"variabledebt", "aave"},
    {"crvusd", "curve-finance"},  // before "curve"
    {"curve",  "curve-finance"},
    // …~50 patterns total
}

This is the same "order is load-bearing" discipline you saw in L19's cap pipeline and L20's cell dedup — here it's what keeps a specific token from being swallowed by its protocol's generic prefix.

3 · Canonical slugs — this is a parity constraint

Every inferred value is piped through NormalizeLabelAPIProject before it's returned. Why does that matter? Because the realtime indexer and the Python batch pipeline both write a project field, and they must produce the byte-identical slug or the parity harness (L23) flags a mismatch:

// the patterns already encode canonical slugs; normalization is defence-in-depth
{"sushi": "sushiswap"}, {"curve": "curve-finance"}, {"maker": "sky"}, {"eigencloud": "eigenlayer"}

Two things the slug table quietly encodes

Rebrands: "maker" → "sky" and "dai" → "sky" — MakerDAO became Sky, and the canonical slug carries that so old and new names converge on one project node. Vendor drift: "eigencloud" → "eigenlayer", "sushi" → "sushiswap" — the same protocol named differently by different explorers all normalize to one slug. The goal isn't "the objectively right name"; it's the exact slug the batch pipeline writes, so RT and batch agree per node.

4 · The reserved-generic filter — refusing to guess

Some labels look like attribution but aren't a protocol. "stablecoin", "dex", "lending", "erc20-token", "safe", "mev-bot" — all describe a kind, not a project. Inference returns empty for these rather than stamping a meaningless BELONGS_TO:

var projectReservedSortedBy = map[string]bool{
    "token-contract": true, "stablecoin": true, "dex": true,
    "lending": true, "oracle": true, "bridge": true, // …treated as no-match
}

The trade-off, named

Project inference is a deliberate heuristic: cheap, deterministic, parity-matchable — but fuzzy. A contract named "DAIProxyHelper" would match "dai" → sky even if it's unrelated to Sky. The team accepts that imprecision because the alternative (structural inference from graph topology, or per-contract curation) costs far more for a field that's organizational, not safety-critical. The reserved-generic filter is the floor: better no attribution than a confidently-wrong one. Correctness here is defined as "matches batch", not "objectively perfect".

5 · What it produces and who reads it

The inferred slug becomes the project field, a BELONGS_TO edge to a protocol node, plus project_source (nametag / contract_name) and project_category (defi / infra / uncategorized) — stage 10 of the L24 pipeline. Downstream it's the grouping key for: the admin panel (L17, "show me everything in Aave"), rule scoping (L12, rules that target a protocol), and the cost-allocation model (attributing on-chain signal to the customer who controls a protocol). It's the connective tissue that turns a flat node set into protocols.

Check yourself

1. Why does the system infer a contract's project from labels instead of reading it on-chain?

2. InferProject checks nametag, then contract name, then labels_slug. Why that order?

3. The pattern lists put "crvusd" before "curve" and "metamorpho" before "morpho". The reason is…

4. Every result is piped through NormalizeLabelAPIProject. What's the point?

5. The slug table maps both "maker" and "dai" to "sky". This encodes…

6. A contract's only label is "stablecoin". What does inference return?

7. A contract named "DAIProxyHelper" gets attributed to sky despite being unrelated. How does the team view this?

8. The project field this stage produces is used downstream primarily as…

↳ Ask your teacher

Try: "What's the full set of ~50 patterns, and how are new ones added?" · "How is project_category (defi/infra/uncategorized) decided?" · "Where does the BELONGS_TO edge + protocol node actually get written?" · "Does the cost-allocation model use project, and how?" · "How does TestInferProject_CanonicalIdempotent enforce parity?"

What you can now do

Explain why project attribution is a label heuristic — there's no on-chain protocol field to read.
Walk the precedence cascade (nametag → contract name → labels_slug) and justify the trust order.
Explain specificity ordering + first-match as the fix for substring collisions like crvusd vs curve.
Describe canonical-slug normalization as a parity constraint, and what rebrands/vendor-drift it encodes.
Explain the reserved-generic filter, the heuristic's accepted imprecision, and who consumes the project field.

A clean look at a "good-enough" subsystem

Not every part of a risk system is a careful algorithm. Project inference is intentionally a tuned string-matcher, parity-locked to batch, with a refusal-to-guess floor. Knowing where the codebase chooses cheap heuristics over precision — and why — is as much a part of understanding it as the at_risk math was.

← PreviousLesson 27 · The Oracle Bridger · Discovery Internals Next →Lesson 29 · Chain-Reference Quality Harness · New Subsystem

Grounded in: pkg/enrichment/project.go (InferProject nametag→contract-name cascade, knownNametagPatterns + knownProjects specificity-ordered ~50-pattern lists, NormalizeLabelAPIProject + labelAPIProjectNorm canonical slugs incl. maker→sky / sushi→sushiswap, labelAPIGenericProjects + projectReservedSortedBy reserved-generic filter, InferProjectFromSlug first-token slug path; TestInferProject_CanonicalIdempotent parity test). Stage 10 of the enrichment pipeline (L24). Verify against source — the code is the truth.