Every LLM call writes a row. Every funnel event back-links to the call that drafted the email. After a month, "which pitch variant is winning" is a SQL query, not an opinion.
stage_runs.prompt_id — versioned, scopedstage_runs.input_presence — which facts were availablestage_runs.missing_required — which hard inputs forced abortstage_runs.engine — anthropic / google / perplexitystage_runs.model — exact model idstage_runs.tokens_in + tokens_outstage_runs.output_meta — templates fired, signals emitted, etc.pitch_brief variant has the highest promote-to-reply rate on a given verticalmissing_required rate (signals to fix)
Every prompt in the system carries a version, a scope, and a tier. The version increments on any meaningful edit. The scope is a tuple — (lead_type, vertical, subcategory, brand) — and tiers are baseline, vertical-override, and subcategory-override. When the runtime needs a pitch_brief prompt, it walks a 5-step fallback resolution from most-specific to most-general scope and picks the first match.
Every call writes a stage_runs row before it executes — prompt_id, input presence, engine, model, token estimates. After completion, tokens_out and output_meta backfill. If the call aborts because hard inputs are missing, that's also recorded. This means even abort patterns are queryable: "which prompt has the highest abort rate, and on which input field?" That tells you what data to fetch, not what prompt to rewrite.
Downstream, every email draft records pitch_prompt_id, email_prompt_id, and tune_prompt_id. Every send records email_prompt_id. Every funnel event on a lead — replied, call_booked, demo, closed_won, closed_lost — joins back to the exact prompt versions that drafted the outreach. After a quarter of operation, you can compare promote-to-reply rate between two pitch_brief variants on the same vertical-subcategory pair and pick the winner. The competitor running mail-merge cannot reconstruct this. Their prompts aren't versioned, their outputs aren't attributed.
--- entry --- prompt_id: pitch_brief_v14 stage: pitch_brief tier: vertical_override scope: pack: devtools vertical: kubernetes subcategory: b2b brand: acme input_contract: required: [lead.painSignals, lead.aiSummary, brand.kbExcerpts] soft: [lead.namedLeaders, lead.socialResearch] abort_on: any required missing engine_pref: anthropic model_pref: claude-sonnet-4-6 fallback_to: pitch_brief_v9 (baseline tier) --- attached metrics (last 30d) --- calls: 412 abort_rate: 7.0% # missing_required: aiSummary promote_rate: 68.4% # human approves draft reply_rate: — PLACEHOLDER — call_booked_rate: — PLACEHOLDER — avg_tokens_in / out: 3,914 / 842 --- variant under test --- prompt_id: pitch_brief_v15 # 10% traffic split delta: new system instruction on opener grounding status: live · 9 days # <!-- PLACEHOLDER — full registry available in app -->
A new prompt variant looks great after 12 calls. You promote it. It loses badly at scale.
Tier-based prompt resolution: 5-step fallback to baseline. Variants are flagged "experimental" until they cross a minimum-call threshold.
An LLM provider silently updates the model behind a name. Performance changes overnight.
Exact model id is logged on every call. Date-keyed metrics surface step-changes. Pin to specific model versions when stability matters.
A reply comes in two months after send. The prompt that drafted it has been edited.
Prompts are immutable on version increment. The exact prompt_id at send time is preserved on the email_message row, not refetched.
The self-teaching loop isn't a stage. It's the bookkeeping that wraps every AI stage in the pipeline.
See the prompt registry, stage_runs, and funnel attribution in a 30-minute walkthrough.