Prompts you can
measure.

Every LLM call writes a row. Every funnel event back-links to the call that drafted the email. After a month, "which pitch variant is winning" is a SQL query, not an opinion.

In / Out

What gets logged. What gets answered.

Inputs (per LLM call)

stage_runs

stage_runs.prompt_id — versioned, scoped
stage_runs.input_presence — which facts were available
stage_runs.missing_required — which hard inputs forced abort
stage_runs.engine — anthropic / google / perplexity
stage_runs.model — exact model id
stage_runs.tokens_in + tokens_out
stage_runs.output_meta — templates fired, signals emitted, etc.

Outputs (queryable answers)

analytics

Which pitch_brief variant has the highest promote-to-reply rate on a given vertical
Which pain-signal combinations correlate with reply vs ignore
Which solution templates actually convert vs just get fired
Whether switching email_draft from sonnet to opus is worth the cost
Which ICPs have the highest missing_required rate (signals to fix)
Per-prompt × vertical × subcategory funnel attribution

How it works

Versioned.
Scoped.
Attributed.

prompt → call → stage_run → output → draft → send → funnel_event
↓
join back to prompt_id
↓
per-prompt outcome metric

Every prompt in the system carries a version, a scope, and a tier. The version increments on any meaningful edit. The scope is a tuple — (lead_type, vertical, subcategory, brand) — and tiers are baseline, vertical-override, and subcategory-override. When the runtime needs a pitch_brief prompt, it walks a 5-step fallback resolution from most-specific to most-general scope and picks the first match.

Every call writes a stage_runs row before it executes — prompt_id, input presence, engine, model, token estimates. After completion, tokens_out and output_meta backfill. If the call aborts because hard inputs are missing, that's also recorded. This means even abort patterns are queryable: "which prompt has the highest abort rate, and on which input field?" That tells you what data to fetch, not what prompt to rewrite.

Downstream, every email draft records pitch_prompt_id, email_prompt_id, and tune_prompt_id. Every send records email_prompt_id. Every funnel event on a lead — replied, call_booked, demo, closed_won, closed_lost — joins back to the exact prompt versions that drafted the outreach. After a quarter of operation, you can compare promote-to-reply rate between two pitch_brief variants on the same vertical-subcategory pair and pick the winner. The competitor running mail-merge cannot reconstruct this. Their prompts aren't versioned, their outputs aren't attributed.

The prompt registry

What a registry entry looks like.

--- entry ---
prompt_id:    pitch_brief_v14
stage:        pitch_brief
tier:         vertical_override
scope:
  pack:        devtools
  vertical:    kubernetes
  subcategory: b2b
  brand:       acme

input_contract:
  required:  [lead.painSignals, lead.aiSummary, brand.kbExcerpts]
  soft:      [lead.namedLeaders, lead.socialResearch]
  abort_on:  any required missing

engine_pref:  anthropic
model_pref:   claude-sonnet-4-6
fallback_to:  pitch_brief_v9 (baseline tier)

--- attached metrics (last 30d) ---
calls:                 412
abort_rate:            7.0%   # missing_required: aiSummary
promote_rate:          68.4%  # human approves draft
reply_rate:            — PLACEHOLDER —
call_booked_rate:      — PLACEHOLDER —
avg_tokens_in / out:   3,914 / 842

--- variant under test ---
prompt_id:    pitch_brief_v15  # 10% traffic split
delta:        new system instruction on opener grounding
status:       live · 9 days

# <!-- PLACEHOLDER — full registry available in app -->

Failure modes & safeguards

What can break.
And what catches it.

Risk

Small-sample overfitting

A new prompt variant looks great after 12 calls. You promote it. It loses badly at scale.

Mitigation

Tier-based prompt resolution: 5-step fallback to baseline. Variants are flagged "experimental" until they cross a minimum-call threshold.

Risk

Model drift

An LLM provider silently updates the model behind a name. Performance changes overnight.

Mitigation

Exact model id is logged on every call. Date-keyed metrics surface step-changes. Pin to specific model versions when stability matters.

Risk

Attribution loss

A reply comes in two months after send. The prompt that drafted it has been edited.

Mitigation

Prompts are immutable on version increment. The exact prompt_id at send time is preserved on the email_message row, not refetched.

Touches every stage

The self-teaching loop isn't a stage. It's the bookkeeping that wraps every AI stage in the pipeline.

01
Discovery 02
Audit 03
Social 04
Qualify 05
Signals 06
Pitch 07
Enrich 08
Draft 09
Review 10
Sent 11
Funnel

highlighted = AI stages writing to stage_runs and back-linked from funnel events

Stop guessing which pitch works.
Start measuring it.

See the prompt registry, stage_runs, and funnel attribution in a 30-minute walkthrough.

Talk to Sales Try the Sandbox

Prompts you can measure.

What gets logged. What gets answered.

Versioned. Scoped. Attributed.

What a registry entry looks like.

What can break. And what catches it.

Stop guessing which pitch works. Start measuring it.

Prompts you can
measure.

Versioned.
Scoped.
Attributed.

What can break.
And what catches it.

Stop guessing which pitch works.
Start measuring it.