Paitho
← Product / Stage 02 · Web Audit
Pipeline · Stage 02

Read the website
like a senior rep would.

Perplexity structured extraction over the homepage and key sub-pages. Channels, ecommerce maturity, traffic proxy, content recency — turned into typed fields the next nine stages can reason on.

What goes in, what comes out.

Inputs
Lead seed
  • lead.domain — the only required field
  • lead.packId — controls per-pack field weights and required fields
  • lead.socialHandles[] — optional seed handles if known
  • pack.auditFields[] — vertical-specific fields to extract (e.g. has_oss_repo for devtools, has_dealer_locator for logistics)
  • pack.minMaturity — gate value below which the audit short-circuits
Outputs
web_audit
  • web_audit.channels[]b2c | b2b | distributor | marketplace
  • web_audit.ecommerceshopify | woocommerce | custom | none
  • web_audit.trafficLevel — proxy bucket: low | mid | high
  • web_audit.has_request_quote, has_dealer_locator
  • web_audit.marketplaceDeps[] — Amazon, Etsy, ThomasNet, etc.
  • web_audit.socialFollowers — per-network counts
  • web_audit.contentRecency — newest blog post age in days
  • web_audit.hiringSignals[] — open roles by team
Schema-bound: strict_json: true · missing → null

One Perplexity call.
Typed back into a row.

channels→ enum[]
ecommerce→ enum
trafficLevel→ low | mid | high
has_request_quote→ bool | null
has_dealer_locator→ bool | null
marketplaceDeps→ string[]
socialFollowers→ map<net, int>
contentRecency→ days_since
logistics pack adds has_dealer_locator; devtools pack adds has_oss_repo

The audit is one model call, not a crawl. Perplexity Sonar is given the lead's domain, a list of structured fields to extract, and the schema each field must match. It reads the home page, the about page, the products or pricing page if present, and the most recent blog or news entry. The output comes back as JSON, validated against the schema, and written into the lead's web_audit column. No HTML scraping in our infra. No flaky DOM selectors that break on a Webflow redesign.

Field requirements are per-pack. The logistics pack needs has_dealer_locator and marketplaceDeps because the pitches downstream lean on those. The devtools pack needs has_oss_repo and recent_release_age. A pack declares which audit fields it needs, and the prompt is composed at run-time from that declaration. Add a field to a pack and every lead in that pack re-audits on the next refresh.

Traffic level is a proxy, not a measurement — we do not have access to the lead's analytics. The model estimates a bucket from public signals: visible review counts, marketplace listing rank, social follower distribution, and the cadence of dated content. Buckets are intentionally coarse. low | mid | high is a useful axis to sort on; a fake-precise visitor count is not. Where the bucket is uncertain, the field returns null and the lead waits for a human spot-check before qualification can fire.

Schema-bound. Per-pack composed.

prompt · web_audit_v12 · logistics +
--- system ---
You audit a B2B company's web presence. Read only what is publicly
available at the given domain. NEVER infer a field if the page does
not provide a basis. Return null and add the field to missing[].

--- inputs ---
domain:    {{lead.domain}}
pack:      {{lead.packId}}            # logistics
fields:    {{pack.auditFields | json}}

--- output schema ---
{
  "channels":           ("b2c"|"b2b"|"distributor"|"marketplace")[],
  "ecommerce":          "shopify"|"woocommerce"|"custom"|"none"|null,
  "trafficLevel":       "low"|"mid"|"high"|null,
  "has_request_quote":  boolean|null,
  "has_dealer_locator": boolean|null,
  "marketplaceDeps":    string[],
  "socialFollowers":    { linkedin?:int, x?:int, instagram?:int, facebook?:int },
  "contentRecency":     int|null,         # days since newest dated post
  "hiringSignals":      string[],         # role · team
  "missing":            string[]
}

--- guards ---
- Each enum value MUST come from the declared list. Unknown → null.
- trafficLevel is a coarse bucket; do not invent a visitor number.
- Quote the source URL anchor for every non-null field in _evidence.

# <!-- PLACEHOLDER — full prompt registry available in app -->

What can break.
And what catches it.

Risk
Hallucinated facts

Model invents an ecommerce platform or a dealer-locator that isn't on the site.

Mitigation

Every non-null field carries a URL-anchor quote in _evidence. Fields without evidence are dropped to null server-side before write.

Risk
Cloudflare or login walls

The page is behind a JS challenge or auth wall, audit returns nothing useful.

Mitigation

All required fields land in missing[]. Lead is tagged audit_blocked and never enters Qualification — it surfaces in a manual triage queue.

Risk
Stale audit drift

Lead's site is rebuilt; the audit on file no longer reflects reality.

Mitigation

Per-pack TTL on audit rows. Re-audit triggers automatically when a lead is promoted out of cold storage or when the pack's audit-field list changes.

Audit a real domain.
In a sandbox, in 2 minutes.

Drop a domain. Watch eight typed fields appear with quoted evidence.