AI Cost Optimization Engine, Insurance Copilot by Innovestor

The Problem with AI in Health Tech

Unconstrained LLM use is a financial liability.

Health platforms that route every query to a frontier LLM face three compounding problems. Insurance Copilot is built to eliminate all three.

Runaway Token Spend

A member asking "what is my deductible balance?" does not need GPT-4. Sending every inquiry to a frontier model inflates inference costs by 10x or more.

No Governance on AI Depth

Without routing intelligence, the system cannot distinguish a simple status lookup from a complex clinical rationale request. Every query gets the same expensive treatment.

No Observability on Spend

Without per-query cost attribution, ops teams cannot detect anomalies, track model usage trends, or act before monthly budgets are breached.

The Five-Layer Routing Stack

Every query passes through five filters.

Each layer resolves what it can and passes the remainder down. By the time a query reaches the LLM tier, 85%+ of the original volume has already been handled at near-zero cost.

Layer 1

Deterministic Resolution

0 token cost

Policy lookups, claim status checks, preauthorization status, coverage balance queries, benefit schedule reads. All resolved from live structured data with zero LLM involvement. Answers are deterministic, auditable, and instant.

Claim status Deductible balance Preauth status Provider network check Coverage eligibility Benefit limits

60-65% of all queries resolved here

35-40% pass to Layer 2

▼

Layer 2

Semantic Response Cache

Near-zero cost

Semantically similar queries that have been answered before are matched to cached responses. "How do I submit a claim?" and "What is the process for filing a claim?" retrieve the same cached answer. Cache is tenant-scoped and invalidated on policy changes.

Common policy questions Repeated FAQ patterns Same-session context Multi-turn follow-ups

Additional 10-15% of total volume resolved here

20-25% reach the LLM tier

▼

Layer 3

Model Selection Router

Intelligent routing

Intent classification determines the minimum capable model for each remaining query. Complexity scoring, clinical depth flag, and reasoning requirement all feed the router. The system never calls a premium model for a task a micro model can handle.

⚡

Micro Tier

Haiku · GPT-3.5 · Flash

~70%

Simple clarifications, field explanations, format transforms, date conversions, and basic FAQ. Lowest cost per token.

Benefit explanationDate formattingSimple FAQShort summary

🧠

Mid Tier

Sonnet · GPT-4o-mini · Pro

~25%

Policy parsing, multi-step reasoning, rejection explanations, provider recommendations, and moderate clinical context synthesis.

Rejection rationalePolicy comparisonClaim analysisClinical summary

🏆

Premium Tier

Opus · GPT-4 · Ultra

~5%

Complex clinical dispute reasoning, multi-document evidence synthesis, high-stakes preauth justification. Triggered only when complexity score exceeds threshold.

Clinical disputesEvidence synthesisAppeal draftingComplex preauth

Model selected — context compression applied before dispatch

▼

Layer 4

Context Compression

Reduces token count 60-80%

Only the relevant slice of member context is included in the prompt. Conversation history is summarised, not verbatim-appended. Clinical data is scoped to the specific query domain. Policy rules are extracted as key-value facts rather than full document text. The average effective prompt size is 60-80% smaller than a naive context approach.

Conversation history

Full transcript

→

Rolling summary

Member policy

Full policy doc

→

Relevant clauses only

Clinical records

Full history

→

Visit-scoped facts

Response generated — output deduplication applied

▼

Layer 5

Output Deduplication Cache

Eliminates repeat spend

Generated responses are stored and matched against future semantically-equivalent queries. Identical outcomes are never paid for twice. The dedup cache is member-scoped (for personal data) and tenant-scoped (for general policy content), with TTL tuned per content type.

Policy explanation dedup Provider list caching Benefit summary reuse Session-level memory

Cost Guardrails

Six independent safety nets on every inference.

Even after routing optimisation, hard guardrails enforce budget discipline at conversation, session, tenant, and platform level.

⛔

Conversation Budget Ceiling

Every conversation has a configurable maximum spend. If the ceiling is approached, the router is forced to downgrade the model tier or return a deterministic fallback response.

Configurable per tenant · Default: $0.05 per session

📆

Monthly Tenant Spend Cap

Each tenant (insurer/TPA) has a monthly LLM spend cap. As usage approaches the cap, the router progressively enforces lower model tiers. At 100%, all queries fall back to deterministic.

Configurable per plan · Tracked in real-time

🔌

Circuit Breaker

If per-minute or per-hour inference spend exceeds anomaly thresholds, the circuit breaker trips and all LLM calls are blocked for a cooldown period. The system continues serving deterministic responses uninterrupted.

Auto-resets · Ops team alerted immediately

📡

Real-Time Cost Alerting

Spend thresholds trigger alerts at 50%, 75%, and 90% of configured budgets. Ops teams receive webhook, email, or in-platform notifications before caps are reached, not after.

Webhook + in-platform + email delivery

🔁

Provider Fallback Routing

If the selected provider is unavailable or over rate limit, the router switches to the next cheapest capable provider automatically. On-premises deployments can route to a local model with zero marginal cost.

Multi-provider: OpenAI · Anthropic · Azure · Local

🔒

Complexity Score Gate

Before dispatching to premium tier, a complexity classifier re-evaluates the query. If the score does not meet the premium threshold, the model is downgraded to mid tier and the query is re-processed. Premium is never triggered by default.

Complexity scored on 0-100 scale · Premium gate at 80+

AI Provider Flexibility

Not locked to one provider. Ever.

Insurance Copilot abstracts the AI layer. Providers are configured at the tenant level. Routing logic compares cost, latency, and capability before each dispatch.

OpenAI

GPT-4o · GPT-4o-mini · GPT-3.5

Default provider for most tenants. Full tier coverage from micro to premium.

Anthropic

Claude Opus · Sonnet · Haiku

Strong alternative for clinical reasoning tasks. Configurable as primary or fallback.

Azure OpenAI

Deployed models in client tenant

For clients who require data residency. Models run within the client's Azure subscription.

Local / On-Prem

Llama · Mistral · Custom

On-premises deployments can route to a locally-hosted model. Zero per-token cost. Meets maximum data control requirements.

How the router chooses a provider for each call

Check circuit breaker status for each provider

→

Match required model tier to available provider endpoints

→

Score candidates on cost-per-token and recent latency

→

Dispatch to cheapest capable endpoint. Log cost attribution.

Cost Observability

Every token spend is logged and attributed.

Ops teams have full visibility into inference spend at query, session, member, intent, and tenant level. Anomalies surface before they become problems.

AI Cost Dashboard

$0.031

Avg cost/session

82%

Micro tier usage

4.1%

Premium tier usage

Intent TypeModel TierTokensCost

Claim statusDeterministic0$0.000

Benefit explanationMicro412$0.001

Rejection rationaleMid1,842$0.009

Clinical disputePremium3,210$0.048

Provider lookupDeterministic0$0.000

📊

Per-Query Cost Attribution

Every inference is tagged with member ID, intent type, model used, token count, and cost. Spend is fully traceable and auditable.

📈

Trend and Anomaly Detection

Rolling spend windows detect sudden spikes. If premium tier usage exceeds normal variance, the system flags it before manual review catches it.

🗂️

Model Mix Reports

Daily and monthly reports show what percentage of volume was handled by each tier. These are the primary levers for cost tuning over time.

📤

Export and Billing Integration

Cost data is exportable per billing period. For multi-tenant platform deployments, per-tenant cost allocation is available for pass-through billing.

Multi-level intelligence.
Minimal token spend.

Unconstrained LLM use is a financial liability.

Every query passes through five filters.

Six independent safety nets on every inference.

Not locked to one provider. Ever.

Every token spend is logged and attributed.

Built for finance-conscious enterprise.

Multi-level intelligence.Minimal token spend.

Unconstrained LLM use is a financial liability.

Every query passes through five filters.

Six independent safety nets on every inference.

Not locked to one provider. Ever.

Every token spend is logged and attributed.

Built for finance-conscious enterprise.

Multi-level intelligence.
Minimal token spend.