HC
Open Live Demo
Home / AI Cost Optimization
AI Cost Optimization Engine

Multi-level intelligence.
Minimal token spend.

HealthCopilot routes every query through five progressive filters before an LLM ever sees it. Most queries are resolved from live data. The ones that need AI are matched to the cheapest capable model. Spend is capped, observed, and circuit-broken at every tier.

60%+
Resolved without
any LLM call
85%+
Resolved without
premium model
5
Independent
cost guard layers
3
Model tiers,
auto-selected
The Problem with AI in Health Tech

Unconstrained LLM use is a financial liability.

Health platforms that route every query to a frontier LLM face three compounding problems. HealthCopilot is built to eliminate all three.

$
Runaway Token Spend
A member asking "what is my deductible balance?" does not need GPT-4. Sending every inquiry to a frontier model inflates inference costs by 10x or more.
!
No Governance on AI Depth
Without routing intelligence, the system cannot distinguish a simple status lookup from a complex clinical rationale request. Every query gets the same expensive treatment.
?
No Observability on Spend
Without per-query cost attribution, ops teams cannot detect anomalies, track model usage trends, or act before monthly budgets are breached.
The Five-Layer Routing Stack

Every query passes through five filters.

Each layer resolves what it can and passes the remainder down. By the time a query reaches the LLM tier, 85%+ of the original volume has already been handled at near-zero cost.

Layer 1
Deterministic Resolution
0 token cost
Policy lookups, claim status checks, preauthorization status, coverage balance queries, benefit schedule reads. All resolved from live structured data with zero LLM involvement. Answers are deterministic, auditable, and instant.
Claim status Deductible balance Preauth status Provider network check Coverage eligibility Benefit limits
60-65% of all queries resolved here
35-40% pass to Layer 2
Layer 2
Semantic Response Cache
Near-zero cost
Semantically similar queries that have been answered before are matched to cached responses. "How do I submit a claim?" and "What is the process for filing a claim?" retrieve the same cached answer. Cache is tenant-scoped and invalidated on policy changes.
Common policy questions Repeated FAQ patterns Same-session context Multi-turn follow-ups
Additional 10-15% of total volume resolved here
20-25% reach the LLM tier
Layer 3
Model Selection Router
Intelligent routing
Intent classification determines the minimum capable model for each remaining query. Complexity scoring, clinical depth flag, and reasoning requirement all feed the router. The system never calls a premium model for a task a micro model can handle.
Micro Tier
Haiku · GPT-3.5 · Flash
~70%
Simple clarifications, field explanations, format transforms, date conversions, and basic FAQ. Lowest cost per token.
Benefit explanationDate formattingSimple FAQShort summary
🧠
Mid Tier
Sonnet · GPT-4o-mini · Pro
~25%
Policy parsing, multi-step reasoning, rejection explanations, provider recommendations, and moderate clinical context synthesis.
Rejection rationalePolicy comparisonClaim analysisClinical summary
🏆
Premium Tier
Opus · GPT-4 · Ultra
~5%
Complex clinical dispute reasoning, multi-document evidence synthesis, high-stakes preauth justification. Triggered only when complexity score exceeds threshold.
Clinical disputesEvidence synthesisAppeal draftingComplex preauth
Model selected — context compression applied before dispatch
Layer 4
Context Compression
Reduces token count 60-80%
Only the relevant slice of member context is included in the prompt. Conversation history is summarised, not verbatim-appended. Clinical data is scoped to the specific query domain. Policy rules are extracted as key-value facts rather than full document text. The average effective prompt size is 60-80% smaller than a naive context approach.
Conversation history
Full transcript
Rolling summary
Member policy
Full policy doc
Relevant clauses only
Clinical records
Full history
Visit-scoped facts
Response generated — output deduplication applied
Layer 5
Output Deduplication Cache
Eliminates repeat spend
Generated responses are stored and matched against future semantically-equivalent queries. Identical outcomes are never paid for twice. The dedup cache is member-scoped (for personal data) and tenant-scoped (for general policy content), with TTL tuned per content type.
Policy explanation dedup Provider list caching Benefit summary reuse Session-level memory
Cost Guardrails

Six independent safety nets on every inference.

Even after routing optimisation, hard guardrails enforce budget discipline at conversation, session, tenant, and platform level.

Conversation Budget Ceiling
Every conversation has a configurable maximum spend. If the ceiling is approached, the router is forced to downgrade the model tier or return a deterministic fallback response.
Configurable per tenant · Default: $0.05 per session
📆
Monthly Tenant Spend Cap
Each tenant (insurer/TPA) has a monthly LLM spend cap. As usage approaches the cap, the router progressively enforces lower model tiers. At 100%, all queries fall back to deterministic.
Configurable per plan · Tracked in real-time
🔌
Circuit Breaker
If per-minute or per-hour inference spend exceeds anomaly thresholds, the circuit breaker trips and all LLM calls are blocked for a cooldown period. The system continues serving deterministic responses uninterrupted.
Auto-resets · Ops team alerted immediately
📡
Real-Time Cost Alerting
Spend thresholds trigger alerts at 50%, 75%, and 90% of configured budgets. Ops teams receive webhook, email, or in-platform notifications before caps are reached, not after.
Webhook + in-platform + email delivery
🔁
Provider Fallback Routing
If the selected provider is unavailable or over rate limit, the router switches to the next cheapest capable provider automatically. On-premises deployments can route to a local model with zero marginal cost.
Multi-provider: OpenAI · Anthropic · Azure · Local
🔒
Complexity Score Gate
Before dispatching to premium tier, a complexity classifier re-evaluates the query. If the score does not meet the premium threshold, the model is downgraded to mid tier and the query is re-processed. Premium is never triggered by default.
Complexity scored on 0-100 scale · Premium gate at 80+
AI Provider Flexibility

Not locked to one provider. Ever.

HealthCopilot abstracts the AI layer. Providers are configured at the tenant level. Routing logic compares cost, latency, and capability before each dispatch.

GPT-4o · GPT-4o-mini · GPT-3.5
Default provider for most tenants. Full tier coverage from micro to premium.
Claude Opus · Sonnet · Haiku
Strong alternative for clinical reasoning tasks. Configurable as primary or fallback.
Deployed models in client tenant
For clients who require data residency. Models run within the client's Azure subscription.
Llama · Mistral · Custom
On-premises deployments can route to a locally-hosted model. Zero per-token cost. Meets maximum data control requirements.
How the router chooses a provider for each call
1
Check circuit breaker status for each provider
2
Match required model tier to available provider endpoints
3
Score candidates on cost-per-token and recent latency
4
Dispatch to cheapest capable endpoint. Log cost attribution.
Cost Observability

Every token spend is logged and attributed.

Ops teams have full visibility into inference spend at query, session, member, intent, and tenant level. Anomalies surface before they become problems.

AI Cost Dashboard
$0.031
Avg cost/session
82%
Micro tier usage
4.1%
Premium tier usage
Intent TypeModel TierTokensCost
Claim statusDeterministic0$0.000
Benefit explanationMicro412$0.001
Rejection rationaleMid1,842$0.009
Clinical disputePremium3,210$0.048
Provider lookupDeterministic0$0.000
📊
Per-Query Cost Attribution
Every inference is tagged with member ID, intent type, model used, token count, and cost. Spend is fully traceable and auditable.
📈
Trend and Anomaly Detection
Rolling spend windows detect sudden spikes. If premium tier usage exceeds normal variance, the system flags it before manual review catches it.
🗂️
Model Mix Reports
Daily and monthly reports show what percentage of volume was handled by each tier. These are the primary levers for cost tuning over time.
📤
Export and Billing Integration
Cost data is exportable per billing period. For multi-tenant platform deployments, per-tenant cost allocation is available for pass-through billing.
AI Cost Optimization

Built for finance-conscious enterprise.

The five-layer routing stack, the six cost guardrails, and the full observability layer are not add-ons. They are core to the platform architecture. Every deployment ships with them active from day one.