01 Introduction
The Problem: Retrieval That Doesn't Understand Context
Every retrieval-augmented generation system faces the same fundamental problem: how do you retrieve the right context for a given query?
The standard answer — embed everything, search by cosine similarity — works until it doesn't. When your knowledge base covers multiple domains or specialties, cosine similarity has no way to prefer "surgical recovery guidelines" over "chronic disease management tips" for a patient heading into knee replacement surgery. Both might be semantically similar to the query. Only one is relevant.
Graph RAG (Microsoft, 2024) solved this by building a knowledge graph, partitioning it into communities, and having an LLM summarize each community. Retrieval then respects community structure — related knowledge stays together. It works. But it costs $$$, takes seconds per query, requires an LLM at every step, and produces non-deterministic outputs.
For regulated industries — healthcare, finance, legal — that's disqualifying.
Prior authorization alone processes 53 million requests per year in the US. Each requires retrieval that respects clinical context. CMS-0057-F (effective January 2027) mandates electronic processing with transparent decision rationale and 72-hour turnaround for urgent requests. A system that takes 3-5 seconds per query, returns different results each time, and can only explain its decisions as "the LLM decided" does not meet this bar.
We built a system that does.
02 Why LLM-Based Graph RAG Can't Serve Regulated Industries
The Four Walls
Healthcare systems, financial institutions, and legal operations deploying RAG-based retrieval hit four constraints simultaneously:
| Constraint | What it means | Why LLM-based RAG fails |
|---|---|---|
| Latency | Inline decisions in clinical workflows, not async queues | 3-5 seconds per LLM call; prior auth needs sub-100ms |
| Determinism | Same input must produce same output for audit compliance | Temperature > 0, sampling, API versioning all break reproducibility |
| Cost at volume | 100K+ queries per month at sustainable cost | $0.01-0.10 per LLM call compounds to $1K-10K/month for routing alone |
| Calibrated confidence | The system must know when to escalate — and prove that decision is reliable | LLM self-assessment ("I'm confident") has no formal guarantee |
These aren't nice-to-haves. HIPAA requires audit trails with reproducible decisions. CMS-0057-F requires transparent rationale for every prior authorization determination. Financial regulators require model governance with explainable outputs. A system that can't guarantee the same output on the same input fails compliance review before it reaches production.
What Graph RAG Gets Right — and What It Costs
Microsoft's Graph RAG solved a real problem: retrieval that respects community structure in knowledge bases. The insight is sound — related knowledge should stay together during retrieval.
But the implementation requires an LLM at every step:
| Step | What it does | What it costs |
|---|---|---|
| Entity extraction | LLM reads every document, extracts entities | Hundreds of LLM calls at index time |
| Community detection | Leiden algorithm partitions the graph | Hard partitions — each node in exactly one community |
| Query routing | LLM decides which community to search | Another LLM call per query |
| Answer synthesis | LLM summarizes retrieved context | Another LLM call per query |
| "Should I retrieve more?" | LLM self-assesses confidence | Uncalibrated — no formal guarantee |
Total: 3-5 seconds per query, non-deterministic outputs, hard community boundaries that lose information, and no way to prove the confidence assessment is reliable.
The Insight: You Already Have Communities
If you have a Bayesian inference pipeline — and many production ML systems in healthcare, finance, and recommendations do — you already have everything Graph RAG provides:
Soft communities from latent factor models (NMF, topic models, embeddings). Each knowledge chunk belongs to multiple communities with different affinities, not just one. A chunk about "elderly surgical patient with cardiac comorbidities" lives partially in the Surgical community, partially in the Cardiac community, and partially in Geriatrics. Hard partitions force a choice that loses information. Soft partitions preserve it.
Structured retrieval from evidence graphs built on co-occurrence statistics. No LLM needed to extract entities — the relationships already exist in the data.
Calibrated escalation from formal confidence scoring. Not "the model thinks it's confident" — a 4-signal gating mechanism validated with zero reversals on real clinical data.
The question isn't whether these components can replace Graph RAG. It's why we've been paying for LLM calls to do what Bayesian inference already does faster, cheaper, and with formal guarantees.
03 The Architecture: Three Components, Zero LLM Calls
The system replaces every LLM step in Graph RAG with an existing Bayesian component:
GRAPH RAG (Standard) THIS SYSTEM
======================== ========================
LLM entity extraction → Evidence graph (pre-built from co-occurrence weights)
Leiden community detection → Soft community affinities (probabilistic, multi-membership)
LLM query routing → Bayesian posterior inference
LLM answer synthesis → Deterministic state machine
LLM "should I retrieve more?" → Calibrated confidence tiers (formally validated)Component 1: Community-Biased Retrieval
Standard cosine search retrieves by semantic similarity alone. Our system blends semantic similarity with community membership — biasing retrieval toward chunks that live in the same clinical (or domain-specific) neighborhood as the query.
The community signal comes from a Bayesian posterior computed over the query's structured inputs (diagnosis codes, procedure codes, clinical features). This posterior tells the retrieval system which domain context the query lives in, and retrieval preferentially surfaces chunks with high affinity to that context.
Why this matters: On easy queries, community bias adds nothing — cosine already gets the right results. On hard queries (ambiguous inputs, conflicting signals from multiple domains), community bias lifts Precision@5 by 63.8%. The value shows up exactly where it should: when the query is ambiguous and pure semantic similarity isn't enough.
Component 2: Evidence Accumulation Loop
When the system is uncertain, it doesn't guess — it gathers more evidence. The agentic loop retrieves knowledge chunks, extracts structured features from them, and feeds those features back into the Bayesian posterior. This isn't "reasoning" or "chain-of-thought" — it's formal evidence accumulation that sharpens the posterior distribution.
The benchmark proves this works: 87.5% of uncertain cases upgraded to HIGH confidence after retrieval augmentation. The system retrieves context not just to find relevant documents, but to become more certain about its own assessment.
This breaks the circular criticism of agentic RAG ("you need to understand the question to retrieve relevant context"). The retrieval step extracts concrete, structured evidence — clinical features, domain indicators, contextual signals — that the Bayesian update formula incorporates directly. More concordant evidence → sharper posterior → higher confidence. The math guarantees it.
Component 3: Calibrated Escalation
The agent follows a deterministic state machine driven by formally validated confidence tiers:
HIGH confidence → Route directly (the system has enough evidence)
MEDIUM confidence → Retrieve additional evidence, re-assess, then route
LOW confidence → Escalate to human review (insufficient signal to decide)Every agentic RAG system (LangChain, LlamaIndex, AutoGPT) uses an LLM to make this decide-retrieve-act decision. The LLM's "confidence" is uncalibrated self-assessment — research consistently shows LLM confidence is poorly calibrated on domain-specific tasks.
Our confidence tiers are backed by the Confidence Gate Theorem (CGT), validated with zero reversals on MIMIC-IV clinical data (10,000 encounters). When the system says HIGH, the prediction is reliably accurate. When it says LOW, it reliably escalates. Every threshold tested produces equal or better accuracy — no danger zones, no surprises.
For regulated industries, this is the critical differentiator. A compliance officer can audit the confidence scoring, verify the threshold calibration, and confirm that escalation decisions are formally grounded — not based on an LLM's self-assessment that no one can test or guarantee.
04 Benchmark Results: MIMIC-IV Clinical Data
All benchmarks run on MIMIC-IV trained artifacts: 3,461 ICD-10 codes, 12 detected pathways, 2,604 clinical features.
Queries at three difficulty levels:
- Easy: Strong, unambiguous clinical signal
- Medium: Sparse or ambiguous signal
- Hard: Deliberately conflicting signals from competing clinical domains
Result 1: Community Bias Lifts Retrieval Where It Matters
| Community Bias | Easy P@5 | Medium P@5 | Hard P@5 | Overall P@5 |
|---|---|---|---|---|
| None (cosine only) | 1.000 | 0.843 | 0.302 | 0.715 |
| Low (0.3) | 1.000 | 0.841 | 0.367 | 0.736 |
| Balanced (0.5) | 1.000 | 0.850 | 0.405 | 0.752 |
| High (0.7) | 1.000 | 0.824 | 0.460 | 0.761 |
| Full (1.0) | 1.000 | 0.791 | 0.495 | 0.762 |
The pattern: On easy queries, community bias adds nothing — cosine already achieves perfect precision. On hard queries with conflicting signals, community bias lifts Precision@5 from 0.302 to 0.495 — a 63.8% improvement. The value of structured retrieval shows up exactly where standard search fails.
Note the medium-query tradeoff: full community bias (1.0) slightly hurts medium queries by over-constraining retrieval when the community signal itself is uncertain. The balanced setting (0.5) optimizes across all difficulty levels.
Result 2: The Agentic Loop Resolves Uncertainty
| Difficulty | HIGH Confidence | MEDIUM | LOW | Retrieval Triggered |
|---|---|---|---|---|
| Easy | 100% | 0% | 0% | 0 (not needed) |
| Medium | 84.5% | 3.6% | 11.9% | 13 |
| Hard | 98.8% | 1.2% | 0% | 1 |
When retrieval is triggered (MEDIUM confidence cases), the system retrieves evidence, extracts features, updates the posterior, and re-assesses. 28 of 32 retrieval triggers resulted in MEDIUM→HIGH upgrades — an 87.5% success rate.
The system knows when it doesn't know enough, gathers specific evidence to resolve the uncertainty, and proves that the evidence actually helped. This is auditable evidence accumulation, not LLM "reasoning."
Result 3: 1000x Latency Advantage
| System | Mean Latency | P95 | P99 | Under 100ms |
|---|---|---|---|---|
| This system | 2.65ms | 5.14ms | 27.7ms | 100% |
| Microsoft Graph RAG | ~3,000-5,000ms | ~8,000ms | ~12,000ms | ~0% |
| LangChain Agentic RAG | ~2,000-4,000ms | ~6,000ms | ~10,000ms | ~0% |
| LlamaIndex ReAct Agent | ~1,500-3,000ms | ~5,000ms | ~8,000ms | ~0% |
Competitor latencies are published estimates for typical deployments, not head-to-head benchmarks on identical queries. Our latency is measured on the MIMIC-IV benchmark (252 queries, single CPU core).
Every query completes in under 100ms. The P99 is 27.7ms. No API calls, no token costs, no rate limits, no retry logic. Pure numerical computation on a single CPU core.
For a healthcare system processing 100K prior authorization requests per month, this is the difference between inline processing (2.65ms) and async queuing infrastructure (3,000ms+). At $0.01-0.10 per LLM call, the cost difference at volume is $1K-10K/month — just for the retrieval routing layer.
Result 4: Robust Under Adversarial Conditions
| Difficulty | Accuracy | vs Random (8.3%) | Lift |
|---|---|---|---|
| Easy (strong signal) | 100.0% | +91.7 pp | 12.0x |
| Medium (ambiguous signal) | 85.7% | +77.4 pp | 10.3x |
| Hard (conflicting signals) | 48.8% | +40.5 pp | 5.9x |
| Overall | 78.2% | +69.8 pp | 9.4x |
Even with deliberately conflicting inputs (codes from competing clinical domains injected into the same query), the system correctly identifies the target pathway 48.8% of the time — 5.9x better than random across 12 pathways. The evidence accumulation mechanism resolves conflicting signals by weighing contextual features that disambiguate the domain.
05 What No One Else Can Claim
Soft Communities vs. Hard Partitions
Every Graph RAG system today uses the Leiden algorithm (or Louvain, or similar) to partition knowledge graphs into communities. These are hard partitions: each node belongs to exactly one community.
Our system uses soft community membership: each knowledge chunk has a probability distribution over all communities. A chunk about "elderly surgical patient with cardiac comorbidities" belongs partially to the Surgical community, partially to Cardiac, and partially to Geriatrics — with specific, learned affinities.
This matters because real-world knowledge is inherently multi-faceted. A clinical guideline about post-surgical cardiac monitoring is relevant to both surgical and cardiac queries. Hard partitions force it into one community. Soft membership preserves all relevant associations and lets the query's context determine which affinity matters most.
No existing Graph RAG system uses soft community structure. Microsoft's Graph RAG, RAPTOR, and all derivatives use hard partitions. This is, to our knowledge, the first demonstration that probabilistic community membership can serve the same structural role in retrieval — with strictly greater expressiveness.
Calibrated Confidence vs. LLM Self-Assessment
Every agentic RAG system relies on an LLM to decide "do I know enough, or should I retrieve more?" This decision is based on the LLM's self-assessed confidence — which research consistently shows is poorly calibrated, especially on domain-specific tasks. There is no formal guarantee that "I'm confident" correlates with accuracy.
Our confidence tiers are derived from a multi-signal gating mechanism validated by the Confidence Gate Theorem. On MIMIC-IV (10,000 real hospital encounters), the system achieves:
- Zero reversals across the entire confidence range — every threshold tested produces equal or better accuracy
- ECE of 0.032 — when the system says 80% confident, it's right about 80% of the time
- Clean monotonic ordering — HIGH confidence cases are reliably more accurate than MEDIUM, which are reliably more accurate than LOW
No existing agentic RAG system has published formally validated confidence calibration. This is the difference between "the system seems confident" and "we can prove the confidence signal is reliable" — a distinction that matters for regulatory compliance, clinical safety, and audit defensibility.
Deterministic Outputs
The system is fully deterministic: same input produces identical output, every time. This is a requirement for regulated industries:
- Healthcare (HIPAA): Audit trails require reproducible decisions
- Financial services: Model governance requires explainable, repeatable outputs
- Legal: Inconsistent outputs undermine credibility
Every LLM-based agentic system is inherently non-deterministic (temperature, sampling, API versioning). Our system is a numerical computation with fixed parameters producing identical floating-point results — auditable, reproducible, and compliant by construction.
Scope and Limitations
To be clear about what this system does and doesn't do:
- This is not a general-purpose QA system. There is no natural language generation. The system classifies queries into communities, retrieves relevant knowledge, and reports calibrated confidence. An LLM can be layered on top for answer synthesis — the retrieval and routing layer doesn't need one.
- This is not better than LLM-based Graph RAG on open-ended questions. If you need narrative synthesis from multiple documents, you need an LLM. This system is built for structured decision support: routing, classification, triage, and retrieval in domains where speed, determinism, and auditability matter more than fluency.
- The benchmark uses MIMIC-IV clinical artifacts (3,461 ICD-10 codes, 2,604 clinical features). Validation on free-text knowledge bases is planned but not yet benchmarked.
06 Deployment Profile
What a Deployment Looks Like
The system processes a structured query through three stages: detect the domain context, assess confidence, and route accordingly — with optional retrieval augmentation for uncertain cases.
For a clear case (strong, unambiguous signal):
Input: Structured query with diagnosis/procedure codes + clinical context
Stage 1 — DETECT: Bayesian posterior over domain communities (0.3ms)
Stage 2 — ASSESS: Multi-signal confidence scoring → HIGH (0.01ms)
Stage 3 — RETRIEVE: Community-biased search returns top-K results (1.5ms)
Output: Ranked results + community assignment + confidence tier + evidence summary
Total: ~1.8ms, deterministic, fully auditableFor an ambiguous case (conflicting or sparse signals):
Input: Structured query with mixed signals from competing domains
Stage 1 — DETECT: Bayesian posterior is split across communities → MEDIUM confidence
Stage 2 — RETRIEVE: Community-biased search returns chunks; features extracted
Stage 3 — UPDATE: Extracted features fed back into Bayesian posterior → confidence sharpens
Stage 4 — RESOLVE: Updated posterior resolves to HIGH confidence → route with evidence
Output: Same structured output, with retrieval augmentation logged in audit trail
Total: ~4-6ms, still deterministic, every step traceableIntegration Points
The system is designed to sit inside existing infrastructure as a retrieval and routing engine:
- API: FastAPI endpoint accepts structured queries, returns routing decisions with confidence scores and evidence summaries
- Models: Pretrained community embeddings and evidence graph weights, loadable from standard artifact files
- Configuration: Confidence thresholds, routing rules, and escalation criteria are configurable per deployment
- Dependencies: numpy, FAISS (optional). No LLM SDK, no API keys, no GPU required for inference
What's Needed for a Pilot
- Historical data for community model training (minimum ~10K structured records)
- Integration point with the existing query intake system
- Domain review of learned communities and confidence thresholds
- Parallel run comparing system routing against existing manual or LLM-based process
The system trains in hours on a single GPU, deploys on CPU, and runs at sub-10ms latency with zero external dependencies.
Regulatory Alignment
| Requirement | How the system addresses it |
|---|---|
| CMS-0057-F: Electronic PA processing | Sub-100ms API endpoint, FHIR-compatible request/response |
| CMS-0057-F: Decision transparency | Every decision includes community assignment, confidence score, top evidence features, and routing rationale |
| CMS-0057-F: Timely decisions | 2.65ms mean vs. 48-hour manual average |
| HIPAA: Reproducible audit trails | Deterministic — same input, same output, every time |
| HIPAA: On-premises deployment | No external API calls. Runs entirely within the organization's infrastructure |
| State gold carding laws | Built-in provider performance tracking with configurable auto-approval criteria |
•Resources
- Paper: Confidence Gate Theorem (arXiv 2603.09947)
- governed-rank: github.com/rdoku/governed-rank
- Contact: ronald@haskelabs.com
07 Conclusion
Graph RAG solved a real problem: retrieval that respects community structure in knowledge bases. But the implementation — LLMs at every step — introduced cost, latency, and non-determinism that make it impractical for regulated, real-time, or high-volume systems.
We show that the same capability can be achieved with existing Bayesian infrastructure:
- Soft community membership replaces hard partitions — strictly more expressive, probabilistic, and directly composable with Bayesian posteriors
- Community-biased retrieval replaces LLM-guided search — blending semantic similarity with domain context in a single retrieval operation
- Calibrated confidence tiers replace LLM self-assessment — formally validated with zero reversals on 10,000 real clinical encounters
- Deterministic routing replaces LLM "reasoning" — same input, same output, every time, with a complete audit trail
The benchmark on MIMIC-IV clinical data:
- 63.8% Precision@5 lift on adversarial queries
- 87.5% of uncertain cases resolved via evidence accumulation
- 2.65ms mean latency (1000x faster than LLM-based alternatives)
- 9.4x pathway detection accuracy over random baseline
- 100% of queries under 100ms
The prior authorization market alone processes 53 million requests per year. CMS-0057-F mandates electronic processing with transparent rationale by January 2027. The industry needs retrieval systems that are fast enough for inline decisions, deterministic enough for audit compliance, and smart enough to know when to escalate — with proof that the escalation decision is reliable.
This is the architecture for that problem.
What if Graph RAG didn't need an LLM — and was 1000x faster?
Standard Graph RAG uses an LLM to extract entities, build a graph, partition communities, and summarize answers. We replace every LLM step with existing Bayesian infrastructure — achieving 2.65ms mean latency, deterministic outputs, and retrieval that actually improves under adversarial conditions. Built for healthcare prior authorization, clinical triage, and any domain where speed, auditability, and calibrated uncertainty are non-negotiable.