Haske Labs logo

01 Introduction

The Problem: Retrieval That Doesn't Understand Context

Every retrieval-augmented generation system faces the same fundamental problem: how do you retrieve the right context for a given query?

The standard answer — embed everything, search by cosine similarity — works until it doesn't. When your knowledge base covers multiple domains or specialties, cosine similarity has no way to prefer "surgical recovery guidelines" over "chronic disease management tips" for a patient heading into knee replacement surgery. Both might be semantically similar to the query. Only one is relevant.

Graph RAG (Microsoft, 2024) solved this by building a knowledge graph, partitioning it into communities, and having an LLM summarize each community. Retrieval then respects community structure — related knowledge stays together. It works. But it costs $$$, takes seconds per query, requires an LLM at every step, and produces non-deterministic outputs.

For regulated industries — healthcare, finance, legal — that's disqualifying.

Prior authorization alone processes 53 million requests per year in the US. Each requires retrieval that respects clinical context. CMS-0057-F (effective January 2027) mandates electronic processing with transparent decision rationale and 72-hour turnaround for urgent requests. A system that takes 3-5 seconds per query, returns different results each time, and can only explain its decisions as "the LLM decided" does not meet this bar.

We built a system that does.

02 Why LLM-Based Graph RAG Can't Serve Regulated Industries

The Four Walls

Healthcare systems, financial institutions, and legal operations deploying RAG-based retrieval hit four constraints simultaneously:

Constraint	What it means	Why LLM-based RAG fails
Latency	Inline decisions in clinical workflows, not async queues	3-5 seconds per LLM call; prior auth needs sub-100ms
Determinism	Same input must produce same output for audit compliance	Temperature > 0, sampling, API versioning all break reproducibility
Cost at volume	100K+ queries per month at sustainable cost	$0.01-0.10 per LLM call compounds to $1K-10K/month for routing alone
Calibrated confidence	The system must know when to escalate — and prove that decision is reliable	LLM self-assessment ("I'm confident") has no formal guarantee

These aren't nice-to-haves. HIPAA requires audit trails with reproducible decisions. CMS-0057-F requires transparent rationale for every prior authorization determination. Financial regulators require model governance with explainable outputs. A system that can't guarantee the same output on the same input fails compliance review before it reaches production.

What Graph RAG Gets Right — and What It Costs

Microsoft's Graph RAG solved a real problem: retrieval that respects community structure in knowledge bases. The insight is sound — related knowledge should stay together during retrieval.

But the implementation requires an LLM at every step:

Step	What it does	What it costs
Entity extraction	LLM reads every document, extracts entities	Hundreds of LLM calls at index time
Community detection	Leiden algorithm partitions the graph	Hard partitions — each node in exactly one community
Query routing	LLM decides which community to search	Another LLM call per query
Answer synthesis	LLM summarizes retrieved context	Another LLM call per query
"Should I retrieve more?"	LLM self-assesses confidence	Uncalibrated — no formal guarantee

Total: 3-5 seconds per query, non-deterministic outputs, hard community boundaries that lose information, and no way to prove the confidence assessment is reliable.

The Insight: You Already Have Communities

If you have a Bayesian inference pipeline — and many production ML systems in healthcare, finance, and recommendations do — you already have everything Graph RAG provides:

Soft communities from latent factor models (NMF, topic models, embeddings). Each knowledge chunk belongs to multiple communities with different affinities, not just one. A chunk about "elderly surgical patient with cardiac comorbidities" lives partially in the Surgical community, partially in the Cardiac community, and partially in Geriatrics. Hard partitions force a choice that loses information. Soft partitions preserve it.

Structured retrieval from evidence graphs built on co-occurrence statistics. No LLM needed to extract entities — the relationships already exist in the data.

Calibrated escalation from formal confidence scoring. Not "the model thinks it's confident" — a 4-signal gating mechanism validated with zero reversals on real clinical data.

The question isn't whether these components can replace Graph RAG. It's why we've been paying for LLM calls to do what Bayesian inference already does faster, cheaper, and with formal guarantees.

03 The Architecture: Three Components, Zero LLM Calls

The system replaces every LLM step in Graph RAG with an existing Bayesian component:

GRAPH RAG (Standard)                    THIS SYSTEM
========================                ========================
LLM entity extraction          →       Evidence graph (pre-built from co-occurrence weights)
Leiden community detection      →       Soft community affinities (probabilistic, multi-membership)
LLM query routing               →       Bayesian posterior inference
LLM answer synthesis            →       Deterministic state machine
LLM "should I retrieve more?"   →       Calibrated confidence tiers (formally validated)

Component 1: Community-Biased Retrieval

Standard cosine search retrieves by semantic similarity alone. Our system blends semantic similarity with community membership — biasing retrieval toward chunks that live in the same clinical (or domain-specific) neighborhood as the query.

The community signal comes from a Bayesian posterior computed over the query's structured inputs (diagnosis codes, procedure codes, clinical features). This posterior tells the retrieval system which domain context the query lives in, and retrieval preferentially surfaces chunks with high affinity to that context.

Why this matters: On easy queries, community bias adds nothing — cosine already gets the right results. On hard queries (ambiguous inputs, conflicting signals from multiple domains), community bias lifts Precision@5 by 63.8%. The value shows up exactly where it should: when the query is ambiguous and pure semantic similarity isn't enough.

Component 2: Evidence Accumulation Loop

When the system is uncertain, it doesn't guess — it gathers more evidence. The agentic loop retrieves knowledge chunks, extracts structured features from them, and feeds those features back into the Bayesian posterior. This isn't "reasoning" or "chain-of-thought" — it's formal evidence accumulation that sharpens the posterior distribution.

The benchmark proves this works: 87.5% of uncertain cases upgraded to HIGH confidence after retrieval augmentation. The system retrieves context not just to find relevant documents, but to become more certain about its own assessment.

This breaks the circular criticism of agentic RAG ("you need to understand the question to retrieve relevant context"). The retrieval step extracts concrete, structured evidence — clinical features, domain indicators, contextual signals — that the Bayesian update formula incorporates directly. More concordant evidence → sharper posterior → higher confidence. The math guarantees it.

Component 3: Calibrated Escalation

The agent follows a deterministic state machine driven by formally validated confidence tiers:

HIGH confidence    →  Route directly (the system has enough evidence)
MEDIUM confidence  →  Retrieve additional evidence, re-assess, then route
LOW confidence     →  Escalate to human review (insufficient signal to decide)

Every agentic RAG system (LangChain, LlamaIndex, AutoGPT) uses an LLM to make this decide-retrieve-act decision. The LLM's "confidence" is uncalibrated self-assessment — research consistently shows LLM confidence is poorly calibrated on domain-specific tasks.

Our confidence tiers are backed by the Confidence Gate Theorem (CGT), validated with zero reversals on MIMIC-IV clinical data (10,000 encounters). When the system says HIGH, the prediction is reliably accurate. When it says LOW, it reliably escalates. Every threshold tested produces equal or better accuracy — no danger zones, no surprises.

For regulated industries, this is the critical differentiator. A compliance officer can audit the confidence scoring, verify the threshold calibration, and confirm that escalation decisions are formally grounded — not based on an LLM's self-assessment that no one can test or guarantee.

04 Benchmark Results: MIMIC-IV Clinical Data

All benchmarks run on MIMIC-IV trained artifacts: 3,461 ICD-10 codes, 12 detected pathways, 2,604 clinical features.

Queries at three difficulty levels:

Easy: Strong, unambiguous clinical signal
Medium: Sparse or ambiguous signal
Hard: Deliberately conflicting signals from competing clinical domains

Result 1: Community Bias Lifts Retrieval Where It Matters

Community Bias	Easy P@5	Medium P@5	Hard P@5	Overall P@5
None (cosine only)	1.000	0.843	0.302	0.715
Low (0.3)	1.000	0.841	0.367	0.736
Balanced (0.5)	1.000	0.850	0.405	0.752
High (0.7)	1.000	0.824	0.460	0.761
Full (1.0)	1.000	0.791	0.495	0.762

The pattern: On easy queries, community bias adds nothing — cosine already achieves perfect precision. On hard queries with conflicting signals, community bias lifts Precision@5 from 0.302 to 0.495 — a 63.8% improvement. The value of structured retrieval shows up exactly where standard search fails.

Note the medium-query tradeoff: full community bias (1.0) slightly hurts medium queries by over-constraining retrieval when the community signal itself is uncertain. The balanced setting (0.5) optimizes across all difficulty levels.

Result 2: The Agentic Loop Resolves Uncertainty

Difficulty	HIGH Confidence	MEDIUM	LOW	Retrieval Triggered
Easy	100%	0%	0%	0 (not needed)
Medium	84.5%	3.6%	11.9%	13
Hard	98.8%	1.2%	0%	1

When retrieval is triggered (MEDIUM confidence cases), the system retrieves evidence, extracts features, updates the posterior, and re-assesses. 28 of 32 retrieval triggers resulted in MEDIUM→HIGH upgrades — an 87.5% success rate.

The system knows when it doesn't know enough, gathers specific evidence to resolve the uncertainty, and proves that the evidence actually helped. This is auditable evidence accumulation, not LLM "reasoning."

Result 3: 1000x Latency Advantage

System	Mean Latency	P95	P99	Under 100ms
This system	2.65ms	5.14ms	27.7ms	100%
Microsoft Graph RAG	~3,000-5,000ms	~8,000ms	~12,000ms	~0%
LangChain Agentic RAG	~2,000-4,000ms	~6,000ms	~10,000ms	~0%
LlamaIndex ReAct Agent	~1,500-3,000ms	~5,000ms	~8,000ms	~0%

Competitor latencies are published estimates for typical deployments, not head-to-head benchmarks on identical queries. Our latency is measured on the MIMIC-IV benchmark (252 queries, single CPU core).

Every query completes in under 100ms. The P99 is 27.7ms. No API calls, no token costs, no rate limits, no retry logic. Pure numerical computation on a single CPU core.

For a healthcare system processing 100K prior authorization requests per month, this is the difference between inline processing (2.65ms) and async queuing infrastructure (3,000ms+). At $0.01-0.10 per LLM call, the cost difference at volume is $1K-10K/month — just for the retrieval routing layer.

Result 4: Robust Under Adversarial Conditions

Difficulty	Accuracy	vs Random (8.3%)	Lift
Easy (strong signal)	100.0%	+91.7 pp	12.0x
Medium (ambiguous signal)	85.7%	+77.4 pp	10.3x
Hard (conflicting signals)	48.8%	+40.5 pp	5.9x
Overall	78.2%	+69.8 pp	9.4x

Even with deliberately conflicting inputs (codes from competing clinical domains injected into the same query), the system correctly identifies the target pathway 48.8% of the time — 5.9x better than random across 12 pathways. The evidence accumulation mechanism resolves conflicting signals by weighing contextual features that disambiguate the domain.

05 What No One Else Can Claim

Soft Communities vs. Hard Partitions

Every Graph RAG system today uses the Leiden algorithm (or Louvain, or similar) to partition knowledge graphs into communities. These are hard partitions: each node belongs to exactly one community.

Our system uses soft community membership: each knowledge chunk has a probability distribution over all communities. A chunk about "elderly surgical patient with cardiac comorbidities" belongs partially to the Surgical community, partially to Cardiac, and partially to Geriatrics — with specific, learned affinities.

This matters because real-world knowledge is inherently multi-faceted. A clinical guideline about post-surgical cardiac monitoring is relevant to both surgical and cardiac queries. Hard partitions force it into one community. Soft membership preserves all relevant associations and lets the query's context determine which affinity matters most.

No existing Graph RAG system uses soft community structure. Microsoft's Graph RAG, RAPTOR, and all derivatives use hard partitions. This is, to our knowledge, the first demonstration that probabilistic community membership can serve the same structural role in retrieval — with strictly greater expressiveness.

Calibrated Confidence vs. LLM Self-Assessment

Every agentic RAG system relies on an LLM to decide "do I know enough, or should I retrieve more?" This decision is based on the LLM's self-assessed confidence — which research consistently shows is poorly calibrated, especially on domain-specific tasks. There is no formal guarantee that "I'm confident" correlates with accuracy.

Our confidence tiers are derived from a multi-signal gating mechanism validated by the Confidence Gate Theorem. On MIMIC-IV (10,000 real hospital encounters), the system achieves:

Zero reversals across the entire confidence range — every threshold tested produces equal or better accuracy
ECE of 0.032 — when the system says 80% confident, it's right about 80% of the time
Clean monotonic ordering — HIGH confidence cases are reliably more accurate than MEDIUM, which are reliably more accurate than LOW

No existing agentic RAG system has published formally validated confidence calibration. This is the difference between "the system seems confident" and "we can prove the confidence signal is reliable" — a distinction that matters for regulatory compliance, clinical safety, and audit defensibility.

Deterministic Outputs

The system is fully deterministic: same input produces identical output, every time. This is a requirement for regulated industries:

Healthcare (HIPAA): Audit trails require reproducible decisions
Financial services: Model governance requires explainable, repeatable outputs
Legal: Inconsistent outputs undermine credibility

Every LLM-based agentic system is inherently non-deterministic (temperature, sampling, API versioning). Our system is a numerical computation with fixed parameters producing identical floating-point results — auditable, reproducible, and compliant by construction.

Scope and Limitations

To be clear about what this system does and doesn't do:

This is not a general-purpose QA system. There is no natural language generation. The system classifies queries into communities, retrieves relevant knowledge, and reports calibrated confidence. An LLM can be layered on top for answer synthesis — the retrieval and routing layer doesn't need one.

This is not better than LLM-based Graph RAG on open-ended questions. If you need narrative synthesis from multiple documents, you need an LLM. This system is built for structured decision support: routing, classification, triage, and retrieval in domains where speed, determinism, and auditability matter more than fluency.

The benchmark uses MIMIC-IV clinical artifacts (3,461 ICD-10 codes, 2,604 clinical features). Validation on free-text knowledge bases is planned but not yet benchmarked.

06 Deployment Profile

What a Deployment Looks Like

The system processes a structured query through three stages: detect the domain context, assess confidence, and route accordingly — with optional retrieval augmentation for uncertain cases.

For a clear case (strong, unambiguous signal):

Input:  Structured query with diagnosis/procedure codes + clinical context
Stage 1 — DETECT:   Bayesian posterior over domain communities (0.3ms)
Stage 2 — ASSESS:   Multi-signal confidence scoring → HIGH (0.01ms)
Stage 3 — RETRIEVE: Community-biased search returns top-K results (1.5ms)
Output: Ranked results + community assignment + confidence tier + evidence summary
Total:  ~1.8ms, deterministic, fully auditable

For an ambiguous case (conflicting or sparse signals):

Input:  Structured query with mixed signals from competing domains
Stage 1 — DETECT:   Bayesian posterior is split across communities → MEDIUM confidence
Stage 2 — RETRIEVE: Community-biased search returns chunks; features extracted
Stage 3 — UPDATE:   Extracted features fed back into Bayesian posterior → confidence sharpens
Stage 4 — RESOLVE:  Updated posterior resolves to HIGH confidence → route with evidence
Output: Same structured output, with retrieval augmentation logged in audit trail
Total:  ~4-6ms, still deterministic, every step traceable

Integration Points

The system is designed to sit inside existing infrastructure as a retrieval and routing engine:

API: FastAPI endpoint accepts structured queries, returns routing decisions with confidence scores and evidence summaries
Models: Pretrained community embeddings and evidence graph weights, loadable from standard artifact files
Configuration: Confidence thresholds, routing rules, and escalation criteria are configurable per deployment
Dependencies: numpy, FAISS (optional). No LLM SDK, no API keys, no GPU required for inference

What's Needed for a Pilot

Historical data for community model training (minimum ~10K structured records)
Integration point with the existing query intake system
Domain review of learned communities and confidence thresholds
Parallel run comparing system routing against existing manual or LLM-based process

The system trains in hours on a single GPU, deploys on CPU, and runs at sub-10ms latency with zero external dependencies.

Regulatory Alignment

Requirement	How the system addresses it
CMS-0057-F: Electronic PA processing	Sub-100ms API endpoint, FHIR-compatible request/response
CMS-0057-F: Decision transparency	Every decision includes community assignment, confidence score, top evidence features, and routing rationale
CMS-0057-F: Timely decisions	2.65ms mean vs. 48-hour manual average
HIPAA: Reproducible audit trails	Deterministic — same input, same output, every time
HIPAA: On-premises deployment	No external API calls. Runs entirely within the organization's infrastructure
State gold carding laws	Built-in provider performance tracking with configurable auto-approval criteria

•Resources

Paper: Confidence Gate Theorem (arXiv 2603.09947)
governed-rank: github.com/rdoku/governed-rank
Contact: ronald@haskelabs.com

07 Conclusion

Graph RAG solved a real problem: retrieval that respects community structure in knowledge bases. But the implementation — LLMs at every step — introduced cost, latency, and non-determinism that make it impractical for regulated, real-time, or high-volume systems.

We show that the same capability can be achieved with existing Bayesian infrastructure:

Soft community membership replaces hard partitions — strictly more expressive, probabilistic, and directly composable with Bayesian posteriors
Community-biased retrieval replaces LLM-guided search — blending semantic similarity with domain context in a single retrieval operation
Calibrated confidence tiers replace LLM self-assessment — formally validated with zero reversals on 10,000 real clinical encounters
Deterministic routing replaces LLM "reasoning" — same input, same output, every time, with a complete audit trail

The benchmark on MIMIC-IV clinical data:

63.8% Precision@5 lift on adversarial queries
87.5% of uncertain cases resolved via evidence accumulation
2.65ms mean latency (1000x faster than LLM-based alternatives)
9.4x pathway detection accuracy over random baseline
100% of queries under 100ms

The prior authorization market alone processes 53 million requests per year. CMS-0057-F mandates electronic processing with transparent rationale by January 2027. The industry needs retrieval systems that are fast enough for inline decisions, deterministic enough for audit compliance, and smart enough to know when to escalate — with proof that the escalation decision is reliable.

This is the architecture for that problem.

What if Graph RAG didn't need an LLM — and was 1000x faster?

Standard Graph RAG uses an LLM to extract entities, build a graph, partition communities, and summarize answers. We replace every LLM step with existing Bayesian infrastructure — achieving 2.65ms mean latency, deterministic outputs, and retrieval that actually improves under adversarial conditions. Built for healthcare prior authorization, clinical triage, and any domain where speed, auditability, and calibrated uncertainty are non-negotiable.

Get in Touch Read the Paper

Graph RAG Without the LLM Bayesian Community-Biased Retrieval with Calibrated Abstention