Back to Insights
TutorialContent & Safety

Tutorial: RAG Safety — Blocking Prompt Injection Without Breaking Retrieval

Attackers craft high-relevance documents with hidden prompt injections. This tutorial walks through the rag_safety notebook: how govern() removes injected documents from the citation window while preserving retrieval quality.

March 7, 202612 mingoverned-rank

All injected docs removed from top-10 with 65% quality retained

Retrieval-Augmented Generation (RAG) pipelines fetch documents from a corpus and feed them to an LLM as context. The retriever ranks documents by embedding similarity — how closely a document's vector matches the user's query. The top-ranked documents become the LLM's grounding: it reads them, trusts them, and cites them in its response.

This creates an attack surface. An adversary can plant documents in the corpus that score high on embedding similarity while containing prompt injection payloads — hidden instructions like "ignore previous instructions", "you are now DAN", or encoded system prompt overrides. This is the PoisonedRAG threat model: the user's query is benign, but the retrieved documents are malicious.

This tutorial walks through the rag_safety.ipynb notebook step by step. We generate 600 synthetic documents, show why the naive fix destroys retrieval quality, and demonstrate how govern() removes the threat with a single budget knob.

Why RAG is Different from Content Moderation

In content moderation, the correlation between engagement and toxicity is incidental — toxic content happens to be engaging because outrage drives clicks. In RAG safety, the correlation between relevance and risk is adversarial by design. Attackers deliberately craft documents that embed close to likely queries. They need high relevance to get retrieved, and they carry injection payloads that trigger risk detectors.

This distinction matters because it changes what any defense must accomplish. In content moderation, you are demoting content that is genuinely engaging — there is a real tradeoff. In RAG safety, the injected documents are not genuinely relevant in a useful sense. They are relevant only because the attacker designed them to be. The tradeoff is between removing the threat and preserving the ordering of the legitimate documents around it.

The Simulation: 600 Documents with PoisonedRAG

We generate 600 candidate documents: 500 legitimate and 100 injected with prompt injection patterns. The simulation captures two properties of real attacks:

Relevance scores (embedding similarity):

  • Legitimate docs are drawn from Beta(2, 3) — most have moderate relevance, few are very high. This matches real corpora where only a handful of documents are strong matches for any given query.
  • Injected docs are drawn from Beta(3, 2) — biased toward high relevance. A well-crafted injection document can score higher on embedding similarity than most legitimate documents.

Risk scores (PRRGate-style pattern detection):

  • We simulate a multi-signal risk detector that scores documents by counting matches across keyword families (instruction override, roleplay, encoding tricks) and structural markers (markdown fences, JSON blocks, base64 fragments).
  • The detection formula uses p = 1 - 0.6^m for keyword families and p = 1 - 0.7^m for structural markers, where m is the match count.
  • A quorum rule requires matches in at least 2 distinct families before flagging high risk. This reduces false positives on legitimate documents that happen to contain a single suspicious keyword (e.g., a document about "system prompts").
pythonimport numpy as np
from mosaic import govern

np.random.seed(42)
n_legit, n_injected = 500, 100

# Legit docs: moderate relevance. Injected: biased high.
relevance_legit = np.random.beta(2, 3, n_legit)
relevance_injected = np.random.beta(3, 2, n_injected)
relevance = np.concatenate([relevance_legit, relevance_injected])

# PRRGate-style risk scoring
# Injected: 2-5 keyword families + structural markers -> high risk (~0.89)
# Legit: 0-1 families, no quorum -> low risk (~0.09)
safety = 1.0 - risk  # steering signal: higher = safer

The measured correlation between relevance and risk is r = 0.326 — positive because the injected documents have both high relevance and high risk. The mean relevance of injected docs (0.591) is substantially higher than legitimate docs (0.401), and their mean risk (0.892) dwarfs the legitimate docs' risk (0.087).

The Threat: Injected Documents Dominate the Top-10

When we take the top 100 candidates by relevance (simulating a retrieval window) and rank them, injected documents flood the top positions:

RankDocRelevanceRiskType
1780.9650.066Legitimate
25230.9460.921INJECTED
35620.9310.863INJECTED
45170.9160.873INJECTED
55390.9120.919INJECTED
...............

8 of the top 10 documents are injected. 2 are in the top 3 — the citation window where the LLM directly quotes and cites its sources. The attack succeeds: the LLM would ground its response on malicious documents that contain prompt injection payloads.

This is worse than the content moderation case (where 7 of 10 top posts were toxic) because the contamination is both more concentrated and more dangerous. A single injected document in the citation window can hijack the LLM's entire response.

Why Naive Penalties Fail

The obvious defense: subtract risk from relevance.

pythonpenalty_weight = 0.5
naive_scores = {i: base_scores[i] - penalty_weight * risk_lookup[i]
                for i in base_scores}

This does remove the injected documents — their high risk scores (0.8–0.99) eat into their relevance advantage. Zero injected documents appear in the top 10 after penalization.

But the cost is steep. Because risk is correlated with relevance (r = 0.326), the penalty does not just move the bad documents down — it reshuffles the ordering of all documents. Documents that were confidently ranked by the retriever get reordered based on small differences in their risk scores, even when those risk differences are just noise from the detector.

MethodInjected/3Injected/10Recall@10TauQuality
Base2820%1.000100.0%
Naive0080%0.22261.1%

Kendall tau of 0.222 means that only 61.1% of the base retriever's pairwise orderings survived. Nearly 40% of the ranking decisions were overturned — not because they were wrong, but because the naive penalty could not distinguish "this document should move" from "this document is fine, leave it alone."

The naive approach does achieve 80% Recall@10 (8 of the 10 best safe documents by base relevance are recovered into the top 10). This is actually higher than MOSAIC's Recall@10 because the naive penalty aggressively reshuffles everything, which happens to push many safe documents upward. But this aggressive reshuffling is precisely why tau collapses.

MOSAIC: Orthogonalize, Then Steer

govern() removes the relevance-risk correlation before steering. It mathematically subtracts the component of the safety signal that is aligned with the base relevance scores, leaving only the "new information" — the part that tells us something about safety that the relevance scores don't already capture.

pythonsteer_weight = 0.5
base_scores  = {int(i): float(relevance[i]) for i in top_idx}
steer_scores = {int(i): steer_weight * float(safety[i]) for i in top_idx}

result = govern(base_scores, steer_scores, budget=0.30)

Key diagnostics from the result:

  • Projection coefficient: −0.7117 — strongly negative. This is much larger in magnitude than the content moderation case (−0.1154). The negative sign means the safety signal was heavily anti-correlated with relevance — exactly what we expect when injected docs have both high relevance and high risk. Orthogonalization stripped out 71% of the safety signal's variance that was redundant with relevance. The remaining 29% is the genuinely new safety information.
  • Protected edges: 15 — the budget locked 30% of the most confident retrieval decisions.
  • Active constraints: 8 — at 8 positions, the safety signal wanted to reverse the ordering but the budget prevented it.

The result: zero injected documents in the top 3 or top 10, matching the naive approach. But Kendall tau is 0.306 versus naive's 0.222 — 4.2 percentage points more quality retained.

Understanding the Quality Gap

Why is the quality difference moderate (65.3% vs 61.1%) rather than dramatic? Because the base ranking is heavily contaminated — 8 of the top 10 are injected. Any method that removes them must make large rank changes, which inherently reduces tau. The ceiling for quality retention is lower than in content moderation (where most of the top-10 were legitimate and only needed minor reordering).

MOSAIC's advantage is that it makes rank changes more surgically. The orthogonalized safety signal has zero correlation with relevance by construction, so it can only swap documents where the retriever is genuinely uncertain about their relative ordering. Naive penalty subtraction, by contrast, subtracts a correlated signal that interferes with confident retrieval decisions.

The Recall@10 metric tells a complementary story. Naive achieves 80% Recall@10 (recovers 8 of 10 best safe documents into the top 10) while MOSAIC achieves 60%. This is because MOSAIC moves fewer documents overall — some safe documents that were buried behind injected docs in the base ranking do not rise as far. The tradeoff is: naive recovers more safe documents but scrambles their ordering; MOSAIC recovers fewer but preserves the relative ordering of those it does include.

Head-to-Head Comparison

MethodAttackMal/3Mal/10Recall@10TauQuality
BaseYES2820%1.000100.0%
NaiveNO0080%0.22261.1%
MOSAICNO0060%0.30665.3%

Both methods completely block the attack. The difference is in collateral damage to ordering quality.

Tiered Gating: CITE / INCLUDE / EXCLUDE

Production RAG systems use tiered actions based on position in the ranking:

TierPositionsAction
CITETop 3Direct citation — LLM uses these as grounding for its answer
INCLUDE4–10Available context — LLM reads but is less likely to directly quote
EXCLUDE11+Not passed to the LLM at all

Precision (fraction of legitimate documents) in each tier:

TierBaseNaiveMOSAIC
CITE (top 3)33.3%100.0%100.0%
INCLUDE (4–10)14.3%100.0%100.0%
EXCLUDE (11+)65.6%56.7%56.7%

Both methods achieve 100% precision in the CITE and INCLUDE tiers — every document the LLM sees is legitimate. The lower precision in EXCLUDE is actually desirable: it means injected documents were pushed out of the LLM's context window into the excluded zone.

Audit Receipts: Why Each Document Moved

Every document gets a GovernReceipt — a complete audit trail showing what happened at each pipeline stage:

pythonfor r in sorted(result.receipts, key=lambda x: x.final_rank)[:5]:
    print(f"Doc {r.item}  base={r.base_score:.3f}  steer={r.steering_score:.3f}  "
          f"ortho={r.orthogonalized_steering:.3f}  final={r.final_score:.3f}  "
          f"rank: {r.base_rank} -> {r.final_rank}")
DocTypeBaseSteerOrthoFinalMove
78legit0.9650.467+0.3191.2850 → 0
212legit0.8350.460+0.2191.05412 → 1
127legit0.8290.463+0.2181.04614 → 2
560INJECT0.9080.006−0.1830.9515 → 11

Reading the receipts:

  • Doc 78 (legit, rank 0): the most relevant legitimate document stays at rank 0. Its positive Ortho value (+0.319) means the orthogonalized safety signal confirms it is safe. It did not move because the base ranker and safety signal agree.
  • Doc 212 (legit, rank 12 → 1): this safe document was buried at rank 12 behind injected docs. Its positive Ortho (+0.219) promotes it into the citation window.
  • Doc 560 (injected, rank 5 → 11): this injected document had high relevance (0.908) but its Ortho value is strongly negative (−0.183), meaning the orthogonalized safety signal identifies it as riskier than its relevance would suggest. It drops from rank 5 into the EXCLUDE tier.

The average displacement tells the full story: injected documents moved +33.5 positions (demoted), safe documents moved −21.4 positions (promoted). The safety signal correctly identifies the adversarial documents and removes them.

The Budget Sweep

Because the risk detector produces a very strong binary signal (injected docs have risk ~0.89 vs ~0.09 for legit), the budget sweep reveals an unusual pattern:

BudgetMal/3Mal/10Recall@10TauQuality
0.000080%0.31365.6%
0.100080%0.32466.2%
0.200070%0.31165.5%
0.300060%0.30665.3%
0.500050%0.31265.6%
0.700040%0.29764.9%
1.000310%0.45872.9%

The curve is flat: Mal/10 stays at zero from budget 0.00 all the way through 0.70. Only at budget 1.00 (all edges protected, no steering allowed) do injected documents reappear.

This is fundamentally different from the content moderation case, where the budget sweep showed a gradual tradeoff between toxicity and quality. Here, the risk signal is so strong that even minimal steering is sufficient to eliminate the threat entirely. The budget knob matters less for binary safety decisions and more for controlling how much the ordering of safe documents is rearranged.

Notice how Recall@10 trades off with budget: at budget 0.00 (maximum steering), Recall@10 is 80% — the steering aggressively promotes safe documents. At budget 0.70, Recall@10 drops to 40% because more edges are locked, limiting how far safe documents can rise. But the attack is blocked at every setting.

Comparing with Content Moderation

It is worth stepping back to compare the RAG safety results with the content moderation case:

MetricContent ModerationRAG Safety
Correlation (r)0.4240.326
Projection coeff−0.115−0.712
Base contamination7/10 (70%)8/10 (80%)
MOSAIC quality (tau)0.5100.306
Budget sweep shapeGradual tradeoffFlat (binary threshold)

The projection coefficient is 6x larger in the RAG case (−0.712 vs −0.115), meaning orthogonalization had much more work to do. The base contamination is higher (80% vs 70%), so any effective defense requires larger rank changes, which is why tau is lower.

The key difference is in the budget sweep. Content moderation shows a smooth, gradual tradeoff — every budget increment gives you a little more toxicity reduction at a little more quality cost. RAG safety shows a step function — any budget below 1.00 eliminates the threat, and budget 1.00 lets the threat through. This is because the risk detector produces a clean binary signal, not a noisy continuous one.

Key Takeaways

  1. Both naive and MOSAIC block the attack. With a strong risk detector (PRRGate-style pattern matching), even naive penalty subtraction removes all injected documents from the citation window. The question is not whether the attack is blocked, but at what cost to retrieval quality.
  2. MOSAIC preserves more pairwise orderings. Tau = 0.306 vs 0.222 (4.2 percentage points). The orthogonalization prevents unnecessary reshuffling of safe-vs-safe document orderings.
  3. The quality gap is moderate by design. When 80% of the top-10 is contaminated, any defense must make large changes. The ceiling for quality retention is inherently lower than in content moderation or fraud detection.
  4. The budget sweep is flat. The risk signal is strong enough that even budget = 0.70 blocks the attack. This is characteristic of strong binary safety signals — you don't need fine-grained budget tuning for the safety decision itself.
  5. Audit receipts make every decision explainable. Injected documents have negative orthogonalized steering and are displaced +33.5 positions on average. Safe documents have positive orthogonalized steering and rise −21.4 positions. Every rank change is traceable.
pythonfrom mosaic import govern
result = govern(retrieval_scores, safety_scores, budget=0.30)

Run the full notebook: `rag_safety.ipynb`

Try governed-rank

pip install governed-rankGitHubTutorial
governed-rankrag-safetyprompt-injectionretrievalpoisoned-ragtutorial