Back to Insights
TutorialRisk & Compliance

Tutorial: Fraud Detection — Steering Review Queues Toward High-Value Fraud

A fraud model ranks by probability, but a $50K wire transfer matters more than a $5 candy purchase. This tutorial walks through the fraud_detection notebook: how govern() steers toward high-impact fraud without flooding the queue with false positives.

March 7, 202610 mingoverned-rank

10x fraud value captured in BLOCK tier vs base model

Fraud review teams have limited capacity. A team of 10 analysts can review maybe 100 transactions per day. The ML model ranks transactions by fraud probability — but a \$50,000 fraudulent wire transfer matters more than a \$5 fraudulent candy purchase. The base model ignores business impact entirely. It treats every fraud equally, regardless of dollar value.

This tutorial walks through the fraud_detection.ipynb notebook. We simulate 300 transactions with realistic distributions, show why naive impact-weighting floods the queue with false positives, and demonstrate how govern() steers toward high-value fraud while protecting the model's most confident calls.

The Review Queue Problem

We generate 300 transactions with log-normal dollar amounts ranging from \$5 to \$17,660. Fraud probabilities follow a Beta(1.5, 8) distribution — most transactions are low-risk, but a meaningful tail exists.

pythonimport numpy as np
from mosaic import govern

np.random.seed(42)
n = 300

amounts = np.exp(np.random.normal(4.0, 1.5, n))
amounts = np.clip(amounts, 5, 50000).astype(int)

fraud_prob = np.random.beta(1.5, 8, n)
fraud_prob = fraud_prob + 0.1 * np.log1p(amounts) / np.log1p(50000)
fraud_prob = np.clip(fraud_prob, 0.01, 0.99)

is_fraud = (np.random.rand(n) < fraud_prob).astype(int)

# Business impact signal (log-scaled amount)
impact = np.log1p(amounts) / np.log1p(50000)

The dataset:

  • 300 transactions, dollar amounts from \$5 to \$17,660
  • Fraud rate: 19.3% (58 fraudulent cases out of 300)
  • Total fraud value: \$7,635 across all 58 cases
  • Correlation between fraud probability and impact: r = 0.131 — a small positive correlation. Larger transactions are slightly riskier, but the correlation is weak.

We select the top 100 transactions by fraud probability as the review queue. When the team works this queue top-down by fraud probability alone, they catch fraud cases but miss the high-value ones. A \$15,000 fraudulent wire transfer sitting at position 40 in the queue might never get reviewed if the team only has time for the top 20.

Why Naive Impact-Weighting Fails

The obvious fix: add business impact to the fraud score.

pythonnaive_weight = 0.5
naive_scores = {i: base_scores[i] + naive_weight * float(impact[i]) for i in base_scores}

This boosts ALL high-value transactions into the review queue — including legitimate ones. A \$15,000 legitimate wire transfer suddenly ranks above a \$200 confirmed fraud case. The queue fills with false positives: high-value transactions that the fraud model correctly identified as low-risk get pushed to the top because their dollar amount is large.

The correlation between fraud probability and impact (r = 0.131) means naive addition double-counts: transactions that are both high-risk AND high-value get boosted twice, while the real problem — surfacing high-value fraud that the model ranked too low — gets lost in noise.

MOSAIC: Prioritize High-Value Fraud

govern() orthogonalizes the impact signal against fraud probability before steering. The remaining impact signal can only reorder transactions where the fraud model is uncertain.

pythonbase_scores  = {int(i): float(fraud_prob[i]) for i in top_idx}
steer_scores = {int(i): 1.0 * float(impact[i]) for i in top_idx}

result = govern(base_scores, steer_scores, budget=0.30)

Key diagnostics:

  • Projection coefficient: -0.1694 — negative, meaning high-impact transactions actually had slightly lower fraud scores on average within the top-100 queue. Orthogonalization strips out this misleading anti-correlation.
  • Protected edges: 15 — the budget locked the 30% most confident ordering decisions.
  • Active constraints: 6 — of those 15 protected edges, 6 actually bound the solution.

The budget protects the fraud model's most confident calls:

  • High-confidence fraud stays at the top (protected by large score gaps)
  • High-confidence legitimate stays at the bottom (also protected)
  • Uncertain middle gets reordered by business impact

This is exactly what the review team wants: do not second-guess the model where it is confident, but when the model is uncertain between two transactions, prioritize the one worth more.

Head-to-Head: Value Captured Is the Key Metric

The right metric for fraud teams is not just precision or recall — it is fraud value captured. Catching a \$15,000 fraud is worth more than catching ten \$5 frauds.

MethodFraud/20Value/20Fraud/50Value/50TauQuality
Base7\$55217\$1,5621.000100.0%
Naive8\$2,47717\$4,1930.51875.9%
MOSAIC6\$2,27115\$4,1650.33666.8%

The base model catches 7 fraud cases in the top-20 of the queue, but those cases are worth only \$552. The model is optimized for fraud probability, not fraud value, so it surfaces many low-dollar cases.

MOSAIC captures \$2,271 in fraud value in the top-20 — a 4.1x improvement over the base model's \$552. It catches slightly fewer fraud cases (6 vs. 7), but those 6 are high-value fraud. A team that reviews the top 20 MOSAIC-ranked transactions recovers more than four times the dollar value.

The naive approach captures slightly more value (\$2,477 vs. \$2,271 in the top-20) and preserves more of the base ordering (tau = 0.518 vs. MOSAIC's 0.336). MOSAIC disrupts the ranking more than naive here because it is doing something different: concentrating high-VALUE fraud at the top of the queue, which requires more reordering. The real win shows up in the tiered gating analysis below, where MOSAIC's BLOCK tier captures 10.4x more fraud value (\$1,603 vs. \$154) at the same 40% precision, and fraud slipping through ALLOW drops from \$4,088 to \$767.

High-value false positives (legitimate transactions over \$1,000 that appear in the top-20): base has 1, naive has 2, MOSAIC has 2. The false-positive rate is comparable, but the true-positive value is dramatically different.

Tiered Gating: Where MOSAIC Shines

Production fraud systems do not use binary approve/reject. They use tiered actions:

  • BLOCK (top 10%): Auto-decline the highest-risk transactions
  • REVIEW (next 20%): Send to manual review for step-up authentication
  • ALLOW (bottom 70%): Auto-approve low-risk transactions
pythonn_items = len(base_order)
tiers = [
    ("BLOCK",  0, int(0.10 * n_items)),
    ("REVIEW", int(0.10 * n_items), int(0.30 * n_items)),
    ("ALLOW",  int(0.30 * n_items), n_items),
]

for tier_name, start, end in tiers:
    tier_items = result.ranked_items[start:end]
    # compute precision, fraud value per tier
TierMethodItemsFraudPrecisionValue
BLOCKBase10440.0%\$154
BLOCKNaive10330.0%\$487
BLOCKMOSAIC10440.0%\$1,603
REVIEWBase20525.0%\$604
REVIEWNaive20840.0%\$2,443
REVIEWMOSAIC201050.0%\$2,476
ALLOWBase702332.9%\$4,088
ALLOWNaive702130.0%\$1,916
ALLOWMOSAIC701825.7%\$767

The BLOCK tier tells the story. All three methods block 10 transactions:

  • Base blocks 4 fraud cases worth \$154. The auto-decline stops low-value fraud.
  • Naive blocks only 3 fraud cases worth \$487. Lower precision because legitimate high-value transactions pushed into the BLOCK tier.
  • MOSAIC blocks 4 fraud cases worth \$1,603. Same precision as base (40%), but the blocked fraud is worth 10.4x more.

MOSAIC's BLOCK tier captures the same number of fraudulent transactions as the base model but concentrates on the high-value cases. That is \$1,603 in automatically declined fraud versus \$154 — without increasing the false-positive rate.

The REVIEW tier is equally revealing. MOSAIC achieves 50% precision (10 fraud out of 20 reviewed) versus base's 25% (5 out of 20). Every second transaction the reviewer examines is actual fraud, worth \$2,476 total. The reviewers' time is spent efficiently.

Fraud Slipping Through: The ALLOW Tier

The most critical metric for a fraud team is not what they catch — it is what they miss. Fraud that slips through the ALLOW tier goes unreviewed and costs the business real money.

MethodFraud Cases in ALLOWFraud Value in ALLOW
Base23\$4,088
Naive21\$1,916
MOSAIC18\$767

MOSAIC reduces the dollar value of fraud slipping through to \$767 — an 81% reduction from the base model's \$4,088. Only 18 fraud cases slip through (vs. base's 23), and those 18 are overwhelmingly low-value cases. The high-value fraud has been pushed up into BLOCK or REVIEW where it gets caught.

This is the core value proposition for fraud teams: MOSAIC does not catch more fraud in absolute terms — it catches the fraud that matters. The \$767 that slips through MOSAIC versus the \$4,088 that slips through the base model represents a \$3,321 improvement in fraud losses.

Budget Sweep: Tuning the Tradeoff

Sweeping the budget from 0.00 to 1.00 reveals the tradeoff frontier:

pythonfor b in [0.00, 0.10, 0.20, 0.30, 0.50, 0.70, 1.00]:
    r = govern(base_scores, steer_scores, budget=b)
    # measure fraud caught in top-20, value captured, quality
BudgetFraud/20Value/20TauQuality
0.007\$2,9300.32666.3%
0.107\$2,9300.33266.6%
0.207\$2,8410.33866.9%
0.306\$2,2710.33666.8%
0.507\$2,8410.35667.8%
0.706\$2,5380.40170.1%
1.006\$1,9780.58279.1%

At budget = 0.00 (maximum steering), MOSAIC captures \$2,930 in fraud value in the top-20 — a 5.3x improvement over the base model's \$552. At budget = 1.00, all 50 protectable edges are locked, giving the highest quality retention (tau = 0.582). With 100 items in the queue, govern() protects up to 50 edges by default — the top-50 items in the queue (the critical decision zone) have their base ordering fully preserved, while the bottom 49 edges remain unprotected. MOSAIC still captures \$1,978 in the top-20 — 3.6x more than the base — because the orthogonalized steering signal reorders freely in the unprotected tail.

The tradeoff is smooth. As budget increases, quality retention improves (tau rises from 0.326 to 0.582) while fraud value captured gradually decreases. There is no cliff, no sudden collapse. The fraud team can choose their operating point based on their risk tolerance.

For most fraud teams, budget = 0.30 is a good starting point: it captures \$2,271 in the top-20 (4.1x improvement) while protecting the model's most confident decisions. Teams with higher risk tolerance can decrease the budget to capture even more value.

Understanding the Projection Coefficient

The projection coefficient of -0.1694 is more strongly negative than in the content moderation case (-0.1154). The negative sign means that within the top-100 queue, higher-impact transactions actually tend to have lower fraud probability scores. This seems counterintuitive — should not higher-value transactions be riskier?

In the full population (r = 0.131), there is a small positive correlation between amount and fraud probability. But in the top-100 by fraud probability, the relationship flips. The top-100 are all relatively high-risk transactions. Within that high-risk subset, the highest-dollar transactions tend to have slightly lower fraud scores (because truly massive fraud is rarer than small-ticket fraud). Orthogonalization strips out this misleading anti-correlation, ensuring that the impact signal can only reorder transactions where the fraud model is genuinely uncertain.

Without orthogonalization, naive impact-weighting would disproportionately boost the lower-fraud-probability transactions in the queue (because they happen to be higher-value), potentially pushing genuinely high-probability fraud cases down the list. MOSAIC prevents this.

Reading the Diagnostics

The result object provides a full diagnostic picture:

  • n_protected_edges = 15: Out of 99 adjacent pairs in the 100-item queue, 15 were locked by budget = 0.30.
  • n_active_constraints = 6: Of those 15 protected edges, 6 actually bound the solution. The impact steering wanted to reverse those 6 orderings but could not.

The active ratio of 6/15 = 40% is moderate. The fraud model's most confident calls are being protected, but the impact signal has substantial room to reorder the uncertain middle of the queue. For fraud detection, this balance is appropriate: you want the model's highest-confidence fraud detections preserved at the top, while allowing business impact to prioritize within the uncertain tier.

How to Think About This

The fraud detection use case illustrates a key principle of MOSAIC: the steering signal does not need to be the same type as the base signal.

The base signal is fraud probability (a probability between 0 and 1). The steering signal is business impact (a log-scaled dollar amount). These are fundamentally different quantities. Naive addition of different-scale quantities is doubly problematic: you get both correlation interference AND scale mismatch.

MOSAIC handles both problems. Orthogonalization removes the correlation between fraud probability and business impact. The budget protects the fraud model's confident decisions. And the isotonic projection finds the optimal compromise. The result: high-value fraud rises in the queue, low-value fraud drops, and the model's most confident calls are preserved.

Production Deployment Pattern

For fraud teams deploying MOSAIC, the tiered gating analysis suggests a specific workflow:

  1. Set the tier thresholds based on operational capacity. If you can auto-block 10 transactions and manually review 20 per day, use 10/20/remaining.
  2. Start at budget = 0.30 and monitor two metrics: precision in the BLOCK tier (you want high confidence on auto-declines) and fraud value slipping through ALLOW (the business cost of missed fraud).
  3. Adjust the budget based on operational feedback. If reviewers report too many false positives in REVIEW, increase the budget to give the fraud model more control. If fraud losses in ALLOW are too high, decrease the budget to let the impact signal push more high-value cases up.
  4. Monitor the projection coefficient over time. If the correlation between fraud probability and impact changes (e.g., due to new fraud patterns), the projection coefficient will shift. A sudden change is a signal to investigate the fraud model's calibration.

The budget can be adjusted per-queue, per-day, or even per-shift. Morning queues with fresh analysts might use budget = 0.20 (more aggressive impact steering) while end-of-day queues might use budget = 0.50 (more conservative, trust the model more). This operational flexibility is unique to MOSAIC's budget-based approach.

Key Takeaway

Naive impact-weighting boosts ALL high-value transactions into the review queue — including legitimate ones. It reduces precision in the BLOCK tier from 40% to 30% because high-value legitimate transactions get pushed to the top.

MOSAIC orthogonalizes first, so the impact signal can only reorder transactions where the fraud model is uncertain. The result: BLOCK tier precision stays at 40% while fraud value captured jumps from \$154 to \$1,603 (10.4x). REVIEW tier precision improves from 25% to 50%. Fraud slipping through ALLOW drops from \$4,088 to \$767 (81% reduction).

pythonfrom mosaic import govern
result = govern(fraud_scores, impact_scores, budget=0.30)

Run the full notebook: `fraud_detection.ipynb`

Try governed-rank

pip install governed-rankGitHubTutorial
governed-rankfraud-detectionreview-queuetiered-gatingtutorial