Tutorial: Fairness — Reducing Racial Bias in COMPAS Risk Rankings

The COMPAS recidivism dataset is the canonical example of algorithmic bias in criminal justice. ProPublica's 2016 investigation showed that COMPAS risk scores systematically rank African-American defendants as higher risk than Caucasian defendants with similar profiles. The question: can we steer toward demographic parity without discarding the base model's predictive value?

This tutorial walks through the COMPAS section of the demo.ipynb notebook. We apply govern() to a 100-defendant sample, compute the Adverse Impact Ratio before and after steering, and inspect the audit trail to see exactly which defendants moved and why.

The Dataset

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) assigns each defendant a decile score from 1 to 10, where 10 is highest risk. Following ProPublica's methodology, we filter to African-American and Caucasian defendants and sample 100 for analysis.

pythonimport pandas as pd
from mosaic import govern

compas_url = "https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv"
compas_raw = pd.read_csv(compas_url)

# Filter following ProPublica's methodology
compas_raw = compas_raw[
    (compas_raw["days_b_screening_arrest"].between(-30, 30)) &
    (compas_raw["is_recid"] != -1) &
    (compas_raw["c_charge_degree"] != "O") &
    (compas_raw["score_text"] != "N/A")
].copy()

compas = compas_raw[compas_raw["race"].isin(["African-American", "Caucasian"])]
compas = compas.sample(n=100, random_state=42).reset_index(drop=True)

The sample of 100 defendants:

66 African-American, 34 Caucasian
Mean decile score: African-American = 5.59, Caucasian = 4.09
The 1.50-point gap in mean decile scores encodes the systemic bias: African-American defendants are rated as higher risk on average

The disparity is immediate and measurable. African-American defendants have higher average decile scores, which means they are ranked as higher risk. In a ranking context where lower risk is the favorable outcome, this translates to African-American defendants being systematically placed lower in the ranking.

Steering Toward Demographic Parity

We invert the scores so higher = lower risk = better outcome. Being classified as low-risk is the favorable result in this context. The steering signal gives African-American defendants a positive boost.

python# Invert: higher = lower risk = favorable
base_scores = dict(zip(compas.index, (10.0 - compas["decile_score"]).astype(float)))

# Fairness boost for African-American defendants
steering = dict(zip(
    compas.index,
    1.5 * (compas["race"] == "African-American").astype(float)
))

result = govern(base_scores, steering, budget=0.30)

Key diagnostics:

Projection coefficient: -0.0583 — negative, confirming that African-American defendants have lower base scores on average (after inversion, lower score = higher risk = less favorable). The fairness boost was partially correlated with low scores, and orthogonalization stripped out this redundancy. The remaining signal is pure fairness: it can only move defendants where the base model is uncertain.
Protected edges: 15 — the budget locked the 30% most confident ordering decisions among the 99 adjacent pairs.
Active constraints: 11 — of those 15 protected edges, 11 actually bound the solution. This is a high ratio (73%), meaning the fairness signal was fighting the base ranker on many fronts, and the budget was doing real work to preserve the model's most confident decisions.

The Adverse Impact Ratio and the 4/5ths Rule

The Adverse Impact Ratio (AIR) is the standard metric for evaluating fairness in ranked outcomes. It is used by the EEOC (Equal Employment Opportunity Commission) and applies broadly to any selection or ranking process:

AIR = favorable_rate(protected group) / favorable_rate(non-protected group)

Where "favorable" means being in the top half of the ranking (classified as low-risk). The EEOC's 4/5ths rule: an AIR below 0.80 indicates potential adverse impact and may trigger regulatory scrutiny.

pythondef compute_adverse_impact_ratio(ranking, race_series, top_frac=0.5):
    n_top = int(len(ranking) * top_frac)
    top_items = set(ranking[:n_top])

    aa_total = (race_series == "African-American").sum()
    c_total  = (race_series == "Caucasian").sum()

    aa_in_top = sum(1 for idx in top_items if race_series.iloc[idx] == "African-American")
    c_in_top  = sum(1 for idx in top_items if race_series.iloc[idx] == "Caucasian")

    aa_rate = aa_in_top / aa_total if aa_total > 0 else 0
    c_rate  = c_in_top / c_total if c_total > 0 else 0

    return aa_rate / c_rate if c_rate > 0 else 0

The results:

	Base	MOSAIC
AA low-risk rate	0.455	0.485
Caucasian low-risk rate	0.588	0.529
Adverse Impact Ratio	0.773	0.916

The base ranking produces an AIR of 0.773 — below the 4/5ths threshold of 0.80. This means African-American defendants are classified as low-risk at only 77.3% of the rate of Caucasian defendants. Under EEOC guidelines, this constitutes potential adverse impact.

After MOSAIC steering, the AIR improves to 0.916 — well above the 0.80 threshold. The improvement comes from two simultaneous movements: the African-American low-risk rate increases from 0.455 to 0.485, and the Caucasian low-risk rate decreases slightly from 0.588 to 0.529. The net effect is a substantial narrowing of the gap.

Quality Retention

The critical question: how much predictive accuracy did we sacrifice for this fairness improvement?

pythonfrom scipy.stats import kendalltau
tau, _ = kendalltau(base_positions, mosaic_positions)
quality_retained = (1 + tau) / 2

Kendall tau: 0.8998
Quality retained: 95.0%

A tau of 0.8998 means 95.0% of all pairwise orderings are preserved. The ranking changed substantially in fairness terms (AIR from 0.773 to 0.916 — a 0.143 improvement) while preserving nearly all of the base model's ordering decisions. Only 5.0% of pairwise orderings changed, and those changes were concentrated in the uncertain middle of the ranking where the base model's score gaps were smallest.

The Audit Trail: Who Moved and Why

Every defendant gets a GovernReceipt explaining their rank movement. This is critical for regulated domains where every algorithmic decision must be explainable.

pythonfor r in result.receipts:
    moved = r.base_rank - r.final_rank  # positive = improved
    if abs(moved) > 5:
        print(f"Defendant {r.item}  Race: {compas.loc[r.item, 'race']}")
        print(f"  COMPAS decile: {compas.loc[r.item, 'decile_score']}")
        print(f"  Rank: {r.base_rank} -> {r.final_rank}  (moved {moved:+d})")

Overall movement statistics:

Defendants who stayed in place: 2
Defendants who moved: 98
Average displacement: 4.8 positions

The largest improvements (moved toward lower risk):

Defendant #93 (African-American, decile 1): rank 12 to 0 (+12 positions). A low-risk African-American defendant who was ranked too low by the base model. MOSAIC moved them to the top of the low-risk group.
Defendant #94 (African-American, decile 1): rank 13 to 1 (+12 positions). Similar profile — a low-risk defendant pushed up.
Defendant #96 (African-American, decile 2): rank 27 to 20 (+7 positions). Another low-risk defendant moved into a more favorable position.

The largest movements toward higher risk:

Defendant #22 (Caucasian, decile 8): rank 74 to 90 (-16 positions). A high-risk Caucasian defendant who was ranked more favorably than their risk score warranted. MOSAIC moved them down.
Defendant #3 (Caucasian, decile 2): rank 15 to 28 (-13 positions). A defendant who moved down to make room for African-American defendants with similar risk profiles.

The pattern is clear: African-American defendants in the uncertain middle of the ranking move up, while Caucasian defendants in the same zone move down. Defendants at the extremes — where the base model has large score gaps and the edges are protected — barely move at all. The budget ensures that the model's most confident decisions are preserved.

Budget Sweep: The Fairness-Accuracy Frontier

Sweeping from budget 0.00 to 1.00 reveals the tradeoff:

pythonfor b in [0.00, 0.10, 0.20, 0.30, 0.50, 0.70, 1.00]:
    r = govern(base_scores, steering, budget=b)
    # compute AIR, Kendall tau, quality retained

Budget	AIR	Tau	Quality
0.00	0.916	0.873	93.7%
0.10	0.916	0.872	93.6%
0.20	0.916	0.872	93.6%
0.30	0.916	0.900	95.0%
0.50	0.916	0.919	95.9%
0.70	0.916	0.942	97.1%
1.00	0.773	0.951	97.6%

A striking pattern emerges: AIR stays at 0.916 across all budgets from 0.00 to 0.70. The fairness improvement is remarkably robust. Whether you protect 0% or 70% of edges, the Adverse Impact Ratio achieves the same value.

This happens because the orthogonalized fairness signal is concentrated in the uncertain middle of the ranking. Even at budget = 0.70 (70% of edges protected), the remaining 30% of edges are enough for the fairness signal to achieve full effect. The protected edges are not in the positions where the fairness signal needs to operate.

The difference across budgets shows up in quality, not fairness:

At budget = 0.00: AIR = 0.916, quality = 93.7%
At budget = 0.30: AIR = 0.916, quality = 95.0%
At budget = 0.70: AIR = 0.916, quality = 97.1%

Higher budgets preserve more of the base ordering while achieving the same fairness outcome. The default budget = 0.30 is a good balance: you get the full fairness improvement (AIR = 0.916) with 95.0% quality retention.

At budget = 1.00, the maximum 50 edges are protected. Since most movement occurs in the top half of the ranking where edges are protected, tau is high (0.951) and the AIR returns to the base value of 0.773. With 100 defendants, govern() protects up to 50 edges by default, leaving the bottom 49 edges unprotected — but the fairness signal has little room to operate in the protected zone where it matters most.

What the Projection Coefficient Tells You

The projection coefficient of -0.0583 is relatively small in magnitude compared to the content moderation case (-0.1154). This tells you something important: the fairness signal was not strongly correlated with the base scores.

Why? Because the fairness signal is binary (1 for African-American, 0 for Caucasian), while the base scores span a continuous range. The correlation is moderate because African-American defendants tend to have higher risk scores (lower inverted scores), but the overlap between the two groups is substantial. Orthogonalization still matters — without it, the fairness boost would disproportionately affect low-scoring defendants of both races — but the correction is smaller than in domains where the correlation is stronger.

Understanding the Active Constraint Ratio

The diagnostic shows 11 active constraints out of 15 protected edges (73%). This is a notably high ratio compared to the content moderation case (50%) and fraud detection (40%).

Why is it so high? Because the fairness signal is binary (1 for African-American, 0 for Caucasian). Binary signals create sharp boundaries: every adjacent pair where an African-American defendant is below a Caucasian defendant is a candidate for reversal. When the signal is binary rather than continuous, it pushes hard on many boundaries simultaneously.

The high active ratio means the budget is doing significant work. Without budget protection (at budget = 0.00), the fairness signal would reverse substantially more orderings. The budget constrains it to operate only where the base model is uncertain, which is why tau at budget = 0.00 (0.873) is still quite high — the orthogonalized signal cannot interfere with the base model's information content, only reorder where confidence is low.

Interpreting the Movements: Patterns in the Audit Trail

The audit trail reveals a consistent pattern in the movements:

African-American defendants who improved most are those with low decile scores (low risk) who were ranked below their risk level. Defendant #93 (decile 1) and #94 (decile 1) both improved by 12 positions. These are low-risk defendants who the base model placed too far down the ranking. MOSAIC recognizes the opportunity to improve fairness by moving them up, and the small score gaps around their original positions mean the base model was not confident about keeping them down.

Caucasian defendants who moved down most are those where the base model's ranking was generous relative to their risk level. Defendant #22 (decile 8, high risk) moved from rank 74 to rank 90 — a 16-position drop. This is a high-risk defendant who was ranked more favorably than their score warranted. MOSAIC's orthogonalized signal corrects for this.

Defendants who did not move (2 out of 100) are at positions where the score gap to their neighbor is in the top 30% (protected by budget) AND the steering signal does not push them in a direction that conflicts with the constraint.

The average displacement of 4.8 positions means most defendants moved modestly. This is not a wholesale reshuffling — it is a targeted adjustment concentrated in the middle of the ranking where score gaps are small and the model's confidence is low.

Implications for Regulated Domains

The COMPAS analysis demonstrates several properties that matter for regulatory compliance:

Traceability: Every defendant's rank movement is documented in the receipt. Regulators can audit individual decisions. If asked "why did defendant #93 move from rank 12 to rank 0?", the answer is specific: the orthogonalized fairness signal was positive for this defendant, no protected edge prevented the movement, and the PAV projection found this to be the optimal position.
Predictability: The budget sweep shows the fairness-accuracy tradeoff is smooth and monotonic. There are no surprises — increasing the budget always increases quality and never decreases AIR (until budget = 1.00 where the policy is fully suppressed).
Proportionality: The average displacement of 4.8 positions is modest. MOSAIC does not wholesale reshuffle the ranking — it makes targeted adjustments in the uncertain zone. The largest movement is 16 positions (defendant #22), and most movements are much smaller.
Threshold compliance: The AIR of 0.916 comfortably exceeds the 4/5ths rule threshold of 0.80. There is a 0.116 margin of safety. Even under moderate budget changes, AIR remains at 0.916 from budget 0.00 to 0.70.
Quality preservation: 95.0% of pairwise orderings are preserved at budget = 0.30. The base model's predictive value is largely intact. At budget = 0.70, quality rises to 97.1% while AIR stays at 0.916.
Robustness: The identical AIR across budgets 0.00 to 0.70 demonstrates that the fairness improvement is not fragile. It does not depend on a precise budget setting. Any budget between 0.00 and 0.70 achieves the same fairness outcome, giving the practitioner wide latitude to prioritize quality without sacrificing fairness.

The Broader Fairness Question

It is worth noting what MOSAIC does and does not claim in the fairness context. MOSAIC is a post-processing reranking tool. It improves the Adverse Impact Ratio of an existing ranking by steering toward demographic parity. It does not address the upstream question of whether the COMPAS risk scores are accurate, whether the training data is biased, or whether the criminal justice system should use algorithmic risk assessment at all.

What MOSAIC provides is a transparent, auditable, and mathematically grounded way to reduce observed disparities in a ranking while preserving as much of the base model's ordering as possible. The projection coefficient (-0.0583) quantifies how much of the fairness signal was redundant with the base scores. The budget provides a knob for controlling the fairness-accuracy tradeoff. The audit trail documents every individual movement.

For organizations operating under regulatory scrutiny — whether in criminal justice, lending, hiring, or healthcare — this combination of transparency, controllability, and auditability is the core value proposition.

Key Takeaway

COMPAS risk scores encode racial disparities: African-American defendants have a mean decile of 5.59 versus 4.09 for Caucasian defendants. The base ranking produces an Adverse Impact Ratio of 0.773, failing the 4/5ths rule.

govern() steers toward demographic parity by orthogonalizing the fairness signal against the base scores. The result: AIR improves from 0.773 to 0.916 (passing the threshold with margin) while retaining 95.0% of ranking quality. The improvement is robust across budgets — AIR stays at 0.916 from budget 0.00 to 0.70. Every movement is documented in the audit trail: defendant #93 (African-American, decile 1) moved from rank 12 to rank 0; defendant #22 (Caucasian, decile 8) moved from rank 74 to rank 90.

pythonfrom mosaic import govern
result = govern(risk_scores, fairness_signal, budget=0.30)

Run the full notebook: `demo.ipynb`

Try governed-rank

pip install governed-rankGitHub Tutorial

governed-rankfairnessCOMPASadverse-impactdemographic-paritytutorial

Related Insights

Technical Deep DiveGetting Started

Understanding governed-rank: How MOSAIC Steers Rankings Without Breaking Them

Every ranking system eventually needs a second objective. MOSAIC orthogonalizes the policy signal, protects confident decisions, and projects the optimal result — in three steps, one function call.

3 stepsZero-interference steering with full audit trail

Mar 7, 202612 min

TutorialContent & Safety

Tutorial: Content Moderation — Demoting Toxicity Without Killing Engagement

Toxic content is engaging — outrage drives clicks. This tutorial walks through the content_moderation notebook: why naive penalties over-correct, and how govern() targets only the uncertain zone.

Toxicity drops in top-10 while ranking quality preserved

Mar 7, 202610 min

TutorialRisk & Compliance

Tutorial: Fraud Detection — Steering Review Queues Toward High-Value Fraud

A fraud model ranks by probability, but a $50K wire transfer matters more than a $5 candy purchase. This tutorial walks through the fraud_detection notebook: how govern() steers toward high-impact fraud without flooding the queue with false positives.

10x fraud value captured in BLOCK tier vs base model

Mar 7, 202610 min

Back to all insights