Back to Insights
TutorialContent & Safety

Tutorial: Content Moderation — Demoting Toxicity Without Killing Engagement

Toxic content is engaging — outrage drives clicks. This tutorial walks through the content_moderation notebook: why naive penalties over-correct, and how govern() targets only the uncertain zone.

March 7, 202610 mingoverned-rank

Toxicity drops in top-10 while ranking quality preserved

Every content platform faces the same dilemma: the engagement model rewards toxicity. Outrage drives clicks, shares, and comments. A latent "outrage factor" pushes both engagement and toxicity upward simultaneously, and the correlation between the two signals turns every naive fix into a minefield.

This tutorial walks through the content_moderation.ipynb notebook step by step. We generate a realistic content feed of 200 posts, show why the naive fix over-corrects, and demonstrate how govern() solves the problem with a single budget knob.

The Engagement-Toxicity Trap

We generate 200 posts. Each post has a latent outrage factor drawn from a Beta(2, 5) distribution — most posts are mild, but a meaningful tail of high-outrage posts exists. Engagement is driven by two components: underlying quality (60%) and outrage boost (40%). Toxicity tracks outrage directly.

pythonimport numpy as np
from mosaic import govern

np.random.seed(42)
n = 200

outrage = np.random.beta(2, 5, n)
quality = np.random.uniform(0.3, 1.0, n)
engagement = 0.6 * quality + 0.4 * outrage + np.random.normal(0, 0.05, n)
engagement = np.clip(engagement, 0.01, 1.0)

toxicity = 0.7 * outrage + 0.3 * np.random.uniform(0, 0.4, n)
toxicity = np.clip(toxicity, 0, 1)

safety = 1.0 - toxicity  # steering signal: higher = safer

The measured correlation between engagement and toxicity is r = 0.424. This is not a bug — it mirrors reality. In real content feeds, inflammatory content genuinely drives engagement metrics. The engagement model is doing exactly what it was trained to do, and that is the problem.

When we rank purely by engagement and take the top 50 posts for the content feed, toxic content floats to the top. The base ranking is optimized for engagement, and engagement is entangled with outrage.

Why Naive Penalties Fail

The obvious fix: subtract toxicity from engagement.

pythonpenalty_weight = 0.5
naive_scores = {i: base_scores[i] - penalty_weight * tox_lookup[i] for i in base_scores}

Because toxicity correlates with engagement at r = 0.424, this does not just demote toxic posts — it reshuffles the entire ranking. The penalty subtracts a signal that shares information with the base scores, causing unpredictable reordering across all 50 items, not just the toxic ones.

The result is severe. Looking at the top-10 of the base ranking, 7 out of 10 posts are toxic (toxicity > median). The naive approach reduces this to 2 toxic posts in the top-10 — impressive at first glance. But the cost is devastating:

MetricBaseNaive
Toxic in top-541
Toxic in top-1072
Toxic in top-25149
Mean toxicity (top-10)0.3390.250
Kendall tau vs. base1.0000.438
Quality retained100.0%71.9%

The naive approach achieves strong toxicity reduction (mean toxicity drops from 0.339 to 0.250), but Kendall tau plummets to 0.438. That means 56.2% of all pairwise orderings changed. Safe, engaging content gets rearranged as collateral damage. The ML team sees quality collapse and reverts the change. The policy team escalates. The cycle repeats.

MOSAIC: Orthogonalize, Then Steer

govern() removes the engagement-toxicity correlation before steering. The remaining safety signal can only move posts where the engagement model is uncertain.

pythonsteer_weight = 0.5
base_scores  = {int(i): float(engagement[i]) for i in top_idx}
steer_scores = {int(i): steer_weight * float(safety[i]) for i in top_idx}

result = govern(base_scores, steer_scores, budget=0.30)

Key diagnostics from the result:

  • Projection coefficient: -0.1154 — negative, confirming that safety is anti-correlated with engagement. This makes sense: safe content tends to be less engaging when outrage drives clicks. The coefficient tells you that orthogonalization stripped out 11.54% of the steering signal's variance that was redundant with engagement.
  • Protected edges: 14 — the budget locked the 30% most confident ordering decisions (14 of the 49 adjacent pairs in the 50-item ranking).
  • Active constraints: 7 — of those 14 protected edges, 7 actually bound the solution. At those 7 positions, steering wanted to reverse the ordering but could not because the edge was protected.

Head-to-Head: Three Approaches Compared

The notebook compares base, naive, and MOSAIC across multiple top-K thresholds:

MethodToxic/5Toxic/10Toxic/25MeanTox(10)TauQuality
Base47140.3391.000100.0%
Naive1290.2500.43871.9%
MOSAIC2590.2800.51075.5%

The honest story here is nuanced. The naive approach is more aggressive at removing toxicity from the top of the feed: 2 toxic posts in the top-10 versus MOSAIC's 5. If all you cared about was minimizing toxicity in the top-10, naive wins at budget=0.30.

But naive destroys ranking quality. Kendall tau of 0.438 means the ordering of the entire feed was reshuffled. Quality retention of 71.9% means nearly 30% of the base ranker's pairwise decisions were overturned. MOSAIC achieves tau = 0.510 and quality = 75.5% — a substantial improvement in quality preservation.

The real question is: at the same level of quality, which method achieves better toxicity reduction? And here MOSAIC wins clearly.

The Budget Knob: MOSAIC's Real Advantage

The true power of MOSAIC is not any single operating point — it is the budget knob. By sweeping budget from 0.00 to 1.00, you trace a smooth frontier of toxicity-quality tradeoffs:

pythonfor b in [0.00, 0.10, 0.20, 0.30, 0.50, 0.70, 1.00]:
    r = govern(base_scores, steer_scores, budget=b)
    # measure toxic posts in top-10, mean toxicity, Kendall tau
BudgetToxic/10MeanTox(10)TauQuality
0.0020.2500.45672.8%
0.1020.2500.45672.8%
0.2040.2740.47673.8%
0.3050.2800.51075.5%
0.5040.2840.52276.1%
0.7040.2840.55177.6%
1.0070.3391.000100.0%

Look at budget = 0.00 (no edges protected, maximum steering). MOSAIC achieves 2 toxic posts in the top-10 — matching the naive approach exactly — but with better quality: tau = 0.456 versus naive's 0.438. Quality retention is 72.8% versus 71.9%. At the same toxicity level, MOSAIC preserves more of the base ranking.

This is the fundamental advantage. At every budget level, MOSAIC achieves the best possible quality for that level of toxicity reduction. The budget knob gives you a smooth, monotonic frontier:

  • At budget = 0.00, you get maximum toxicity reduction (2 toxic in top-10) with reasonable quality (72.8%)
  • At budget = 0.30, you get moderate toxicity reduction (5 toxic in top-10) with good quality (75.5%)
  • At budget = 0.50, you still get meaningful reduction (4 toxic in top-10) with better quality (76.1%)
  • At budget = 1.00, you get the base ranking back entirely (7 toxic in top-10, quality = 100%)

There is no cliff. There is no sudden collapse. The tradeoff is smooth and predictable. The policy team and ML team can agree on a budget value that balances their competing needs, and they can adjust it in production without retraining anything.

Understanding the Projection Coefficient

The projection coefficient of -0.1154 deserves a closer look. The negative sign tells you that the safety signal (1 - toxicity) was anti-correlated with engagement. In other words, safe content tends to have lower engagement scores. This is exactly what we expect when outrage drives clicks.

The magnitude of 0.1154 tells you the strength of the correlation. If it were near zero, orthogonalization would have had little work to do — the steering signal was already providing mostly new information. At -0.1154, there was meaningful redundancy to strip out. Without orthogonalization, the naive approach over-penalizes highly-engaging content (because engaging content correlates with toxicity) and under-penalizes low-engagement toxic content (because the correlation masks their toxicity).

Reading the Protected Edges and Active Constraints

The result object from govern() provides two additional diagnostics beyond the projection coefficient:

  • n_protected_edges = 14: Out of the 49 adjacent pairs in the 50-item ranking, 14 were locked by the budget. These are the 30% of gaps where the engagement model is most confident about the ordering.
  • n_active_constraints = 7: Of those 14 protected edges, 7 actually bound the solution. At those 7 positions, the safety steering signal wanted to reverse the ordering (move a safer post above a more engaging one), but the protected edge prevented it.

The ratio of active to protected constraints (7/14 = 50%) tells you that the budget was doing real work. Half of the protected edges were "live" — the safety signal was pushing against them. If this ratio were near zero, the budget was overly generous (the safety signal was not trying to reverse any protected orderings). If it were near 100%, the budget might be too tight — the safety signal is fighting on every front and could benefit from more room.

For content moderation specifically, a 50% active ratio at budget = 0.30 is a healthy signal. It means the engagement model's confident decisions are being protected where they matter, while the safety signal has enough room to operate in the uncertain middle.

When to Tune the Budget

The budget sweep reveals an important pattern: from budget 0.00 to 0.10, nothing changes. Both rows show identical results (2 toxic in top-10, tau = 0.456). This means the first 10% of protected edges were not binding — even with those edges locked, the steering signal did not want to reverse them.

The action starts between budget 0.10 and 0.20, where toxic posts in the top-10 jump from 2 to 4. This tells you that the 10th to 20th percentile of score gaps are the "battleground" where safety steering and engagement confidence collide.

For production deployment, start at budget = 0.30 (the default). If the policy team needs more aggressive toxicity reduction, decrease the budget toward 0.00. If the ML team is concerned about quality degradation, increase it toward 0.50. The smooth tradeoff means there is always a compromise that both teams can accept.

How MOSAIC Compares at Every Operating Point

It is worth comparing MOSAIC and naive at matched toxicity levels, not just at their default settings. The naive approach has only one knob: the weight multiplier. MOSAIC has the budget knob.

At the most aggressive MOSAIC setting (budget = 0.00), MOSAIC matches the naive approach's toxicity reduction — 2 toxic posts in the top-10, mean toxicity of 0.250 — while achieving better quality (tau = 0.456 vs. naive's 0.438). The 0.018 gap in tau may sound small, but it represents approximately 1.8 percentage points of pairwise orderings preserved. In a 50-item ranking, that is roughly 22 additional pairwise orderings that MOSAIC preserves over naive. For an ML team monitoring ranking quality metrics, that difference is significant.

At the recommended default (budget = 0.30), MOSAIC allows 5 toxic posts in the top-10 versus naive's 2. The toxicity reduction is less aggressive, but the quality improvement is substantial: tau = 0.510 versus 0.438, a 0.072 gap. That represents 7.2 percentage points of pairwise orderings — roughly 87 additional pairwise orderings preserved in a 50-item ranking.

The key insight: MOSAIC gives you a smooth frontier of tradeoffs. The naive approach gives you a single point. If the policy team is unhappy with naive's quality degradation, the only option is to reduce the penalty weight, which proportionally reduces toxicity reduction as well. With MOSAIC, you can independently control how much toxicity reduction you want (via steer_weight) and how much base-ranking quality to preserve (via budget).

Production Considerations

In a production content feed, the 50-item ranking we analyzed here maps directly to a single page load or scroll session. The top-10 corresponds to what users see immediately; the top-25 corresponds to the first scroll. This is why the top-10 and top-25 toxicity counts matter — they represent the toxic content that users are most likely to encounter.

The budget sweep results suggest a practical deployment strategy:

  1. Start at budget = 0.30. Measure the impact on engagement metrics and toxicity reports.
  2. If toxicity complaints persist, decrease the budget to 0.20 or 0.10. The table shows this reduces toxic posts in the top-10 from 5 to 4 to 2.
  3. If engagement drops more than acceptable, increase the budget to 0.50. Quality retention rises from 75.5% to 76.1% with only a modest change in toxicity (4 toxic in top-10 vs. 5).
  4. At budget = 0.70, quality retention reaches 77.6% and the feed is still meaningfully safer than the base (4 toxic in top-10 vs. 7).

The budget can be adjusted in real time without retraining the engagement model. This makes MOSAIC suitable for dynamic policy environments where toxicity thresholds change based on current events, regulatory requirements, or platform-specific standards.

Key Takeaway

Naive toxicity penalties subtract a signal that is correlated with engagement (r = 0.424), causing unpredictable reshuffling of the entire ranking. The naive approach achieves aggressive toxicity reduction (2 toxic in top-10) but at the cost of severe quality degradation (tau = 0.438, quality = 71.9%).

MOSAIC orthogonalizes first — the remaining safety signal can only move posts where the engagement model does not have a strong opinion. At budget = 0.00, MOSAIC matches naive's toxicity reduction with better quality (tau = 0.456 vs. 0.438). At budget = 0.30, MOSAIC achieves moderate toxicity reduction (5 toxic in top-10) with substantially better quality (tau = 0.510 vs. 0.438). The budget knob gives you a smooth frontier of tradeoffs that the naive approach simply cannot offer.

pythonfrom mosaic import govern
result = govern(engagement_scores, safety_scores, budget=0.30)

Run the full notebook: `content_moderation.ipynb`

Try governed-rank

pip install governed-rankGitHubTutorial
governed-rankcontent-moderationtoxicityengagementtutorial