Back to Insights
Technical Deep DiveStrategy

Why Forcing Diversity Backfires — And What Works Instead

We tested 34 policy candidates on 54,544 real news sessions. Diversity-forcing policies scored 0.25× preference lift. Quality-based steering scored 2.00× lift AND increased diversity by 47%. The data says: optimize quality, get diversity for free.

March 7, 202614 mingoverned-rank

Quality steering: 2.00× preference lift AND +47% click diversity

Diversity mandates are everywhere in recommendation systems. Regulators want it. Product teams want it. The intuition is obvious: show users a broader range of content and everyone benefits. So teams build diversity objectives — promote underexposed categories, surface long-tail items, penalize popularity concentration.

It doesn't work. Not because the engineering is wrong, but because the objective is wrong.

This article presents results from 34 policy candidates tested on 54,544 real browsing sessions from the Adressa Norwegian news dataset. The findings are counterintuitive but statistically robust: forcing diversity directly produces the worst preference alignment of any policy we tested. Steering toward quality content produces both higher engagement AND higher diversity. The evidence is triple-validated with bootstrap confidence intervals, negative controls, and independent behavioral metrics.

The Experiment

The Adressa dataset contains one day of browsing logs from a Norwegian news site: 2.4GB of raw data, processed into 601K article events across 1,049 articles with at least 5 views each, distributed across 54,544 sessions with at least 3 articles per session.

We generated 34 policy candidates across 8 categories:

  • Category-based: nyheter (news), sports, culture, lifestyle, etc.
  • Temporal: morning, evening, business hours
  • Popularity: trending (top 10% by views), long-tail (bottom 30%), mid-tier
  • Quality: top 25% by average active reading time
  • Topics: local, weather, politics, business
  • Diversity: underexposed categories, small categories
  • Freshness: early morning publications
  • Author: prolific contributors

For each policy, we ran govern() and measured Preference Lift — the ratio between how much users actually consume content from that policy set versus how much the catalog contains:

                   P(article ∈ Policy | user reads article)
Preference Lift = ─────────────────────────────────────────
                         P(article ∈ Policy | catalog)
  • Lift > 1.0: Users disproportionately read content from this policy set. The policy is aligned with user preference.
  • Lift = 1.0: No signal. Random.
  • Lift < 1.0: Users avoid this content relative to its catalog share. The policy fights user behavior.

This metric measures user behavior, not MOSAIC's ability to promote items. A policy with high lift means users actually want what you're steering toward.

The Results: Aligned vs. Misaligned Policies

Top Performers (Aligned)

PolicyPreference Lift95% CIInterpretation
popularity_trending9.32×[9.29, 9.35]Users strongly prefer trending content
topic_business3.00×[2.92, 3.08]Business news is significantly underexposed
quality_high_engagement2.00×[1.96, 2.03]Users prefer quality content
freshness_early1.45×[1.41, 1.49]Fresh morning news is underserved

Worst Performers (Misaligned)

PolicyPreference Lift95% CIInterpretation
diversity_underexposed0.25×[0.24, 0.26]Users actively avoid underexposed categories
diversity_small_categories0.32×[0.30, 0.34]Small categories are small for a reason
popularity_longtail_300.01×[0.01, 0.01]Bottom 30% by views — nobody wants this
culture_opinion0.20×[0.18, 0.22]Culture/opinion content is avoided
lifestyle0.27×[0.25, 0.29]Lifestyle content is misaligned

The pattern is stark. Every diversity-forcing policy scores below 0.35× — users actively avoid the content these policies promote. Every quality and trending policy scores above 1.4× — users disproportionately engage with what these policies surface. The confidence intervals don't overlap. This isn't noise.

The Paradox: Quality Achieves What Diversity Cannot

Here's where it gets interesting. We measured diversity metrics — Shannon entropy of category distribution and Intra-List Distance (ILD) — for each policy's steered output:

PolicyPreference LiftEntropy ΔILD ΔNovelty Δ
diversity_underexposed0.25×+0.039+0.016+0.052
popularity_longtail_300.01×-0.074-0.029+0.108
quality_high_engagement2.00×+0.136+0.082+0.071

Read that again. The quality policy achieves 3.5× more entropy gain and 5× more ILD gain than the diversity-forcing policy — while also having 8× higher preference lift. The policy explicitly designed to increase diversity barely moves the needle. The quality policy moves it massively, as a side effect.

Why? High-quality articles span multiple categories naturally. A great sports analysis, a thorough business investigation, and a compelling culture piece all score high on reading time. When you steer toward quality, you surface excellent content from diverse categories. When you steer toward underexposed categories, you surface mediocre content from niche categories that users don't want.

Long-Tail Is Not Diversity

To understand this further, we ran a diversity analysis on Adressa long-tail steering:

MetricBase → SteeredChangeInterpretation
Novelty0.668 → 0.695+0.027Less popular items surfaced
Coverage29 → 72 items+148%More catalog exploration
Category Entropy1.05 → 1.01-0.037Fewer categories represented
ILD0.45 → 0.43-0.019Less within-list diversity

Long-tail steering increases novelty and coverage (you see more obscure items) but decreases category diversity. The unpopular articles cluster in a few niche categories. You get more items from the tail, but they're all from the same two or three categories. This is novelty, not diversity — and users don't want it (0.01× preference lift).

Negative Controls: Is the Signal Real?

A reasonable objection: "Your quality label is based on reading time. Users who spend more time naturally engage more. The 2.00× preference lift could be a tautology."

We generated three negative controls:

ControlPreference Lift95% CIInterpretation
CONTROL_random_25pct0.99×[0.97, 1.01]Random subset → no signal (as expected)
CONTROL_shuffled_quality0.46×[0.44, 0.48]Shuffled labels → noise destroys signal
CONTROL_inverse_quality0.93×[0.90, 0.96]Bottom 75% by reading time → users avoid it

The random control confirms the baseline is unbiased (0.99× ≈ 1.0). The shuffled control confirms the signal depends on the actual quality labels, not just the set size. The inverse control is the smoking gun: the bottom 75% by reading time scores 0.93× — users slightly avoid low-quality content. Quality 2.00× vs. inverse quality 0.93× is a 2.15× difference. The signal is real.

Robustness: 8 Configurations, Zero Sign Flips

We ran the quality preference lift across 8 configurations (4 random seeds × 4 sample sizes from 1,000 to 10,000 sessions):

  • Lift range: [1.96, 2.08]
  • Lift mean ± std: 2.01 ± 0.034
  • Sign flips (lift dropping below 1.0): 0 out of 8
  • All lifts above 1.0: YES

The result is stable across seeds and sample sizes. The coefficient of variation is 1.7%. This is not a fragile finding.

Behavioral Validation: No Leakage Proof

To definitively rule out label leakage (quality defined by reading time → lift measured by engagement → circular), we computed four behavioral metrics that are completely independent of reading time:

  1. Session Depth: Number of articles read per session (more clicks = deeper engagement)
  2. Category Breadth: Number of unique categories visited
  3. Click Diversity: Shannon entropy of category distribution within session
  4. Session Continuation: Whether users continue browsing after encountering an article

We compared sessions containing quality articles (top 25% by reading time) against sessions without:

MetricQuality SessionsNon-Quality SessionsΔ95% CI
Session Depth4.50 articles3.75 articles+20%[+0.72, +0.78]
Category Breadth2.15 categories1.71 categories+26%[+0.43, +0.45]
Click Diversity0.872 entropy0.595 entropy+47%[+0.27, +0.29]
Continuation4.50 articles3.75 articles+20%[+0.72, +0.78]

Sample sizes: 38,348 quality sessions, 16,196 non-quality sessions. All confidence intervals are entirely above zero. These metrics measure clicks, categories, and browsing patterns — not reading time.

Users who encounter quality articles:

  • Read 20% more articles in the same session
  • Visit 26% more categories
  • Have 47% higher click diversity (entropy)

This means quality articles aren't just "long reads that inflate the reading time metric." They drive richer, more diverse browsing behavior. Users who encounter a great article explore more broadly, click on more categories, and stay longer in the session. The quality signal is real, and it drives both engagement and diversity through genuine user behavior.

Portfolio Optimization: The Best of Both Worlds

If trending content has the highest preference lift (9.32×) and quality content has the best diversity gains, what happens when you combine them?

We ran a portfolio optimization sweeping the mix between trending and quality steering:

w_Trendingw_QualityBudgetRecallEntropyvs Trending-Only
1.00.030%35.0%1.017baseline
0.80.230%32.5%1.072+5.4% diversity
0.60.430%30.6%1.178+15.8% diversity
0.01.030%26.4%1.064+4.6% diversity

The 60-40 trending-quality mix trades 4.4 percentage points of recall for +15.8% diversity gain. Pure quality steering actually has less diversity than the mix, because trending content adds variety that quality alone doesn't capture. The optimal portfolio is a blend.

This is the Pareto frontier in action: govern() doesn't just steer toward a single objective. By combining steering signals, you can navigate the tradeoff between engagement, diversity, and any other policy dimension — and the budget parameter controls how aggressively you navigate it.

Time-of-Day Segmentation

The Objective Discovery Engine also revealed that preference patterns shift dramatically by time of day:

CategoryMorning LiftEvening LiftRatio
Weather2.01×0.49×4.1× more in mornings
Family1.65×0.42×3.9× more in mornings
Cars0.07×0.72×10× less in mornings
Consumer0.03×0.25×8× less in mornings

Same user, same catalog, different time of day — preference lift changes by 4-10×. Morning users want weather and family content. Evening users tolerate car and consumer content. A static diversity mandate treats all time segments equally and gets it wrong in both.

Context-aware steering isn't a luxury feature. It's the difference between a 2.01× aligned policy and a 0.03× misaligned one.

The Takeaway

The data from 54,544 real sessions is unambiguous:

  1. Don't optimize diversity directly. The diversity_underexposed policy has 0.25× preference lift. Users actively avoid the content it promotes. Forcing it destroys engagement and produces no measurable diversity benefit.
  2. Optimize quality instead. The quality_high_engagement policy has 2.00× preference lift AND +47% click diversity. High-quality articles span categories naturally. Users engage more, browse more broadly, and visit more categories.
  3. Long-tail is not diversity. Long-tail steering increases novelty (you see more obscure items) but decreases category diversity. The unpopular items cluster in a few niche categories.
  4. The signal is real. Triple-validated: bootstrap CIs entirely above 1.0, inverse control at 0.93×, 0/8 sign flips across robustness checks, and independent behavioral metrics showing +20% session depth and +47% click diversity with CIs entirely above zero.
  5. Context matters. Morning vs. evening preference lift differs by 4-10×. Any policy that ignores context is leaving value on the table.

The counterintuitive conclusion: the fastest path to a diverse, engaging content experience is to stop trying to force diversity and start surfacing the best content in each category. Quality is the proxy for good diversity. The data is clear.

Try governed-rank

pip install governed-rankGitHubTutorial
governed-rankdiversityqualityobjective-discoveryreal-dataadressa