Why Forcing Diversity Backfires — And What Works Instead

Diversity mandates are everywhere in recommendation systems. Regulators want it. Product teams want it. The intuition is obvious: show users a broader range of content and everyone benefits. So teams build diversity objectives — promote underexposed categories, surface long-tail items, penalize popularity concentration.

It doesn't work. Not because the engineering is wrong, but because the objective is wrong.

This article presents results from 34 policy candidates tested on 54,544 real browsing sessions from the Adressa Norwegian news dataset. The findings are counterintuitive but statistically robust: forcing diversity directly produces the worst preference alignment of any policy we tested. Steering toward quality content produces both higher engagement AND higher diversity. The evidence is triple-validated with bootstrap confidence intervals, negative controls, and independent behavioral metrics.

The Experiment

The Adressa dataset contains one day of browsing logs from a Norwegian news site: 2.4GB of raw data, processed into 601K article events across 1,049 articles with at least 5 views each, distributed across 54,544 sessions with at least 3 articles per session.

We generated 34 policy candidates across 8 categories:

Category-based: nyheter (news), sports, culture, lifestyle, etc.
Temporal: morning, evening, business hours
Popularity: trending (top 10% by views), long-tail (bottom 30%), mid-tier
Quality: top 25% by average active reading time
Topics: local, weather, politics, business
Diversity: underexposed categories, small categories
Freshness: early morning publications
Author: prolific contributors

For each policy, we ran govern() and measured Preference Lift — the ratio between how much users actually consume content from that policy set versus how much the catalog contains:

                   P(article ∈ Policy | user reads article)
Preference Lift = ─────────────────────────────────────────
                         P(article ∈ Policy | catalog)

Lift > 1.0: Users disproportionately read content from this policy set. The policy is aligned with user preference.
Lift = 1.0: No signal. Random.
Lift < 1.0: Users avoid this content relative to its catalog share. The policy fights user behavior.

This metric measures user behavior, not MOSAIC's ability to promote items. A policy with high lift means users actually want what you're steering toward.

The Results: Aligned vs. Misaligned Policies

•Top Performers (Aligned)

Policy	Preference Lift	95% CI	Interpretation
`popularity_trending`	9.32×	[9.29, 9.35]	Users strongly prefer trending content
`topic_business`	3.00×	[2.92, 3.08]	Business news is significantly underexposed
`quality_high_engagement`	2.00×	[1.96, 2.03]	Users prefer quality content
`freshness_early`	1.45×	[1.41, 1.49]	Fresh morning news is underserved

•Worst Performers (Misaligned)

Policy	Preference Lift	95% CI	Interpretation
`diversity_underexposed`	0.25×	[0.24, 0.26]	Users actively avoid underexposed categories
`diversity_small_categories`	0.32×	[0.30, 0.34]	Small categories are small for a reason
`popularity_longtail_30`	0.01×	[0.01, 0.01]	Bottom 30% by views — nobody wants this
`culture_opinion`	0.20×	[0.18, 0.22]	Culture/opinion content is avoided
`lifestyle`	0.27×	[0.25, 0.29]	Lifestyle content is misaligned

The pattern is stark. Every diversity-forcing policy scores below 0.35× — users actively avoid the content these policies promote. Every quality and trending policy scores above 1.4× — users disproportionately engage with what these policies surface. The confidence intervals don't overlap. This isn't noise.

The Paradox: Quality Achieves What Diversity Cannot

Here's where it gets interesting. We measured diversity metrics — Shannon entropy of category distribution and Intra-List Distance (ILD) — for each policy's steered output:

Policy	Preference Lift	Entropy Δ	ILD Δ	Novelty Δ
`diversity_underexposed`	0.25×	+0.039	+0.016	+0.052
`popularity_longtail_30`	0.01×	-0.074	-0.029	+0.108
`quality_high_engagement`	2.00×	+0.136	+0.082	+0.071

Read that again. The quality policy achieves 3.5× more entropy gain and 5× more ILD gain than the diversity-forcing policy — while also having 8× higher preference lift. The policy explicitly designed to increase diversity barely moves the needle. The quality policy moves it massively, as a side effect.

Why? High-quality articles span multiple categories naturally. A great sports analysis, a thorough business investigation, and a compelling culture piece all score high on reading time. When you steer toward quality, you surface excellent content from diverse categories. When you steer toward underexposed categories, you surface mediocre content from niche categories that users don't want.

Long-Tail Is Not Diversity

To understand this further, we ran a diversity analysis on Adressa long-tail steering:

Metric	Base → Steered	Change	Interpretation
Novelty	0.668 → 0.695	+0.027	Less popular items surfaced
Coverage	29 → 72 items	+148%	More catalog exploration
Category Entropy	1.05 → 1.01	-0.037	Fewer categories represented
ILD	0.45 → 0.43	-0.019	Less within-list diversity

Long-tail steering increases novelty and coverage (you see more obscure items) but decreases category diversity. The unpopular articles cluster in a few niche categories. You get more items from the tail, but they're all from the same two or three categories. This is novelty, not diversity — and users don't want it (0.01× preference lift).

Negative Controls: Is the Signal Real?

A reasonable objection: "Your quality label is based on reading time. Users who spend more time naturally engage more. The 2.00× preference lift could be a tautology."

We generated three negative controls:

Control	Preference Lift	95% CI	Interpretation
`CONTROL_random_25pct`	0.99×	[0.97, 1.01]	Random subset → no signal (as expected)
`CONTROL_shuffled_quality`	0.46×	[0.44, 0.48]	Shuffled labels → noise destroys signal
`CONTROL_inverse_quality`	0.93×	[0.90, 0.96]	Bottom 75% by reading time → users avoid it

The random control confirms the baseline is unbiased (0.99× ≈ 1.0). The shuffled control confirms the signal depends on the actual quality labels, not just the set size. The inverse control is the smoking gun: the bottom 75% by reading time scores 0.93× — users slightly avoid low-quality content. Quality 2.00× vs. inverse quality 0.93× is a 2.15× difference. The signal is real.

Robustness: 8 Configurations, Zero Sign Flips

We ran the quality preference lift across 8 configurations (4 random seeds × 4 sample sizes from 1,000 to 10,000 sessions):

Lift range: [1.96, 2.08]
Lift mean ± std: 2.01 ± 0.034
Sign flips (lift dropping below 1.0): 0 out of 8
All lifts above 1.0: YES

The result is stable across seeds and sample sizes. The coefficient of variation is 1.7%. This is not a fragile finding.

Behavioral Validation: No Leakage Proof

To definitively rule out label leakage (quality defined by reading time → lift measured by engagement → circular), we computed four behavioral metrics that are completely independent of reading time:

Session Depth: Number of articles read per session (more clicks = deeper engagement)
Category Breadth: Number of unique categories visited
Click Diversity: Shannon entropy of category distribution within session
Session Continuation: Whether users continue browsing after encountering an article

We compared sessions containing quality articles (top 25% by reading time) against sessions without:

Metric	Quality Sessions	Non-Quality Sessions	Δ	95% CI
Session Depth	4.50 articles	3.75 articles	+20%	[+0.72, +0.78]
Category Breadth	2.15 categories	1.71 categories	+26%	[+0.43, +0.45]
Click Diversity	0.872 entropy	0.595 entropy	+47%	[+0.27, +0.29]
Continuation	4.50 articles	3.75 articles	+20%	[+0.72, +0.78]

Sample sizes: 38,348 quality sessions, 16,196 non-quality sessions. All confidence intervals are entirely above zero. These metrics measure clicks, categories, and browsing patterns — not reading time.

Users who encounter quality articles:

Read 20% more articles in the same session
Visit 26% more categories
Have 47% higher click diversity (entropy)

This means quality articles aren't just "long reads that inflate the reading time metric." They drive richer, more diverse browsing behavior. Users who encounter a great article explore more broadly, click on more categories, and stay longer in the session. The quality signal is real, and it drives both engagement and diversity through genuine user behavior.

Portfolio Optimization: The Best of Both Worlds

If trending content has the highest preference lift (9.32×) and quality content has the best diversity gains, what happens when you combine them?

We ran a portfolio optimization sweeping the mix between trending and quality steering:

w_Trending	w_Quality	Budget	Recall	Entropy	vs Trending-Only
1.0	0.0	30%	35.0%	1.017	baseline
0.8	0.2	30%	32.5%	1.072	+5.4% diversity
0.6	0.4	30%	30.6%	1.178	+15.8% diversity
0.0	1.0	30%	26.4%	1.064	+4.6% diversity

The 60-40 trending-quality mix trades 4.4 percentage points of recall for +15.8% diversity gain. Pure quality steering actually has less diversity than the mix, because trending content adds variety that quality alone doesn't capture. The optimal portfolio is a blend.

This is the Pareto frontier in action: govern() doesn't just steer toward a single objective. By combining steering signals, you can navigate the tradeoff between engagement, diversity, and any other policy dimension — and the budget parameter controls how aggressively you navigate it.

Time-of-Day Segmentation

The Objective Discovery Engine also revealed that preference patterns shift dramatically by time of day:

Category	Morning Lift	Evening Lift	Ratio
Weather	2.01×	0.49×	4.1× more in mornings
Family	1.65×	0.42×	3.9× more in mornings
Cars	0.07×	0.72×	10× less in mornings
Consumer	0.03×	0.25×	8× less in mornings

Same user, same catalog, different time of day — preference lift changes by 4-10×. Morning users want weather and family content. Evening users tolerate car and consumer content. A static diversity mandate treats all time segments equally and gets it wrong in both.

Context-aware steering isn't a luxury feature. It's the difference between a 2.01× aligned policy and a 0.03× misaligned one.

The Takeaway

The data from 54,544 real sessions is unambiguous:

Don't optimize diversity directly. The diversity_underexposed policy has 0.25× preference lift. Users actively avoid the content it promotes. Forcing it destroys engagement and produces no measurable diversity benefit.
Optimize quality instead. The quality_high_engagement policy has 2.00× preference lift AND +47% click diversity. High-quality articles span categories naturally. Users engage more, browse more broadly, and visit more categories.
Long-tail is not diversity. Long-tail steering increases novelty (you see more obscure items) but decreases category diversity. The unpopular items cluster in a few niche categories.
The signal is real. Triple-validated: bootstrap CIs entirely above 1.0, inverse control at 0.93×, 0/8 sign flips across robustness checks, and independent behavioral metrics showing +20% session depth and +47% click diversity with CIs entirely above zero.
Context matters. Morning vs. evening preference lift differs by 4-10×. Any policy that ignores context is leaving value on the table.

The counterintuitive conclusion: the fastest path to a diverse, engaging content experience is to stop trying to force diversity and start surfacing the best content in each category. Quality is the proxy for good diversity. The data is clear.

Try governed-rank

pip install governed-rankGitHub Tutorial

governed-rankdiversityqualityobjective-discoveryreal-dataadressa

Related Insights

TutorialStrategy

Tutorial: Objective Discovery — Finding Policies That Work Before You Deploy

Not all policies are worth pursuing. This tutorial walks through the objective_discovery notebook: run 7 candidate policies through govern() to discover which objectives align with users and which fight them.

quality_depth is the only policy with high preference lift AND diversity gain

Mar 7, 202610 min

Technical Deep DiveGetting Started

Understanding governed-rank: How MOSAIC Steers Rankings Without Breaking Them

Every ranking system eventually needs a second objective. MOSAIC orthogonalizes the policy signal, protects confident decisions, and projects the optimal result — in three steps, one function call.

3 stepsZero-interference steering with full audit trail

Mar 7, 202612 min

Technical Deep DiveGeneralization

One Algorithm, Six Datasets, Four Domains

Does govern() actually generalize? We ran the same function call — no retuning — across grocery, movies, fashion, news, and music. Policy lifts from 1.15× to 41.7×, budget controls stability smoothly in every domain.

1.15×–41.7× policy lift across 6 real datasets with zero retuning

Mar 7, 202612 min

Back to all insights