Diversity mandates are everywhere in recommendation systems. Regulators want it. Product teams want it. The intuition is obvious: show users a broader range of content and everyone benefits. So teams build diversity objectives — promote underexposed categories, surface long-tail items, penalize popularity concentration.
It doesn't work. Not because the engineering is wrong, but because the objective is wrong.
This article presents results from 34 policy candidates tested on 54,544 real browsing sessions from the Adressa Norwegian news dataset. The findings are counterintuitive but statistically robust: forcing diversity directly produces the worst preference alignment of any policy we tested. Steering toward quality content produces both higher engagement AND higher diversity. The evidence is triple-validated with bootstrap confidence intervals, negative controls, and independent behavioral metrics.
The Experiment
The Adressa dataset contains one day of browsing logs from a Norwegian news site: 2.4GB of raw data, processed into 601K article events across 1,049 articles with at least 5 views each, distributed across 54,544 sessions with at least 3 articles per session.
We generated 34 policy candidates across 8 categories:
- Category-based: nyheter (news), sports, culture, lifestyle, etc.
- Temporal: morning, evening, business hours
- Popularity: trending (top 10% by views), long-tail (bottom 30%), mid-tier
- Quality: top 25% by average active reading time
- Topics: local, weather, politics, business
- Diversity: underexposed categories, small categories
- Freshness: early morning publications
- Author: prolific contributors
For each policy, we ran govern() and measured Preference Lift — the ratio between how much users actually consume content from that policy set versus how much the catalog contains:
P(article ∈ Policy | user reads article)
Preference Lift = ─────────────────────────────────────────
P(article ∈ Policy | catalog)- Lift > 1.0: Users disproportionately read content from this policy set. The policy is aligned with user preference.
- Lift = 1.0: No signal. Random.
- Lift < 1.0: Users avoid this content relative to its catalog share. The policy fights user behavior.
This metric measures user behavior, not MOSAIC's ability to promote items. A policy with high lift means users actually want what you're steering toward.
The Results: Aligned vs. Misaligned Policies
•Top Performers (Aligned)
| Policy | Preference Lift | 95% CI | Interpretation |
|---|---|---|---|
popularity_trending | 9.32× | [9.29, 9.35] | Users strongly prefer trending content |
topic_business | 3.00× | [2.92, 3.08] | Business news is significantly underexposed |
quality_high_engagement | 2.00× | [1.96, 2.03] | Users prefer quality content |
freshness_early | 1.45× | [1.41, 1.49] | Fresh morning news is underserved |
•Worst Performers (Misaligned)
| Policy | Preference Lift | 95% CI | Interpretation |
|---|---|---|---|
diversity_underexposed | 0.25× | [0.24, 0.26] | Users actively avoid underexposed categories |
diversity_small_categories | 0.32× | [0.30, 0.34] | Small categories are small for a reason |
popularity_longtail_30 | 0.01× | [0.01, 0.01] | Bottom 30% by views — nobody wants this |
culture_opinion | 0.20× | [0.18, 0.22] | Culture/opinion content is avoided |
lifestyle | 0.27× | [0.25, 0.29] | Lifestyle content is misaligned |
The pattern is stark. Every diversity-forcing policy scores below 0.35× — users actively avoid the content these policies promote. Every quality and trending policy scores above 1.4× — users disproportionately engage with what these policies surface. The confidence intervals don't overlap. This isn't noise.
The Paradox: Quality Achieves What Diversity Cannot
Here's where it gets interesting. We measured diversity metrics — Shannon entropy of category distribution and Intra-List Distance (ILD) — for each policy's steered output:
| Policy | Preference Lift | Entropy Δ | ILD Δ | Novelty Δ |
|---|---|---|---|---|
diversity_underexposed | 0.25× | +0.039 | +0.016 | +0.052 |
popularity_longtail_30 | 0.01× | -0.074 | -0.029 | +0.108 |
quality_high_engagement | 2.00× | +0.136 | +0.082 | +0.071 |
Read that again. The quality policy achieves 3.5× more entropy gain and 5× more ILD gain than the diversity-forcing policy — while also having 8× higher preference lift. The policy explicitly designed to increase diversity barely moves the needle. The quality policy moves it massively, as a side effect.
Why? High-quality articles span multiple categories naturally. A great sports analysis, a thorough business investigation, and a compelling culture piece all score high on reading time. When you steer toward quality, you surface excellent content from diverse categories. When you steer toward underexposed categories, you surface mediocre content from niche categories that users don't want.
Long-Tail Is Not Diversity
To understand this further, we ran a diversity analysis on Adressa long-tail steering:
| Metric | Base → Steered | Change | Interpretation |
|---|---|---|---|
| Novelty | 0.668 → 0.695 | +0.027 | Less popular items surfaced |
| Coverage | 29 → 72 items | +148% | More catalog exploration |
| Category Entropy | 1.05 → 1.01 | -0.037 | Fewer categories represented |
| ILD | 0.45 → 0.43 | -0.019 | Less within-list diversity |
Long-tail steering increases novelty and coverage (you see more obscure items) but decreases category diversity. The unpopular articles cluster in a few niche categories. You get more items from the tail, but they're all from the same two or three categories. This is novelty, not diversity — and users don't want it (0.01× preference lift).
Negative Controls: Is the Signal Real?
A reasonable objection: "Your quality label is based on reading time. Users who spend more time naturally engage more. The 2.00× preference lift could be a tautology."
We generated three negative controls:
| Control | Preference Lift | 95% CI | Interpretation |
|---|---|---|---|
CONTROL_random_25pct | 0.99× | [0.97, 1.01] | Random subset → no signal (as expected) |
CONTROL_shuffled_quality | 0.46× | [0.44, 0.48] | Shuffled labels → noise destroys signal |
CONTROL_inverse_quality | 0.93× | [0.90, 0.96] | Bottom 75% by reading time → users avoid it |
The random control confirms the baseline is unbiased (0.99× ≈ 1.0). The shuffled control confirms the signal depends on the actual quality labels, not just the set size. The inverse control is the smoking gun: the bottom 75% by reading time scores 0.93× — users slightly avoid low-quality content. Quality 2.00× vs. inverse quality 0.93× is a 2.15× difference. The signal is real.
Robustness: 8 Configurations, Zero Sign Flips
We ran the quality preference lift across 8 configurations (4 random seeds × 4 sample sizes from 1,000 to 10,000 sessions):
- Lift range: [1.96, 2.08]
- Lift mean ± std: 2.01 ± 0.034
- Sign flips (lift dropping below 1.0): 0 out of 8
- All lifts above 1.0: YES
The result is stable across seeds and sample sizes. The coefficient of variation is 1.7%. This is not a fragile finding.
Behavioral Validation: No Leakage Proof
To definitively rule out label leakage (quality defined by reading time → lift measured by engagement → circular), we computed four behavioral metrics that are completely independent of reading time:
- Session Depth: Number of articles read per session (more clicks = deeper engagement)
- Category Breadth: Number of unique categories visited
- Click Diversity: Shannon entropy of category distribution within session
- Session Continuation: Whether users continue browsing after encountering an article
We compared sessions containing quality articles (top 25% by reading time) against sessions without:
| Metric | Quality Sessions | Non-Quality Sessions | Δ | 95% CI |
|---|---|---|---|---|
| Session Depth | 4.50 articles | 3.75 articles | +20% | [+0.72, +0.78] |
| Category Breadth | 2.15 categories | 1.71 categories | +26% | [+0.43, +0.45] |
| Click Diversity | 0.872 entropy | 0.595 entropy | +47% | [+0.27, +0.29] |
| Continuation | 4.50 articles | 3.75 articles | +20% | [+0.72, +0.78] |
Sample sizes: 38,348 quality sessions, 16,196 non-quality sessions. All confidence intervals are entirely above zero. These metrics measure clicks, categories, and browsing patterns — not reading time.
Users who encounter quality articles:
- Read 20% more articles in the same session
- Visit 26% more categories
- Have 47% higher click diversity (entropy)
This means quality articles aren't just "long reads that inflate the reading time metric." They drive richer, more diverse browsing behavior. Users who encounter a great article explore more broadly, click on more categories, and stay longer in the session. The quality signal is real, and it drives both engagement and diversity through genuine user behavior.
Portfolio Optimization: The Best of Both Worlds
If trending content has the highest preference lift (9.32×) and quality content has the best diversity gains, what happens when you combine them?
We ran a portfolio optimization sweeping the mix between trending and quality steering:
| w_Trending | w_Quality | Budget | Recall | Entropy | vs Trending-Only |
|---|---|---|---|---|---|
| 1.0 | 0.0 | 30% | 35.0% | 1.017 | baseline |
| 0.8 | 0.2 | 30% | 32.5% | 1.072 | +5.4% diversity |
| 0.6 | 0.4 | 30% | 30.6% | 1.178 | +15.8% diversity |
| 0.0 | 1.0 | 30% | 26.4% | 1.064 | +4.6% diversity |
The 60-40 trending-quality mix trades 4.4 percentage points of recall for +15.8% diversity gain. Pure quality steering actually has less diversity than the mix, because trending content adds variety that quality alone doesn't capture. The optimal portfolio is a blend.
This is the Pareto frontier in action: govern() doesn't just steer toward a single objective. By combining steering signals, you can navigate the tradeoff between engagement, diversity, and any other policy dimension — and the budget parameter controls how aggressively you navigate it.
Time-of-Day Segmentation
The Objective Discovery Engine also revealed that preference patterns shift dramatically by time of day:
| Category | Morning Lift | Evening Lift | Ratio |
|---|---|---|---|
| Weather | 2.01× | 0.49× | 4.1× more in mornings |
| Family | 1.65× | 0.42× | 3.9× more in mornings |
| Cars | 0.07× | 0.72× | 10× less in mornings |
| Consumer | 0.03× | 0.25× | 8× less in mornings |
Same user, same catalog, different time of day — preference lift changes by 4-10×. Morning users want weather and family content. Evening users tolerate car and consumer content. A static diversity mandate treats all time segments equally and gets it wrong in both.
Context-aware steering isn't a luxury feature. It's the difference between a 2.01× aligned policy and a 0.03× misaligned one.
The Takeaway
The data from 54,544 real sessions is unambiguous:
- Don't optimize diversity directly. The
diversity_underexposedpolicy has 0.25× preference lift. Users actively avoid the content it promotes. Forcing it destroys engagement and produces no measurable diversity benefit. - Optimize quality instead. The
quality_high_engagementpolicy has 2.00× preference lift AND +47% click diversity. High-quality articles span categories naturally. Users engage more, browse more broadly, and visit more categories. - Long-tail is not diversity. Long-tail steering increases novelty (you see more obscure items) but decreases category diversity. The unpopular items cluster in a few niche categories.
- The signal is real. Triple-validated: bootstrap CIs entirely above 1.0, inverse control at 0.93×, 0/8 sign flips across robustness checks, and independent behavioral metrics showing +20% session depth and +47% click diversity with CIs entirely above zero.
- Context matters. Morning vs. evening preference lift differs by 4-10×. Any policy that ignores context is leaving value on the table.
The counterintuitive conclusion: the fastest path to a diverse, engaging content experience is to stop trying to force diversity and start surfacing the best content in each category. Quality is the proxy for good diversity. The data is clear.
Related Insights
Tutorial: Objective Discovery — Finding Policies That Work Before You Deploy
Not all policies are worth pursuing. This tutorial walks through the objective_discovery notebook: run 7 candidate policies through govern() to discover which objectives align with users and which fight them.
quality_depth is the only policy with high preference lift AND diversity gain
Understanding governed-rank: How MOSAIC Steers Rankings Without Breaking Them
Every ranking system eventually needs a second objective. MOSAIC orthogonalizes the policy signal, protects confident decisions, and projects the optimal result — in three steps, one function call.
3 stepsZero-interference steering with full audit trail
One Algorithm, Six Datasets, Four Domains
Does govern() actually generalize? We ran the same function call — no retuning — across grocery, movies, fashion, news, and music. Policy lifts from 1.15× to 41.7×, budget controls stability smoothly in every domain.
1.15×–41.7× policy lift across 6 real datasets with zero retuning