Not all policies are worth pursuing. Some objectives align with user preferences — steering toward them helps everyone. Others fight user behavior — forcing them destroys engagement and produces nothing of value. The question is: how do you know which is which before running an expensive A/B test?
This tutorial walks through the objective_discovery.ipynb notebook. We use govern() as a decision platform: run 7 candidate policies through it, measure each policy's impact on engagement and diversity, and discover which objectives work BEFORE deploying anything.
The Setup: A Synthetic News Catalog
We generate a synthetic news catalog with 500 articles across 7 categories:
| Category | Count | Share |
|---|---|---|
| news | 154 | 30.8% |
| sports | 87 | 17.4% |
| tech | 62 | 12.4% |
| lifestyle | 57 | 11.4% |
| business | 52 | 10.4% |
| culture | 44 | 8.8% |
| opinion | 44 | 8.8% |
Each article has engagement (the base model's score, with a slight category boost for news and sports), views (popularity, log-normal, correlated with engagement), reading time (a quality/depth proxy, driven by content quality NOT category), and freshness (hours since publication).
pythonimport numpy as np
from mosaic import govern
np.random.seed(42)
n = 500
cat_names = ['news', 'sports', 'business', 'culture', 'lifestyle', 'tech', 'opinion']
cat_probs = [0.30, 0.20, 0.10, 0.08, 0.12, 0.12, 0.08]
categories = np.random.choice(cat_names, size=n, p=cat_probs)
engagement = np.random.beta(2, 5, n) + np.array([cat_boost[c] for c in categories])
reading_time = np.random.gamma(3, 2, n) + 2 * engagementThe base ranking over-represents news and sports because the engagement model has a category bias. Culture and opinion articles are underexposed despite having content that some users would value.
The base top-50 (ranked purely by engagement) has a Shannon entropy of 2.155 across categories. A perfectly uniform distribution across 7 categories would have entropy of log2(7) = 2.807. The base entropy of 2.155 is well below uniform because news and sports dominate the top positions.
Simulating User Behavior: 10,000 Reads
We simulate 10,000 reading events to model user behavior. Users prefer engaging, popular, and high-quality content — but they do not explicitly seek diversity. The simulation generates realistic reading patterns where some articles are read frequently and others are ignored.
The base top-50 captures 2,328 reads out of the 10,000 simulated events.
Preference Lift: Measuring Policy Alignment
Preference lift is the key diagnostic. It measures whether users disproportionately read articles from a given policy set:
Lift = P(article in Policy | user reads it) / P(article in Policy | catalog)- Lift > 1.0: users prefer these articles more than their catalog share would predict. The policy is aligned with user behavior.
- Lift = 1.0: users read these articles at exactly their catalog rate. The policy is neutral.
- Lift < 1.0: users avoid these articles relative to their catalog share. The policy is misaligned — forcing it will fight user behavior.
The 7 Candidate Policies
We define 7 candidate policies, each representing a different editorial or strategic objective:
pythonpolicies = {
'trending': views >= np.percentile(views, 90),
'quality_depth': reading_time >= np.percentile(reading_time, 75),
'freshness_recent': hours_ago <= 12,
'topic_sports': categories == 'sports',
'topic_business': categories == 'business',
'diversity_underexposed': np.isin(categories, ['culture', 'opinion']),
'longtail': views <= np.percentile(views, 30),
}Preference lift results:
| Policy | Size | Lift |
|---|---|---|
| trending | 50 | 3.22x |
| quality_depth | 125 | 1.90x |
| topic_sports | 87 | 1.05x |
| freshness_recent | 208 | 1.02x |
| topic_business | 52 | 0.80x |
| diversity_underexposed | 88 | 0.36x |
| longtail | 151 | 0.35x |
The results split into three clear groups:
Strongly aligned (lift > 1.5x): Trending content has a 3.22x lift — users read trending articles at 3.22 times the rate their catalog share would predict. Quality/depth content has a 1.90x lift. These policies go with user behavior.
Neutral (lift near 1.0x): Sports (1.05x) and freshness (1.02x) are essentially neutral. Users read them at roughly their catalog rate. Steering toward these would have limited benefit.
Misaligned (lift < 1.0x): Business (0.80x), diversity/underexposed (0.36x), and longtail (0.35x) are actively misaligned. Users avoid longtail articles at a rate of 0.35x — they read them at only 35% of the rate their catalog share would predict. Forcing these policies would destroy engagement for minimal diversity gain.
This is the first critical insight: forcing diversity (underexposed categories) fights user behavior at 0.36x lift. Users are not avoiding culture and opinion articles by accident — they genuinely prefer other content.
The Scorecard: govern() as Decision Platform
For each policy, we run govern() with the base engagement scores and measure two outcomes:
- Reads captured: how many of the 10,000 simulated reads land in the governed top-50 (engagement proxy)
- Category entropy: Shannon entropy of the top-50 category distribution (diversity proxy)
pythonbase_scores = {i: float(engagement[i]) for i in range(n)}
steer_weight = 0.15
for name, mask in policies.items():
steering = {i: steer_weight * float(mask[i]) for i in range(n)}
r = govern(base_scores, steering, budget=0.30)
top_k = r.ranked_items[:50]
# measure entropy and reads capturedFull scorecard:
| Policy | Size | Lift | Entropy | dEnt | Reads | dReads | Proj |
|---|---|---|---|---|---|---|---|
| trending | 50 | 3.22x | 1.902 | -0.254 | 2,857 | +529 | 0.0774 |
| quality_depth | 125 | 1.90x | 2.325 | +0.170 | 2,822 | +494 | 0.0101 |
| freshness_recent | 208 | 1.02x | 2.281 | +0.126 | 2,008 | -320 | 0.0039 |
| topic_sports | 87 | 1.05x | 2.105 | -0.051 | 2,205 | -123 | 0.0174 |
| topic_business | 52 | 0.80x | 2.315 | +0.159 | 2,137 | -191 | -0.0042 |
| diversity_underexposed | 88 | 0.36x | 2.570 | +0.415 | 2,071 | -257 | -0.0902 |
| longtail | 151 | 0.35x | 2.461 | +0.305 | 2,097 | -231 | -0.1869 |
The columns:
- Size: Number of articles in the policy set
- Lift: Preference lift (user alignment)
- Entropy: Shannon entropy of governed top-50 category distribution
- dEnt: Change in entropy vs. base (positive = more diverse)
- Reads: Simulated reads captured in governed top-50
- dReads: Change in reads vs. base (positive = more engagement)
- Proj: Projection coefficient (how much correlation was removed)
The scorecard reveals four quadrants:
| More Reads (+dReads) | Fewer Reads (-dReads) | |
|---|---|---|
| More Diverse (+dEnt) | quality_depth | diversity_underexposed, longtail, freshness, business |
| Less Diverse (-dEnt) | trending | topic_sports |
Only one policy lands in the top-left quadrant (more reads AND more diversity): quality_depth.
The Gold Mine: Quality Beats Forced Diversity
This is the key finding of the entire notebook. quality_depth (top 25% by reading time) is the only policy that achieves both high preference lift AND positive diversity gain:
- Lift: 1.90x — users strongly prefer quality articles
- dEntropy: +0.170 — the top-50 becomes more diverse (entropy rises from 2.155 to 2.325)
- dReads: +494 — the governed top-50 captures 494 more reads than the base (2,822 vs. 2,328)
- Projection coefficient: 0.0101 — near zero, meaning quality is barely correlated with engagement. Orthogonalization had almost nothing to strip out. Quality provides genuinely new information.
Compare with the alternatives:
- Trending has higher lift (3.22x) and captures more reads (+529), but reduces diversity (dEntropy = -0.254). Steering toward trending concentrates the feed in news and sports.
- Diversity steering (underexposed categories) increases entropy the most (+0.415), but users do not want the content (lift = 0.36x) and reads drop by 257. Forcing culture and opinion articles onto users costs engagement.
- Quality threads the needle: users want it, AND it naturally diversifies the feed because quality articles come from every category, not just the dominant ones.
The insight is powerful: you do not need to force diversity. Surface quality content and diversity follows as a side effect. Quality articles from business, culture, tech, and opinion rise in the feed, naturally broadening the category distribution without fighting user preferences.
Projection Coefficients: A Diagnostic Tool
The projection coefficients tell you how much each policy signal is correlated with the base engagement scores:
- trending: 0.0774 — positively correlated. Trending articles tend to already be highly ranked by engagement. Orthogonalization strips some of the redundancy.
- quality_depth: 0.0101 — near zero. Quality is nearly independent of engagement. This is why quality is such a good steering signal: it provides genuinely new information.
- diversity_underexposed: -0.0902 — negatively correlated. Underexposed categories (culture, opinion) tend to have lower engagement. Orthogonalization removes this anti-correlation, but the remaining signal is still misaligned with user preferences.
- longtail: -0.1869 — strongly negatively correlated. Longtail articles (low views) are anti-correlated with engagement. The large negative coefficient means orthogonalization had to strip a lot of redundancy. After stripping, the remaining signal is still misaligned (lift = 0.35x).
A near-zero projection coefficient is the sweet spot: it means the steering signal provides genuinely new information that the base ranker does not already capture. Quality is the best example.
Portfolio Optimization: Mixing Policies
The notebook also explores mixing trending and quality signals at different weights. This traces a portfolio frontier:
pythonweights = [(1.0, 0.0), (0.8, 0.2), (0.6, 0.4), (0.4, 0.6), (0.2, 0.8), (0.0, 1.0)]
for wt, wq in weights:
steering = {i: steer_weight * (wt * trending[i] + wq * quality[i]) for i in range(n)}
r = govern(base_scores, steering, budget=0.30)| Mix (trend/quality) | Entropy | dEnt | Reads | dReads |
|---|---|---|---|---|
| 100/0 | 1.902 | -0.254 | 2,857 | +529 |
| 80/20 | 1.942 | -0.213 | 2,966 | +638 |
| 60/40 | 1.942 | -0.213 | 3,014 | +686 |
| 40/60 | 1.996 | -0.159 | 2,964 | +636 |
| 20/80 | 2.271 | +0.116 | 2,854 | +526 |
| 0/100 | 2.325 | +0.170 | 2,822 | +494 |
| base | 2.155 | +0.000 | 2,328 | +0 |
The 60/40 trending-quality mix captures the most reads of any mix: 3,014 (dReads = +686), a 29.5% improvement over the base. But diversity drops (dEntropy = -0.213). The 80/20 mix captures 2,966 reads with the same diversity penalty.
The critical observation: only the pure quality mix (0/100) achieves BOTH more reads AND higher diversity than the base ranking. Every mix that includes trending improves reads but reduces diversity. Pure quality is the only operating point that improves both dimensions simultaneously.
| Mix | More Reads than Base? | More Diverse than Base? | Both? |
|---|---|---|---|
| 100/0 (pure trending) | Yes (+529) | No (-0.254) | No |
| 80/20 | Yes (+638) | No (-0.213) | No |
| 60/40 | Yes (+686) | No (-0.213) | No |
| 40/60 | Yes (+636) | No (-0.159) | No |
| 20/80 | Yes (+526) | Yes (+0.116) | Yes |
| 0/100 (pure quality) | Yes (+494) | Yes (+0.170) | Yes |
The 20/80 mix also achieves both improvements, but with lower diversity gain (+0.116 vs. +0.170) and higher reads (+526 vs. +494). If you want maximum reads, use a trending-heavy mix and accept the diversity cost. If you want both reads and diversity, use pure quality.
Why Forced Diversity Fails (And Quality Succeeds)
The data tells a clear story about why forced diversity is counterproductive. The diversity_underexposed policy (steering toward culture and opinion articles) has a preference lift of only 0.36x. Users read culture and opinion articles at 36% of the rate their catalog share would predict. Forcing these articles into the top-50 costs 257 reads (an 11.0% engagement drop) while achieving the highest entropy gain (+0.415).
The projection coefficient for diversity steering is -0.0902 — strongly negative. This means underexposed categories are anti-correlated with engagement. Orthogonalization strips out this anti-correlation, but the remaining signal is still misaligned with user preferences. Even after removing the engagement interference, users simply do not want these articles.
Quality steering achieves a better outcome through a different mechanism. Quality articles (top 25% by reading time) exist in every category. When you steer toward quality, you promote the best business articles, the best culture articles, the best opinion articles — content that users actually want to read. The natural consequence is a more diverse top-50, because the best articles are distributed across categories rather than concentrated in news and sports.
The projection coefficient for quality (0.0101) confirms this mechanism. Quality is nearly independent of engagement. It provides genuinely new information about which articles are worth promoting. Engagement captures popularity; quality captures depth. They measure different things, which is exactly why quality is such an effective steering signal.
The Decision Framework
The notebook establishes a three-step framework for policy evaluation:
- Compute preference lift for each candidate policy. Policies with lift below 1.0 are fighting user behavior and should be deprioritized or abandoned.
- Run each policy through
govern()and measure the scorecard: reads captured, category entropy, projection coefficient. The scorecard tells you what each policy achieves in practice, not just in theory. - Explore portfolio mixes of the top candidates. The frontier reveals tradeoffs and identifies operating points that improve multiple dimensions.
This framework replaces expensive A/B testing for policy selection. You still need A/B tests to validate the final choice in production, but you can eliminate clearly bad policies (longtail, forced diversity) before spending any production traffic on them.
Key Takeaway
govern() is not just a reranking tool — it is a policy experimentation platform. Run candidate policies through the scorecard BEFORE deploying. The finding from this notebook: quality steering (top 25% by reading time) achieves a 1.90x preference lift, captures 494 more reads than the base (a 21.2% improvement), and increases category entropy by 0.170 — making it the only policy that improves both engagement and diversity simultaneously.
Forced diversity (steering toward underexposed categories) has a preference lift of only 0.36x, loses 257 reads versus the base, and produces a projection coefficient of -0.0902 indicating strong anti-correlation with engagement. It is a costly intervention that users actively resist.
The counterintuitive insight: you do not need to force diversity. Surface quality content and diversity follows as a side effect. This is exactly the kind of finding you can only discover by systematically evaluating candidate policies against user behavior data.
pythonfrom mosaic import govern
for policy_name, signal in candidate_policies.items():
result = govern(base_scores, signal, budget=0.30)
# measure lift, diversity, quality retentionRun the full notebook: `objective_discovery.ipynb`
Related Insights
Understanding governed-rank: How MOSAIC Steers Rankings Without Breaking Them
Every ranking system eventually needs a second objective. MOSAIC orthogonalizes the policy signal, protects confident decisions, and projects the optimal result — in three steps, one function call.
3 stepsZero-interference steering with full audit trail
Tutorial: Content Moderation — Demoting Toxicity Without Killing Engagement
Toxic content is engaging — outrage drives clicks. This tutorial walks through the content_moderation notebook: why naive penalties over-correct, and how govern() targets only the uncertain zone.
Toxicity drops in top-10 while ranking quality preserved
Tutorial: Fraud Detection — Steering Review Queues Toward High-Value Fraud
A fraud model ranks by probability, but a $50K wire transfer matters more than a $5 candy purchase. This tutorial walks through the fraud_detection notebook: how govern() steers toward high-impact fraud without flooding the queue with false positives.
10x fraud value captured in BLOCK tier vs base model