Back to Insights
TutorialStrategy

Tutorial: Objective Discovery — Finding Policies That Work Before You Deploy

Not all policies are worth pursuing. This tutorial walks through the objective_discovery notebook: run 7 candidate policies through govern() to discover which objectives align with users and which fight them.

March 7, 202610 mingoverned-rank

quality_depth is the only policy with high preference lift AND diversity gain

Not all policies are worth pursuing. Some objectives align with user preferences — steering toward them helps everyone. Others fight user behavior — forcing them destroys engagement and produces nothing of value. The question is: how do you know which is which before running an expensive A/B test?

This tutorial walks through the objective_discovery.ipynb notebook. We use govern() as a decision platform: run 7 candidate policies through it, measure each policy's impact on engagement and diversity, and discover which objectives work BEFORE deploying anything.

The Setup: A Synthetic News Catalog

We generate a synthetic news catalog with 500 articles across 7 categories:

CategoryCountShare
news15430.8%
sports8717.4%
tech6212.4%
lifestyle5711.4%
business5210.4%
culture448.8%
opinion448.8%

Each article has engagement (the base model's score, with a slight category boost for news and sports), views (popularity, log-normal, correlated with engagement), reading time (a quality/depth proxy, driven by content quality NOT category), and freshness (hours since publication).

pythonimport numpy as np
from mosaic import govern

np.random.seed(42)
n = 500

cat_names = ['news', 'sports', 'business', 'culture', 'lifestyle', 'tech', 'opinion']
cat_probs = [0.30, 0.20, 0.10, 0.08, 0.12, 0.12, 0.08]
categories = np.random.choice(cat_names, size=n, p=cat_probs)

engagement = np.random.beta(2, 5, n) + np.array([cat_boost[c] for c in categories])
reading_time = np.random.gamma(3, 2, n) + 2 * engagement

The base ranking over-represents news and sports because the engagement model has a category bias. Culture and opinion articles are underexposed despite having content that some users would value.

The base top-50 (ranked purely by engagement) has a Shannon entropy of 2.155 across categories. A perfectly uniform distribution across 7 categories would have entropy of log2(7) = 2.807. The base entropy of 2.155 is well below uniform because news and sports dominate the top positions.

Simulating User Behavior: 10,000 Reads

We simulate 10,000 reading events to model user behavior. Users prefer engaging, popular, and high-quality content — but they do not explicitly seek diversity. The simulation generates realistic reading patterns where some articles are read frequently and others are ignored.

The base top-50 captures 2,328 reads out of the 10,000 simulated events.

Preference Lift: Measuring Policy Alignment

Preference lift is the key diagnostic. It measures whether users disproportionately read articles from a given policy set:

Lift = P(article in Policy | user reads it) / P(article in Policy | catalog)
  • Lift > 1.0: users prefer these articles more than their catalog share would predict. The policy is aligned with user behavior.
  • Lift = 1.0: users read these articles at exactly their catalog rate. The policy is neutral.
  • Lift < 1.0: users avoid these articles relative to their catalog share. The policy is misaligned — forcing it will fight user behavior.

The 7 Candidate Policies

We define 7 candidate policies, each representing a different editorial or strategic objective:

pythonpolicies = {
    'trending':              views >= np.percentile(views, 90),
    'quality_depth':         reading_time >= np.percentile(reading_time, 75),
    'freshness_recent':      hours_ago <= 12,
    'topic_sports':          categories == 'sports',
    'topic_business':        categories == 'business',
    'diversity_underexposed': np.isin(categories, ['culture', 'opinion']),
    'longtail':              views <= np.percentile(views, 30),
}

Preference lift results:

PolicySizeLift
trending503.22x
quality_depth1251.90x
topic_sports871.05x
freshness_recent2081.02x
topic_business520.80x
diversity_underexposed880.36x
longtail1510.35x

The results split into three clear groups:

Strongly aligned (lift > 1.5x): Trending content has a 3.22x lift — users read trending articles at 3.22 times the rate their catalog share would predict. Quality/depth content has a 1.90x lift. These policies go with user behavior.

Neutral (lift near 1.0x): Sports (1.05x) and freshness (1.02x) are essentially neutral. Users read them at roughly their catalog rate. Steering toward these would have limited benefit.

Misaligned (lift < 1.0x): Business (0.80x), diversity/underexposed (0.36x), and longtail (0.35x) are actively misaligned. Users avoid longtail articles at a rate of 0.35x — they read them at only 35% of the rate their catalog share would predict. Forcing these policies would destroy engagement for minimal diversity gain.

This is the first critical insight: forcing diversity (underexposed categories) fights user behavior at 0.36x lift. Users are not avoiding culture and opinion articles by accident — they genuinely prefer other content.

The Scorecard: govern() as Decision Platform

For each policy, we run govern() with the base engagement scores and measure two outcomes:

  • Reads captured: how many of the 10,000 simulated reads land in the governed top-50 (engagement proxy)
  • Category entropy: Shannon entropy of the top-50 category distribution (diversity proxy)
pythonbase_scores = {i: float(engagement[i]) for i in range(n)}
steer_weight = 0.15

for name, mask in policies.items():
    steering = {i: steer_weight * float(mask[i]) for i in range(n)}
    r = govern(base_scores, steering, budget=0.30)
    top_k = r.ranked_items[:50]
    # measure entropy and reads captured

Full scorecard:

PolicySizeLiftEntropydEntReadsdReadsProj
trending503.22x1.902-0.2542,857+5290.0774
quality_depth1251.90x2.325+0.1702,822+4940.0101
freshness_recent2081.02x2.281+0.1262,008-3200.0039
topic_sports871.05x2.105-0.0512,205-1230.0174
topic_business520.80x2.315+0.1592,137-191-0.0042
diversity_underexposed880.36x2.570+0.4152,071-257-0.0902
longtail1510.35x2.461+0.3052,097-231-0.1869

The columns:

  • Size: Number of articles in the policy set
  • Lift: Preference lift (user alignment)
  • Entropy: Shannon entropy of governed top-50 category distribution
  • dEnt: Change in entropy vs. base (positive = more diverse)
  • Reads: Simulated reads captured in governed top-50
  • dReads: Change in reads vs. base (positive = more engagement)
  • Proj: Projection coefficient (how much correlation was removed)

The scorecard reveals four quadrants:

More Reads (+dReads)Fewer Reads (-dReads)
More Diverse (+dEnt)quality_depthdiversity_underexposed, longtail, freshness, business
Less Diverse (-dEnt)trendingtopic_sports

Only one policy lands in the top-left quadrant (more reads AND more diversity): quality_depth.

The Gold Mine: Quality Beats Forced Diversity

This is the key finding of the entire notebook. quality_depth (top 25% by reading time) is the only policy that achieves both high preference lift AND positive diversity gain:

  • Lift: 1.90x — users strongly prefer quality articles
  • dEntropy: +0.170 — the top-50 becomes more diverse (entropy rises from 2.155 to 2.325)
  • dReads: +494 — the governed top-50 captures 494 more reads than the base (2,822 vs. 2,328)
  • Projection coefficient: 0.0101 — near zero, meaning quality is barely correlated with engagement. Orthogonalization had almost nothing to strip out. Quality provides genuinely new information.

Compare with the alternatives:

  • Trending has higher lift (3.22x) and captures more reads (+529), but reduces diversity (dEntropy = -0.254). Steering toward trending concentrates the feed in news and sports.
  • Diversity steering (underexposed categories) increases entropy the most (+0.415), but users do not want the content (lift = 0.36x) and reads drop by 257. Forcing culture and opinion articles onto users costs engagement.
  • Quality threads the needle: users want it, AND it naturally diversifies the feed because quality articles come from every category, not just the dominant ones.

The insight is powerful: you do not need to force diversity. Surface quality content and diversity follows as a side effect. Quality articles from business, culture, tech, and opinion rise in the feed, naturally broadening the category distribution without fighting user preferences.

Projection Coefficients: A Diagnostic Tool

The projection coefficients tell you how much each policy signal is correlated with the base engagement scores:

  • trending: 0.0774 — positively correlated. Trending articles tend to already be highly ranked by engagement. Orthogonalization strips some of the redundancy.
  • quality_depth: 0.0101 — near zero. Quality is nearly independent of engagement. This is why quality is such a good steering signal: it provides genuinely new information.
  • diversity_underexposed: -0.0902 — negatively correlated. Underexposed categories (culture, opinion) tend to have lower engagement. Orthogonalization removes this anti-correlation, but the remaining signal is still misaligned with user preferences.
  • longtail: -0.1869 — strongly negatively correlated. Longtail articles (low views) are anti-correlated with engagement. The large negative coefficient means orthogonalization had to strip a lot of redundancy. After stripping, the remaining signal is still misaligned (lift = 0.35x).

A near-zero projection coefficient is the sweet spot: it means the steering signal provides genuinely new information that the base ranker does not already capture. Quality is the best example.

Portfolio Optimization: Mixing Policies

The notebook also explores mixing trending and quality signals at different weights. This traces a portfolio frontier:

pythonweights = [(1.0, 0.0), (0.8, 0.2), (0.6, 0.4), (0.4, 0.6), (0.2, 0.8), (0.0, 1.0)]

for wt, wq in weights:
    steering = {i: steer_weight * (wt * trending[i] + wq * quality[i]) for i in range(n)}
    r = govern(base_scores, steering, budget=0.30)
Mix (trend/quality)EntropydEntReadsdReads
100/01.902-0.2542,857+529
80/201.942-0.2132,966+638
60/401.942-0.2133,014+686
40/601.996-0.1592,964+636
20/802.271+0.1162,854+526
0/1002.325+0.1702,822+494
base2.155+0.0002,328+0

The 60/40 trending-quality mix captures the most reads of any mix: 3,014 (dReads = +686), a 29.5% improvement over the base. But diversity drops (dEntropy = -0.213). The 80/20 mix captures 2,966 reads with the same diversity penalty.

The critical observation: only the pure quality mix (0/100) achieves BOTH more reads AND higher diversity than the base ranking. Every mix that includes trending improves reads but reduces diversity. Pure quality is the only operating point that improves both dimensions simultaneously.

MixMore Reads than Base?More Diverse than Base?Both?
100/0 (pure trending)Yes (+529)No (-0.254)No
80/20Yes (+638)No (-0.213)No
60/40Yes (+686)No (-0.213)No
40/60Yes (+636)No (-0.159)No
20/80Yes (+526)Yes (+0.116)Yes
0/100 (pure quality)Yes (+494)Yes (+0.170)Yes

The 20/80 mix also achieves both improvements, but with lower diversity gain (+0.116 vs. +0.170) and higher reads (+526 vs. +494). If you want maximum reads, use a trending-heavy mix and accept the diversity cost. If you want both reads and diversity, use pure quality.

Why Forced Diversity Fails (And Quality Succeeds)

The data tells a clear story about why forced diversity is counterproductive. The diversity_underexposed policy (steering toward culture and opinion articles) has a preference lift of only 0.36x. Users read culture and opinion articles at 36% of the rate their catalog share would predict. Forcing these articles into the top-50 costs 257 reads (an 11.0% engagement drop) while achieving the highest entropy gain (+0.415).

The projection coefficient for diversity steering is -0.0902 — strongly negative. This means underexposed categories are anti-correlated with engagement. Orthogonalization strips out this anti-correlation, but the remaining signal is still misaligned with user preferences. Even after removing the engagement interference, users simply do not want these articles.

Quality steering achieves a better outcome through a different mechanism. Quality articles (top 25% by reading time) exist in every category. When you steer toward quality, you promote the best business articles, the best culture articles, the best opinion articles — content that users actually want to read. The natural consequence is a more diverse top-50, because the best articles are distributed across categories rather than concentrated in news and sports.

The projection coefficient for quality (0.0101) confirms this mechanism. Quality is nearly independent of engagement. It provides genuinely new information about which articles are worth promoting. Engagement captures popularity; quality captures depth. They measure different things, which is exactly why quality is such an effective steering signal.

The Decision Framework

The notebook establishes a three-step framework for policy evaluation:

  1. Compute preference lift for each candidate policy. Policies with lift below 1.0 are fighting user behavior and should be deprioritized or abandoned.
  2. Run each policy through govern() and measure the scorecard: reads captured, category entropy, projection coefficient. The scorecard tells you what each policy achieves in practice, not just in theory.
  3. Explore portfolio mixes of the top candidates. The frontier reveals tradeoffs and identifies operating points that improve multiple dimensions.

This framework replaces expensive A/B testing for policy selection. You still need A/B tests to validate the final choice in production, but you can eliminate clearly bad policies (longtail, forced diversity) before spending any production traffic on them.

Key Takeaway

govern() is not just a reranking tool — it is a policy experimentation platform. Run candidate policies through the scorecard BEFORE deploying. The finding from this notebook: quality steering (top 25% by reading time) achieves a 1.90x preference lift, captures 494 more reads than the base (a 21.2% improvement), and increases category entropy by 0.170 — making it the only policy that improves both engagement and diversity simultaneously.

Forced diversity (steering toward underexposed categories) has a preference lift of only 0.36x, loses 257 reads versus the base, and produces a projection coefficient of -0.0902 indicating strong anti-correlation with engagement. It is a costly intervention that users actively resist.

The counterintuitive insight: you do not need to force diversity. Surface quality content and diversity follows as a side effect. This is exactly the kind of finding you can only discover by systematically evaluating candidate policies against user behavior data.

pythonfrom mosaic import govern

for policy_name, signal in candidate_policies.items():
    result = govern(base_scores, signal, budget=0.30)
    # measure lift, diversity, quality retention

Run the full notebook: `objective_discovery.ipynb`

Try governed-rank

pip install governed-rankGitHubTutorial
governed-rankobjective-discoverypolicy-evaluationdiversitytutorial