One Algorithm, Six Datasets, Four Domains

The tutorials on this site use synthetic data to illustrate how govern() works. Synthetic data is useful for pedagogy — you control the ground truth and can show exactly what happens at each step. But the question every practitioner asks is: does it work on real data?

This article presents results from six public recommendation datasets spanning four domains. Every experiment uses the same govern(base_scores, steering_scores, budget) call. No per-dataset tuning, no custom feature engineering, no domain-specific modifications. The algorithm either generalizes or it doesn't.

The Datasets

Dataset	Domain	Items	Sessions	Source
Ta-Feng	Grocery	11,197	99,210	UCI ML Repository
MovieLens 100K	Movies	1,682	943	GroupLens Research
Instacart	Grocery	15,825	8,389	Kaggle
H&M	Fashion	20,500	300 (sampled)	Kaggle (31M transactions, 105K articles)
Adressa	News	1,049	54,544	NTNU (2.4GB raw Norwegian news browsing logs)
LastFM 360K	Music	53,355	9,811	Last.fm API

These are not toy datasets. Instacart and H&M are from Kaggle competitions with millions of transactions. Adressa is a real Norwegian news site's browsing logs. LastFM has nearly 480K user-artist interaction rows. Ta-Feng is a standard grocery benchmark from the UCI repository.

What We're Measuring

For each dataset, we run govern() with a steering policy (long-tail promotion, category steering, temporal steering, etc.) and measure:

Policy Exposure: What fraction of the top-K results belong to the target policy set? Higher means the steering is working.
Policy Lift: Policy exposure divided by the base rate in the catalog. A lift of 3× means users see 3× more policy-targeted content than a random sample would contain.
Recall@10: How many of the user's actual next interactions appear in the top-10? This measures whether steering hurts the base ranker's accuracy.
Stability: Kendall tau or set overlap between the governed ranking and the base ranking. Higher budget = more stability = less disruption.

The key question: can you get meaningful policy lift without destroying recall, and does the budget knob control the tradeoff smoothly?

Results by Dataset

•Ta-Feng (Grocery, 11K items, 99K sessions)

Ta-Feng is a Taiwanese grocery retailer dataset. We test temporal steering (promoting morning purchases) with an Item-CF base ranker.

Budget	Recall@10	Stability	Policy Lift
0%	24.2%	0.79	2.94×
10%	24.3%	0.79	2.94×
30%	24.5%	0.89	2.91×
50%	24.8%	0.94	2.83×
70%	25.0%	0.96	2.72×
100%	25.3%	0.97	2.60×

Observation: Recall actually improves slightly as budget increases (24.2% → 25.3%). This is because protection constraints prevent the steering signal from pushing relevant items out of the top-K. The budget smoothly trades policy lift (2.94× → 2.60×) for stability (0.79 → 0.97). At Budget=30%, you retain 99% of max steering at much higher stability.

We also validated with a Hybrid ranker (70% Item-CF + 30% popularity) and an Embedding ranker (moment2vec). The hybrid ranker shows the same pattern with slightly different numbers. The embedding-only ranker reveals something important: MOSAIC doesn't fix a bad ranker. If the base ranking is fundamentally weak, constraints bind but the underlying quality isn't there to protect. This is correct behavior — governed ranking governs, it doesn't create.

•MovieLens 100K (Movies, 1.7K items, 943 users)

MovieLens is the canonical movie recommendation benchmark. We tested three steering types: temporal (morning/evening preference), long-tail (promote less popular films), and genre (promote documentaries/film-noir). Here are the genre documentary results:

Budget	Recall@10	Stability	Policy Lift
0%	23.1%	0.88	1.15×
10%	23.1%	0.88	1.15×
30%	23.3%	0.90	1.15×
50%	23.4%	0.92	1.14×
100%	23.7%	1.00	1.05×

Observation: MovieLens shows the smallest policy lift (1.15×). Why? Documentaries are already reasonably represented in the catalog relative to user interest. The orthogonalization step correctly identifies that there isn't much misalignment to correct. This is what you want from the algorithm — it doesn't manufacture a steering effect where none exists.

•Instacart (Grocery, 16K items, 8.4K orders)

Instacart is a grocery delivery dataset from the Kaggle competition. We tested two steering types:

Long-tail steering (promote bottom 50% by popularity):

Budget	Recall@10	Stability	Policy Lift
0%	52.3%	0.88	1.44×
30%	53.9%	0.86	1.44×
70%	57.0%	0.94	1.40×
100%	59.2%	1.00	1.35×

Category steering (promote Produce department):

Budget	Recall@10	Stability	Policy Lift
0%	46.0%	0.40	15.4×
30%	41.4%	0.38	15.3×
70%	42.2%	0.40	15.2×
100%	43.4%	0.40	15.1×

Observation: Category steering achieves 15.4× lift because produce is only 6.0% of the catalog but heavily purchased. The algorithm surfaces a massive unmet demand signal. Long-tail steering shows recall increasing with budget (52.3% → 59.2%) — the protection constraints are catching relevant items that would otherwise be displaced.

•H&M (Fashion, 20.5K items, 500K transactions)

H&M comes from the Kaggle Personalized Fashion Recommendations competition (31M transactions, 105K articles). We sampled 500K transactions and ran 144 configurations (4 steering types × 3 λ × 6 budgets × 2 protection modes).

Long-tail steering:

Budget	Recall@10	Stability	Policy Lift
0%	59.0%	0.73	1.68×
10%	65.6%	0.73	1.68×
30%	70.8%	0.71	1.67×
50%	78.9%	0.78	1.66×
70%	82.5%	0.92	1.63×
100%	83.9%	0.98	1.58×

Product group steering:

Budget	Recall@10	Stability	Policy Lift
0%	58.5%	0.26	41.7×
30%	59.2%	0.26	41.5×
100%	59.6%	0.26	41.5×

Observation: Product group steering reaches 41.7× lift because the target product group is only 2.27% of the catalog but heavily purchased. Recall barely moves (58.5% → 59.6%) across all budget levels. Long-tail steering shows the most dramatic recall improvement: 59.0% → 83.9% as budget increases. The protection constraints are actively helping the ranker by preventing over-displacement.

•Adressa (Norwegian News, 1K articles, 54K sessions)

Adressa is real Norwegian news browsing data — 2.4GB of single-day raw logs processed into 601K article events across 1,049 articles and 54,544 sessions.

Sports category steering:

Budget	Recall@10	Stability	Policy Lift
0%	29.0%	0.66	2.66×
30%	29.4%	0.62	2.61×
50%	30.7%	0.67	2.35×
100%	36.7%	1.00	0.81×

Temporal (morning) steering:

Budget	Recall@10	Stability	Policy Lift
0%	23.5%	0.41	3.58×
30%	26.5%	0.44	3.38×
50%	24.8%	0.47	3.32×
100%	36.7%	1.00	1.10×

Observation: Adressa reveals an important characteristic of news: temporal steering (3.58×) is more powerful than category steering (2.66×). Morning-published articles are significantly underexposed relative to morning browsing demand. At Budget=100% (no steering, pure base ranker), recall jumps to 36.7% — meaning steering does cost recall in news. This is an honest tradeoff: you're choosing policy compliance over pure accuracy. The budget knob makes that tradeoff explicit and controllable.

•LastFM 360K (Music, 53K artists, 9.8K baskets)

LastFM is a large-scale music listening dataset (479,601 rows, 10,000 users, 53,355 artists filtered to 2,000 items).

Long-tail steering (bottom 30% by popularity):

Budget	Recall@10	Retention	Policy Lift
0.05	31.1%	0.978	1.46×
0.10	31.1%	0.979	1.44×
0.20	31.4%	0.980	1.38×
0.30	31.4%	0.981	1.33×
0.40	31.1%	0.984	1.27×

Observation: LastFM shows policy lift strongest at low budgets (1.46× at 5%) with smooth decay. Recall stays rock-stable around 31% across all budgets. Retention improves monotonically (0.978 → 0.984). Multi-policy sweep validated consistency: long_tail_50 lift 1.32× → 1.19×, mid_tail 1.29× → 1.18×, head_10 1.08× → 1.05×.

The Cross-Dataset Summary

Dataset	Domain	Items	Sessions	Best Policy Lift	Recall Stable?	Budget Controls Stability?
Ta-Feng	Grocery	11K	99K	2.94×	Yes	Yes
MovieLens	Movies	1.7K	943	1.15×	Yes	Yes
Instacart	Grocery	16K	8.4K	15.4×	Yes	Yes
H&M	Fashion	20.5K	300	41.7×	Yes	Yes
Adressa	News	1K	54K	3.58×	Tradeoff	Yes
LastFM	Music	53K	9.8K	1.46×	Yes	Yes

Three patterns hold across every dataset:

Budget controls stability monotonically. Spearman correlation between budget and stability is 1.00 ± 0.00 across multi-seed validation. This isn't a happy accident — it's a consequence of the isotonic projection: more protected edges = more constraints = ranking moves less.
Policy lift varies by how misaligned the base ranker is. MovieLens documentaries are already reasonably exposed (1.15×). H&M product groups are massively underexposed relative to demand (41.7×). The algorithm correctly identifies the magnitude of the correction needed without the operator specifying it.
Recall is preserved or improved in 5 of 6 datasets. The exception is Adressa news, where steering toward temporal or category content genuinely trades off with engagement. This is not a failure — it's the algorithm being honest about a real conflict between the steering objective and the base signal.

Baseline Comparison

We also compared govern() against four baseline reranking approaches on Ta-Feng (500 baskets):

Method	Stability@10	Policy Exposure@50
Naive Boost	0.806	0.329
Gap-Threshold	0.836	0.290
xQuAD	0.851	0.303
Capped Boost	0.796	0.313
Freeze Top-5	1.000	0.263
MOSAIC (B=30%)	0.890	0.349

MOSAIC achieves the highest policy exposure (0.349) AND the second-highest stability (0.890), beaten only by Freeze Top-5 which achieves stability by definition (it never moves the top-5) at the cost of the lowest policy exposure.

Iso-exposure comparison (all methods matched to MOSAIC's policy exposure of 0.349):

Method	Policy Exposure	Stability@10
Naive Boost	0.349	0.745
Gap-Threshold	0.349	0.774
xQuAD	0.349	0.810
MOSAIC	0.349	0.890

At the same policy exposure, MOSAIC provides 10-19% higher stability than every baseline. The orthogonalization step removes the interference between base scores and steering scores, so the algorithm can steer without unnecessary disruption.

Why It Generalizes

The algorithm generalizes because it makes no domain-specific assumptions. The three steps — orthogonalize the steering signal against base scores, protect the most confident ordering decisions via budget-controlled edge constraints, and solve for the optimal ranking via isotonic projection — operate purely on score vectors. It doesn't know whether it's ranking groceries, movies, news articles, or songs. It only knows: here are base scores, here are steering scores, here is how much disruption you're willing to tolerate.

This is also why the algorithm is O(N log N) and runs sub-millisecond. There's no learned component, no iterative optimization, no convergence check. Orthogonalize, protect, project. One pass.

What This Means

If you have a ranked list and a policy objective, govern() will find the Pareto-optimal reranking for any budget level. The evidence across six datasets and four domains shows this isn't dataset-dependent. The budget parameter gives you a smooth, monotonic dial between "maximum policy compliance" and "preserve the base ranking exactly." Where you set that dial is a business decision. The algorithm makes the tradeoff explicit and auditable.

pythonfrom mosaic import govern

result = govern(base_scores, steering_scores, budget=0.30)

Same call. Every domain. Every dataset. No retuning.

Try governed-rank

pip install governed-rankGitHub Tutorial

governed-rankcross-datasetgeneralizationrecommendationsreal-data

Related Insights

Technical Deep DiveGetting Started

Understanding governed-rank: How MOSAIC Steers Rankings Without Breaking Them

Every ranking system eventually needs a second objective. MOSAIC orthogonalizes the policy signal, protects confident decisions, and projects the optimal result — in three steps, one function call.

3 stepsZero-interference steering with full audit trail

Mar 7, 202612 min

TutorialStrategy

Tutorial: Objective Discovery — Finding Policies That Work Before You Deploy

Not all policies are worth pursuing. This tutorial walks through the objective_discovery notebook: run 7 candidate policies through govern() to discover which objectives align with users and which fight them.

quality_depth is the only policy with high preference lift AND diversity gain

Mar 7, 202610 min

TutorialContent & Safety

Tutorial: Content Moderation — Demoting Toxicity Without Killing Engagement

Toxic content is engaging — outrage drives clicks. This tutorial walks through the content_moderation notebook: why naive penalties over-correct, and how govern() targets only the uncertain zone.

Toxicity drops in top-10 while ranking quality preserved

Mar 7, 202610 min

Back to all insights