Haske Labs logo

01 Introduction

Third-party cookies are going away. Chrome, Safari, and Firefox are all restricting or eliminating the cross-site tracking that powers most ad personalization today. The industry's response has been to find new ways to recreate user-level tracking — login walls, device fingerprinting, cohort IDs, data clean rooms.

We think the question is wrong. Instead of asking "How do we keep tracking users?", we ask:

"How much can you personalize from a single session — and how do you know when you can't?"

The answer turns out to be: quite a lot. A user who has visited 8 product pages, added 2 items to cart, and spent 4 minutes browsing electronics has told you a great deal about what they want right now — without you knowing who they are or what they did last week.

The hard part isn't detecting intent. It's knowing when your detection is reliable enough to act on, and steering the ranking without making things worse when it isn't.

This article describes a three-component pipeline that solves both problems:

IntentLens detects what the session wants from behavioral signals alone
The Confidence Gate decides whether to trust that detection or back off
governed-rank steers the ad/content ranking using the trusted signal, with mathematical guarantees that it won't degrade relevance

The result: 2–5x conversion lift on three public datasets, sub-5ms latency, and graceful degradation to context-only relevance when the signal isn't strong enough. No cookies. No user IDs. No cross-site anything.

02 Why Cookieless Is Hard

What Cookies Actually Did

Third-party cookies solved personalization by giving you a persistent identity across time and sites. You could build a rich user profile — browsing history, purchase patterns, category preferences, brand affinity — and use it to rank ads and recommendations.

When cookies disappear, you lose:

History: No record of what this person browsed last week
Cross-site context: No idea what they did on other sites
Preference models: No long-term taste profile to score against
Frequency capping: No way to know if they've already seen this ad 50 times

Most of the industry is trying to replace cookies with something that works the same way — persistent IDs through login, fingerprinting, or Google's Topics API. These approaches face privacy regulation headwinds and still depend on some form of tracking.

The Session-Only Constraint

We impose a harder constraint: the system can only see what happens in the current session. No persistent IDs, no cross-session linking, no external data. If the user closes the browser and comes back tomorrow, they're a completely new session.

This sounds limiting, but it reframes the problem in a useful way. Instead of asking "Who is this person and what do they usually like?", you ask "What does this session want right now?"

A session where someone searches "running shoes size 10," views 4 product pages, and adds one to cart is telling you something very specific — even if you have no idea who they are.

The Three Challenges

Working with session-only signals creates three distinct challenges:

1. Intent detection. How do you convert raw behavioral signals (clicks, dwell times, cart events, search queries) into a structured understanding of what the session wants?

2. Confidence estimation. A session with 15 events gives you a strong signal. A session with 2 page views gives you almost nothing. How do you measure how much you should trust your intent detection — and what do you do when trust is low?

3. Safe steering. Once you have a trusted intent signal, how do you use it to rerank ads or recommendations without breaking the base relevance model? If the intent detection is wrong, the reranking should be harmless, not destructive.

Our pipeline addresses each challenge with a dedicated component.

03 The Three-Component Pipeline

Browser session (no cookies, no user ID)
         |
    [ IntentLens ]    "What does this session want?"
    Session signals → intent distribution + confidence tier
         |
    [ Confidence Gate ]    "Should we trust this signal?"
    HIGH  → full steering     (strong evidence)
    MEDIUM → reduced steering  (partial evidence)
    LOW   → no steering        (fall back to context only)
         |
    [ governed-rank ]    "Steer without breaking relevance"
    Orthogonalize → Protect → Project
         |
    Personalized ranking + audit receipt

Each component is independently validated and solves a different piece of the puzzle. Here's how they work together.

Component 1: IntentLens — Detecting Intent from Session Behavior

IntentLens converts raw session signals into a structured intent distribution. It answers: "Given what this session has done so far, what are they trying to accomplish?"

Input signals (all session-level, no cookies):

Pages visited and their categories
Time spent on each page (dwell time)
Search queries entered
Items added to cart
Click patterns and scroll depth
Device type, time of day
Referrer (how they arrived)

How it works:

Step 1 — Intent discovery. The system learns latent intent patterns from historical item co-occurrence data. Items that frequently appear together in sessions define an intent cluster. For example, items commonly browsed together in electronics sessions define a "Research Electronics" intent. The system discovers 8 such intents automatically (Browse, Research, Quick Buy, Compare, Restock, Discover, Deal Hunt, Gift Shop).

Step 2 — Session prior. For the current session, the system looks at which items have been viewed or carted. Each item has an affinity to each intent (learned in Step 1). The average of these affinities gives a starting estimate of what the session wants — the "prior."

Step 3 — Evidence update. The prior is refined using 295 behavioral features — not just which items were viewed, but how they were viewed. Long dwell time on comparison pages suggests Research intent. Rapid cart additions suggest Quick Buy. The system learns which features are predictive of which intents.

Step 4 — Confidence scoring. The output is a probability distribution over intents (e.g., 72% Research, 18% Compare, 10% Browse). The confidence score captures how peaked and decisive this distribution is:

High confidence: One intent dominates (e.g., 85% Research). The session has given clear signals.
Medium confidence: Two intents compete (e.g., 45% Research, 38% Compare). Some signal, but ambiguous.
Low confidence: Flat distribution. The session has barely interacted — not enough evidence to detect intent.

Component 2: Confidence Gate — Knowing When to Trust the Signal

This is where the Confidence Gate Theorem enters. IntentLens produces an intent distribution and a confidence score for every session. But should you act on that score?

The key insight: Confidence is only useful if higher confidence actually means better predictions. This sounds obvious, but it fails in specific, predictable ways.

When a session has few interactions (2 page views, no cart), the system has structural uncertainty — it simply doesn't have enough data. The confidence score correctly identifies this: low signal volume → low confidence → low accuracy. Gating works perfectly here. Every confidence threshold you try gives equal or better results.

But when the system has plenty of signals that happen to be contradictory or unusual (a long session that bounces between unrelated categories), the uncertainty is contextual — the system can't understand why the user is behaving this way, not because it lacks data, but because the behavior doesn't match learned patterns. Here, the confidence score can be misleading.

The gate routes sessions into three tiers:

Tier	Meaning	Action	Steering Budget
HIGH	Strong, clear intent signal	Full personalization	100% (λ = 1.0)
MEDIUM	Partial signal, some ambiguity	Cautious personalization	50% (λ = 0.5)
LOW	Insufficient evidence	No personalization — context only	0% (λ = 0.0)

LOW sessions get pure contextual relevance — page-ad category overlap, nothing more. This is the graceful degradation that makes the system safe. When you don't know what the session wants, you don't guess. You serve contextually relevant content and move on.

Why the medium tier matters: On Criteo (844K sessions), we found that MEDIUM-tier sessions actually converted worse than LOW-tier sessions (6.9% vs 9.3%). Partial confidence was worse than no confidence. This is exactly what the Confidence Gate Theorem predicts: the "medium-confidence danger zone" where the system is confident enough to steer but not accurate enough to steer correctly. The gate catches this — by reducing the steering budget for MEDIUM sessions, the damage is contained.

Component 3: governed-rank — Steering Without Breaking Relevance

Once the confidence gate says "trust this signal," governed-rank handles the actual reranking. It solves a specific problem: how do you incorporate an intent signal into an existing ranking without degrading the base ranker's accuracy?

The naive approach fails. If you just add the intent score to the relevance score, you get interference. Intent signals correlate with relevance signals (high-intent sessions already see somewhat relevant content). Adding a correlated signal amplifies what's already there instead of steering toward what's missing.

governed-rank's three steps:

1. Orthogonalize. Remove the component of the intent signal that correlates with the base relevance score. After this step, the intent signal can only move items in directions that the base ranker has no opinion about. Mathematically, the correlation between the adjusted intent signal and base scores is effectively zero (measured at r = 3.47 × 10⁻¹⁸ on RetailRocket — that's 0.00000000000000000347).

2. Protect edges. Lock the base ranker's most confident ordering decisions. If the base ranker is very sure that Ad A should rank above Ad B (large score gap), that ordering is protected. No amount of intent steering can reverse it. This gives the relevance team a single knob — the "budget" — that controls how much the base ordering can change.

3. Project. Solve for the maximum intent effect within the protected constraints. The result is Pareto-optimal: you cannot increase personalization without decreasing relevance. This is a mathematical guarantee, not a hope.

The confidence gate scales the budget: HIGH sessions get full steering, MEDIUM sessions get half, LOW sessions get zero. This means the system is most aggressive when the signal is strongest and completely passive when it isn't.

04 Results: Three Public Datasets

We validated the full pipeline on three public e-commerce and advertising datasets. Each test measures the same thing: do confidence tiers predict conversion, and does gated steering improve outcomes?

RetailRocket (20,000 e-commerce sessions)

RetailRocket is an e-commerce dataset with full clickstream data — page views, cart events, and purchases.

Setup: Full IntentLens pipeline with 295 behavioral features. Confidence tiers from intent posterior margin and entropy.

Tier	Sessions	Conversion Rate	Lift vs MEDIUM
HIGH	80.4%	4.4%	4.9x
MEDIUM	19.6%	0.9%	—
LOW	0%	—	—

4.9x conversion lift between HIGH and MEDIUM tiers. The confidence gate correctly separates sessions where intent is clear (and conversion is likely) from sessions where it isn't.

The LOW tier is empty — RetailRocket sessions are information-rich (many events per session), so the evidence model confidently classifies nearly all sessions. This is the ideal case: enough behavioral signal to always make a determination.

Orthogonality check: The correlation between intent scores and base relevance scores is r = 3.47 × 10⁻¹⁸. The intent signal adds genuinely new information — it's not just re-discovering what the base ranker already knows.

Latency: p50 = 0.7ms, p99 = 4.9ms. Well under the ~100ms budget for real-time ad serving.

Criteo (844,000 ad sessions)

Criteo is a large-scale display advertising dataset with impressions, clicks, and conversions. Unlike RetailRocket, sessions here are sparser — fewer signals per session, noisier behavior.

Setup: Logistic regression on 7 session features (impressions, clicks, CTR, campaigns, categories, duration, cost). Confidence tiers from predicted conversion probability.

Tier	Sessions	Conversion Rate	Lift vs MEDIUM
HIGH	55.3%	13.2%	1.9x
MEDIUM	42.1%	6.9%	—
LOW	2.6%	9.3%	—

1.9x lift between HIGH and MEDIUM. But notice the important finding: LOW converts higher than MEDIUM (9.3% vs 6.9%).

This is the "medium-confidence danger zone" that the Confidence Gate Theorem predicts. MEDIUM sessions have enough signal to steer the system away from contextual relevance, but not enough accuracy to steer it toward the right answer. The result: partial confidence actively hurts.

This is why the gate matters. Without it, you'd serve a partially-confident personalization to 42% of traffic, performing worse than if you'd done nothing. With the gate, MEDIUM sessions get reduced steering (50% budget) and LOW sessions get zero steering — falling back to pure contextual relevance, which performs better than a wrong guess.

Bidding simulation: Using confidence tiers to adjust bid multipliers (HIGH: bid up, LOW: bid down) produces +3.25% cost-per-acquisition improvement and +3.36% conversion lift in a simulated bidding environment.

Yoochoose (92,000 sessions)

Yoochoose is a large e-commerce dataset (originally 924K sessions; tested on a 10% sample). It has balanced tier distribution, making it the cleanest test of monotonicity.

Setup: Logistic regression on 5 session features (click count, unique items, categories, repeat-view ratio, duration). Confidence tiers from predicted conversion probability.

Tier	Sessions	Conversion Rate	Lift vs LOW
HIGH	34.2%	9.45%	4.5x
MEDIUM	32.9%	3.65%	1.7x
LOW	32.9%	2.10%	—

4.5x lift between HIGH and LOW, and critically: perfect monotonicity. HIGH > MEDIUM > LOW with no reversals. The confidence gate passes both C1 and C2 checks cleanly.

Bidding simulation: +27.3% cost-per-acquisition improvement, +37.6% conversion lift. (These are simulation numbers under idealized conditions — realistic production improvement would be lower, likely 2–8%, but directionally strong.)

Cross-Dataset Summary

Dataset	Sessions	Best Lift	Monotonic?	Key Finding
RetailRocket	20,000	4.9x	Yes	Rich sessions → strong signal, near-empty LOW tier
Criteo	844,000	1.9x	No	Medium-zone danger: partial confidence hurts
Yoochoose	92,000	4.5x	Yes	Balanced tiers, clean monotonicity

The pattern: when sessions are information-rich, the pipeline works cleanly. When sessions are sparse and noisy (Criteo), the medium tier becomes dangerous — but the confidence gate catches this and reduces steering automatically.

All six core claims pass across all datasets:

Confidence tiers predict conversion (4.9x, 1.9x, 4.5x)
Gated steering beats ungated (Criteo +3.25% CPA, Yoochoose +27.3%)
LOW tier gets zero steering — graceful degradation
Orthogonalization preserves base relevance (r ≈ 0)
>50% of sessions reach HIGH or MEDIUM (100%, 97.4%, 67.1%)
Sub-5ms latency on all datasets

05 How the Pipeline Fits Together

The Eight Stages

Under the hood, a session flows through eight processing stages:

Stage	What Happens	Time
1. Signal collection	Gather page views, clicks, cart events, dwell times, search queries	Real-time
2. Intent detection	Convert signals to intent distribution via NMF + evidence model	<1ms
3. Confidence scoring	Compute posterior margin + entropy → assign HIGH/MEDIUM/LOW tier	<0.1ms
4. Confidence gate	Route: HIGH → full steering, MEDIUM → half, LOW → zero	<0.1ms
5. Base scoring	Score ads/items on contextual relevance (page-content overlap)	<1ms
6. Orthogonalization	Remove correlation between intent signal and base scores	<0.1ms
7. Edge protection	Lock the base ranker's most confident ordering decisions	<0.1ms
8. Projection	Solve for maximum intent effect within constraints	<1ms

Total pipeline latency: <5ms. This fits comfortably inside the ~100ms window for real-time ad serving. The entire pipeline is stateless — no database lookups, no cross-session storage, no user profile retrieval.

Graceful Degradation

The most important design property is what happens when the system doesn't have a good signal.

Session with 12 events, clear intent    →  HIGH  →  Full personalization
Session with 5 events, mixed signals     →  MED   →  Cautious personalization
Session with 1 page view, just arrived   →  LOW   →  Contextual relevance only

A LOW-confidence session receives the exact same ranking it would have received without this system — pure content-based relevance. The pipeline adds value when it can and stays out of the way when it can't. There is no scenario where installing this system makes things worse than not having it, because the fallback is the status quo.

This is not a "best effort" claim — it's a consequence of the math. When the steering budget is zero (LOW tier), the governed-rank pipeline returns the base ranking unchanged. When the budget is positive, orthogonalization guarantees the intent signal cannot interfere with the base ranker's dimensions of confidence.

What You Don't Need

This pipeline explicitly does not require:

Cookies or persistent IDs. Every session is independent.
Login walls. Works for anonymous visitors.
Cross-site data. No third-party data partnerships needed.
User profiles. No historical preference models.
Fingerprinting. No device or browser fingerprinting.
Consent for tracking. Session behavioral signals are first-party by definition.

The only data the system touches is what the user does on your site, in this session. That's the strongest possible privacy position.

Where Each Component Comes From

Component	What It Does	More Detail
IntentLens	Detects intent from session signals	NMF intent discovery + 295-feature evidence model
Confidence Gate	Decides when to trust intent detection	Based on the Confidence Gate Theorem
governed-rank	Steers ranking without degrading relevance	Open source: github.com/rdoku/governed-rank

These are three independently validated systems that happen to compose into a complete cookieless personalization pipeline. IntentLens provides the signal, CGT provides the safety check, governed-rank provides the steering mechanism.

06 The Medium-Confidence Danger Zone

Why Partial Confidence Is Worse Than None

The most counterintuitive finding across our experiments is the Criteo result: sessions with medium confidence converted at 6.9%, while sessions with low confidence converted at 9.3%. Knowing a little was worse than knowing nothing.

This happens because medium-confidence sessions have enough signal to steer the system away from its default, but not enough accuracy to steer it toward the right answer. The system makes a move — reranking ads based on a partially-detected intent — and that move is wrong often enough to hurt.

Low-confidence sessions, by contrast, get zero steering. They see the default contextual ranking, which is already optimized for relevance. No move is better than a wrong move.

This Is Predictable (Not a Bug)

The Confidence Gate Theorem formalizes exactly when this happens. The diagnostic test is simple:

Group sessions by confidence tier (HIGH / MEDIUM / LOW)
Compute conversion rate per tier
Check: does conversion rate increase monotonically from LOW → MEDIUM → HIGH?

If yes (as on RetailRocket and Yoochoose): safe to steer at all tiers.

If no (as on Criteo): the medium zone is dangerous. Reduce or eliminate steering for medium-confidence sessions.

The beauty of this approach: You run this check before deploying. On test data. In minutes. You don't have to launch and hope — you can diagnose the problem offline and adjust the steering budget per tier accordingly.

The Fix Is Simple

Once you've identified a medium-zone problem, the fix is mechanical:

Option 1: Reduce the steering budget for MEDIUM. On Criteo, λ = 0.5 for MEDIUM (instead of 1.0) contains the damage while still extracting some value from partial-confidence sessions.
Option 2: Merge MEDIUM into LOW. Treat any session that isn't HIGH-confidence as a contextual-only session. This is more conservative but eliminates the risk entirely.
Option 3: Improve the confidence model. On Criteo, the initial hand-tuned heuristic produced the reversal. Replacing it with a trained logistic regression on the same features eliminated the reversal entirely on re-analysis. Sometimes the medium zone is a model quality problem, not a fundamental data property.

The diagnostic (check for reversals) tells you whether you have a problem. The tier-level results tell you where the problem is. The budget knob lets you fix it without rebuilding anything.

Implications for Ad Tech

This medium-zone finding has broad implications beyond our specific pipeline:

Any system that uses partial signals to personalize — lookalike audiences, contextual targeting, interest-based cohorts — faces the same risk. Partial signal can be worse than no signal.
The standard approach of A/B testing the whole system can miss this. If 55% of sessions are HIGH (and working well) and 42% are MEDIUM (and hurting), the aggregate A/B test might show a modest positive — hiding a significant negative for almost half of traffic.
Per-tier measurement is essential. You need to measure lift per confidence tier, not just overall. The overall number is an average that can hide exactly the failure mode that matters.

07 Conclusion

Third-party cookies solved personalization by tracking users across time and sites. Their disappearance creates a real problem: how do you personalize without persistent identity?

Our answer: you don't need to know who someone is. You need to know what they want right now — and whether you're confident enough to act on it.

The pipeline is three components, each solving a distinct problem:

IntentLens converts session-level behavioral signals (clicks, dwell times, cart events, search queries) into a structured intent distribution. No cookies, no user IDs — just what the session has done.
The Confidence Gate routes sessions by signal strength: full personalization when the evidence is clear, reduced personalization when it's ambiguous, zero personalization when there isn't enough data. This prevents the medium-confidence danger zone from hurting overall performance.
governed-rank steers the ranking using the trusted intent signal, with mathematical guarantees that it cannot degrade base relevance. The intent signal is orthogonalized against the base ranker, confident decisions are protected, and the final ranking is provably optimal within constraints.

Validated results across three public datasets:

Dataset	Sessions	Best Lift	Monotonic?
RetailRocket	20,000	4.9x	Yes
Criteo	844,000	1.9x	No — medium-zone caught by gate
Yoochoose	92,000	4.5x	Yes

Key properties:

Sub-5ms latency (p99 = 4.9ms)
Stateless — no database, no user profile, no cross-session storage
Graceful degradation — LOW-confidence sessions get the exact same ranking as if this system didn't exist
Orthogonal — the intent signal adds new information (r ≈ 0 with base scores), it doesn't re-discover what the base ranker already knows
Privacy-first — only touches first-party, in-session behavioral data

Honest limitations:

All results are offline replay on public datasets. No live A/B test or production deployment yet.
Criteo bidding simulations are idealized. Realistic production improvement is likely 2–8%, not 27%.
The Criteo medium-zone reversal is caught and managed by the gate, but it demonstrates that the pipeline isn't magic — it requires per-tier monitoring.
Moment2vec intent discovery underperforms simple co-occurrence baselines on some metrics. The evidence model (Step 3) does the heavy lifting.

The cookieless future doesn't require recreating user-level tracking. It requires session-level intelligence with the discipline to know when that intelligence is trustworthy — and the safety to fall back gracefully when it isn't.

Personalize without tracking

Third-party cookies are disappearing. Most solutions try to recreate user tracking through other means. We take a fundamentally different approach: detect intent from session behavior alone, know when to trust that signal, and steer the ranking without breaking relevance. Validated on RetailRocket, Criteo, and Yoochoose with 2–5x conversion lift.

Read the Paper Contact

Cookieless Personalization Session-Level Intent Without Tracking