---
name: opensearch-function-scoring-algorithms
description: Search relevance and ranking on OpenSearch/Elasticsearch for a two-sided marketplace — candidate retrieval (hybrid BM25 + kNN, RRF, two-tower EBR), base relevance (BM25F, multi_match, LambdaMART), quality signals (Wilson lower bound, Bayesian average, rank_feature saturation/sigmoid), personalization (listing/user/session embeddings), spatial/temporal decay (gauss/exp), marketplace balance (conversion-weighted ranking, supply fairness, Pareto multi-objective), bias correction (IPS, click models, Thompson sampling), empirical evaluation (judgment sets, NDCG, ablation, A/B sizing, CUPED, regression suites), and diversity (MMR, DPP, max-per-host). Triggers on function_score, rank_feature, script_score, kNN, hybrid query, learning-to-rank, two-sided ranking, exposure fairness, NDCG, A/B testing, judgment set construction, ranking ablation, or "why is my OpenSearch ranking bad". Applies to Elasticsearch too — same APIs.
---
# Marketplace-Research OpenSearch Function Scoring Best Practices

A reference distillation of research-backed algorithms for ranking in two-sided marketplaces (Airbnb, Uber Eats, DoorDash, Etsy, eBay, Booking.com) implemented on OpenSearch or Elasticsearch. Contains **56 rules across 9 categories**, prioritised by cascade effect in the search ranking pipeline. Each rule explains the WHY (the cascade or the bias it corrects), shows incorrect-vs-correct code (OpenSearch JSON queries, Painless scripts, Python pre-processing, evaluation methodology), and links to the canonical source — KDD/SIGIR/WSDM papers, the OpenSearch documentation, and the engineering blogs of the marketplaces that proved these patterns at scale.

## When to Apply

Reach for this skill when:

- Designing a new marketplace search system on OpenSearch or Elasticsearch from scratch
- Tuning function_score / rank_feature / script_score queries that aren't moving the needle
- Setting up hybrid retrieval (BM25 + dense vectors) with Reciprocal Rank Fusion
- Choosing between HNSW and IVF for billion-scale ANN indexes
- Adding personalization via listing/user embeddings or two-tower architectures
- Correcting position bias in click logs before retraining an LTR model
- Designing exposure-fairness or new-listing cold-start exposure allocation
- Composing decay functions (gauss / exp / linear) over geo + date + freshness
- Diversifying the top window with MMR, DPP, or per-host caps
- Debugging "why does my top-10 show 8 listings from one host?" or "why does ranking favor popular incumbents?"
- **Building offline evaluation infrastructure** — graded judgment sets, NDCG@k pipelines, ablation studies, regression query suites
- **Designing A/B tests for ranking changes** — MDE / power / sample-size pre-computation, CUPED variance reduction, online-offline correlation calibration
- **Attributing lift to specific scoring components** — "did my new bias-correction help, or was it the embeddings, or both?"

The rules apply to any OpenSearch/Elasticsearch-backed marketplace search regardless of vertical — accommodation, food delivery, restaurants, services, jobs, secondhand goods, real estate. Triggers include "marketplace ranking", "search relevance", "function_score", "rank_feature", "script_score", "kNN", "hybrid search", "RRF", "learning to rank", "embedding-based retrieval", "two-tower", "position bias", "MMR", "supply fairness", "Pareto multi-objective", "NDCG", "judgment set", "ablation study", "CUPED", "A/B sample size", "ranking eval", and "why are my search results bad".

## The Search Ranking Lifecycle

Categories are derived from the marketplace search ranking pipeline. Earlier stages cascade — a miss in recall (stage 1) cannot be repaired by any downstream boost, and a wrong base relevance multiplies through every functional score:

```text
Query → [1] Recall → [2] Base Relevance → [3] Quality Signals → [4] Personalization
      → [5] Geo/Time Decay → [6] Marketplace Balance → [7] Diversity Re-rank → Results
                                                            ↑
                                          [8] Bias Correction (applied across all stages
                                                       and into training)
                                                            ↑
                                          [9] Evaluation & Measurement (the meta-layer:
                                                       judgment sets, NDCG, ablation, A/B
                                                       sizing, CUPED — without these you
                                                       can't tell if any rule helped)
```

## Rule Categories by Priority

| Priority | Category | Impact | Prefix | Rules |
|----------|----------|--------|--------|-------|
| 1 | Candidate Retrieval & Recall | CRITICAL | `recall-` | 6 |
| 2 | Base Relevance & Field Scoring | CRITICAL | `rel-` | 7 |
| 3 | Quality Signals & Confidence Bounds | HIGH | `qual-` | 6 |
| 4 | Personalization & Embeddings | HIGH | `pers-` | 7 |
| 5 | Spatial & Temporal Decay | HIGH | `decay-` | 5 |
| 6 | Two-Sided Marketplace Balance | HIGH | `market-` | 7 |
| 7 | Bias Correction & Online Learning | HIGH | `bias-` | 6 |
| 8 | Evaluation & Measurement | HIGH | `eval-` | 7 |
| 9 | Diversity & Re-ranking | MEDIUM-HIGH | `div-` | 5 |

## Quick Reference

### 1. Candidate Retrieval & Recall (CRITICAL)

- [`recall-hybrid-rrf`](references/recall-hybrid-rrf.md) — Use Hybrid BM25 + kNN with Reciprocal Rank Fusion
- [`recall-two-tower-ebr`](references/recall-two-tower-ebr.md) — Use Two-Tower Architecture for Embedding-Based Retrieval
- [`recall-prefilter-knn`](references/recall-prefilter-knn.md) — Apply Pre-Filter to kNN with Hard Constraints
- [`recall-hnsw-vs-ivf`](references/recall-hnsw-vs-ivf.md) — Choose HNSW for Latency, IVF for Memory at Scale
- [`recall-multi-stage`](references/recall-multi-stage.md) — Split Retrieval into Cheap Recall and Expensive Re-rank
- [`recall-query-expansion`](references/recall-query-expansion.md) — Apply Synonym Expansion at Index Time for Recall, Query Time for Precision

### 2. Base Relevance & Field Scoring (CRITICAL)

- [`rel-bm25f-field-weights`](references/rel-bm25f-field-weights.md) — Tune BM25F Field Weights Before k1/b
- [`rel-multi-match-strategy`](references/rel-multi-match-strategy.md) — Pick multi_match Type by Query Shape, Not by Default
- [`rel-bm25-k1-b-tuning`](references/rel-bm25-k1-b-tuning.md) — Tune BM25 k1 and b Per-Field for Short Marketplace Documents
- [`rel-listwise-loss`](references/rel-listwise-loss.md) — Prefer Listwise (LambdaMART) over Pairwise (RankNet) LTR Loss
- [`rel-script-score-over-function-score`](references/rel-script-score-over-function-score.md) — Use script_score Query, Not function_score, for Composition
- [`rel-rescore-over-bool-should`](references/rel-rescore-over-bool-should.md) — Use rescore Phase for Heavy Scoring, Not bool/should at Retrieval
- [`rel-avoid-boost-inflation`](references/rel-avoid-boost-inflation.md) — Avoid Field-Boost Inflation Above ~10x

### 3. Quality Signals & Confidence Bounds (HIGH)

- [`qual-wilson-lower-bound`](references/qual-wilson-lower-bound.md) — Sort by Wilson Lower Bound, Not Average Rating
- [`qual-bayesian-average`](references/qual-bayesian-average.md) — Use Bayesian Average for Star Ratings with Low Sample Sizes
- [`qual-rank-feature-saturation`](references/qual-rank-feature-saturation.md) — Saturate Popularity Counts with rank_feature.saturation
- [`qual-rank-feature-sigmoid`](references/qual-rank-feature-sigmoid.md) — Apply Sigmoid Modifier for Bounded Ratio Signals
- [`qual-log1p-vs-saturation`](references/qual-log1p-vs-saturation.md) — Choose log1p over Saturation for Long-Tail Signal Preservation
- [`qual-completeness-score`](references/qual-completeness-score.md) — Score Listing Completeness as a Quality Signal

### 4. Personalization & Embeddings (HIGH)

- [`pers-listing-embeddings`](references/pers-listing-embeddings.md) — Train Listing Embeddings from Booking-Session Co-occurrence
- [`pers-type-embeddings-cold-start`](references/pers-type-embeddings-cold-start.md) — Use Type Embeddings for Cold-Start Users and Listings
- [`pers-real-time-session-vector`](references/pers-real-time-session-vector.md) — Update Session Vector in Real-Time from Click Events
- [`pers-multi-modal-embeddings`](references/pers-multi-modal-embeddings.md) — Use Multi-Modal Embeddings (Text + Image) for Recall
- [`pers-cross-encoder-rerank`](references/pers-cross-encoder-rerank.md) — Apply Cross-Encoder Re-rank on Top-50 for Personalization
- [`pers-tower-split-offline-online`](references/pers-tower-split-offline-online.md) — Split Item Tower Offline, Query Tower Online
- [`pers-contextual-features`](references/pers-contextual-features.md) — Inject Contextual Features into script_score

### 5. Spatial & Temporal Decay (HIGH)

- [`decay-gauss-geo`](references/decay-gauss-geo.md) — Use Gauss Decay for Geo Distance, Not Linear
- [`decay-exp-freshness`](references/decay-exp-freshness.md) — Use Exp Decay for Time Freshness, Gauss for Date Proximity
- [`decay-scale-calibration`](references/decay-scale-calibration.md) — Calibrate Decay Scale to the 0.5-Score Distance Target
- [`decay-offset-noise`](references/decay-offset-noise.md) — Add Offset to Decay Functions for Noisy Sparse Fields
- [`decay-multi-field-composition`](references/decay-multi-field-composition.md) — Compose Multi-Field Decay with Explicit Weights

### 6. Two-Sided Marketplace Balance (HIGH)

- [`market-conversion-weighted-ranking`](references/market-conversion-weighted-ranking.md) — Weight Ranking by Conversion Rate, Not Click-Through Rate
- [`market-cold-start-exploration`](references/market-cold-start-exploration.md) — Boost Cold-Start Listings with Bounded Exposure Allocation
- [`market-supply-fairness-lorenz`](references/market-supply-fairness-lorenz.md) — Monitor Supply-Side Fairness with Lorenz/Gini Metrics
- [`market-host-quality-signals`](references/market-host-quality-signals.md) — Separate Host-Quality and Listing-Quality Signals
- [`market-inventory-health`](references/market-inventory-health.md) — Penalize Listings with Low Inventory Health
- [`market-pareto-multi-objective`](references/market-pareto-multi-objective.md) — Optimize Multi-Objective Ranking with Pareto-Aware Weights
- [`market-price-relevance`](references/market-price-relevance.md) — Score Price Relevance with Soft Bands, Not Hard Filters

### 7. Bias Correction & Online Learning (HIGH)

- [`bias-position-ips`](references/bias-position-ips.md) — Correct Position Bias with Inverse Propensity Scoring
- [`bias-click-models`](references/bias-click-models.md) — Estimate Click Propensities with PBM, Cascade, or DBN
- [`bias-thompson-sampling`](references/bias-thompson-sampling.md) — Explore Ranking Alternatives with Thompson Sampling
- [`bias-counterfactual-eval`](references/bias-counterfactual-eval.md) — Validate Ranking Changes with Counterfactual Evaluation
- [`bias-interleaved-evaluation`](references/bias-interleaved-evaluation.md) — Use Interleaved Evaluation for Low-Traffic Ranking Comparisons
- [`bias-popularity-debiasing`](references/bias-popularity-debiasing.md) — Subsample Popular Items in Embedding Training Negatives

### 8. Evaluation & Measurement (HIGH)

- [`eval-graded-judgment-set`](references/eval-graded-judgment-set.md) — Build a Graded Judgment Set for Offline Evaluation
- [`eval-ndcg-primary-metric`](references/eval-ndcg-primary-metric.md) — Use NDCG@k as the Primary Offline Ranking Metric
- [`eval-online-offline-correlation`](references/eval-online-offline-correlation.md) — Validate Online-Offline Metric Correlation Before Trusting Offline Scores
- [`eval-ablation-attribution`](references/eval-ablation-attribution.md) — Run Ablation Studies to Attribute Lift to Specific Components
- [`eval-ab-sample-size-mde`](references/eval-ab-sample-size-mde.md) — Calculate A/B Sample Size from MDE Before Running
- [`eval-cuped-variance-reduction`](references/eval-cuped-variance-reduction.md) — Apply CUPED to Halve A/B Sample Size with Pre-Experiment Covariates
- [`eval-regression-query-suite`](references/eval-regression-query-suite.md) — Maintain a Regression Query Suite for Silent Quality Drops

### 9. Diversity & Re-ranking (MEDIUM-HIGH)

- [`div-mmr-rerank`](references/div-mmr-rerank.md) — Apply MMR Rerank for Top-Window Diversity
- [`div-max-per-host`](references/div-max-per-host.md) — Cap Impressions Per Host with Max-Per-Group Constraint
- [`div-category-diversity`](references/div-category-diversity.md) — Diversify Categories Hierarchically in the Top Window
- [`div-dpp-quality-diversity`](references/div-dpp-quality-diversity.md) — Use Determinantal Point Processes for Joint Quality and Diversity
- [`div-window-penalty`](references/div-window-penalty.md) — Apply Window-Based Diversity Penalty in Rescore

## How to Use

For a focused question ("which decay function for geo distance?"), jump directly to the relevant rule (`decay-gauss-geo`) — each rule is self-contained with the WHY, OpenSearch query/Painless code, and the canonical source citation.

For a full ranking system review, work the categories top-to-bottom. The cascade ordering is real: get recall right first (no boost recovers a missed candidate), then base relevance (it's the multiplicand of every functional score), then quality / personalization / decay / marketplace balance / bias correction in that order. Diversity is the last re-rank step over a well-ordered top window.

For correcting bias before retraining, start with `bias-position-ips` and `bias-click-models` — applying IPS to position-confounded click data is the single highest-leverage change for any marketplace that retrains LTR models on logged clicks.

**For testing multiple algorithms together and validating empirically**, start with `eval-graded-judgment-set` (build the foundation), `eval-ndcg-primary-metric` (pick the metric), then `eval-ablation-attribution` (attribute lift to specific components). Pair with `eval-online-offline-correlation` to verify your offline metric predicts online behavior, `eval-ab-sample-size-mde` + `eval-cuped-variance-reduction` for disciplined A/B testing, and `eval-regression-query-suite` to catch silent quality drops on named queries.

For research-citing a design decision, every rule ends with the canonical reference — KDD/SIGIR/WSDM papers, the relevant engineering blog (Airbnb, Pinterest, DoorDash, Etsy, Just Eat Takeaway, Thumbtack), or the OpenSearch documentation page.

Read [section definitions](references/_sections.md) for the cascade-impact rationale behind the category ordering, or [the rule template](assets/templates/_template.md) when adding a new rule.

## Reference Files

| File | Description |
|------|-------------|
| [references/_sections.md](references/_sections.md) | Category definitions and ordering by cascade impact |
| [AGENTS.md](AGENTS.md) | Compact TOC navigation (auto-built; do not edit by hand) |
| [assets/templates/_template.md](assets/templates/_template.md) | Template for authoring new rules |
| [metadata.json](metadata.json) | Version and authoritative reference URLs |