ADR-0002: Reciprocal Rank Fusion for hybrid search
Status: Accepted
Date: 2025-12 (approximate — decision predates ADR documentation)
Deciders: EtanHey
Context
BrainLayer needs to combine two search signals for retrieval:
- Semantic search — sqlite-vec KNN over bge-large-en-v1.5 embeddings (1024 dims). Good at finding conceptually related content ("how did I implement auth?" matches a chunk about "OAuth2 token refresh").
- Keyword search — FTS5 full-text search over chunk content, summaries, and tags. Good at finding exact matches that semantic search misses (specific function names, error codes, file paths).
Neither signal alone is sufficient. Semantic search misses exact terms; keyword search misses paraphrased intent. The question is how to merge the two ranked result lists into a single ranking.
Candidates considered:
| Method | Approach | Pros | Cons |
|---|---|---|---|
| Linear combination | α * semantic_score + (1-α) * keyword_score |
Simple, tunable | Requires score normalization across different scales |
| Reciprocal Rank Fusion (RRF) | Σ 1/(k + rank_i) per result |
Score-agnostic, no normalization needed, robust | Single hyperparameter (k), no weight tuning |
| Cross-encoder re-ranker | LLM re-scores merged candidates | Highest quality | Requires a second model, adds latency |
| Keyword-only fallback | Semantic first, FTS if no results | Simple | Misses boosting from keyword overlap |
Decision
Use Reciprocal Rank Fusion (RRF) to merge semantic and keyword results, with post-RRF boosting for importance and recency.
The implementation in search_repo.py works as follows:
- Retrieve candidates — run both semantic (top 30) and FTS5 (top 30) searches with the same filters.
- Compute RRF scores — for each unique chunk_id across both result sets:
where
score = 0 if chunk in semantic results: score += 1 / (k + semantic_rank) if chunk in FTS results: score += 1 / (k + fts_rank)k = 60(standard default from the original RRF paper). - Post-RRF boosting — multiply the RRF score by two heuristic factors:
- Importance boost:
1.0 + min(importance, 10) / 20— range 1.0x to 1.5x based on the chunk's enriched importance score (0-10). - Recency boost:
0.7 + 0.3 * exp(-0.023 * age_days)— exponential decay with a 30-day half-life, range 0.7x (old) to 1.0x (fresh). - Sort and return top
n_resultsby boosted score.
Results are cached in a module-level LRU cache (128 entries, 60-second TTL) keyed on (store_path, query_text, embedding_hash, all_filters, k).
Consequences
Positive
- No score normalization needed — RRF operates on ranks, not raw scores. This avoids the fragile calibration problem of combining L2 distances (semantic) with BM25 scores (FTS5) on different scales.
- Robust with minimal tuning — the single
kparameter (60) is the standard default from Cormack, Clarke & Buettcher, "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods", SIGIR 2009 and works well in practice. No per-query weight adjustment needed. - Chunks appearing in both lists get a natural boost — a result ranked highly by both semantic and keyword search receives contributions from both terms, surfacing the most relevant content.
- Handles disjoint results gracefully — chunks appearing in only one list still get a score and can surface if ranked highly enough.
- Fast — the RRF merge itself is O(n + m) after both searches complete. The dominant cost is the semantic KNN scan, not the fusion.
Negative
k = 60is hardcoded — while the standard default works well for general retrieval, domain-specific tuning (e.g., favouring keyword precision for error-code lookups vs. semantic recall for concept queries) would require makingkconfigurable. This is a known limitation.- No learned relevance — RRF is a heuristic. A cross-encoder re-ranker would produce higher-quality rankings but at the cost of loading a second model and adding 100-500ms latency per query.
- Post-RRF boosting adds implicit bias — the importance and recency multipliers mean that a highly-important recent chunk can outrank a more semantically relevant older one. This is intentional (recent decisions matter more) but could surprise users searching for historical content.
- Cache invalidation is time-based only — the 60-second TTL means writes within the cache window are invisible to search. Acceptable for the typical MCP usage pattern (search → store → next prompt) but could cause stale results in rapid write-then-read loops.
Neutral
- The 3x over-fetch (top 30 from each source for a default
n_results=10) provides a large enough candidate pool for RRF fusion without excessive DB load. - Adding a third signal (e.g., knowledge graph proximity) would simply add another
1/(k + rank_i)term to the RRF sum — the algorithm extends naturally.