Information Retrieval: Cosine Similarity, Dense, Sparse & Reranking

When a user types a query into a search system backed by a RAG pipeline, something deceptively complex happens: the system must decide, out of potentially millions of document chunks, which handful are actually relevant. This is the Information Retrieval (IR) problem, and how you solve it determines whether your LLM sees the right context, or hallucinates in the dark.

This blog walks through the full modern IR stack: from the mathematical foundation of cosine similarity, through the rise of dense embeddings with transformer models like BERT, to sparse retrieval methods like BM25 and SPLADE, and finally to hybrid search and cross-encoder reranking, the combination that defines state-of-the-art RAG pipelines today.


1. The Foundation: Cosine Similarity

Before talking about embeddings at all, we need to understand the metric that makes vector search possible: cosine similarity.

The core idea is elegant. If you represent a query and a document as vectors in some $n$-dimensional space, their relevance can be approximated by the angle between those vectors. Two vectors pointing in the same direction are semantically related; two vectors that are orthogonal (90°) share nothing in common.

Cosine similarity is defined as:

$$cos(theta) = frac{vec{q} cdot vec{d}}{|vec{q}| cdot |vec{d}|}$$

where $vec{q}$ and $vec{d}$ are the query and document vectors respectively, $vec{q} cdot vec{d}$ is their dot product, and $|vec{v}|$ denotes the Euclidean norm of a vector.

The result lives in $[-1, 1]$:

- $cos(theta) = 1$ → vectors are identical in direction (perfectly similar)

- $cos(theta) = 0$ → vectors are orthogonal (no similarity)

- $cos(theta) = -1$ → vectors point in opposite directions (maximum dissimilarity)

In practice, embedding models normalise their output vectors to unit length ($|vec{v}| = 1$), which means cosine similarity reduces to a simple dot product: $cos(theta) = vec{q} cdot vec{d}$. This is why vector databases can run retrieval with extremely fast BLAS operations.

A key geometric intuition: cosine similarity is magnitude-invariant. A document mentioning a concept ten times and one mentioning it once can have the same cosine score if they point in the same semantic direction. This is by design, we care about what a chunk is about, not how often it repeats itself. This is one reason why cosine similarity outperforms raw dot product for retrieval tasks where document length varies significantly.


BERT tokenisation and encoding pipeline

For sentence-level retrieval tasks, the [CLS] token's final hidden state is typically used as the sentence embedding, a single vector that summarises the entire input. This is what bi-encoder models (like sentence-transformers) fine-tune: they train BERT so that the [CLS] vector of a query and the [CLS] vector of a relevant document end up close in cosine space.

The dimensionality of this vector is model-specific: BERT-base produces 768-D vectors; larger models like bge-large-en produce 1024-D. Every extra dimension adds expressiveness but also increases memory and compute cost in the vector index.


3. Dense Retrieval

Dense retrieval (also called semantic search or neural retrieval) is the technique of embedding both queries and documents into the same continuous vector space using a bi-encoder, and then finding the nearest document vectors to a query vector using Approximate Nearest Neighbour (ANN) search.

The term dense refers to the nature of the embedding vectors: virtually every dimension is non-zero, each encoding some facet of meaning. In a 768-D BERT embedding, you cannot point to a specific dimension and say "this one represents colour", the information is distributed holistically across all dimensions.

Dense retrieval: offline indexing and online query

Dense retrieval, strengths and weaknesses

StrengthsWeaknesses
Captures semantic similarity and synonymsMisses exact keyword matches ("TS-999 error")
Works across paraphrase and language variationEmbedding dimensions are not interpretable
Single unified vector per document, compact indexSensitive to embedding model quality and domain
Fast ANN search at scale (millions of docs)Requires GPU for encoding; expensive to rebuild index

4. Sparse Retrieval

Sparse retrieval is the classical backbone of information retrieval. Where dense vectors distribute meaning across all dimensions, sparse vectors allocate a dimension to each term in the vocabulary, and most values are zero. Only the terms that actually appear (or are predicted to be relevant) get a non-zero weight.

This creates inverted indexes, efficient structures that, for each term, store the list of documents containing it. Lookup is blazing fast even over billions of documents.

4a. BM25, The Classical Baseline

BM25 (Best Matching 25) has been the dominant sparse retrieval algorithm since the mid-1990s and remains a competitive baseline to this day. It extends TF-IDF by adding two important corrections:

1. Term frequency saturation: the relevance boost from seeing a term grows logarithmically, not linearly, seeing a word 100 times is not 100× better than seeing it once.

2. Document length normalisation: long documents are penalised to avoid them always winning by sheer size.

The BM25 relevance score of document $d$ for query $q$ containing terms $q_1, ldots, q_n$ is:

$$text{BM25}(d, q) = sum_{i=1}^{n} text{IDF}(q_i) cdot frac{f(q_i, d) cdot (k_1 + 1)}{f(q_i, d) + k_1 cdot left(1 - b + b cdot frac{|d|}{text{avgdl}}right)}$$

where $f(q_i, d)$ is the term frequency, $|d|$ is document length, $text{avgdl}$ is the average document length, and $k_1 approx 1.5$, $b approx 0.75$ are tuning parameters.

BM25 excels at exact entity matching: queries like "Error code TS-999", "Christopher Nolan", or "GDPR Article 17", terms that dense models often blur into nearby semantic neighbours.

4b. Learned Sparse Embeddings: SPLADE

Classical BM25 has one fundamental limitation: it can only match terms that literally appear in both the query and the document. If the user writes "car" and the document says "automobile", BM25 scores zero.

SPLADE (Sparse Lexical and Expansion model, NAVER Labs, NAVER blog) bridges this gap by using BERT to produce learned sparse vectors that include term expansion: the model predicts which vocabulary terms are semantically relevant to an input, even if they don't appear in it.

The SPLADE architecture feeds input tokens through BERT, then passes each token's hidden state through a linear projection back into the full vocabulary dimension ($|V| = 30{,}522$ for BERT-base), applies a ReLU activation to enforce non-negativity, and then sums and log-saturates across all token positions:

$$w_j^{text{SPLADE}} = logleft(1 + sum_{i} text{ReLU}left(W_{text{lex}} cdot H_iright)_j right)$$

where $H_i$ is the BERT hidden state at position $i$ and $j$ indexes the vocabulary. The result is a sparse weight vector over the entire vocabulary, most weights are zero, but the non-zero ones capture both the original terms and semantically related expansions. With sparsity ratios typically above 99.8%, these vectors can still be indexed with standard inverted index structures, enabling millisecond retrieval over millions of documents.

BM25 vs SPLADE, key differences

PropertyBM25SPLADE
Vocabulary coverageOnly literal query termsLearned expansion via BERT MLM head
Synonyms / paraphrases❌ Blind to them✅ Expanded into the sparse vector
Index typeClassic inverted indexInverted index (same structure, learned weights)
Interpretability✅ Fully interpretable weights✅ Weights map to real vocabulary terms
Training required❌ No, rule-based✅ Supervised fine-tuning needed
Speed at query time✅ Extremely fast✅ Fast (still sparse index lookup)
GPU needed❌ CPU only✅ GPU for encoding, CPU for index search

5. BGE-M3: Dense + Sparse + Multi-Vector in One Model

A landmark development in the embedding space is BGE-M3 (Beijing Academy of AI, 2024, arxiv:2402.03216), the first model to unify all three major retrieval paradigms in a single checkpoint. The "M3" stands for Multi-Lingual (100+ languages), Multi-Functionality (dense + sparse + multi-vector), and Multi-Granularity (up to 8,192 tokens).

BGE-M3 uses a single BERT-style encoder backbone, but produces three types of output simultaneously:

- Dense: the [CLS] token's hidden state → 1024-D normalised vector for semantic search.

- Sparse (lexical): a linear + ReLU projection over all token hidden states → SPLADE-style sparse weights over the vocabulary.

- Multi-vector (ColBERT-style): all token vectors are kept → late interaction scoring.

Training uses a novel self-knowledge distillation approach where relevance scores from the three retrieval heads are combined and used as a teacher signal for each head, letting them reinforce each other during training.

In practice, the BGE authors recommend starting with hybrid retrieval (dense + sparse) as the first stage, it is faster than including ColBERT, and then applying a reranker on the top-K candidates. The combined hybrid score is:

$$s_{text{rank}} = w_1 cdot s_{text{dense}} + w_2 cdot s_{text{lex}} + w_3 cdot s_{text{mul}}$$

where typical values are $w_1 = 0.4$, $w_2 = 0.2$, $w_3 = 0.4$, though these should be tuned on a held-out dev set for each domain.


6. Hybrid Search and Reciprocal Rank Fusion

Neither dense nor sparse retrieval is universally superior. Dense models grasp paraphrase and contextual meaning; sparse models nail exact entity matches. Hybrid search runs both in parallel and fuses their ranked result lists into a single ranking.

The de-facto fusion algorithm is Reciprocal Rank Fusion (RRF). For each document $d$ retrieved by at least one system, its RRF score aggregates its rank positions across all retrievers:

$$text{RRF}(d) = sum_{r in R} frac{1}{k + text{rank}_r(d)}$$

where $R$ is the set of retrievers, $text{rank}_r(d)$ is the rank of document $d$ in retriever $r$'s result list (1-indexed), and $k$ is a smoothing constant (typically $k = 60$).

The smoothing constant $k$ penalises outliers: a document ranked #1 in one system and absent from all others scores $frac{1}{61} approx 0.016$, which is relatively modest. Documents that rank well across both retrievers are rewarded multiplicatively, a document at rank #2 in both systems scores $2 times frac{1}{62} approx 0.032$, which is comparable to being #1 in just one.

When hybrid search beats either method alone

  • Mixed query types in the same system: some users ask conceptual questions (dense wins), others search for specific codes or names (sparse wins). Hybrid handles both without tuning per query type.
  • Technical corpora (API docs, legal text, medical records) where exact term matching is critical and semantic understanding is needed.
  • Multi-lingual retrieval, BGE-M3 sparse weights degrade for cross-lingual queries (queries and documents share fewer vocabulary terms across languages), so pairing it with dense retrieval compensates.
  • As a general rule: if you're building a production RAG system and you're not doing hybrid search, you're leaving significant recall on the table.

7. Cross-Encoder Reranking

Retrieval, even hybrid retrieval, has a fundamental limitation: it scores query and document independently. A bi-encoder embeds the query into a vector, embeds the document into a vector, and computes cosine similarity. This is fast, but it means the model never sees query and document together during scoring, so it misses subtle cross-term interactions.

Reranking with a cross-encoder adds a second stage that fixes this. Instead of encoding query and document separately, the cross-encoder processes them concatenated as a single input through the full transformer, allowing all attention heads to attend across both at once:

$$text{input} = [texttt{[CLS]} ; q_1 ldots q_n ; texttt{[SEP]} ; d_1 ldots d_m ; texttt{[SEP]}]$$

The [CLS] vector of this joint sequence is then passed through a single linear layer to produce a relevance score, a real number indicating how relevant the document is to the query. This score is far more accurate than cosine similarity, but it requires a separate forward pass for every (query, document) pair, making it too slow to apply to an entire corpus.

The canonical two-stage pipeline is therefore:

1. Retrieval phase, hybrid search retrieves the top-$K$ candidates quickly (e.g., $K = 100$).

2. Reranking phase, the cross-encoder scores all $K$ pairs and re-sorts them, returning the top-$k$ to the LLM (e.g., $k = 5{-}20$).

Two-stage retrieval + reranking pipeline

Bi-encoder vs cross-encoder, the fundamental trade-off

PropertyBi-encoder (retrieval)Cross-encoder (reranker)
Sees query + doc together?❌ No, encoded separately✅ Yes, joint attention
Latency per query✅ Milliseconds (ANN search)⚠️ Grows linearly with K
Scales to large corpus?✅ Yes, pre-index docs❌ No, must run per pair at query time
Relevance accuracy⚠️ Good but misses cross-term interactions✅ Excellent, full attention over both
Use in pipelineStage 1: retrieve candidatesStage 2: re-rank candidates
ModelsBAAI/bge-*, sentence-transformersBAAI/bge-reranker-*, Cohere Rerank

8. The MTEB Benchmark, Choosing Your Model

With dozens of embedding models available, how do you choose? The MTEB (Massive Text Embedding Benchmark) (HuggingFace leaderboard) is the standard evaluation framework. It covers 8 task categories, retrieval, re-ranking, classification, clustering, semantic similarity, summarisation, bitext mining, and pair classification, across 58 datasets and dozens of languages.

A few practical lessons from MTEB:

- The overall average can be misleading. Always filter by Retrieval if you're building a RAG system.

- Larger models (7-8B) consistently outperform smaller ones but at significant GPU cost. For many production use cases, a well-tuned 300-500M model like bge-large-en-v1.5 hits a sweet cost-quality spot.

- Domain-specific models (biomedical, legal, financial) usually outperform general-purpose ones within their domain, even if their overall MTEB score is lower.

- BGE-M3 ranks among the top multilingual models and is one of the few open-weight models supporting dense + sparse + multi-vector from a single checkpoint.


9. Putting It All Together, The Full Production Pipeline

A production-grade RAG retrieval pipeline combines everything above. The typical recommended stack from the BGE authors, and consistent with findings from Anthropic's Contextual Retrieval research, is:

Index time: Chunk documents → enrich chunks (deterministic or LLM-based) → encode with BGE-M3 (both dense and sparse) → store in a vector DB with hybrid index support (Qdrant, Milvus, Weaviate).

Query time: Encode query with same model → run hybrid search (dense ANN + sparse inverted index) → fuse with RRF → take top-100 candidates → pass to cross-encoder reranker → take top-20 → feed to LLM.

Full RAG retrieval stack

When to use each retrieval component

ComponentUse it whenSkip it when
Dense onlySemantic queries, no exact entities, budget-constrainedQueries require exact term matching
Sparse (BM25)Exact entity / code / identifier matchingCorpus is tiny or purely conceptual
Sparse (SPLADE)You want lexical matching + semantic expansion, interpretable weightsNo GPU budget for encoding; BM25 suffices
Hybrid + RRFMixed query types or production system needing high recallCorpus is tiny (<1K docs) or pure semantic domain
Cross-encoder rerankerHigh-precision retrieval, customer-facing Q&A, medical/legalLatency is <100ms hard constraint with no async
BGE-M3 unifiedMultilingual corpus, want all three retrieval modes from one modelEnglish-only, prefer smaller/faster separate models

Interesting Papers