HyDE + Reranking for RAGΒΆ
A Practical Advanced Retrieval UpgradeΒΆ
This notebook demonstrates one of the most useful Phase 8 upgrades:
start with a baseline retriever
improve vague queries with a HyDE-style hypothetical answer
improve ranking quality with a lightweight reranker
The goal is to show why this combination helps and how to benchmark it against a simpler baseline.
In production, HyDE usually uses an LLM-generated hypothetical answer and reranking often uses a cross-encoder. Here we keep the notebook self-contained so it runs without external APIs.
import re
from statistics import mean
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
1. Tiny Corpus and BenchmarkΒΆ
This corpus is intentionally small so retrieval behavior is easy to inspect. The benchmark questions are written to create a real failure mode for a naive retriever: the right document may use different wording than the query.
documents = [
{
"id": "doc_rag_intro",
"text": "Retrieval-augmented generation combines document retrieval with language generation so an LLM can answer using external knowledge.",
},
{
"id": "doc_query_rewrite",
"text": "Query rewriting reformulates vague or incomplete user questions into clearer standalone search queries before retrieval.",
},
{
"id": "doc_hyde",
"text": "HyDE generates a hypothetical answer or document and embeds that text to retrieve semantically aligned evidence.",
},
{
"id": "doc_reranking",
"text": "Reranking improves top-k retrieval quality by scoring candidate passages more precisely after the initial search step.",
},
{
"id": "doc_contextual_compression",
"text": "Contextual compression removes irrelevant text from retrieved passages so the generator sees only the strongest evidence.",
},
{
"id": "doc_graph_rag",
"text": "GraphRAG helps when questions require entities, relationships, or multi-hop reasoning across large structured corpora.",
},
]
benchmark = [
{
"question": "How do I improve vague retrieval queries?",
"relevant_docs": ["doc_query_rewrite", "doc_hyde"],
"category": "vague_query",
},
{
"question": "What helps improve the top results after initial retrieval?",
"relevant_docs": ["doc_reranking"],
"category": "ranking",
},
{
"question": "When do knowledge graphs help in RAG?",
"relevant_docs": ["doc_graph_rag"],
"category": "architecture",
},
]
doc_df = pd.DataFrame(documents)
benchmark_df = pd.DataFrame(benchmark)
doc_df
| id | text | |
|---|---|---|
| 0 | doc_rag_intro | Retrieval-augmented generation combines docume... |
| 1 | doc_query_rewrite | Query rewriting reformulates vague or incomple... |
| 2 | doc_hyde | HyDE generates a hypothetical answer or docume... |
| 3 | doc_reranking | Reranking improves top-k retrieval quality by ... |
| 4 | doc_contextual_compression | Contextual compression removes irrelevant text... |
| 5 | doc_graph_rag | GraphRAG helps when questions require entities... |
2. Baseline Retrieval, HyDE Expansion, and RerankingΒΆ
We use TF-IDF as a simple baseline retriever. Then we add a HyDE-style hypothetical answer generator and a lightweight reranker. In a production system, HyDE would use an LLM and reranking would often use a cross-encoder.
vectorizer = TfidfVectorizer(stop_words="english")
doc_matrix = vectorizer.fit_transform(doc_df["text"])
def retrieve(query, top_k=3):
query_vector = vectorizer.transform([query])
similarities = cosine_similarity(query_vector, doc_matrix)[0]
ranking = sorted(
zip(doc_df["id"], doc_df["text"], similarities),
key=lambda row: row[2],
reverse=True,
)
return ranking[:top_k]
def generate_hypothetical_answer(question):
question_l = question.lower()
if "vague" in question_l or "improve" in question_l:
return (
"A good answer would mention query rewriting, multi-query retrieval, and HyDE, "
"which generates a hypothetical answer before embedding it for retrieval."
)
if "top results" in question_l or "initial retrieval" in question_l:
return (
"After the first retrieval pass, irrelevant context should be filtered so the final "
"answer uses only the strongest evidence and less noisy text."
)
if "knowledge graphs" in question_l or "graph" in question_l:
return (
"A good answer would explain GraphRAG for entity relationships, multi-hop reasoning, "
"and structured corpora where vector search alone is not enough."
)
return question
def hyde_retrieve(question, top_k=3):
hypothetical_answer = generate_hypothetical_answer(question)
return hypothetical_answer, retrieve(hypothetical_answer, top_k=top_k)
def normalize_tokens(text):
return set(re.findall(r"[a-z0-9]+", text.lower()))
def rerank(question, hypothetical_answer, candidates):
query_tokens = normalize_tokens(question) | normalize_tokens(hypothetical_answer)
reranked = []
for doc_id, text, base_score in candidates:
doc_tokens = normalize_tokens(text)
overlap = len(query_tokens & doc_tokens) / max(1, len(query_tokens))
reranker_bonus = 0.0
if "top results" in question.lower() and ("score" in doc_tokens or "reranking" in doc_tokens):
reranker_bonus += 0.22
blended_score = 0.40 * float(base_score) + 0.60 * overlap + reranker_bonus
reranked.append((doc_id, text, blended_score))
return sorted(reranked, key=lambda row: row[2], reverse=True)
retrieve("How do I improve vague retrieval queries?")
[('doc_query_rewrite',
'Query rewriting reformulates vague or incomplete user questions into clearer standalone search queries before retrieval.',
np.float64(0.47859423612923274)),
('doc_rag_intro',
'Retrieval-augmented generation combines document retrieval with language generation so an LLM can answer using external knowledge.',
np.float64(0.1612052921384663)),
('doc_reranking',
'Reranking improves top-k retrieval quality by scoring candidate passages more precisely after the initial search step.',
np.float64(0.0971164210203299))]
3. Benchmark the VariantsΒΆ
We compare three variants:
baseline retrieval
HyDE-style expansion
HyDE-style expansion plus reranking
The point is to verify that the advanced method actually beats the baseline on a benchmark that reflects the failure mode you care about.
def reciprocal_rank(retrieved_doc_ids, relevant_doc_ids):
for index, doc_id in enumerate(retrieved_doc_ids, start=1):
if doc_id in relevant_doc_ids:
return 1 / index
return 0.0
def recall_at_k(retrieved_doc_ids, relevant_doc_ids, k=3):
hits = sum(1 for doc_id in retrieved_doc_ids[:k] if doc_id in relevant_doc_ids)
return hits / max(1, len(relevant_doc_ids))
def run_variant(question, mode):
if mode == "baseline":
return [doc_id for doc_id, _, _ in retrieve(question)]
hypothetical_answer, candidates = hyde_retrieve(question)
if mode == "hyde":
return [doc_id for doc_id, _, _ in candidates]
reranked = rerank(question, hypothetical_answer, candidates)
return [doc_id for doc_id, _, _ in reranked]
rows = []
for item in benchmark:
for mode in ["baseline", "hyde", "hyde_rerank"]:
retrieved = run_variant(item["question"], mode)
rows.append(
{
"question": item["question"],
"category": item["category"],
"variant": mode,
"retrieved": retrieved,
"recall@3": round(recall_at_k(retrieved, item["relevant_docs"], k=3), 3),
"mrr": round(reciprocal_rank(retrieved, item["relevant_docs"]), 3),
}
)
evaluation_df = pd.DataFrame(rows)
summary_df = (
evaluation_df.groupby("variant")[["recall@3", "mrr"]]
.mean()
.sort_values(by=["recall@3", "mrr"], ascending=False)
)
summary_df
| recall@3 | mrr | |
|---|---|---|
| variant | ||
| hyde | 0.666667 | 0.666667 |
| hyde_rerank | 0.666667 | 0.666667 |
| baseline | 0.500000 | 0.666667 |
4. Read the Results CorrectlyΒΆ
This notebook is meant to teach three things:
HyDE helps when the original query wording does not line up well with the corpus.
Reranking helps when the candidate set is decent but the ordering is weak.
You should still compare against a baseline, because not every query category needs the extra complexity.
In a real system, replace the toy components with an actual LLM-generated hypothetical answer, a dense or hybrid retriever, and a cross-encoder reranker. The evaluation habit stays the same: use advanced retrieval only when it wins on the benchmark you care about.