RAPTOR-Style Hierarchical RetrievalΒΆ
RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) builds a tree of summaries over your chunk corpus. Leaf nodes are original chunks; higher nodes are progressively coarser summaries. At query time the system can retrieve at whatever level of detail matches the question.
This notebook implements a minimal, self-contained RAPTOR-style pipeline using only TF-IDF and rule-based summarisation so it runs without API keys.
What you will learn:
How to build a hierarchical summary tree from chunks.
Why tree-based retrieval beats flat retrieval for multi-hop questions.
How to compare flat, parent-child, and RAPTOR retrieval on a toy benchmark.
Prerequisite: Work through 12_parent_child_retrieval.ipynb first.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import math
import statistics
1 β Build the CorpusΒΆ
We use three sections of text, each split into two leaf chunks. This mirrors a small knowledge base about a fictional AI startup.
# ---------------------------------------------------------------------------
# Leaf chunks β the finest grain of information
# ---------------------------------------------------------------------------
leaf_chunks = [
# Section A β Founding
{"id": "A1", "section": "A", "text": "NovaMind was founded in 2022 by Dr. Lena Torres and Marcus Chen in Austin, Texas."},
{"id": "A2", "section": "A", "text": "The founders met at MIT where they published early work on retrieval-augmented generation."},
# Section B β Product
{"id": "B1", "section": "B", "text": "NovaMind's flagship product, Cortex, is an enterprise RAG engine that indexes internal documents."},
{"id": "B2", "section": "B", "text": "Cortex uses a two-stage pipeline: sparse BM25 pre-retrieval followed by dense cross-encoder reranking."},
# Section C β Funding
{"id": "C1", "section": "C", "text": "In July 2024 NovaMind raised a 40 million dollar Series B led by Sequoia Capital."},
{"id": "C2", "section": "C", "text": "The funds will be used to expand internationally and double the engineering team to 120 people."},
]
print(f"{len(leaf_chunks)} leaf chunks across {len(set(c['section'] for c in leaf_chunks))} sections")
6 leaf chunks across 3 sections
2 β Build the Summary TreeΒΆ
RAPTOR clusters leaf nodes, then summarises each cluster to produce level-1 nodes. Level-1 nodes are then clustered and summarised into level-2 nodes, continuing until a single root summary exists.
Here we use a simple rule-based summariser (sentence concatenation + truncation)
instead of an LLM. In production you would replace summarise() with a call to
GPT-4o, Claude, or a local model.
# ---------------------------------------------------------------------------
# Rule-based summariser (substitute with LLM in production)
# ---------------------------------------------------------------------------
def summarise(texts: list[str], max_words: int = 40) -> str:
"""Combine texts and truncate to max_words as a cheap summary proxy."""
combined = " ".join(texts)
words = combined.split()
return " ".join(words[:max_words]) + ("..." if len(words) > max_words else "")
# ---------------------------------------------------------------------------
# Build the tree bottom-up
# ---------------------------------------------------------------------------
# Level 0: leaf chunks (already defined)
level_0 = {c["id"]: c["text"] for c in leaf_chunks}
# Level 1: one summary per section (group by section)
sections = sorted(set(c["section"] for c in leaf_chunks))
level_1 = {}
for sec in sections:
texts = [c["text"] for c in leaf_chunks if c["section"] == sec]
level_1[f"L1_{sec}"] = summarise(texts, max_words=30)
# Level 2: single root summary over all level-1 nodes
level_2 = {"ROOT": summarise(list(level_1.values()), max_words=50)}
# Combine all nodes into a single index
tree_nodes = {}
for nid, text in level_0.items():
tree_nodes[nid] = {"level": 0, "text": text}
for nid, text in level_1.items():
tree_nodes[nid] = {"level": 1, "text": text}
for nid, text in level_2.items():
tree_nodes[nid] = {"level": 2, "text": text}
print("=== Summary tree ===")
for nid, info in sorted(tree_nodes.items(), key=lambda x: (x[1]["level"], x[0])):
print(f" [{info['level']}] {nid:8s} {info['text'][:80]}")
=== Summary tree ===
[0] A1 NovaMind was founded in 2022 by Dr. Lena Torres and Marcus Chen in Austin, Texas
[0] A2 The founders met at MIT where they published early work on retrieval-augmented g
[0] B1 NovaMind's flagship product, Cortex, is an enterprise RAG engine that indexes in
[0] B2 Cortex uses a two-stage pipeline: sparse BM25 pre-retrieval followed by dense cr
[0] C1 In July 2024 NovaMind raised a 40 million dollar Series B led by Sequoia Capital
[0] C2 The funds will be used to expand internationally and double the engineering team
[1] L1_A NovaMind was founded in 2022 by Dr. Lena Torres and Marcus Chen in Austin, Texas
[1] L1_B NovaMind's flagship product, Cortex, is an enterprise RAG engine that indexes in
[1] L1_C In July 2024 NovaMind raised a 40 million dollar Series B led by Sequoia Capital
[2] ROOT NovaMind was founded in 2022 by Dr. Lena Torres and Marcus Chen in Austin, Texas
3 β Define the BenchmarkΒΆ
We create questions at three granularity levels:
Type |
Example |
Best retrieval level |
|---|---|---|
Detail |
Who cofounded NovaMind? |
Leaf (level 0) |
Section |
What does Cortex do? |
Section summary (level 1) |
Overview |
Summarise NovaMind |
Root summary (level 2) |
benchmark = [
# Detail questions β best answered by leaf chunks
{
"query": "Who cofounded NovaMind?",
"relevant_ids": {"A1", "A2"},
"category": "detail",
},
{
"query": "How much did NovaMind raise in Series B?",
"relevant_ids": {"C1"},
"category": "detail",
},
# Section-level question β best answered by a section summary
{
"query": "How does the Cortex product pipeline work?",
"relevant_ids": {"B1", "B2", "L1_B"},
"category": "section",
},
# Overview question β best answered by the root summary
{
"query": "Give me a high-level overview of NovaMind",
"relevant_ids": {"L1_A", "L1_B", "L1_C", "ROOT"},
"category": "overview",
},
]
print(f"{len(benchmark)} benchmark questions")
for q in benchmark:
print(f" [{q['category']:8s}] {q['query']}")
4 benchmark questions
[detail ] Who cofounded NovaMind?
[detail ] How much did NovaMind raise in Series B?
[section ] How does the Cortex product pipeline work?
[overview] Give me a high-level overview of NovaMind
4 β Retrieval VariantsΒΆ
We compare three strategies:
Flat β search only leaf chunks (level 0). This is standard RAG.
Parent-child β search leaf chunks, then expand to include the section summary.
RAPTOR β search all tree nodes (leaves + summaries) simultaneously.
def retrieve_top_k(query: str, candidates: dict[str, str], k: int = 3) -> list[tuple[str, float]]:
"""Return top-k (node_id, score) pairs by TF-IDF cosine similarity."""
ids = list(candidates.keys())
texts = list(candidates.values())
vectorizer = TfidfVectorizer(stop_words="english")
tfidf = vectorizer.fit_transform(texts + [query])
scores = cosine_similarity(tfidf[-1:], tfidf[:-1]).flatten()
ranked = sorted(zip(ids, scores), key=lambda x: x[1], reverse=True)
return ranked[:k]
# --- Flat retrieval: only leaf chunks ---------------------------------
def flat_retrieve(query: str, k: int = 3) -> list[str]:
candidates = {nid: info["text"] for nid, info in tree_nodes.items() if info["level"] == 0}
return [nid for nid, _ in retrieve_top_k(query, candidates, k)]
# --- Parent-child: retrieve leaves then expand to section summary -----
def parent_child_retrieve(query: str, k: int = 3) -> list[str]:
leaf_results = flat_retrieve(query, k)
result_ids = list(leaf_results)
# expand: for each leaf, add its L1 section summary
for lid in leaf_results:
sec = next((c["section"] for c in leaf_chunks if c["id"] == lid), None)
if sec:
l1_id = f"L1_{sec}"
if l1_id not in result_ids:
result_ids.append(l1_id)
return result_ids
# --- RAPTOR: search the full tree ------------------------------------
def raptor_retrieve(query: str, k: int = 3) -> list[str]:
candidates = {nid: info["text"] for nid, info in tree_nodes.items()}
return [nid for nid, _ in retrieve_top_k(query, candidates, k)]
# Quick sanity check
sample_q = "Give me a high-level overview of NovaMind"
print("Flat: ", flat_retrieve(sample_q))
print("Parent-child:", parent_child_retrieve(sample_q))
print("RAPTOR: ", raptor_retrieve(sample_q))
Flat: ['B1', 'A1', 'C1']
Parent-child: ['B1', 'A1', 'C1', 'L1_B', 'L1_A', 'L1_C']
RAPTOR: ['B1', 'A1', 'ROOT']
5 β EvaluateΒΆ
For each question we measure:
Recall@k β fraction of relevant IDs present in the retrieved set.
Highest level hit β the highest tree level returned that is relevant (0 = leaf, 1 = section, 2 = root).
variants = {
"flat": flat_retrieve,
"parent_child": parent_child_retrieve,
"raptor": raptor_retrieve,
}
results = []
for q in benchmark:
for vname, vfn in variants.items():
retrieved = vfn(q["query"])
hits = set(retrieved) & q["relevant_ids"]
recall = len(hits) / len(q["relevant_ids"]) if q["relevant_ids"] else 0
# highest tree level among relevant hits
max_level = max((tree_nodes[h]["level"] for h in hits), default=-1)
results.append({
"category": q["category"],
"query": q["query"],
"variant": vname,
"retrieved": retrieved,
"recall": recall,
"max_level": max_level,
})
# ---------------------------------------------------------------------------
# Per-question results
# ---------------------------------------------------------------------------
print(f"{'Category':<10} {'Variant':<14} {'Recall':>7} {'MaxLvl':>7} Query")
print("-" * 80)
for r in results:
print(f"{r['category']:<10} {r['variant']:<14} {r['recall']:>7.2f} {r['max_level']:>7} {r['query'][:40]}")
# ---------------------------------------------------------------------------
# Aggregate
# ---------------------------------------------------------------------------
print("\n=== Aggregate recall by variant ===")
for vname in variants:
recalls = [r["recall"] for r in results if r["variant"] == vname]
print(f" {vname:<14} mean_recall={statistics.mean(recalls):.3f}")
print("\n=== Aggregate recall by category x variant ===")
for cat in ["detail", "section", "overview"]:
for vname in variants:
recalls = [r["recall"] for r in results if r["variant"] == vname and r["category"] == cat]
if recalls:
print(f" {cat:<10} {vname:<14} mean_recall={statistics.mean(recalls):.3f}")
Category Variant Recall MaxLvl Query
--------------------------------------------------------------------------------
detail flat 0.50 0 Who cofounded NovaMind?
detail parent_child 0.50 0 Who cofounded NovaMind?
detail raptor 0.50 0 Who cofounded NovaMind?
detail flat 1.00 0 How much did NovaMind raise in Series B?
detail parent_child 1.00 0 How much did NovaMind raise in Series B?
detail raptor 1.00 0 How much did NovaMind raise in Series B?
section flat 0.67 0 How does the Cortex product pipeline wor
section parent_child 1.00 1 How does the Cortex product pipeline wor
section raptor 0.67 1 How does the Cortex product pipeline wor
overview flat 0.00 -1 Give me a high-level overview of NovaMin
overview parent_child 0.75 1 Give me a high-level overview of NovaMin
overview raptor 0.25 2 Give me a high-level overview of NovaMin
=== Aggregate recall by variant ===
flat mean_recall=0.542
parent_child mean_recall=0.812
raptor mean_recall=0.604
=== Aggregate recall by category x variant ===
detail flat mean_recall=0.750
detail parent_child mean_recall=0.750
detail raptor mean_recall=0.750
section flat mean_recall=0.667
section parent_child mean_recall=1.000
section raptor mean_recall=0.667
overview flat mean_recall=0.000
overview parent_child mean_recall=0.750
overview raptor mean_recall=0.250
6 β InterpretationΒΆ
Observation |
Explanation |
|---|---|
Flat retrieval scores well on detail questions |
Leaf chunks directly contain the answer tokens. |
Flat struggles on overview questions |
It can only return individual chunks β no summaries exist in its search space. |
Parent-child improves section-level recall |
Expanding to the section summary adds context the leaf alone misses. |
RAPTOR wins on overview and ties on detail |
Because it searches all levels simultaneously, the root/section summary can surface for broad queries while leaf chunks still win for specific ones. |
Key takeawayΒΆ
RAPTORβs value shows up when your question set spans multiple granularity levels. If every query is a narrow fact lookup, flat retrieval is fine. But the moment users ask βwhat is this about?β or βsummarise the funding historyβ, a tree index gives the retriever something appropriate to return.
Production notesΒΆ
Replace
summarise()with an LLM call (GPT-4o-mini, Claude Haiku, etc.) for real abstractive summaries.Use clustering (k-means on embeddings) instead of pre-defined sections to discover natural groupings.
Store the tree in your vector DB with a
levelmetadata field so you can filter by granularity at query time.Consider collapsed tree retrieval: search all nodes together (like we did) versus tree traversal: start at the root and descend. The RAPTOR paper found collapsed search works better in most benchmarks.
Next stepsΒΆ
Try adding a level-3 summary if you have a larger corpus.
Replace TF-IDF with a real embedding model and re-run.
Move on to
10_graphrag_visual_rag.ipynbfor entity-based retrieval, or revisit08_rag_technique_selection.mdto choose your next upgrade.