Run this notebook: Open in Colab Open in Kaggle

Parent-Child Retrieval for RAG¶

Preserve Local Relevance Without Losing Document Context¶

Flat chunk retrieval often returns a useful sentence or paragraph but loses the broader section that makes the answer complete. Parent-child retrieval fixes that by:

retrieving a small child chunk
mapping it back to a larger parent section
generating the final answer with both local evidence and wider context

This is one of the best structured-retrieval upgrades to learn before RAPTOR or GraphRAG.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

1. Parent Documents and Child Chunks¶

Each parent document represents a larger section. Each child chunk is a smaller passage indexed for retrieval.

parent_docs = [
    {
        "parent_id": "parent_query_expansion",
        "title": "Query expansion techniques",
        "text": "Query expansion improves weak retrieval by rewriting vague questions, generating multiple search queries, or using HyDE to embed a hypothetical answer. It helps when the user's wording does not match the corpus.",
    },
    {
        "parent_id": "parent_reranking",
        "title": "Reranking and context filtering",
        "text": "Reranking improves the ordering of retrieved passages after the first retrieval pass. Contextual compression removes noisy passages so the final context contains only the strongest evidence.",
    },
    {
        "parent_id": "parent_structured_retrieval",
        "title": "Structured retrieval for long documents",
        "text": "Parent-child retrieval preserves local relevance while expanding back to a larger section. RAPTOR organizes summaries hierarchically, and GraphRAG helps when answers depend on relationships and multi-hop reasoning.",
    },
]

child_chunks = [
    {
        "chunk_id": "chunk_query_rewrite",
        "parent_id": "parent_query_expansion",
        "text": "Query rewriting reformulates vague or incomplete user questions into clearer standalone search queries.",
    },
    {
        "chunk_id": "chunk_hyde",
        "parent_id": "parent_query_expansion",
        "text": "HyDE generates a hypothetical answer before retrieval to improve semantic alignment.",
    },
    {
        "chunk_id": "chunk_reranking",
        "parent_id": "parent_reranking",
        "text": "Reranking rescored candidate passages after initial retrieval to improve top-k quality.",
    },
    {
        "chunk_id": "chunk_compression",
        "parent_id": "parent_reranking",
        "text": "Contextual compression removes irrelevant text before generation.",
    },
    {
        "chunk_id": "chunk_parent_child",
        "parent_id": "parent_structured_retrieval",
        "text": "Parent-child retrieval retrieves fine-grained chunks, then expands to the larger parent section.",
    },
    {
        "chunk_id": "chunk_raptor",
        "parent_id": "parent_structured_retrieval",
        "text": "RAPTOR builds a tree of recursive summaries for long-document retrieval.",
    },
]

parent_df = pd.DataFrame(parent_docs)
child_df = pd.DataFrame(child_chunks)
parent_lookup = {row['parent_id']: row for row in parent_docs}

child_df

	chunk_id	parent_id	text
0	chunk_query_rewrite	parent_query_expansion	Query rewriting reformulates vague or incomple...
1	chunk_hyde	parent_query_expansion	HyDE generates a hypothetical answer before re...
2	chunk_reranking	parent_reranking	Reranking rescored candidate passages after in...
3	chunk_compression	parent_reranking	Contextual compression removes irrelevant text...
4	chunk_parent_child	parent_structured_retrieval	Parent-child retrieval retrieves fine-grained ...
5	chunk_raptor	parent_structured_retrieval	RAPTOR builds a tree of recursive summaries fo...

2. Benchmark Questions¶

These questions are designed so the answer benefits from retrieving a precise child chunk while still needing the broader parent section for complete context.

benchmark = [
    {
        "question": "What helps when a user's wording does not match the corpus?",
        "relevant_chunk": "chunk_hyde",
        "relevant_parent": "parent_query_expansion",
        "category": "query_expansion",
    },
    {
        "question": "Which retrieval pattern keeps a precise match but expands back to a larger section?",
        "relevant_chunk": "chunk_parent_child",
        "relevant_parent": "parent_structured_retrieval",
        "category": "structured_retrieval",
    },
    {
        "question": "What should you do after the initial retrieval pass to improve top results?",
        "relevant_chunk": "chunk_reranking",
        "relevant_parent": "parent_reranking",
        "category": "reranking",
    },
]

pd.DataFrame(benchmark)

	question	relevant_chunk	relevant_parent	category
0	What helps when a user's wording does not matc...	chunk_hyde	parent_query_expansion	query_expansion
1	Which retrieval pattern keeps a precise match ...	chunk_parent_child	parent_structured_retrieval	structured_retrieval
2	What should you do after the initial retrieval...	chunk_reranking	parent_reranking	reranking

3. Child Retrieval vs Parent-Child Retrieval¶

The first system returns only the child chunk. The second system retrieves the child chunk and then expands to the parent section.

vectorizer = TfidfVectorizer(stop_words="english")
child_matrix = vectorizer.fit_transform(child_df["text"])

def retrieve_child_chunks(question, top_k=2):
    query_vector = vectorizer.transform([question])
    similarities = cosine_similarity(query_vector, child_matrix)[0]
    ranking = sorted(
        zip(child_df["chunk_id"], child_df["parent_id"], child_df["text"], similarities),
        key=lambda row: row[3],
        reverse=True,
    )
    return ranking[:top_k]

def child_only_context(question):
    retrieved = retrieve_child_chunks(question, top_k=2)
    return {
        "chunks": [chunk_id for chunk_id, _, _, _ in retrieved],
        "parents": [],
        "context": [text for _, _, text, _ in retrieved],
    }

def parent_child_context(question):
    retrieved = retrieve_child_chunks(question, top_k=2)
    parent_ids = []
    parent_context = []

    for _, parent_id, _, _ in retrieved:
        if parent_id not in parent_ids:
            parent_ids.append(parent_id)
            parent_context.append(parent_lookup[parent_id]["text"])

    return {
        "chunks": [chunk_id for chunk_id, _, _, _ in retrieved],
        "parents": parent_ids,
        "context": parent_context,
    }

parent_child_context("Which retrieval pattern keeps a precise match but expands back to a larger section?")

{'chunks': ['chunk_parent_child', 'chunk_hyde'],
 'parents': ['parent_structured_retrieval', 'parent_query_expansion'],
 'context': ['Parent-child retrieval preserves local relevance while expanding back to a larger section. RAPTOR organizes summaries hierarchically, and GraphRAG helps when answers depend on relationships and multi-hop reasoning.',
  "Query expansion improves weak retrieval by rewriting vague questions, generating multiple search queries, or using HyDE to embed a hypothetical answer. It helps when the user's wording does not match the corpus."]}

4. Evaluate the Two Retrieval Modes¶

We evaluate both chunk-level retrieval and parent-level coverage.

rows = []
for item in benchmark:
    child_only = child_only_context(item["question"])
    parent_child = parent_child_context(item["question"])

    rows.append({
        "question": item["question"],
        "category": item["category"],
        "variant": "child_only",
        "chunk_hit": int(item["relevant_chunk"] in child_only["chunks"]),
        "parent_hit": int(item["relevant_parent"] in child_only["parents"]),
        "retrieved_chunks": child_only["chunks"],
        "retrieved_parents": child_only["parents"],
    })

    rows.append({
        "question": item["question"],
        "category": item["category"],
        "variant": "parent_child",
        "chunk_hit": int(item["relevant_chunk"] in parent_child["chunks"]),
        "parent_hit": int(item["relevant_parent"] in parent_child["parents"]),
        "retrieved_chunks": parent_child["chunks"],
        "retrieved_parents": parent_child["parents"],
    })

evaluation_df = pd.DataFrame(rows)
evaluation_df

	question	category	variant	chunk_hit	parent_hit	retrieved_chunks	retrieved_parents
0	What helps when a user's wording does not matc...	query_expansion	child_only	1	0	[chunk_query_rewrite, chunk_hyde]	[]
1	What helps when a user's wording does not matc...	query_expansion	parent_child	1	1	[chunk_query_rewrite, chunk_hyde]	[parent_query_expansion]
2	Which retrieval pattern keeps a precise match ...	structured_retrieval	child_only	1	0	[chunk_parent_child, chunk_hyde]	[]
3	Which retrieval pattern keeps a precise match ...	structured_retrieval	parent_child	1	1	[chunk_parent_child, chunk_hyde]	[parent_structured_retrieval, parent_query_exp...
4	What should you do after the initial retrieval...	reranking	child_only	1	0	[chunk_reranking, chunk_hyde]	[]
5	What should you do after the initial retrieval...	reranking	parent_child	1	1	[chunk_reranking, chunk_hyde]	[parent_reranking, parent_query_expansion]

summary_df = (
    evaluation_df.groupby("variant")[["chunk_hit", "parent_hit"]]
    .mean()
    .sort_values(by=["parent_hit", "chunk_hit"], ascending=False)
)
summary_df

	chunk_hit	parent_hit
variant
parent_child	1.0	1.0
child_only	1.0	0.0

5. What This Notebook Teaches¶

Parent-child retrieval is a strong next step when flat chunk retrieval finds the right local passage but loses the broader section needed for complete answers.

Why it matters:

child chunks preserve local relevance
parent sections restore surrounding context
the pattern is much simpler than RAPTOR or GraphRAG

That makes parent-child retrieval one of the best structured-retrieval upgrades to learn before moving into heavier architectures.