Parent-Child Retrieval for RAGΒΆ

Preserve Local Relevance Without Losing Document ContextΒΆ

Flat chunk retrieval often returns a useful sentence or paragraph but loses the broader section that makes the answer complete. Parent-child retrieval fixes that by:

  1. retrieving a small child chunk

  2. mapping it back to a larger parent section

  3. generating the final answer with both local evidence and wider context

This is one of the best structured-retrieval upgrades to learn before RAPTOR or GraphRAG.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

1. Parent Documents and Child ChunksΒΆ

Each parent document represents a larger section. Each child chunk is a smaller passage indexed for retrieval.

parent_docs = [
    {
        "parent_id": "parent_query_expansion",
        "title": "Query expansion techniques",
        "text": "Query expansion improves weak retrieval by rewriting vague questions, generating multiple search queries, or using HyDE to embed a hypothetical answer. It helps when the user's wording does not match the corpus.",
    },
    {
        "parent_id": "parent_reranking",
        "title": "Reranking and context filtering",
        "text": "Reranking improves the ordering of retrieved passages after the first retrieval pass. Contextual compression removes noisy passages so the final context contains only the strongest evidence.",
    },
    {
        "parent_id": "parent_structured_retrieval",
        "title": "Structured retrieval for long documents",
        "text": "Parent-child retrieval preserves local relevance while expanding back to a larger section. RAPTOR organizes summaries hierarchically, and GraphRAG helps when answers depend on relationships and multi-hop reasoning.",
    },
]

child_chunks = [
    {
        "chunk_id": "chunk_query_rewrite",
        "parent_id": "parent_query_expansion",
        "text": "Query rewriting reformulates vague or incomplete user questions into clearer standalone search queries.",
    },
    {
        "chunk_id": "chunk_hyde",
        "parent_id": "parent_query_expansion",
        "text": "HyDE generates a hypothetical answer before retrieval to improve semantic alignment.",
    },
    {
        "chunk_id": "chunk_reranking",
        "parent_id": "parent_reranking",
        "text": "Reranking rescored candidate passages after initial retrieval to improve top-k quality.",
    },
    {
        "chunk_id": "chunk_compression",
        "parent_id": "parent_reranking",
        "text": "Contextual compression removes irrelevant text before generation.",
    },
    {
        "chunk_id": "chunk_parent_child",
        "parent_id": "parent_structured_retrieval",
        "text": "Parent-child retrieval retrieves fine-grained chunks, then expands to the larger parent section.",
    },
    {
        "chunk_id": "chunk_raptor",
        "parent_id": "parent_structured_retrieval",
        "text": "RAPTOR builds a tree of recursive summaries for long-document retrieval.",
    },
]

parent_df = pd.DataFrame(parent_docs)
child_df = pd.DataFrame(child_chunks)
parent_lookup = {row['parent_id']: row for row in parent_docs}

child_df
chunk_id parent_id text
0 chunk_query_rewrite parent_query_expansion Query rewriting reformulates vague or incomple...
1 chunk_hyde parent_query_expansion HyDE generates a hypothetical answer before re...
2 chunk_reranking parent_reranking Reranking rescored candidate passages after in...
3 chunk_compression parent_reranking Contextual compression removes irrelevant text...
4 chunk_parent_child parent_structured_retrieval Parent-child retrieval retrieves fine-grained ...
5 chunk_raptor parent_structured_retrieval RAPTOR builds a tree of recursive summaries fo...

2. Benchmark QuestionsΒΆ

These questions are designed so the answer benefits from retrieving a precise child chunk while still needing the broader parent section for complete context.

benchmark = [
    {
        "question": "What helps when a user's wording does not match the corpus?",
        "relevant_chunk": "chunk_hyde",
        "relevant_parent": "parent_query_expansion",
        "category": "query_expansion",
    },
    {
        "question": "Which retrieval pattern keeps a precise match but expands back to a larger section?",
        "relevant_chunk": "chunk_parent_child",
        "relevant_parent": "parent_structured_retrieval",
        "category": "structured_retrieval",
    },
    {
        "question": "What should you do after the initial retrieval pass to improve top results?",
        "relevant_chunk": "chunk_reranking",
        "relevant_parent": "parent_reranking",
        "category": "reranking",
    },
]

pd.DataFrame(benchmark)
question relevant_chunk relevant_parent category
0 What helps when a user's wording does not matc... chunk_hyde parent_query_expansion query_expansion
1 Which retrieval pattern keeps a precise match ... chunk_parent_child parent_structured_retrieval structured_retrieval
2 What should you do after the initial retrieval... chunk_reranking parent_reranking reranking

3. Child Retrieval vs Parent-Child RetrievalΒΆ

The first system returns only the child chunk. The second system retrieves the child chunk and then expands to the parent section.

vectorizer = TfidfVectorizer(stop_words="english")
child_matrix = vectorizer.fit_transform(child_df["text"])

def retrieve_child_chunks(question, top_k=2):
    query_vector = vectorizer.transform([question])
    similarities = cosine_similarity(query_vector, child_matrix)[0]
    ranking = sorted(
        zip(child_df["chunk_id"], child_df["parent_id"], child_df["text"], similarities),
        key=lambda row: row[3],
        reverse=True,
    )
    return ranking[:top_k]

def child_only_context(question):
    retrieved = retrieve_child_chunks(question, top_k=2)
    return {
        "chunks": [chunk_id for chunk_id, _, _, _ in retrieved],
        "parents": [],
        "context": [text for _, _, text, _ in retrieved],
    }

def parent_child_context(question):
    retrieved = retrieve_child_chunks(question, top_k=2)
    parent_ids = []
    parent_context = []

    for _, parent_id, _, _ in retrieved:
        if parent_id not in parent_ids:
            parent_ids.append(parent_id)
            parent_context.append(parent_lookup[parent_id]["text"])

    return {
        "chunks": [chunk_id for chunk_id, _, _, _ in retrieved],
        "parents": parent_ids,
        "context": parent_context,
    }

parent_child_context("Which retrieval pattern keeps a precise match but expands back to a larger section?")
{'chunks': ['chunk_parent_child', 'chunk_hyde'],
 'parents': ['parent_structured_retrieval', 'parent_query_expansion'],
 'context': ['Parent-child retrieval preserves local relevance while expanding back to a larger section. RAPTOR organizes summaries hierarchically, and GraphRAG helps when answers depend on relationships and multi-hop reasoning.',
  "Query expansion improves weak retrieval by rewriting vague questions, generating multiple search queries, or using HyDE to embed a hypothetical answer. It helps when the user's wording does not match the corpus."]}

4. Evaluate the Two Retrieval ModesΒΆ

We evaluate both chunk-level retrieval and parent-level coverage.

rows = []
for item in benchmark:
    child_only = child_only_context(item["question"])
    parent_child = parent_child_context(item["question"])

    rows.append({
        "question": item["question"],
        "category": item["category"],
        "variant": "child_only",
        "chunk_hit": int(item["relevant_chunk"] in child_only["chunks"]),
        "parent_hit": int(item["relevant_parent"] in child_only["parents"]),
        "retrieved_chunks": child_only["chunks"],
        "retrieved_parents": child_only["parents"],
    })

    rows.append({
        "question": item["question"],
        "category": item["category"],
        "variant": "parent_child",
        "chunk_hit": int(item["relevant_chunk"] in parent_child["chunks"]),
        "parent_hit": int(item["relevant_parent"] in parent_child["parents"]),
        "retrieved_chunks": parent_child["chunks"],
        "retrieved_parents": parent_child["parents"],
    })

evaluation_df = pd.DataFrame(rows)
evaluation_df
question category variant chunk_hit parent_hit retrieved_chunks retrieved_parents
0 What helps when a user's wording does not matc... query_expansion child_only 1 0 [chunk_query_rewrite, chunk_hyde] []
1 What helps when a user's wording does not matc... query_expansion parent_child 1 1 [chunk_query_rewrite, chunk_hyde] [parent_query_expansion]
2 Which retrieval pattern keeps a precise match ... structured_retrieval child_only 1 0 [chunk_parent_child, chunk_hyde] []
3 Which retrieval pattern keeps a precise match ... structured_retrieval parent_child 1 1 [chunk_parent_child, chunk_hyde] [parent_structured_retrieval, parent_query_exp...
4 What should you do after the initial retrieval... reranking child_only 1 0 [chunk_reranking, chunk_hyde] []
5 What should you do after the initial retrieval... reranking parent_child 1 1 [chunk_reranking, chunk_hyde] [parent_reranking, parent_query_expansion]
summary_df = (
    evaluation_df.groupby("variant")[["chunk_hit", "parent_hit"]]
    .mean()
    .sort_values(by=["parent_hit", "chunk_hit"], ascending=False)
)
summary_df
chunk_hit parent_hit
variant
parent_child 1.0 1.0
child_only 1.0 0.0

5. What This Notebook TeachesΒΆ

Parent-child retrieval is a strong next step when flat chunk retrieval finds the right local passage but loses the broader section needed for complete answers.

Why it matters:

  • child chunks preserve local relevance

  • parent sections restore surrounding context

  • the pattern is much simpler than RAPTOR or GraphRAG

That makes parent-child retrieval one of the best structured-retrieval upgrades to learn before moving into heavier architectures.