Challenges: RAG Systems¶

Hands-on challenges to master Retrieval-Augmented Generation

🚀 Challenge 1: The Chunking Optimization Game¶

Difficulty: ⭐⭐ Beginner-Intermediate
Time: 45-60 minutes
Concepts: Text chunking, retrieval accuracy, semantic boundaries

The Problem¶

Chunking is critical for RAG - bad chunks = bad retrieval = bad answers. Find the optimal chunking strategy!

Your Task¶

Take a long technical document (e.g., Python documentation, research paper)
Create 10 test questions that require specific passages
Try 5 different chunking strategies:
- Fixed size (256, 512, 1024 tokens)
- Sentence-based
- Paragraph-based
- Semantic (embeddings-based)
- Hierarchical (sections → paragraphs → sentences)
Measure which strategy retrieves the right passages most often

Evaluation Metrics¶

# For each question, check if correct passage is in top-3 results
hit_rate = correct_chunks_retrieved / total_questions

# Average position of correct chunk
mrr = mean([1/rank for rank in chunk_positions])

Success Criteria¶

Test all 5 chunking methods
Create visualization comparing methods
Identify when each method works best
Provide recommendations

💡 Hint

Different content types need different strategies: - Code documentation: Semantic chunking works well - Narrative text: Paragraph-based is often good - Q&A: Sentence-based can work

🚀 Challenge 2: Query Expansion Techniques¶

Difficulty: ⭐⭐⭐ Intermediate
Time: 1-2 hours
Concepts: Query understanding, multi-query retrieval, HyDE

The Problem¶

User queries are often vague or poorly worded. Expand them to improve retrieval!

Your Task¶

Implement 3 query expansion techniques:

Technique 1: Multi-Query Generation

# Original: "How to use python lists?"
# Expanded:
# - "Python list operations tutorial"
# - "Add items to Python list"
# - "List methods in Python"
# - "Python array vs list"

Technique 2: Hypothetical Document Embeddings (HyDE)

# Original query: "What causes climate change?"
# Generate hypothetical answer, then search for it:
generated_answer = llm("Write a detailed answer about climate change causes...")
search_embedding = embed(generated_answer)

Technique 3: Query Decomposition

# Complex: "Compare Python and JavaScript for web development"
# Decompose:
# - "Python for web development features"
# - "JavaScript for web development features"
# - "Python vs JavaScript comparison"

Comparison Task¶

Test on 20 diverse questions
Compare retrieval accuracy for each method
Analyze latency and cost tradeoffs
Identify best use cases

💡 Hint

Multi-query can be parallelized for speed. HyDE works great when you know the answer format. Query decomposition is powerful for complex questions.

🚀 Challenge 3: The Hallucination Hunter¶

Difficulty: ⭐⭐⭐⭐ Advanced
Time: 2-3 hours
Concepts: Faithfulness, fact verification, hallucination detection

The Problem¶

LLMs sometimes “hallucinate” - generate plausible-sounding but incorrect information. Catch them!

Your Task¶

Build a hallucination detection system:

Faithfulness Scoring
- Check if answer is supported by retrieved context
- Use entailment model or LLM-as-judge
- Score 0-1 for how well grounded the answer is
Citation Verification
- Extract claims from answer
- Verify each claim against source documents
- Flag unsupported claims
Confidence Calibration
- Estimate answer confidence
- Compare with actual correctness
- Calibrate model to be more honest

Implementation¶

class HallucinationDetector:
    def check_faithfulness(self, answer, context):
        """Score how well answer is supported by context."""
        # TODO: Implement
        pass
    
    def verify_citations(self, answer, sources):
        """Verify each claim in answer."""
        claims = self.extract_claims(answer)
        verified = []
        for claim in claims:
            is_supported = self.verify_claim(claim, sources)
            verified.append({
                "claim": claim,
                "supported": is_supported,
                "confidence": ...
            })
        return verified

Test Dataset¶

Create 30 questions with known hallucination triggers:

Questions outside knowledge base
Ambiguous questions
Questions with conflicting information
Questions requiring calculation/reasoning

💡 Hint

Use models like "microsoft/deberta-v3-large" for entailment. Compare multiple answer generations - consistent = likely correct. Prompt engineering: "Only answer if you're certain. Otherwise say 'I don't know.'"

🚀 Challenge 4: Conversational RAG¶

Difficulty: ⭐⭐⭐⭐ Advanced
Time: 3-4 hours
Concepts: Dialogue management, context tracking, memory

The Problem¶

Most RAG systems handle single questions. Build one that handles multi-turn conversations!

Your Task¶

Handle conversation like this:

User: "What are the benefits of Python?"
Bot: "Python offers readability, extensive libraries..." [uses RAG]

User: "What about performance?"  # Implicit: Python performance
Bot: "Python is slower than compiled languages..." [understands context]

User: "Compare it to Java"  # Implicit: Python vs Java performance
Bot: "Java is generally faster because..." [maintains full context]

Requirements¶

Track conversation history
Rewrite queries with context (coreference resolution)
Maintain entity tracking
Handle follow-up questions
Know when to retrieve vs use previous context
Manage token budget (conversation history grows!)

Conversation Management¶

class ConversationalRAG:
    def __init__(self):
        self.conversation_history = []
        self.entity_tracker = {}
    
    def rewrite_query_with_context(self, current_query, history):
        """Rewrite query to be standalone using conversation context."""
        # "What about performance?" → "What about Python performance?"
        pass
    
    def should_retrieve(self, query, history):
        """Decide if we need new retrieval or can use context."""
        # Avoid unnecessary retrievals for clarification questions
        pass
    
    def chat(self, user_message):
        # Rewrite query
        # Retrieve if needed
        # Generate with conversation context
        # Update history
        pass

💡 Hint

Use LLM to rewrite queries: "Given conversation history, rewrite this query to be standalone" Keep sliding window of last N turns to manage tokens. Detect if query is clarification vs new topic.

🚀 Challenge 6: Corrective RAG Loop¶

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 3-5 hours
Concepts: CRAG, retrieval grading, retry policies, abstention

The Problem¶

Many RAG failures are not generation failures. They are retrieval failures that should have been caught before the model answered.

Your Task¶

Build a corrective loop that evaluates retrieval quality before the final answer is generated.

Requirements¶

Grade retrieved evidence for relevance and coverage
Retry with a rewritten query if retrieval quality is weak
Compress or filter noisy chunks before generation
Abstain when no trustworthy evidence is found
Log which step fixed the failure, if any

Suggested pipeline¶

def corrective_rag(query):
    candidates = retrieve(query)
    grade = grade_retrieval(query, candidates)

    if grade < 0.5:
        better_query = rewrite_query(query)
        candidates = retrieve(better_query)
        grade = grade_retrieval(better_query, candidates)

    if grade < 0.5:
        return {"answer": "I don't have enough reliable evidence.", "status": "abstain"}

    context = compress_context(query, candidates)
    return generate_answer(query, context)

Success Criteria¶

Retrieval failures are explicitly detected
Retry logic improves at least some failed cases
Unsupported questions do not produce confident hallucinations
You can show before/after examples from a failure set

💡 Hint

Keep the grading simple first: use a small rubric for topical relevance, evidence coverage, and answerability.

🚀 Challenge 7: Hierarchical or Graph Retrieval¶

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 4-6 hours
Concepts: RAPTOR, parent-child retrieval, GraphRAG, multi-hop reasoning

The Problem¶

Flat chunk retrieval breaks down when the answer is spread across sections, entities, or long reports.

Your Task¶

Implement one structured retrieval approach:

Option A: Parent-Child / Hierarchical Retrieval

Retrieve fine-grained chunks
Expand to their parent section or source document
Generate the final answer using both local evidence and larger context

Option B: RAPTOR-style Summarization Tree

Create chunk summaries recursively
Retrieve from summaries first, then drill down to leaves
Compare quality and latency against flat retrieval

Option C: GraphRAG Prototype

Extract entities and relations from documents
Build a lightweight graph
Retrieve by entity neighborhood plus semantic search

Success Criteria¶

Show at least 10 questions that require cross-section reasoning
Compare flat retrieval vs. your structured approach
Explain where the structured approach helps and where it adds overhead
Include failure cases, not just wins

💡 Hint

If full GraphRAG is too heavy, parent-child retrieval is the best structured upgrade to implement first.

Implementation Components¶

Multi-Modal Embeddings:
- Text: sentence-transformers
- Images: CLIP
- Tables: Table-specific embedders
Hybrid Retrieval:
- Combine results from different modalities
- Weight by relevance and modality type
Multi-Modal Generation:
- GPT-4 Vision for image understanding
- Generate answers referencing both text and images

Success Criteria¶

Process PDFs with images/tables
Retrieve relevant visuals for queries
Generate answers combining modalities
Handle queries like “show me”, “diagram of”, “table showing”

💡 Hint

Use GPT-4 Vision or LLaVA for image understanding. CLIP for image-text similarity. Separate vector stores per modality, then merge results.

🏆 Meta Challenge: RAG Optimization Competition¶

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 8-12 hours
Concepts: End-to-end optimization, systematic evaluation

The Ultimate Challenge¶

Build the best RAG system for a specific domain and prove it!

Competition Format¶

Choose Domain: Medical, legal, technical docs, customer support, etc.
Build System: Full RAG pipeline
Create Benchmark: 100+ test questions with ground truth
Optimize Everything:
- Chunking strategy
- Embedding model
- Retrieval method
- Re-ranking
- Generation prompts
- Cost/latency tradeoffs

Leaderboard Metrics¶

Accuracy: % of correct answers
Faithfulness: % of answers supported by context
Latency: Average response time
Cost: $ per 1000 queries
User Satisfaction: Human evaluation (1-5)

Deliverables¶

Complete RAG system (code)
Benchmark dataset (questions + answers)
Evaluation results (metrics + analysis)
Technical report (methodology + findings)
Demo (Gradio/Streamlit app)

Bonus Points¶

Open-source your solution
Deploy publicly
Write blog post about optimizations
Beat baseline by >20% accuracy

📊 Challenge Progress Tracker¶

Challenge 1: Chunking Optimization
Challenge 2: Query Expansion
Challenge 3: Hallucination Hunter
Challenge 4: Conversational RAG
Challenge 5: Multi-Modal RAG
Meta Challenge: RAG Optimization Competition

💡 Tips for Success¶

Start Simple: Get basic version working first
Measure Everything: Metrics guide optimization
Error Analysis: Study failures to improve
Read Papers: Many techniques have research backing
Use Tools: LangChain, LlamaIndex can speed things up
Iterate: First version won’t be perfect

📚 Helpful Resources¶

Happy building! 🚀

Remember: RAG is about the journey of optimization, not just the destination!

Challenges: RAG Systems¶

🚀 Challenge 1: The Chunking Optimization Game¶

The Problem¶

Your Task¶

Evaluation Metrics¶

Success Criteria¶

🚀 Challenge 2: Query Expansion Techniques¶

The Problem¶

Your Task¶

Comparison Task¶

🚀 Challenge 3: The Hallucination Hunter¶

The Problem¶

Your Task¶

Implementation¶

Test Dataset¶

🚀 Challenge 4: Conversational RAG¶

The Problem¶

Your Task¶

Requirements¶

Conversation Management¶

🚀 Challenge 5: Multi-Modal RAG¶

The Problem¶

Your Task¶

Example Use Case: Technical Documentation¶

🚀 Challenge 6: Corrective RAG Loop¶

The Problem¶

Your Task¶

Requirements¶

Suggested pipeline¶

Success Criteria¶

🚀 Challenge 7: Hierarchical or Graph Retrieval¶

The Problem¶

Your Task¶

Success Criteria¶

Implementation Components¶

Success Criteria¶

🏆 Meta Challenge: RAG Optimization Competition¶

The Ultimate Challenge¶

Competition Format¶

Leaderboard Metrics¶

Deliverables¶

Bonus Points¶

📊 Challenge Progress Tracker¶

🏅 Share Your Work¶

💡 Tips for Success¶

📚 Helpful Resources¶