Challenges: RAG SystemsΒΆ

Hands-on challenges to master Retrieval-Augmented Generation

πŸš€ Challenge 1: The Chunking Optimization GameΒΆ

Difficulty: ⭐⭐ Beginner-Intermediate
Time: 45-60 minutes
Concepts: Text chunking, retrieval accuracy, semantic boundaries

The ProblemΒΆ

Chunking is critical for RAG - bad chunks = bad retrieval = bad answers. Find the optimal chunking strategy!

Your TaskΒΆ

  1. Take a long technical document (e.g., Python documentation, research paper)

  2. Create 10 test questions that require specific passages

  3. Try 5 different chunking strategies:

    • Fixed size (256, 512, 1024 tokens)

    • Sentence-based

    • Paragraph-based

    • Semantic (embeddings-based)

    • Hierarchical (sections β†’ paragraphs β†’ sentences)

  4. Measure which strategy retrieves the right passages most often

Evaluation MetricsΒΆ

# For each question, check if correct passage is in top-3 results
hit_rate = correct_chunks_retrieved / total_questions

# Average position of correct chunk
mrr = mean([1/rank for rank in chunk_positions])

Success CriteriaΒΆ

  • Test all 5 chunking methods

  • Create visualization comparing methods

  • Identify when each method works best

  • Provide recommendations

πŸ’‘ Hint Different content types need different strategies: - Code documentation: Semantic chunking works well - Narrative text: Paragraph-based is often good - Q&A: Sentence-based can work

πŸš€ Challenge 2: Query Expansion TechniquesΒΆ

Difficulty: ⭐⭐⭐ Intermediate
Time: 1-2 hours
Concepts: Query understanding, multi-query retrieval, HyDE

The ProblemΒΆ

User queries are often vague or poorly worded. Expand them to improve retrieval!

Your TaskΒΆ

Implement 3 query expansion techniques:

Technique 1: Multi-Query Generation

# Original: "How to use python lists?"
# Expanded:
# - "Python list operations tutorial"
# - "Add items to Python list"
# - "List methods in Python"
# - "Python array vs list"

Technique 2: Hypothetical Document Embeddings (HyDE)

# Original query: "What causes climate change?"
# Generate hypothetical answer, then search for it:
generated_answer = llm("Write a detailed answer about climate change causes...")
search_embedding = embed(generated_answer)

Technique 3: Query Decomposition

# Complex: "Compare Python and JavaScript for web development"
# Decompose:
# - "Python for web development features"
# - "JavaScript for web development features"
# - "Python vs JavaScript comparison"

Comparison TaskΒΆ

  • Test on 20 diverse questions

  • Compare retrieval accuracy for each method

  • Analyze latency and cost tradeoffs

  • Identify best use cases

πŸ’‘ Hint Multi-query can be parallelized for speed. HyDE works great when you know the answer format. Query decomposition is powerful for complex questions.

πŸš€ Challenge 3: The Hallucination HunterΒΆ

Difficulty: ⭐⭐⭐⭐ Advanced
Time: 2-3 hours
Concepts: Faithfulness, fact verification, hallucination detection

The ProblemΒΆ

LLMs sometimes β€œhallucinate” - generate plausible-sounding but incorrect information. Catch them!

Your TaskΒΆ

Build a hallucination detection system:

  1. Faithfulness Scoring

    • Check if answer is supported by retrieved context

    • Use entailment model or LLM-as-judge

    • Score 0-1 for how well grounded the answer is

  2. Citation Verification

    • Extract claims from answer

    • Verify each claim against source documents

    • Flag unsupported claims

  3. Confidence Calibration

    • Estimate answer confidence

    • Compare with actual correctness

    • Calibrate model to be more honest

ImplementationΒΆ

class HallucinationDetector:
    def check_faithfulness(self, answer, context):
        """Score how well answer is supported by context."""
        # TODO: Implement
        pass
    
    def verify_citations(self, answer, sources):
        """Verify each claim in answer."""
        claims = self.extract_claims(answer)
        verified = []
        for claim in claims:
            is_supported = self.verify_claim(claim, sources)
            verified.append({
                "claim": claim,
                "supported": is_supported,
                "confidence": ...
            })
        return verified

Test DatasetΒΆ

Create 30 questions with known hallucination triggers:

  • Questions outside knowledge base

  • Ambiguous questions

  • Questions with conflicting information

  • Questions requiring calculation/reasoning

πŸ’‘ Hint Use models like "microsoft/deberta-v3-large" for entailment. Compare multiple answer generations - consistent = likely correct. Prompt engineering: "Only answer if you're certain. Otherwise say 'I don't know.'"

πŸš€ Challenge 4: Conversational RAGΒΆ

Difficulty: ⭐⭐⭐⭐ Advanced
Time: 3-4 hours
Concepts: Dialogue management, context tracking, memory

The ProblemΒΆ

Most RAG systems handle single questions. Build one that handles multi-turn conversations!

Your TaskΒΆ

Handle conversation like this:

User: "What are the benefits of Python?"
Bot: "Python offers readability, extensive libraries..." [uses RAG]

User: "What about performance?"  # Implicit: Python performance
Bot: "Python is slower than compiled languages..." [understands context]

User: "Compare it to Java"  # Implicit: Python vs Java performance
Bot: "Java is generally faster because..." [maintains full context]

RequirementsΒΆ

  • Track conversation history

  • Rewrite queries with context (coreference resolution)

  • Maintain entity tracking

  • Handle follow-up questions

  • Know when to retrieve vs use previous context

  • Manage token budget (conversation history grows!)

Conversation ManagementΒΆ

class ConversationalRAG:
    def __init__(self):
        self.conversation_history = []
        self.entity_tracker = {}
    
    def rewrite_query_with_context(self, current_query, history):
        """Rewrite query to be standalone using conversation context."""
        # "What about performance?" β†’ "What about Python performance?"
        pass
    
    def should_retrieve(self, query, history):
        """Decide if we need new retrieval or can use context."""
        # Avoid unnecessary retrievals for clarification questions
        pass
    
    def chat(self, user_message):
        # Rewrite query
        # Retrieve if needed
        # Generate with conversation context
        # Update history
        pass
πŸ’‘ Hint Use LLM to rewrite queries: "Given conversation history, rewrite this query to be standalone" Keep sliding window of last N turns to manage tokens. Detect if query is clarification vs new topic.

πŸš€ Challenge 5: Multi-Modal RAGΒΆ

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 4-6 hours
Concepts: Multi-modal embeddings, vision-language models, hybrid retrieval

The ProblemΒΆ

Real documents have images, tables, charts - not just text. Build RAG that handles it all!

Your TaskΒΆ

Build a system that processes:

  • Text: Standard RAG

  • Images: Visual search with CLIP

  • Tables: Structured data retrieval

  • Diagrams: Caption extraction + visual search

  • Code: Syntax-aware chunking

Example Use Case: Technical DocumentationΒΆ

User: "Show me the architecture diagram and explain the components"

System should:
1. Retrieve relevant diagram (image similarity)
2. Extract/generate diagram description
3. Retrieve text about components
4. Combine image + text in answer
πŸ’‘ Hint Start with caption generation and text retrieval before attempting end-to-end multimodal embeddings.

πŸš€ Challenge 6: Corrective RAG LoopΒΆ

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 3-5 hours
Concepts: CRAG, retrieval grading, retry policies, abstention

The ProblemΒΆ

Many RAG failures are not generation failures. They are retrieval failures that should have been caught before the model answered.

Your TaskΒΆ

Build a corrective loop that evaluates retrieval quality before the final answer is generated.

RequirementsΒΆ

  • Grade retrieved evidence for relevance and coverage

  • Retry with a rewritten query if retrieval quality is weak

  • Compress or filter noisy chunks before generation

  • Abstain when no trustworthy evidence is found

  • Log which step fixed the failure, if any

Suggested pipelineΒΆ

def corrective_rag(query):
    candidates = retrieve(query)
    grade = grade_retrieval(query, candidates)

    if grade < 0.5:
        better_query = rewrite_query(query)
        candidates = retrieve(better_query)
        grade = grade_retrieval(better_query, candidates)

    if grade < 0.5:
        return {"answer": "I don't have enough reliable evidence.", "status": "abstain"}

    context = compress_context(query, candidates)
    return generate_answer(query, context)

Success CriteriaΒΆ

  • Retrieval failures are explicitly detected

  • Retry logic improves at least some failed cases

  • Unsupported questions do not produce confident hallucinations

  • You can show before/after examples from a failure set

πŸ’‘ Hint Keep the grading simple first: use a small rubric for topical relevance, evidence coverage, and answerability.

πŸš€ Challenge 7: Hierarchical or Graph RetrievalΒΆ

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 4-6 hours
Concepts: RAPTOR, parent-child retrieval, GraphRAG, multi-hop reasoning

The ProblemΒΆ

Flat chunk retrieval breaks down when the answer is spread across sections, entities, or long reports.

Your TaskΒΆ

Implement one structured retrieval approach:

Option A: Parent-Child / Hierarchical Retrieval

  • Retrieve fine-grained chunks

  • Expand to their parent section or source document

  • Generate the final answer using both local evidence and larger context

Option B: RAPTOR-style Summarization Tree

  • Create chunk summaries recursively

  • Retrieve from summaries first, then drill down to leaves

  • Compare quality and latency against flat retrieval

Option C: GraphRAG Prototype

  • Extract entities and relations from documents

  • Build a lightweight graph

  • Retrieve by entity neighborhood plus semantic search

Success CriteriaΒΆ

  • Show at least 10 questions that require cross-section reasoning

  • Compare flat retrieval vs. your structured approach

  • Explain where the structured approach helps and where it adds overhead

  • Include failure cases, not just wins

πŸ’‘ Hint If full GraphRAG is too heavy, parent-child retrieval is the best structured upgrade to implement first.

Implementation ComponentsΒΆ

  1. Multi-Modal Embeddings:

    • Text: sentence-transformers

    • Images: CLIP

    • Tables: Table-specific embedders

  2. Hybrid Retrieval:

    • Combine results from different modalities

    • Weight by relevance and modality type

  3. Multi-Modal Generation:

    • GPT-4 Vision for image understanding

    • Generate answers referencing both text and images

Success CriteriaΒΆ

  • Process PDFs with images/tables

  • Retrieve relevant visuals for queries

  • Generate answers combining modalities

  • Handle queries like β€œshow me”, β€œdiagram of”, β€œtable showing”

πŸ’‘ Hint Use GPT-4 Vision or LLaVA for image understanding. CLIP for image-text similarity. Separate vector stores per modality, then merge results.

πŸ† Meta Challenge: RAG Optimization CompetitionΒΆ

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 8-12 hours
Concepts: End-to-end optimization, systematic evaluation

The Ultimate ChallengeΒΆ

Build the best RAG system for a specific domain and prove it!

Competition FormatΒΆ

  1. Choose Domain: Medical, legal, technical docs, customer support, etc.

  2. Build System: Full RAG pipeline

  3. Create Benchmark: 100+ test questions with ground truth

  4. Optimize Everything:

    • Chunking strategy

    • Embedding model

    • Retrieval method

    • Re-ranking

    • Generation prompts

    • Cost/latency tradeoffs

Leaderboard MetricsΒΆ

  • Accuracy: % of correct answers

  • Faithfulness: % of answers supported by context

  • Latency: Average response time

  • Cost: $ per 1000 queries

  • User Satisfaction: Human evaluation (1-5)

DeliverablesΒΆ

  • Complete RAG system (code)

  • Benchmark dataset (questions + answers)

  • Evaluation results (metrics + analysis)

  • Technical report (methodology + findings)

  • Demo (Gradio/Streamlit app)

Bonus PointsΒΆ

  • Open-source your solution

  • Deploy publicly

  • Write blog post about optimizations

  • Beat baseline by >20% accuracy

πŸ“Š Challenge Progress TrackerΒΆ

  • Challenge 1: Chunking Optimization

  • Challenge 2: Query Expansion

  • Challenge 3: Hallucination Hunter

  • Challenge 4: Conversational RAG

  • Challenge 5: Multi-Modal RAG

  • Meta Challenge: RAG Optimization Competition

πŸ… Share Your WorkΒΆ

Post your challenge solutions:

  • GitHub: Share your repos

  • Discussions: Challenges Category

  • Blog: Write about your learnings

  • Twitter: Tag #ZeroToAI #RAGChallenge

πŸ’‘ Tips for SuccessΒΆ

  1. Start Simple: Get basic version working first

  2. Measure Everything: Metrics guide optimization

  3. Error Analysis: Study failures to improve

  4. Read Papers: Many techniques have research backing

  5. Use Tools: LangChain, LlamaIndex can speed things up

  6. Iterate: First version won’t be perfect

πŸ“š Helpful ResourcesΒΆ

Happy building! πŸš€

Remember: RAG is about the journey of optimization, not just the destination!