Evaluating RAG Quality

Your semantic search endpoint works. Users can find tasks related to "Docker deployment" even when those exact words do not appear in task descriptions. But how do you know the answers are actually correct?

A user asks: "How do I mark a task as complete?" Your RAG system retrieves three chunks and generates a response. The response sounds confident. The format looks professional. But is the information accurate? Does it actually come from the retrieved documents, or did the LLM hallucinate details that seem plausible but are wrong?

This is the evaluation problem every RAG system faces. Without systematic measurement, you are flying blind. You might ship a system that confidently delivers incorrect information, damaging user trust and creating support burden.

RAG evaluation is not optional polish. It is the discipline that separates production systems from demos.

The Four Evaluation Dimensions

RAG systems fail in four distinct ways. Each requires a different measurement approach.

+---------------------------------------------------------------------+
|                      RAG Evaluation Framework                        |
+---------------------------------------------------------------------+
|                                                                      |
|  1. RETRIEVAL QUALITY                2. GENERATION QUALITY           |
|  +-- Context Precision               +-- Faithfulness                |
|  +-- Context Recall                  +-- Answer Relevance            |
|  +-- Retrieval Relevance             +-- Answer Correctness          |
|                                                                      |
|  3. END-TO-END                       4. SYSTEM PERFORMANCE           |
|  +-- Groundedness                    +-- Latency                     |
|  +-- Answer Similarity               +-- Token Usage                 |
|  +-- Semantic Similarity             +-- Cost per Query              |
|                                                                      |
+---------------------------------------------------------------------+

Dimension 1: Correctness

Does the response match the expected answer? This requires reference answers to compare against.

When a user asks "How do I create a new task?" and your ground truth says "Use POST /tasks with title and description," the system must generate something semantically equivalent. Not identical words necessarily, but the same actionable information.

Dimension 2: Relevance

Does the response actually answer the question asked? A response might be factually accurate but completely irrelevant.

User asks about creating tasks. System responds with accurate information about deleting tasks. Factually correct, but useless. Relevance evaluation catches this.

Dimension 3: Groundedness

Is the response supported by the retrieved context? This catches hallucination, which is the RAG system's most dangerous failure mode.

The LLM might generate confident-sounding details that appear nowhere in the retrieved documents. "The API also supports batch operations for creating multiple tasks simultaneously" sounds helpful, but if no retrieved document mentions batch operations, this is hallucination.

Dimension 4: Retrieval Quality

Did the retriever find the right documents? Even a perfect generator cannot produce correct answers from irrelevant context.

Context precision measures whether retrieved documents are relevant to the query. Context recall measures whether the retrieved documents contain the information needed to answer correctly.

LLM-as-Judge Pattern

Manual evaluation does not scale. You cannot review every response from a production system. The solution is using LLMs to evaluate LLM outputs.

This sounds circular, but it works. An evaluator LLM examines outputs against specific criteria and returns structured judgments. The key is designing prompts that enforce precise evaluation criteria.

Correctness Evaluator

from typing import TypedDict
from langchain_openai import ChatOpenAI

class CorrectnessGrade(TypedDict):
    explanation: str
    is_correct: bool

evaluator_llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0
).with_structured_output(
    CorrectnessGrade,
    method="json_schema",
    strict=True
)

CORRECTNESS_PROMPT = """
Compare the AI response to the reference answer.

Reference: {reference}
AI Response: {response}

Is the AI response factually correct and complete?
Provide explanation and boolean is_correct.
"""

def evaluate_correctness(
    inputs: dict,
    outputs: dict,
    reference_outputs: dict
) -> bool:
    grade = evaluator_llm.invoke([
        {"role": "system", "content": "You are a precise grading assistant."},
        {"role": "user", "content": CORRECTNESS_PROMPT.format(
            reference=reference_outputs.get("answer", ""),
            response=outputs.get("answer", "")
        )}
    ])
    return grade["is_correct"]

Output:

>>> evaluate_correctness(
...     inputs={"question": "How do I create a task?"},
...     outputs={"answer": "Send POST to /tasks with title field"},
...     reference_outputs={"answer": "Use POST /tasks with title and description"}
... )
True

The evaluator recognizes semantic equivalence even when wording differs. Both responses correctly identify POST /tasks as the endpoint.

Groundedness Evaluator

This catches hallucination by verifying every claim against retrieved context.

class GroundednessGrade(TypedDict):
    explanation: str
    is_grounded: bool

GROUNDEDNESS_PROMPT = """
Verify that the response is fully supported by the context.

Context: {context}
Response: {response}

Is every claim in the response supported by the context?
Mark as NOT grounded if the response includes information not in the context.
"""

def evaluate_groundedness(inputs: dict, outputs: dict) -> bool:
    grade = evaluator_llm.invoke([
        {"role": "system", "content": "You detect hallucinations."},
        {"role": "user", "content": GROUNDEDNESS_PROMPT.format(
            context=outputs.get("context", ""),
            response=outputs.get("answer", "")
        )}
    ])
    return grade["is_grounded"]

Output:

>>> evaluate_groundedness(
...     inputs={"question": "How do I create a task?"},
...     outputs={
...         "context": "POST /tasks endpoint accepts title and description fields.",
...         "answer": "Use POST /tasks with title. You can also use batch mode for multiple tasks."
...     }
... )
False
# Explanation: "batch mode" not mentioned in context - hallucination detected

Relevance Evaluator

class RelevanceGrade(TypedDict):
    explanation: str
    is_relevant: bool

RELEVANCE_PROMPT = """
Does the response directly address the question?

Question: {question}
Response: {response}

A relevant response:
- Answers the specific question asked
- Is helpful and actionable
- Does not include unrelated information
"""

def evaluate_relevance(inputs: dict, outputs: dict) -> bool:
    grade = evaluator_llm.invoke([
        {"role": "system", "content": "You assess answer relevance."},
        {"role": "user", "content": RELEVANCE_PROMPT.format(
            question=inputs.get("question", ""),
            response=outputs.get("answer", "")
        )}
    ])
    return grade["is_relevant"]

Output:

>>> evaluate_relevance(
...     inputs={"question": "How do I create a task?"},
...     outputs={"answer": "DELETE /tasks/{id} removes a task permanently."}
... )
False
# Explanation: Response about deletion does not answer creation question

Retrieval Quality Evaluator

class RetrievalRelevanceGrade(TypedDict):
    explanation: str
    is_relevant: bool

RETRIEVAL_PROMPT = """
Are the retrieved documents relevant to answering this question?

Question: {question}
Retrieved Documents: {context}

Relevant retrieval means the documents contain information needed to answer the question.
"""

def evaluate_retrieval_relevance(inputs: dict, outputs: dict) -> bool:
    grade = evaluator_llm.invoke([
        {"role": "system", "content": "You assess retrieval quality."},
        {"role": "user", "content": RETRIEVAL_PROMPT.format(
            question=inputs.get("question", ""),
            context=outputs.get("context", "")
        )}
    ])
    return grade["is_relevant"]

Output:

>>> evaluate_retrieval_relevance(
...     inputs={"question": "How do I create a task?"},
...     outputs={"context": "POST /tasks creates a new task. Required: title (string)."}
... )
True

LangSmith Setup and Tracing

LangSmith provides observability for LLM applications. Every call through your RAG pipeline gets traced, allowing you to debug failures and run systematic evaluations.

Environment Configuration

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
export LANGCHAIN_API_KEY="your-api-key"
export LANGCHAIN_PROJECT="task-api-rag-eval"

Security Note: Never commit API keys to version control. Use environment variables or secret management tools.

Traceable RAG Pipeline

The @traceable decorator automatically captures inputs, outputs, and timing for each function call.

from langsmith import traceable

@traceable(name="rag_pipeline")
def rag_pipeline(question: str) -> dict:
    """RAG pipeline with automatic tracing."""
    # Retrieve
    docs = retriever.invoke(question)

    # Generate
    context = "\n\n".join(doc.page_content for doc in docs)
    response = llm.invoke(
        qa_prompt.format(context=context, question=question)
    )

    return {
        "question": question,
        "context": context,
        "answer": response.content,
        "sources": [doc.metadata.get("source") for doc in docs],
    }

Output:

>>> result = rag_pipeline("How do I mark a task complete?")
>>> print(result)
{
    'question': 'How do I mark a task complete?',
    'context': 'PATCH /tasks/{id} updates task fields. Set status to completed...',
    'answer': 'Use PATCH /tasks/{id} with status="completed" in the request body.',
    'sources': ['api-reference.md', 'task-endpoints.md']
}
# Trace visible in LangSmith dashboard with timing, token counts, and full context

Creating Evaluation Datasets

LangSmith datasets enable systematic testing across question sets.

from langsmith import Client

client = Client()

# Create dataset
dataset = client.create_dataset(
    dataset_name="task-api-rag-eval",
    description="Evaluation dataset for Task API RAG system"
)

# Add examples with expected outputs
examples = [
    {
        "inputs": {"question": "How do I create a new task?"},
        "outputs": {"answer": "Use POST /tasks with title and description."}
    },
    {
        "inputs": {"question": "How do I mark a task complete?"},
        "outputs": {"answer": "Use PATCH /tasks/{id} with status='completed'."}
    },
    {
        "inputs": {"question": "What happens when I delete a task?"},
        "outputs": {"answer": "DELETE /tasks/{id} removes the task permanently."}
    },
]

for example in examples:
    client.create_example(
        inputs=example["inputs"],
        outputs=example["outputs"],
        dataset_id=dataset.id
    )

Output:

>>> print(f"Created dataset with {len(examples)} examples")
Created dataset with 3 examples

Running Evaluations

experiment_results = client.evaluate(
    rag_pipeline,
    data="task-api-rag-eval",
    evaluators=[
        evaluate_correctness,
        evaluate_groundedness,
        evaluate_relevance,
        evaluate_retrieval_relevance,
    ],
    experiment_prefix="rag-v1",
    metadata={"model": "gpt-4o-mini", "retriever": "qdrant-hybrid"}
)

Output:

Evaluating over 3 examples...

Example 1/3: correctness=True, groundedness=True, relevance=True, retrieval=True
Example 2/3: correctness=True, groundedness=True, relevance=True, retrieval=True
Example 3/3: correctness=True, groundedness=False, relevance=True, retrieval=True

Results:
  correctness:  100% (3/3)
  groundedness:  67% (2/3)
  relevance:    100% (3/3)
  retrieval:    100% (3/3)

The evaluation reveals one groundedness failure. Investigating example 3 might show the response included extra information not present in retrieved context.

RAGAS Metrics

RAGAS (Retrieval Augmented Generation Assessment) provides battle-tested metrics for RAG evaluation. It handles the complexity of comparing semantic similarity and measuring information coverage.

Installation

pip install ragas

Core RAGAS Metrics

Metric	What It Measures	Needs Reference?	Range
Faithfulness	Is answer grounded in context?	No	0-1
Answer Relevancy	Does answer address question?	No	0-1
Context Precision	Are retrieved docs relevant?	No	0-1
Context Recall	Do retrieved docs cover answer?	Yes	0-1

Running RAGAS Evaluation

from ragas import evaluate
from ragas.metrics import (
    Faithfulness,
    ResponseRelevancy,
    ContextPrecision,
    ContextRecall,
)
from datasets import Dataset

# Prepare data in RAGAS format
data = {
    "question": ["How do I create a task?"],
    "answer": ["Use POST /tasks endpoint with title field"],
    "contexts": [["POST /tasks creates new task. Required fields: title (string)."]],
    "ground_truth": ["Send POST request to /tasks with title and description"]
}

dataset = Dataset.from_dict(data)

# Evaluate with metric instances
results = evaluate(
    dataset,
    metrics=[
        Faithfulness(),       # Is answer grounded in context?
        ResponseRelevancy(),  # Does answer address question?
        ContextPrecision(),   # Are retrieved docs relevant?
        ContextRecall(),      # Do retrieved docs cover answer?
    ]
)

print(results)

Output:

{'faithfulness': 0.95, 'response_relevancy': 0.88, 'context_precision': 0.92, 'context_recall': 0.85}

Note: RAGAS uses class-based metrics (e.g., Faithfulness()) rather than lowercase function names. The output keys use snake_case names like response_relevancy.

Interpreting RAGAS Scores

Score Range	Meaning	Action
0.85+	Excellent	System working well
0.70-0.84	Good	Minor improvements possible
0.50-0.69	Needs Work	Investigate specific failures
Below 0.50	Critical	Major system issues

Debugging Poor Metrics

Each metric points to a different system component when low.

Low Metric	Likely Cause	Fix
Context Precision	Poor retrieval	Improve embeddings, add metadata filters
Context Recall	Incomplete chunking	Reduce chunk size, add overlap
Faithfulness	Hallucination	Stricter prompt, lower temperature
Answer Relevancy	Off-topic response	Better prompt engineering

Creating Evaluation Datasets

The quality of your evaluation depends on your test dataset. A good dataset covers different question types and includes edge cases.

Dataset Design Principles

Minimum 20-30 examples for reliable metrics
Diverse question types: factual, procedural, conceptual
Include edge cases: ambiguous queries, multi-hop reasoning
Ground truth required for correctness and recall metrics

Example Dataset Structure

evaluation_examples = [
    # Factual questions (direct lookup)
    {
        "question": "What HTTP method creates a new task?",
        "answer": "POST",
        "type": "factual"
    },
    # Procedural questions (how-to)
    {
        "question": "How do I update a task's priority?",
        "answer": "Use PATCH /tasks/{id} with priority field in request body",
        "type": "procedural"
    },
    # Conceptual questions (understanding)
    {
        "question": "Why might a task search return no results?",
        "answer": "Search may fail if query terms don't match indexed content semantically",
        "type": "conceptual"
    },
    # Edge cases
    {
        "question": "What happens if I create a task without a title?",
        "answer": "API returns 422 Unprocessable Entity with validation error",
        "type": "edge_case"
    },
]

Reflect on Your Skill

You built a rag-deployment skill in Lesson 0. Now test it with evaluation concepts.

Test Your Skill

Ask your skill:

I have a RAG system with these metrics:
- Faithfulness: 0.45
- Answer Relevancy: 0.90
- Context Precision: 0.50
- Context Recall: 0.80

What's wrong with my system and how do I fix it?

Identify Gaps

Does your skill know about the four evaluation dimensions?
Can it recommend specific fixes for low metrics?
Does it suggest LangSmith or RAGAS integration?

Improve Your Skill

If gaps exist, update your skill:

Update my rag-deployment skill to include:
RAG evaluation framework with four dimensions
LLM-as-Judge evaluator patterns
RAGAS metric interpretation guide
Debugging strategies for each low-metric scenario

Try With AI

Prompt 1: Design Your Evaluation Dataset

I'm building a RAG system for my Task API that answers questions about
creating, updating, and searching tasks. Help me design an evaluation
dataset with 10 questions. Ask me: What are the most common user questions?
What edge cases should I test? What mistakes has the system made before?

What you're learning: Evaluation dataset design requires understanding real usage patterns. Your AI partner helps you think systematically about coverage.

Prompt 2: Debug a Failing Metric

My RAG system has faithfulness score of 0.55 but response_relevancy of 0.92.
Walk me through diagnosis: What does this pattern tell us? What should I
check first? Ask me about my prompt template, temperature settings, and
what kind of hallucinations I'm seeing.

What you're learning: Metric patterns reveal specific failure modes. Your AI partner guides diagnostic investigation through structured questioning.

Prompt 3: Create Evaluators for Your Domain

I need to create a custom evaluator for [describe your domain: legal documents,
medical FAQs, product documentation]. The standard evaluators don't capture
what matters in my field. Ask me: What does "correct" mean in my domain?
What are the dangerous failure modes? What compliance requirements apply?

What you're learning: Domain-specific evaluation requires understanding context that generic metrics miss. Your AI partner helps identify what matters for your specific use case.

Safety Note: When using LLM-as-Judge patterns in production, remember that evaluator LLMs can also make mistakes. For high-stakes applications, combine automated evaluation with periodic human review of edge cases.

The Four Evaluation Dimensions​

Dimension 1: Correctness​

Dimension 2: Relevance​

Dimension 3: Groundedness​

Dimension 4: Retrieval Quality​

LLM-as-Judge Pattern​

Correctness Evaluator​

Groundedness Evaluator​

Relevance Evaluator​

Retrieval Quality Evaluator​

LangSmith Setup and Tracing​

Environment Configuration​

Traceable RAG Pipeline​

Creating Evaluation Datasets​

Running Evaluations​

RAGAS Metrics​

Installation​

Core RAGAS Metrics​

Running RAGAS Evaluation​

Interpreting RAGAS Scores​

Debugging Poor Metrics​

Creating Evaluation Datasets​

Dataset Design Principles​

Example Dataset Structure​

Reflect on Your Skill​

Test Your Skill​

Identify Gaps​

Improve Your Skill​

Try With AI​

Prompt 1: Design Your Evaluation Dataset​

Prompt 2: Debug a Failing Metric​

Prompt 3: Create Evaluators for Your Domain​

The Four Evaluation Dimensions

Dimension 1: Correctness

Dimension 2: Relevance

Dimension 3: Groundedness

Dimension 4: Retrieval Quality

LLM-as-Judge Pattern

Correctness Evaluator

Groundedness Evaluator

Relevance Evaluator

Retrieval Quality Evaluator

LangSmith Setup and Tracing

Environment Configuration

Traceable RAG Pipeline

Creating Evaluation Datasets

Running Evaluations

RAGAS Metrics

Installation

Core RAGAS Metrics

Running RAGAS Evaluation

Interpreting RAGAS Scores

Debugging Poor Metrics

Creating Evaluation Datasets

Dataset Design Principles

Example Dataset Structure

Reflect on Your Skill

Test Your Skill

Identify Gaps

Improve Your Skill

Try With AI

Prompt 1: Design Your Evaluation Dataset

Prompt 2: Debug a Failing Metric

Prompt 3: Create Evaluators for Your Domain