Building Trust in AI Part 2 of 9

Knowledge & Grounding: Context You Can Verify

How RAG systems provide context to LLMs – and why seeing the retrieval process matters.

LLMs are powerful, but they hallucinate. They confidently state things that aren’t true. RAG (Retrieval Augmented Generation) grounds responses in actual documents. But if you can’t see what was retrieved, you’re just trading one black box for another.

The Hallucination Problem

Ask an LLM about your company’s Q3 revenue, and it might give you a number. That number will sound confident. It might even be formatted correctly. But unless the LLM has been specifically trained on your financial data (it hasn’t), that number is fabricated.

The Danger: LLMs don’t know what they don’t know. They’re trained to produce fluent, confident text – not to express uncertainty. This is especially dangerous for domain-specific queries where the model lacks real knowledge.

RAG solves this by retrieving actual documents and including them in the context window. The LLM can then ground its response in real information. But this only builds trust if you can see the retrieval process.

How RAG Works

The RAG pipeline has four key stages. Each stage is an opportunity to either build or undermine trust:

1
Document Chunking
Split documents into semantically meaningful chunks with overlap to preserve context
2
Embedding & Indexing
Convert chunks to vectors using an embedding model, store in vector database
3
Similarity Search
Convert user query to vector, find nearest neighbors in vector space
4
Context Assembly
Format retrieved chunks with scores, inject into LLM prompt

What Must Be Measured (And Why)

Here’s the critical question most RAG implementations miss: how do you know if your RAG system is working? Not “is it returning results,” but “are those results actually useful for producing accurate answers?”

The Measurement Gap: Many teams deploy RAG without measuring quality. They see responses and assume retrieval is working. But a RAG system can return documents and still produce hallucinated answers if the retrieved context is irrelevant, incomplete, or misused.

The Four Essential RAG Metrics

Each metric answers a specific question about RAG quality. Ignoring any one creates a blind spot:

Context Precision
Of the chunks retrieved, how many were actually relevant to the question?
Context Recall
Of all relevant chunks that exist, how many did we retrieve?
Faithfulness
Does the LLM’s response actually reflect what the retrieved documents say?
Answer Relevancy
Does the final answer actually address the user’s original question?

Why Each Metric Matters

Metric What It Catches Target
Context Precision Retrieval noise – irrelevant documents filling context window >0.80
Context Recall Missing information – key documents not retrieved >0.75
Faithfulness LLM hallucination – claims not supported by retrieved context >0.85
Answer Relevancy Question drift – correct facts but wrong answer >0.80

How Measurement Actually Works

These metrics are typically measured using an LLM-as-Judge approach – a separate LLM evaluates the quality of the RAG output. This is where frameworks like Ragas and DeepEval come in:

# Example: Measuring RAG quality with Ragas
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy
)

# Each metric evaluates a different aspect
result = evaluate(
    dataset,
    metrics=[
        context_precision,   # Precision of retrieved docs
        context_recall,      # Recall of relevant docs
        faithfulness,        # Response grounded in context?
        answer_relevancy     # Answer addresses question?
    ]
)

# Output: scores 0.0 to 1.0 for each metric
# {'context_precision': 0.87, 'faithfulness': 0.91, ...}

Trust Insight: The power of measurement is early detection. A faithfulness score dropping from 0.90 to 0.75 signals a problem BEFORE users start complaining about wrong answers. Measure continuously, not just at deployment.

The Operational Metrics Layer

Beyond quality, you need to track operational health:

  • Retrieval Latency: p50, p95, p99 response times for vector search
  • Average Similarity Score: Trending down suggests query-document drift
  • Zero-Result Rate: Percentage of queries with no chunks above threshold
  • Context Token Usage: Are you efficiently using the context window?

Now that we know WHAT to measure and WHY, let’s examine HOW the RAG pipeline works at each stage.

Chunking: The First Decision Point

How you split documents determines what can be retrieved. Too small, and you lose context. Too large, and you waste tokens on irrelevant content.

# Chunking configuration
chunk_size = 1000      # Characters per chunk
chunk_overlap = 100    # Overlap to preserve context across boundaries

# Split on sentence boundaries when possible
# Don't break mid-sentence unless chunk exceeds size limit

Trust Insight: By running chunking locally, you can see exactly how your documents are split. If a retrieval misses something obvious, you can check whether the chunking strategy caused the problem.

Vector Search: Finding Relevant Context

When a query comes in, it gets converted to a vector (a list of numbers that capture semantic meaning). This vector is compared against all indexed chunks to find the most similar ones.

Semantic Vector Space
Query    Matched chunks    Other chunks

The key insight: similarity scores tell you how confident the retrieval is. A score of 0.95 means near-perfect semantic match. A score of 0.6 means the chunk is somewhat related but might not contain what you need.

What Retrieved Context Looks Like

When you run RAG locally, you can see exactly what was retrieved and how it scored:

doc_quarterly_report_chunk_3 Score: 0.92
Q3 2024 revenue reached $2.4M, representing a 23% increase over the same quarter last year. This growth was primarily driven by expansion in the enterprise segment, which now accounts for 67% of total revenue…
doc_quarterly_report_chunk_4 Score: 0.87
Key metrics for Q3: Customer acquisition cost (CAC) decreased by 15% while lifetime value (LTV) increased by 8%. The LTV/CAC ratio improved from 3.2 to 4.1, indicating healthier unit economics…
doc_board_presentation_chunk_7 Score: 0.71
Revenue projections for Q4 assume continued growth in enterprise accounts. Conservative estimate: $2.6M. Optimistic: $2.9M. Key risk factors include potential delays in two major contract renewals…

Now when the LLM says “Q3 revenue was $2.4M,” you can verify that it came from doc_quarterly_report_chunk_3 with a 0.92 similarity score. The response is grounded in real data.

Score Thresholds: The Trust Boundary

Not all retrievals are trustworthy. A low similarity score means the system is guessing. Local control lets you set and see the threshold:

# Only include chunks above threshold
score_threshold = 0.7

# Retrieve top-k chunks
top_k = 5

# If no chunks meet threshold, indicate low confidence
if len(results) == 0:
    return {
        "confidence": "low",
        "message": "No relevant context found"
    }
0.7
Minimum Score Threshold
5
Max Chunks Retrieved
1000
Chunk Size (chars)

Context Assembly: What the LLM Actually Sees

The final step formats retrieved chunks into the prompt. This is where you see exactly what information the LLM has to work with:

# Format context for LLM prompt
def format_context_for_llm(documents):
    context_parts = []
    for i, doc in enumerate(documents, 1):
        context_parts.append(
            f"[Document {i}] (Score: {doc['score']:.3f})"
        )
        context_parts.append(doc['text'])
        context_parts.append("")  # Separator

    return "\n".join(context_parts)

The Pattern: Including the similarity score in the formatted context helps the LLM calibrate confidence. A document with score 0.92 should be treated as more reliable than one with score 0.71.

Why Local RAG Builds Trust

Running retrieval locally gives you visibility that cloud APIs can’t provide:

  • Source verification: See exactly which documents were used for any response
  • Score transparency: Know the confidence level of every retrieval
  • Debugging capability: When answers are wrong, trace back to what was retrieved
  • Threshold tuning: Adjust quality/recall tradeoffs based on your requirements
  • Index inspection: Verify your documents are chunked and indexed correctly

Trust Transfer: When evaluating enterprise RAG platforms, you’ll know to ask: What’s your similarity threshold? Can I see retrieval scores? How are documents chunked? These questions come from understanding the mechanics.

Fact Verification: The Next Level

Basic RAG retrieves and presents. Advanced systems verify. A fact verification layer can:

  • Cross-reference claims against multiple sources
  • Flag contradictions between retrieved documents
  • Identify when claims go beyond what sources support
  • Track which specific sentences support which claims

This is where RAG transitions from “find relevant documents” to “verify this claim.” The architecture is similar, but the intent is different.

Our Actual Implementation

The RAG system described here is running code in our proof-of-concept. The knowledge-service (port 8006) implements semantic chunking, vector storage, and context retrieval.

knowledge-service: RAG Engine

From knowledge-service/app/core/rag_engine.py—actual chunking implementation:

# Actual implementation: Semantic chunking with overlap
class RAGEngine:
    """RAG Engine for document processing and retrieval"""

    def __init__(self):
        self.chunk_size = settings.RAG_CHUNK_SIZE      # 1000 chars
        self.chunk_overlap = settings.RAG_CHUNK_OVERLAP  # 100 chars

    def _chunk_text(self, text: str, doc_id: str):
        # Split into sentences first (preserve context)
        sentences = re.split(r'[.!?]+', text)

        chunks = []
        current_chunk = []
        current_length = 0

        for sentence in sentences:
            if current_length + len(sentence) > self.chunk_size:
                # Save chunk, keep overlap for next
                chunks.append({
                    "chunk_id": f"{doc_id}_chunk_{chunk_num}",
                    "text": '. '.join(current_chunk),
                    "doc_id": doc_id
                })
                overlap_sentences = current_chunk[-2:]
                current_chunk = overlap_sentences
            current_chunk.append(sentence)

Vector Store Integration

The implementation uses ChromaDB for vector storage with sentence-transformers for embeddings:

# From vector_store.py
class VectorStore:
    def search(self, query: str, top_k: int = 5) -> List[Dict]:
        """Semantic similarity search with scores."""
        results = self.collection.query(
            query_texts=[query],
            n_results=top_k,
            include=["documents", "metadatas", "distances"]
        )
        # Return documents with similarity scores
        return [
            {"text": doc, "score": 1 - dist, **meta}
            for doc, dist, meta in zip(...)
        ]

Measuring RAG Quality

The eval-service (port 8005) implements the RAG metrics discussed earlier. From eval-service/app/core/rag_evaluator.py:

# Actual RAG evaluation implementation
async def evaluate_rag_response(
    query: str,
    response: str,
    retrieved_contexts: List[str]
) -> Dict[str, float]:
    return {
        "context_precision": await measure_precision(contexts),
        "context_recall": await measure_recall(contexts, query),
        "faithfulness": await measure_faithfulness(response, contexts),
        "answer_relevancy": await measure_relevancy(response, query)
    }

Context Verification: Closing the Loop

Retrieval shows you what documents were used. Verification proves the response actually matches those documents. The Eval Service (port 8005) closes this loop:

0.92
Faithfulness
Claims match context
0.88
Answer Relevancy
Addresses the question
3/3
Claims Grounded
Traced to source

What Gets Measured

# POST /api/v1/evaluate/rag
{
  "query": "What was Q3 revenue?",
  "response": "Q3 revenue was $2.4M, a 23% increase...",
  "contexts": ["chunk_3: Q3 2024 revenue reached $2.4M..."]
}

# Response: verification result
{
  "faithfulness": { "score": 0.92, "passed": true },
  "answer_relevancy": { "score": 0.88, "passed": true },
  "claims_analysis": [
    { "claim": "Q3 revenue was $2.4M", "source": "chunk_3", "supported": true },
    { "claim": "23% increase YoY", "source": "chunk_3", "supported": true }
  ]
}

What This Proves: The LLM didn’t hallucinate. Each claim in the response is traced to a specific chunk. A faithfulness score of 0.92 means 92% of claims are directly supported by retrieved documents.

Knowledge Service provides the what (retrieved context). Eval Service provides the proof (faithfulness verification). Together: “Context You Can Verify.”

Coming Up Next

Knowledge grounding reduces hallucination, but what about harmful content? In the next post, we’ll explore Guardrails & Evaluation – how to detect and block unsafe content, and how to measure response quality systematically.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.