Knowledge & Grounding: Context You Can Verify
How RAG systems provide context to LLMs – and why seeing the retrieval process matters.
LLMs are powerful, but they hallucinate. They confidently state things that aren’t true. RAG (Retrieval Augmented Generation) grounds responses in actual documents. But if you can’t see what was retrieved, you’re just trading one black box for another.
The Hallucination Problem
Ask an LLM about your company’s Q3 revenue, and it might give you a number. That number will sound confident. It might even be formatted correctly. But unless the LLM has been specifically trained on your financial data (it hasn’t), that number is fabricated.
The Danger: LLMs don’t know what they don’t know. They’re trained to produce fluent, confident text – not to express uncertainty. This is especially dangerous for domain-specific queries where the model lacks real knowledge.
RAG solves this by retrieving actual documents and including them in the context window. The LLM can then ground its response in real information. But this only builds trust if you can see the retrieval process.
How RAG Works
The RAG pipeline has four key stages. Each stage is an opportunity to either build or undermine trust:
What Must Be Measured (And Why)
Here’s the critical question most RAG implementations miss: how do you know if your RAG system is working? Not “is it returning results,” but “are those results actually useful for producing accurate answers?”
The Measurement Gap: Many teams deploy RAG without measuring quality. They see responses and assume retrieval is working. But a RAG system can return documents and still produce hallucinated answers if the retrieved context is irrelevant, incomplete, or misused.
The Four Essential RAG Metrics
Each metric answers a specific question about RAG quality. Ignoring any one creates a blind spot:
Why Each Metric Matters
| Metric | What It Catches | Target |
|---|---|---|
| Context Precision | Retrieval noise – irrelevant documents filling context window | >0.80 |
| Context Recall | Missing information – key documents not retrieved | >0.75 |
| Faithfulness | LLM hallucination – claims not supported by retrieved context | >0.85 |
| Answer Relevancy | Question drift – correct facts but wrong answer | >0.80 |
How Measurement Actually Works
These metrics are typically measured using an LLM-as-Judge approach – a separate LLM evaluates the quality of the RAG output. This is where frameworks like Ragas and DeepEval come in:
# Example: Measuring RAG quality with Ragas
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy
)
# Each metric evaluates a different aspect
result = evaluate(
dataset,
metrics=[
context_precision, # Precision of retrieved docs
context_recall, # Recall of relevant docs
faithfulness, # Response grounded in context?
answer_relevancy # Answer addresses question?
]
)
# Output: scores 0.0 to 1.0 for each metric
# {'context_precision': 0.87, 'faithfulness': 0.91, ...}
Trust Insight: The power of measurement is early detection. A faithfulness score dropping from 0.90 to 0.75 signals a problem BEFORE users start complaining about wrong answers. Measure continuously, not just at deployment.
The Operational Metrics Layer
Beyond quality, you need to track operational health:
- Retrieval Latency: p50, p95, p99 response times for vector search
- Average Similarity Score: Trending down suggests query-document drift
- Zero-Result Rate: Percentage of queries with no chunks above threshold
- Context Token Usage: Are you efficiently using the context window?
Now that we know WHAT to measure and WHY, let’s examine HOW the RAG pipeline works at each stage.
Chunking: The First Decision Point
How you split documents determines what can be retrieved. Too small, and you lose context. Too large, and you waste tokens on irrelevant content.
# Chunking configuration
chunk_size = 1000 # Characters per chunk
chunk_overlap = 100 # Overlap to preserve context across boundaries
# Split on sentence boundaries when possible
# Don't break mid-sentence unless chunk exceeds size limit
Trust Insight: By running chunking locally, you can see exactly how your documents are split. If a retrieval misses something obvious, you can check whether the chunking strategy caused the problem.
Vector Search: Finding Relevant Context
When a query comes in, it gets converted to a vector (a list of numbers that capture semantic meaning). This vector is compared against all indexed chunks to find the most similar ones.
The key insight: similarity scores tell you how confident the retrieval is. A score of 0.95 means near-perfect semantic match. A score of 0.6 means the chunk is somewhat related but might not contain what you need.
What Retrieved Context Looks Like
When you run RAG locally, you can see exactly what was retrieved and how it scored:
Now when the LLM says “Q3 revenue was $2.4M,” you can verify that it came from
doc_quarterly_report_chunk_3 with a 0.92 similarity score. The response
is grounded in real data.
Score Thresholds: The Trust Boundary
Not all retrievals are trustworthy. A low similarity score means the system is guessing. Local control lets you set and see the threshold:
# Only include chunks above threshold
score_threshold = 0.7
# Retrieve top-k chunks
top_k = 5
# If no chunks meet threshold, indicate low confidence
if len(results) == 0:
return {
"confidence": "low",
"message": "No relevant context found"
}
Context Assembly: What the LLM Actually Sees
The final step formats retrieved chunks into the prompt. This is where you see exactly what information the LLM has to work with:
# Format context for LLM prompt
def format_context_for_llm(documents):
context_parts = []
for i, doc in enumerate(documents, 1):
context_parts.append(
f"[Document {i}] (Score: {doc['score']:.3f})"
)
context_parts.append(doc['text'])
context_parts.append("") # Separator
return "\n".join(context_parts)
The Pattern: Including the similarity score in the formatted context helps the LLM calibrate confidence. A document with score 0.92 should be treated as more reliable than one with score 0.71.
Why Local RAG Builds Trust
Running retrieval locally gives you visibility that cloud APIs can’t provide:
- Source verification: See exactly which documents were used for any response
- Score transparency: Know the confidence level of every retrieval
- Debugging capability: When answers are wrong, trace back to what was retrieved
- Threshold tuning: Adjust quality/recall tradeoffs based on your requirements
- Index inspection: Verify your documents are chunked and indexed correctly
Trust Transfer: When evaluating enterprise RAG platforms, you’ll know to ask: What’s your similarity threshold? Can I see retrieval scores? How are documents chunked? These questions come from understanding the mechanics.
Fact Verification: The Next Level
Basic RAG retrieves and presents. Advanced systems verify. A fact verification layer can:
- Cross-reference claims against multiple sources
- Flag contradictions between retrieved documents
- Identify when claims go beyond what sources support
- Track which specific sentences support which claims
This is where RAG transitions from “find relevant documents” to “verify this claim.” The architecture is similar, but the intent is different.
Our Actual Implementation
The RAG system described here is running code in our proof-of-concept. The knowledge-service (port 8006) implements semantic chunking, vector storage, and context retrieval.
knowledge-service: RAG Engine
From knowledge-service/app/core/rag_engine.py—actual chunking implementation:
# Actual implementation: Semantic chunking with overlap
class RAGEngine:
"""RAG Engine for document processing and retrieval"""
def __init__(self):
self.chunk_size = settings.RAG_CHUNK_SIZE # 1000 chars
self.chunk_overlap = settings.RAG_CHUNK_OVERLAP # 100 chars
def _chunk_text(self, text: str, doc_id: str):
# Split into sentences first (preserve context)
sentences = re.split(r'[.!?]+', text)
chunks = []
current_chunk = []
current_length = 0
for sentence in sentences:
if current_length + len(sentence) > self.chunk_size:
# Save chunk, keep overlap for next
chunks.append({
"chunk_id": f"{doc_id}_chunk_{chunk_num}",
"text": '. '.join(current_chunk),
"doc_id": doc_id
})
overlap_sentences = current_chunk[-2:]
current_chunk = overlap_sentences
current_chunk.append(sentence)
Vector Store Integration
The implementation uses ChromaDB for vector storage with sentence-transformers for embeddings:
# From vector_store.py
class VectorStore:
def search(self, query: str, top_k: int = 5) -> List[Dict]:
"""Semantic similarity search with scores."""
results = self.collection.query(
query_texts=[query],
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
# Return documents with similarity scores
return [
{"text": doc, "score": 1 - dist, **meta}
for doc, dist, meta in zip(...)
]
Measuring RAG Quality
The eval-service (port 8005) implements the RAG metrics discussed earlier. From
eval-service/app/core/rag_evaluator.py:
# Actual RAG evaluation implementation
async def evaluate_rag_response(
query: str,
response: str,
retrieved_contexts: List[str]
) -> Dict[str, float]:
return {
"context_precision": await measure_precision(contexts),
"context_recall": await measure_recall(contexts, query),
"faithfulness": await measure_faithfulness(response, contexts),
"answer_relevancy": await measure_relevancy(response, query)
}
Context Verification: Closing the Loop
Retrieval shows you what documents were used. Verification proves the response actually matches those documents. The Eval Service (port 8005) closes this loop:
Claims match context
Addresses the question
Traced to source
What Gets Measured
# POST /api/v1/evaluate/rag
{
"query": "What was Q3 revenue?",
"response": "Q3 revenue was $2.4M, a 23% increase...",
"contexts": ["chunk_3: Q3 2024 revenue reached $2.4M..."]
}
# Response: verification result
{
"faithfulness": { "score": 0.92, "passed": true },
"answer_relevancy": { "score": 0.88, "passed": true },
"claims_analysis": [
{ "claim": "Q3 revenue was $2.4M", "source": "chunk_3", "supported": true },
{ "claim": "23% increase YoY", "source": "chunk_3", "supported": true }
]
}
What This Proves: The LLM didn’t hallucinate. Each claim in the response is traced to a specific chunk. A faithfulness score of 0.92 means 92% of claims are directly supported by retrieved documents.
Knowledge Service provides the what (retrieved context). Eval Service provides the proof (faithfulness verification). Together: “Context You Can Verify.”
Coming Up Next
Knowledge grounding reduces hallucination, but what about harmful content? In the next post, we’ll explore Guardrails & Evaluation – how to detect and block unsafe content, and how to measure response quality systematically.
Leave a comment