Building Trust in AI Part 3 of 9

Guardrails & Evaluation: Safety You Can See

Content filtering and quality assessment – two sides of the same trust coin.

AI safety isn’t just about blocking bad outputs. It’s about measuring quality systematically and knowing exactly when and why interventions happen. When guardrails fire, you should understand the trigger. When evaluations fail, you should see the scores.

First: Why Both Guardrails AND Evals?

A common question when building enterprise AI systems: Do I need guardrails, evals, or both? The answer is both, but they serve fundamentally different purposes and operate on completely different timescales.

The Core Distinction: Guardrails are real-time prevention. Evals are async quality measurement. One is a bouncer at the door. The other is a quality auditor reviewing recordings.

Dimension Guardrails Evals
Timing Real-time, every request Async, sampled or batched
Latency Requirement <100ms (in critical path) Seconds to minutes (acceptable)
Output Type Binary: ALLOW / WARN / BLOCK Continuous: 0.0 to 1.0 scores
Purpose Prevent immediate harm Measure quality patterns over time
When to Use Every action (synchronous) Sampled or periodic (async)
Examples PII blocking, toxicity rejection Hallucination rate, faithfulness score

Why you need both: Guardrails catch obvious problems immediately (a user sending SSNs to a cloud model), but they can’t measure subtle quality dimensions like “is this response faithful to the source documents?” That requires evals. Conversely, evals are too slow and expensive to run on every request, so they can’t provide real-time protection.

For a deeper exploration of this distinction and when to apply each approach, see Evals or Guardrails or Both?

🛡
Guardrails
Real-time protection: PII detection, toxicity filtering, injection prevention. Runs on every request before and after the LLM.
📊
Evaluation
Quality measurement: hallucination detection, relevance scoring, faithfulness checks. Systematic testing against benchmarks.

The Problem with “We Have Guardrails”

Every AI vendor says they have guardrails. Content moderation. Safety filters. PII protection. But what does that actually mean?

When you send a prompt to Claude or GPT-4, content filtering happens before your request reaches the model. If something gets blocked, you get a generic refusal. Why? What triggered it? Was it too conservative? You’ll never know. It’s a black box.

“Your request couldn’t be completed due to our content policy.”

— Every AI API, without explaining which policy or why

Part 1: Guardrails – The Safety Net

A comprehensive guardrails system handles multiple types of checks:

🔒 Data Classification

Classifies content as PUBLIC, INTERNAL, CONFIDENTIAL, or SECRET based on detected entities

👤 PII Detection

Finds names, emails, SSNs, credit cards, phone numbers – with option to redact

Toxicity Detection

ML-based analysis for toxicity, threats, insults, and identity attacks

🛡 Bias Detection

Identifies potentially biased language or stereotypes

🔨 Prompt Injection

Detects attempts to manipulate the model through crafted inputs

🗺 Privacy Routing

Recommends local vs cloud processing based on data sensitivity

The Guardrail Pipeline

Guardrails check both input and output:

Request Flow with Guardrails ────────────────────────────── User Input │ ▼ ┌──────────────────────────────────────────────────┐ │ GUARDRAILS LAYER │ │ │ │ 1. Data Classification │ │ └─▶ PUBLIC / INTERNAL / CONFIDENTIAL / SECRET│ │ │ │ 2. PII Detection (Presidio or Regex) │ │ └─▶ SSN, Email, Phone, Credit Card, Names │ │ │ │ 3. Toxicity Check (Detoxify ML or Keywords) │ │ └─▶ Score 0-1 across 6 categories │ │ │ │ 4. Prompt Injection Detection │ │ └─▶ Pattern matching for known attacks │ │ │ │ 5. Privacy Routing Recommendation │ │ └─▶ Local-only flag for sensitive data │ │ │ └──────────────────────────────────────────────────┘ │ ▼ Decision: ALLOW / WARN / BLOCK │ ▼ Orchestration Service → Model Execution

PII Detection: Two Approaches

PII detection can work at two levels of sophistication:

1. Regex-Based (Fast, Default)

Fast pattern matching for common PII formats:

# Regex patterns for common PII
SSN_PATTERN = r'\b\d{3}-\d{2}-\d{4}\b'
EMAIL_PATTERN = r'\b[\w.-]+@[\w.-]+\.\w+\b'
PHONE_PATTERN = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
CREDIT_CARD = r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b'

Pros: Fast, no dependencies, no ML overhead
Cons: Misses context-dependent PII (names, addresses)

2. Presidio-Based (Accurate, Optional)

Microsoft’s Presidio uses NLP to understand context:

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
results = analyzer.analyze(
    text="Contact John Smith at john.smith@acme.com",
    language="en"
)

# Results:
# - PERSON: "John Smith" (confidence: 0.85)
# - EMAIL: "john.smith@acme.com" (confidence: 0.95)

Pros: Catches names, addresses, contextual PII
Cons: Slower, requires ML models (~500MB)

Input:
“Please contact John Smith at john.smith@acme.com or 555-123-4567 regarding SSN 123-45-6789”

Output (redacted):
“Please contact [PERSON] at [EMAIL] or [PHONE] regarding SSN [SSN]

Trust Insight: Seeing exactly what was redacted lets you verify the guardrail is working correctly. Is it catching real PII or false positives? Is it missing any patterns you care about?

Toxicity Detection: ML vs Keywords

Similar story for toxicity – two approaches:

Keyword Matching (Default)

Simple but catches obvious cases:

TOXIC_KEYWORDS = ["hate", "kill", "attack", ...]

def check_toxicity(text):
    for keyword in TOXIC_KEYWORDS:
        if keyword in text.lower():
            return True
    return False

Detoxify ML (Optional)

BERT-based model trained on the Jigsaw toxic comments dataset:

from detoxify import Detoxify

model = Detoxify('original')
results = model.predict("I hate this product")

# Results (0-1 scores):
# {
#   'toxicity': 0.72,
#   'severe_toxicity': 0.01,
#   'obscene': 0.02,
#   'threat': 0.01,
#   'insult': 0.45,
#   'identity_attack': 0.01
# }

The ML approach understands nuance. “I hate this product” is different from “I hate you.” The model gives scores across 6 categories, so you can tune sensitivity per use case.

Category What It Detects Example
Toxicity General toxic content “This is terrible”
Severe Toxicity Extreme toxic content Explicit hate speech
Obscene Vulgar language Profanity
Threat Threatening language “I’ll make you pay”
Insult Personal attacks “You’re an idiot”
Identity Attack Attacks on identity groups Slurs, stereotypes

Data Classification

Every input gets classified into one of four levels:

Level Triggers Routing Recommendation
PUBLIC No sensitive content detected Any model (cloud or local)
INTERNAL Company names, project references Trusted providers
CONFIDENTIAL PII detected (names, emails, etc.) Local processing recommended
SECRET SSN, financial data, health records Local processing required

The classification doesn’t block requests – it provides a recommendation to the orchestration service. If you’re testing with synthetic data, you can override. If it’s real user data, the recommendation ensures it stays on local models.

Prompt Injection Detection

Malicious inputs try to override system instructions. Detection looks for patterns like:

# Injection patterns to detect
injection_patterns = [
    "ignore previous instructions",
    "disregard your training",
    "you are now",
    "act as if",
    "jailbreak",
    "pretend you are",
    "forget everything"
]

# Confidence score indicates how likely the injection attempt

Why False Positives Matter

Here’s something you learn fast when running guardrails: false positives are more annoying than false negatives in a research context.

A false positive means legitimate content gets flagged. If you’re writing a medical chatbot prototype and every mention of “symptoms” triggers the toxicity filter, that’s useless.

The Tuning Challenge: Setting the right thresholds is an art. Too strict = constant false positives. Too loose = things slip through. 0.5 is a common default toxicity threshold, but for domain-specific applications, adjust per-category.

Feature Flags for Flexibility

All the “heavy” ML features can be optional and disabled by default:

# Default: Fast, regex-based checks
config = {
    "presidio_enabled": False,   # Use regex PII
    "detoxify_enabled": False,   # Use keyword toxicity
    "bias_enabled": False        # Skip bias detection
}

# Full ML: Slower but more accurate
config = {
    "presidio_enabled": True,    # ML-based PII
    "detoxify_enabled": True,    # BERT toxicity
    "bias_enabled": True         # Bias detection
}

This lets you trade off speed vs accuracy depending on the use case. For quick iterations, regex is fine. For production validation, enable everything.

Part 2: Evaluation – Measuring Quality

While guardrails catch obvious problems, evaluation measures subtle quality dimensions. This is essential for continuous improvement.

The Evaluation Framework Stack

Multiple frameworks exist because different metrics matter for different use cases:

Framework Strength Key Metrics
DeepEval Comprehensive LLM testing Hallucination, Relevancy, Faithfulness, Bias, Toxicity
G-Eval LLM-as-Judge approach Coherence, Fluency, Relevance (custom criteria)
Ragas RAG-specific evaluation Context Precision, Context Recall, Answer Relevancy

What Evaluation Scores Look Like

Hallucination
Score: 0.85 Threshold: 0.70
Answer Relevancy
Score: 0.92 Threshold: 0.70
Faithfulness
Score: 0.68 Below threshold
Bias
Score: 0.12 Lower is better

LLM-as-Judge: Using AI to Evaluate AI

The most powerful evaluation approach uses one LLM to judge another. This enables evaluation of subjective qualities like coherence and helpfulness:

# G-Eval scoring prompt structure
evaluation_prompt = """
You are evaluating an AI response for {metric}.

Query: {query}
Response: {response}
Context: {context}

Score from 1-10 where:
1-3: Poor - Response fails to address the query
4-6: Acceptable - Response partially addresses the query
7-9: Good - Response fully addresses the query
10: Excellent - Response exceeds expectations

Provide your score and reasoning.
"""

# The judge model provides both score and explanation

The Power: LLM-as-Judge can evaluate arbitrary criteria. Define what “good” means for your use case, and the system measures it. This is impossible with rule-based evaluation.

RAG-Specific Evaluation

When using retrieved context, additional metrics matter:

  • Context Precision: How much of the retrieved context was actually relevant?
  • Context Recall: Did we retrieve everything we needed?
  • Faithfulness: Does the response stay true to the retrieved content?
  • Answer Relevancy: Does the answer actually address the question?
# RAG evaluation checks both retrieval and generation
rag_result = {
    "context_precision": 0.85,  # 85% of retrieved content was useful
    "context_recall": 0.78,     # Found 78% of relevant content
    "faithfulness": 0.92,       # Response closely follows context
    "answer_relevancy": 0.88   # Answer addresses the question
}

The Evaluation Pipeline

A complete evaluation system orchestrates multiple frameworks:

Test Cases
DeepEval
G-Eval
RAG Metrics
Report
# Evaluation pipeline configuration
config = {
    "frameworks": ["deepeval", "geval", "ragas"],
    "metrics": ["relevance", "coherence", "helpfulness"],
    "global_threshold": 0.7,
    "parallel_execution": True,
    "enable_rag_metrics": True,
    "include_recommendations": True
}

Automated Recommendations

Beyond scores, evaluation systems can provide actionable guidance:

Evaluation Recommendations 2 Issues

‘faithfulness’ metric has low pass rate (58%).
Consider reviewing prompt engineering or model configuration.

‘coherence’ shows high variance (range: 0.45).
Consider investigating inconsistent response patterns.

Putting It Together: The Complete Safety Picture

When guardrails and evaluation work together:

  1. Input guardrails sanitize and check incoming requests
  2. Routing selects the appropriate model (covered in Part 1)
  3. Context retrieval provides grounding (covered in Part 2)
  4. Output guardrails check the response for safety
  5. Evaluation scores quality metrics
  6. Logging captures everything for audit (covered in Part 5)

Trust Insight: This layered approach means you can trace any response back through the pipeline. Why was this blocked? What was the faithfulness score? Which guardrails fired? Everything is visible.

The Actual Implementation

The guardrails and evaluation described here aren’t hypothetical—they’re running services in the proof-of-concept. The implementation includes two dedicated microservices with ~1,500 lines of working code.

guardrails-service (Port 8001)

Here’s the actual service initialization:

# Actual implementation: Feature-flagged guardrails
class GuardrailsAIService:
    """
    Primary guardrails service using keyword-based + PII detection

    Features:
    - Data classification (PUBLIC/INTERNAL/CONFIDENTIAL/SECRET)
    - Privacy-based routing recommendations
    - PII detection and redaction
    - Toxicity detection
    - Prompt injection detection
    """

    def __init__(self,
        presidio_enabled: bool = False,
        detoxify_enabled: bool = False,
        bias_enabled: bool = False
    ):
        self.classifier = DataClassifier()

        # Feature-flagged ML detectors
        self.pii_detector = AdvancedPIIDetector(enabled=presidio_enabled)
        self.toxicity_detector = MLToxicityDetector(enabled=detoxify_enabled)
        self.bias_detector = BiasDetector(enabled=bias_enabled)

eval-service (Port 8005)

The evaluation service implements a 1,020-line pipeline supporting G-Eval, RAG metrics, and context-aware threshold adjustment:

# Actual implementation: Context-aware evaluation
class ContextAwareEvaluator:
    """Adjusts thresholds based on domain and sensitivity."""

    DOMAIN_ADJUSTMENTS = {
        "GENERAL": 0.0,
        "MEDICAL": 0.1,   # Higher bar for medical
        "LEGAL": 0.1,
        "FINANCIAL": 0.05
    }

    SENSITIVITY_ADJUSTMENTS = {
        "LOW": -0.05,     # More permissive
        "MEDIUM": 0.0,
        "HIGH": 0.1        # Stricter thresholds
    }

Key Architectural Decisions

  • Feature-flagged ML: Presidio/Detoxify optional—falls back to regex/keywords
  • Fail-open design: Guardrails failures never block responses
  • Tenant-scoped policies: Different teams can have different thresholds
  • Async evaluation: Quality scoring happens post-response

What You Learn Building This

  1. Guardrails are easier to build than to tune. The code is straightforward. Finding the right thresholds for your use case takes iteration.
  2. Context matters enormously. “I’m going to kill it at this presentation” shouldn’t trigger threat detection. Keyword matching doesn’t get this. ML does (mostly).
  3. Transparency builds trust. When you can see exactly why something was flagged, you trust the system more. Even when it’s wrong, you understand why.
  4. Fallbacks are essential. If Presidio isn’t available, fall back to regex. If Detoxify fails, fall back to keywords. Never let the guardrails crash the system.
  5. LLM-as-Judge is powerful. When a vendor says “we use content moderation,” you now know what questions to ask: Is it ML-based or pattern matching? What’s the false positive rate? Can I tune the thresholds?

Coming Up Next

Guardrails block harmful content and evaluation measures quality. But what about understanding why AI makes certain decisions? In the next post, we’ll explore Responsible AI – explainability techniques like LIME and SHAP that make AI reasoning visible.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.