Structured Outputs & Evaluation

Master quality control for production agents through structured output validation and comprehensive evaluation frameworks

Part 1: Structured Outputs
From Free Text to Machine-Readable Data
🎯 Why Structure Matters
LLMs naturally generate unstructured text. But production systems need predictable, typed, validated outputs that can trigger downstream actions, integrate with APIs, populate databases, and flow through software pipelines. Structured outputs transform agents from conversational tools into reliable system components.
❌ Unstructured Problem
Agent says "Create a high-priority ticket for John about the login bug." How does your system parse this? What if it says "urgent" instead of "high"?
βœ… Structured Solution
Agent outputs: {"assignee": "john@company.com", "priority": "high", "category": "bug", "title": "Login issue"}. Perfect for API calls!
❌ Unstructured Problem
"The meeting should probably be next Tuesday around 2pm with Sarah and Mike." Ambiguous, unparseable.
βœ… Structured Solution
{"date": "2025-10-28", "time": "14:00", "duration_minutes": 60, "attendees": ["sarah@co", "mike@co"]}. Calendar API ready!
πŸ”§ Pydantic: Type-Safe Output Validation
Simple Schema
Basic type validation with required fields
from pydantic import BaseModel

class TicketOutput(BaseModel):
    title: str
    priority: str  # high, medium, low
    assignee: str
    estimated_hours: int

# Validates types automatically
ticket = TicketOutput(**llm_output)
Advanced Validation
Constraints, enums, and custom validators
from pydantic import BaseModel, validator
from typing import List
from enum import Enum

class Priority(Enum):
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"

class Ticket(BaseModel):
    title: str
    priority: Priority
    tags: List[str]
    hours: int
    
    @validator('hours')
    def check_hours(cls, v):
        if v < 0 or v > 100:
            raise ValueError('Invalid hours')
        return v
Nested Structures
Complex hierarchical data models
class Action(BaseModel):
    action_type: str
    owner: str
    deadline: str

class MeetingSummary(BaseModel):
    title: str
    date: str
    attendees: List[str]
    action_items: List[Action]
    key_decisions: List[str]

# Validates entire nested structure
summary = MeetingSummary(**llm_output)
Error Handling
Graceful validation failure management
from pydantic import ValidationError

try:
    ticket = TicketOutput(**llm_output)
    # Success: use validated ticket
    create_ticket_in_system(ticket)
    
except ValidationError as e:
    # Validation failed
    print(f"Errors: {e.errors()}")
    # Ask LLM to try again with error details
    retry_with_feedback(e.errors())
πŸ”„ Structured Output Generation Flow
1
πŸ“‹
Define Schema
Create Pydantic model with required fields, types, and constraints
2
πŸ€–
Provide to LLM
Send schema as JSON format instruction via function calling or system prompt
3
βœ…
Validate Output
Parse LLM response through Pydantic model to check types and constraints
4
πŸ”
Retry if Failed
If validation fails, provide error details to LLM and request corrected output
Part 2: Agent Evaluation
Measuring Performance & Driving Improvement
πŸ“¦
Final Response Evaluation
(Black Box Testing)
Examine only the end result without looking at internal steps. Like judging a restaurant by the final dishβ€”you don't care how it was made, just if it's good.
βœ… Best For
  • End-to-end testing
  • User experience validation
  • Quality benchmarking
  • A/B testing different agents
❌ Limitations
  • Can't debug failures
  • No visibility into reasoning
  • Hard to identify specific issues
🎯
Single-Step Evaluation
(Unit Testing)
Test individual decisions in isolationβ€”did the agent choose the right tool? Format arguments correctly? Like testing each ingredient separately.
βœ… Best For
  • Tool selection accuracy
  • Argument validation
  • Fast iteration during development
  • Targeted debugging
❌ Limitations
  • Doesn't validate full task completion
  • Misses multi-step coordination issues
  • Can pass tests but fail in practice
πŸ›€οΈ
Trajectory Evaluation
(Flight Recorder)
Trace the complete execution pathβ€”every tool call, argument, result, and decision. Like a flight recorder capturing the entire journey, not just takeoff and landing.
βœ… Best For
  • Deep debugging
  • Understanding agent reasoning
  • Comparing execution paths
  • Root cause analysis
❌ Limitations
  • Most complex to implement
  • Expensive (requires full logging)
  • Slower to run
πŸ“ Four Critical Evaluation Dimensions
🎯
1. Task Completion
  • Did the agent accomplish what the user asked?
  • Was the goal fully achieved or only partially?
  • Are there any unfinished sub-tasks?
  • Did the agent recognize when it couldn't complete something?
✨
2. Quality Control
  • Is the output in the correct format?
  • Did the agent follow instructions properly?
  • Was context used appropriately (not ignored or misapplied)?
  • Are there factual errors or hallucinations?
πŸ”§
3. Tool Interaction
  • Did the agent choose the right tools for the task?
  • Were tool arguments valid and well-formed?
  • Did tool calls return useful results?
  • Was the tool sequence logical and efficient?
⚑
4. System Metrics
  • How fast did the agent respond (latency)?
  • How many tokens were consumed (cost)?
  • What's the failure/error rate?
  • Did the agent hit context window limits?
πŸ€– LLM-as-Judge: Automated Evaluation
How It Works
  • Use a separate LLM to evaluate agent outputs
  • Provide rubric, reference answer, actual output
  • LLM scores on dimensions (1-5 or pass/fail)
  • Can evaluate subjective qualities like tone, helpfulness
  • Much faster and cheaper than human evaluation
Best Practices
  • Use stronger model as judge (e.g., GPT-4 judging GPT-3.5)
  • Provide clear scoring criteria and examples
  • Combine with rule-based checks for objectivity
  • Validate judge accuracy with human spot-checks
  • Store judgments for analysis and improvement
πŸ”„ The Continuous Improvement Loop
1
πŸ“Š
Measure
Run evaluation across all dimensions using test cases and real traffic
2
πŸ”
Analyze
Identify patterns in failures, bottlenecks, and opportunities for improvement
3
πŸ› οΈ
Improve
Fix tool descriptions, prompts, validation rules, or add missing capabilities
4
βœ…
Validate
Re-run evaluation to confirm improvements, then deploy and repeat
πŸ’‘ Key Insights from Evaluation
  • Unclear tool descriptions β†’ Agent chooses wrong tools or uses them incorrectly
  • Excessive token usage β†’ Need summarization, better memory management, or shorter prompts
  • Inefficient tool sequences β†’ Agent needs better planning or more direct paths to goals
  • Low quality scores β†’ Improve prompts, add examples, or use stronger base model
  • High failure rates β†’ Add retry logic, better error handling, or fallback strategies
  • Context window overflow β†’ Implement sliding window or summarization for memory