Evaluation Prompts¶

Prompts for evaluating AI model outputs, comparing models, and assessing quality and safety. Use these when selecting models, benchmarking agent performance, or building evaluation pipelines.

Reference: 6 Questions Before Choosing a Model | 7 Failure Modes of Agents

Prompts in This Collection¶

Prompt File	Use Case	Related Framework
Model Comparison Prompts	Comparing outputs from different models on the same task	6 Questions Before Choosing a Model
Quality Evaluation Prompts	Scoring output quality across multiple dimensions	8 Patterns for AI Coding
Safety Evaluation Prompts	Testing for safety, alignment, and policy compliance	7 AI Risks and Mitigations

Why Evaluate¶

From Chapter 3: The AI Landscape: choosing a model is not a one-time decision. Models improve, pricing changes, and new providers emerge. Structured evaluation lets you make data-driven model decisions instead of relying on benchmarks that may not reflect your use case.

From Chapter 11: Ethics, Governance, and Risk: safety evaluation is not optional. The companies that avoid AI incidents are the ones that test for failure modes before deploying.

Evaluation Approach¶

These prompts support three evaluation methods:

Human evaluation -- A person scores model outputs using the criteria in these prompts. Most accurate, least scalable.
Model-as-judge -- A separate AI model scores outputs using these prompts as its rubric. Scalable, requires calibration.
Automated metrics -- Programmatic checks (e.g., "does the output contain SQL injection?"). Most scalable, narrowest coverage.

For production evaluation pipelines, combine all three: automated metrics as a first pass, model-as-judge for batch evaluation, human evaluation for edge cases and calibration.

Coding Prompts -- the prompts being evaluated
Agent System Prompts -- agent outputs to evaluate
6 Questions Before Choosing a Model -- the decision framework these prompts support
Permission Model Framework -- autonomy levels that safety evals should verify

Evaluation Prompts¶

Prompts in This Collection¶

Why Evaluate¶

Evaluation Approach¶

Related Resources¶