Evaluation Prompts¶
Prompts for evaluating AI model outputs, comparing models, and assessing quality and safety. Use these when selecting models, benchmarking agent performance, or building evaluation pipelines.
Reference: 6 Questions Before Choosing a Model | 7 Failure Modes of Agents
Prompts in This Collection¶
| Prompt File | Use Case | Related Framework |
|---|---|---|
| Model Comparison Prompts | Comparing outputs from different models on the same task | 6 Questions Before Choosing a Model |
| Quality Evaluation Prompts | Scoring output quality across multiple dimensions | 8 Patterns for AI Coding |
| Safety Evaluation Prompts | Testing for safety, alignment, and policy compliance | 7 AI Risks and Mitigations |
Why Evaluate¶
From Chapter 3: The AI Landscape: choosing a model is not a one-time decision. Models improve, pricing changes, and new providers emerge. Structured evaluation lets you make data-driven model decisions instead of relying on benchmarks that may not reflect your use case.
From Chapter 11: Ethics, Governance, and Risk: safety evaluation is not optional. The companies that avoid AI incidents are the ones that test for failure modes before deploying.
Evaluation Approach¶
These prompts support three evaluation methods:
- Human evaluation -- A person scores model outputs using the criteria in these prompts. Most accurate, least scalable.
- Model-as-judge -- A separate AI model scores outputs using these prompts as its rubric. Scalable, requires calibration.
- Automated metrics -- Programmatic checks (e.g., "does the output contain SQL injection?"). Most scalable, narrowest coverage.
For production evaluation pipelines, combine all three: automated metrics as a first pass, model-as-judge for batch evaluation, human evaluation for edge cases and calibration.
Related Resources¶
- Coding Prompts -- the prompts being evaluated
- Agent System Prompts -- agent outputs to evaluate
- 6 Questions Before Choosing a Model -- the decision framework these prompts support
- Permission Model Framework -- autonomy levels that safety evals should verify