Skip to content

Evaluation Prompts

Prompts for evaluating AI model outputs, comparing models, and assessing quality and safety. Use these when selecting models, benchmarking agent performance, or building evaluation pipelines.

Reference: 6 Questions Before Choosing a Model | 7 Failure Modes of Agents

Prompts in This Collection

Prompt File Use Case Related Framework
Model Comparison Prompts Comparing outputs from different models on the same task 6 Questions Before Choosing a Model
Quality Evaluation Prompts Scoring output quality across multiple dimensions 8 Patterns for AI Coding
Safety Evaluation Prompts Testing for safety, alignment, and policy compliance 7 AI Risks and Mitigations

Why Evaluate

From Chapter 3: The AI Landscape: choosing a model is not a one-time decision. Models improve, pricing changes, and new providers emerge. Structured evaluation lets you make data-driven model decisions instead of relying on benchmarks that may not reflect your use case.

From Chapter 11: Ethics, Governance, and Risk: safety evaluation is not optional. The companies that avoid AI incidents are the ones that test for failure modes before deploying.

Evaluation Approach

These prompts support three evaluation methods:

  1. Human evaluation -- A person scores model outputs using the criteria in these prompts. Most accurate, least scalable.
  2. Model-as-judge -- A separate AI model scores outputs using these prompts as its rubric. Scalable, requires calibration.
  3. Automated metrics -- Programmatic checks (e.g., "does the output contain SQL injection?"). Most scalable, narrowest coverage.

For production evaluation pipelines, combine all three: automated metrics as a first pass, model-as-judge for batch evaluation, human evaluation for edge cases and calibration.