Skip to content

Checklist: Model Selection

Choosing the right AI model for your use case, constraints, and business reality.

Use this checklist before evaluating any model. Gather your technical and business stakeholders, answer every question in sequence, and document your answers. Each section eliminates options -- by the end, your realistic choices should narrow to two or three candidates you can test against actual workloads. Revisit quarterly as the landscape shifts.

Derived from the 6 Questions Before Choosing a Model and Foundation Models Landscape frameworks -- Chapter 3.


Use Case Definition

  • Described the specific job you are hiring AI to do in one sentence (if you can't, you aren't ready to evaluate models)
  • Classified the task complexity: simple classification/extraction (doesn't need a frontier model) vs. multi-step reasoning or complex generation (requires frontier capabilities)
  • Identified the primary capability required: code generation, analytical reasoning, creative synthesis, vision/multimodal, long-document processing, or video analysis
  • Matched capability to model strengths: Claude for code and analytical reasoning, GPT for creative synthesis and vision, Gemini for video and long-context, DeepSeek for budget workloads with security controls
  • Documented the known failure modes that matter for your use case: chaotic people-pleaser (GPT), repetitive patterns (Gemini), over-caution (Claude), security vulnerabilities (DeepSeek)
  • Confirmed the use case is specific enough to benchmark -- not a vague "add AI to our product" directive

Latency Requirements

  • Defined your latency tier: sub-second (real-time voice, fraud detection), 1-3 seconds (chatbots, search), 3-10 seconds (internal tools), or minutes+ (batch processing)
  • Verified that candidate models can meet your latency threshold (e.g., Mistral Large leads time-to-first-token at 0.30s; ChatGPT o1 averages 60.6s -- a 200x difference)
  • Eliminated any models that can't meet your latency requirements regardless of other capabilities
  • For real-time use cases, confirmed that latency trumps marginal capability improvements in your evaluation criteria
  • Tested latency under realistic load conditions, not just single-query benchmarks

Compliance Assessment

  • Identified all regulatory frameworks that apply: HIPAA (healthcare), SOX/financial regulations, FedRAMP (government), GDPR/CCPA (data privacy), industry-specific requirements
  • For regulated industries, treated compliance as the first elimination filter before all other considerations
  • Verified that candidate models and providers offer the required compliance certifications (e.g., FedRAMP High authorization, Business Associate Agreements)
  • Assessed whether data residency requirements mandate self-hosted or on-premises deployment (eliminates most cloud-only API providers)
  • Evaluated whether open-weight models (Llama, Mistral, self-hosted DeepSeek) are required for full data control in strict jurisdictions
  • Confirmed that using models with known security vulnerabilities (e.g., DeepSeek's 100% jailbreak success rate in testing) includes a plan for mitigating controls before deployment

Cost Analysis

  • Estimated your query volume: under 1,000/day (cost is rounding error), or 1M+/day (cost is existential)
  • Evaluated the three pricing models against your volume: per-query (low/unpredictable volume), committed capacity (40-80% discounts at high volume), or self-hosted (high upfront, near-zero marginal)
  • Calculated your annual token spend and compared against the cost break-even thresholds: under $50K/year favors APIs, $50K-$500K favors hybrid, above $500K favors self-hosting
  • Accounted for the full cost spectrum: premium tier (Claude Opus at $15/$75 per million tokens), mid tier (GPT/Gemini Pro ~64% cheaper), budget tier (DeepSeek as low as $0.07/M tokens cached)
  • For self-hosting, factored in the $150K+ annual overhead for engineering talent and operations on top of compute costs
  • Built a cost model that projects 12 months forward based on expected growth in query volume

Explainability Requirements

  • Determined whether your use case requires knowing why the AI reached a conclusion, not just what it concluded
  • For financial services: confirmed the model can produce audit trails that satisfy regulators
  • For healthcare: confirmed the model supports documentation sufficient for clinical risk analysis
  • For legal: confirmed the model can cite sources and ground responses in reference documents
  • Evaluated whether candidate model architectures support traceability -- some architectures make explainability nearly impossible
  • Decided whether explainability is a hard filter (eliminating models that can't provide it) or a soft preference

Vendor Strategy

  • Assessed your switching tolerance: if you choose wrong, how painful is it to change providers?
  • Identified potential lock-in vectors: interface lock-in (workflow configurations, prompt libraries embedded in platforms) and organizational friction (user retraining, trust rebuilding)
  • Pre-defined migration triggers: what price increase, performance degradation, or security incident justifies switching? Document these now, not under pressure later
  • Evaluated a multi-model strategy: 37% of enterprises already support hybrid approaches because no single model meets all requirements
  • Built or planned routing infrastructure and abstraction layers that allow adding or swapping models without rewriting your application
  • Considered the startup vs. enterprise path: startups should start with closed APIs and optimize later; enterprises should plan for multi-model from day one
  • Scheduled quarterly re-evaluation of model selection -- the model that was right six months ago may not be right today

Next step: With all six sections answered, your candidate list should be two or three models. Run those candidates against your actual workloads -- not generic benchmarks -- and make a decision based on production performance. Document your selection rationale so you can revisit it at the next quarterly review.