Skip to content

Checklist: Building Your Data Strategy

Use this checklist when designing a data strategy for an AI-first product, evaluating an existing data pipeline, or assessing whether your data creates a defensible competitive advantage. It draws from the six data strategy failure patterns, the data flywheel framework, and the data moats assessment from Chapter 9.

Data Collection

  • You have validated product-market fit before investing in data infrastructure
  • Define collection targets by what improves the model, not by what's easiest to capture
  • Capture signals from production systems and real user interactions, not just synthetic or test data
  • You have identified the specific edge cases your collection pipeline needs to surface (e.g., Tesla's 0.01% automatic edge case detection)
  • Internal users provide continuous usage data to seed the system before external launch
  • Collection covers all five flywheel components: collection, storage, analysis, application, and feedback
  • You know the point at which adding more data stops improving model performance

Data Quality

  • Data accuracy is measured and tracked over time (U.S. average declined from 63.5% to 26.6% between 2021-2024)
  • Automated checks detect duplicates, outliers, and missing values in incoming data
  • Schema changes (missing columns, altered formats) trigger alerts, not silent failures
  • Data drift detection monitors shifts in input distributions
  • Model performance is tracked across segments over time, not just in aggregate
  • Your team spends more time training than cleaning---if not, collection standards need tightening
  • AI-generated content is flagged and excluded from training sets to prevent model collapse

Flywheel Design

  • You can map your data flow through all five components: Collection, Storage, Analysis, Application, Feedback
  • The break point in your flywheel is identified (typically between storage-analysis or analysis-application)
  • Your system has network learning, not just individual learning (deleting one user's data would affect other users' experience)
  • Insights from analysis translate into shipped product improvements, not slide decks
  • Feedback loops measure whether improvements actually generate more and better data
  • Deployment velocity is measured---you can ship, measure, and iterate within days, not months
  • Cold start strategy is defined: expert seeding, targeting complexity over volume, or building the loop architecture before data arrives
  • You have tested whether your flywheel compounds over time or plateaus after initial gains

Moat Assessment

  • Every data asset considered a competitive advantage has been run through the Moat Test:
  • Can a competitor buy this data?
  • Can they scrape it?
  • Can they partner for it?
  • Does usage generate more of it?
  • Is it embedded in customer workflows?
  • You distinguish between data you possess (commodity) and data your system generates through usage (compounding advantage)
  • Don't treat static datasets as durable moats---recognize them as depreciating assets
  • Your moat strategy focuses on at least one of: workflow integration, execution velocity, or systems of intelligence
  • The five conditions for defensible data are evaluated:
  • Continuous refreshment through usage
  • High-quality domain specificity with intelligent curation
  • Data governance that creates procurement advantage
  • Deep workflow integration that creates switching costs
  • Network effects that compound with each user

Infrastructure

  • Infrastructure complexity matches your current stage, not your aspirational scale
  • You can name the specific bottleneck each infrastructure component solves
  • Teams don't spend days or weeks configuring infrastructure for each new workload
  • Multi-provider strategies with fallback options are in place before reaching production scale
  • Systems can switch between AI providers or degrade gracefully during outages
  • Data is accessible across teams, not siloed in department-specific stores
  • Unit economics are viable: margins are above 40% (AI wrappers average 25-60% vs. traditional SaaS at 70-90%)

Governance

  • Data collection complies with applicable privacy regulations (EU AI Act, industry-specific requirements)
  • Consent and data usage terms are clear to users and legally reviewed
  • Audit trails exist for how data flows through the system and into model training
  • Access controls define who can read, write, and delete training data
  • Data retention and deletion policies are documented and enforced
  • Governance architecture anticipates regulatory tightening rather than reacting to it
  • Compliance capabilities are treated as competitive advantage in enterprise procurement, not just a cost center

Source frameworks: - The 6 Data Strategy Mistakes That Stall Flywheels - Building Data Flywheels - Data Moats: What's Defensible vs. Replicable

Full chapter: Chapter 9: Data Strategy