Data Governance for AI
Quality, lineage, lifecycle, and access — governing the data that makes or breaks every model.
Models inherit every property of their data — including its flaws. Data governance is the discipline of knowing what data you have, where it came from, who may use it, and whether it's good enough. For the exam, know the concepts and the AWS services that support them.
Governance concepts
- Data quality — accurate, complete, consistent, current. Bad quality → bad models, no exceptions.
- Data lineage/provenance — documented origin and transformation history of every dataset; essential for audits and for trusting model behavior.
- Data cataloging — searchable inventory of datasets with metadata and classifications.
- Access control — who can read/use which data, enforced with IAM and Lake Formation permissions.
- Lifecycle & retention — how long data lives, when it's archived/deleted (lifecycle policies), honoring regulations and consent.
- Bias auditing — checking datasets for representativeness before training (Clarify).
AWS services for data governance
Catalog datasets; profile and clean data visually without code.
Build governed data lakes with fine-grained (table/column) access permissions.
Organization-wide data cataloging, sharing, and governance portal.
Central, versioned repository of ML features for consistency between training and inference.
Retention, archival, and recoverability for datasets.
Keyword map: "fine-grained permissions on a data lake" → Lake Formation; "catalog and discover datasets across the org" → DataZone / Glue Data Catalog; "visually clean and normalize data without code" → Glue DataBrew; "track where training data came from" → lineage/provenance.
An auditor asks a company to prove where its model's training data originated and how it was transformed. What is this record called?