Data Governance for AI

Quality, lineage, lifecycle, and access — governing the data that makes or breaks every model.

8 min read

Models inherit every property of their data — including its flaws. Data governance is the discipline of knowing what data you have, where it came from, who may use it, and whether it's good enough. For the exam, know the concepts and the AWS services that support them.

Governance concepts

Data quality — accurate, complete, consistent, current. Bad quality → bad models, no exceptions.
Data lineage/provenance — documented origin and transformation history of every dataset; essential for audits and for trusting model behavior.
Data cataloging — searchable inventory of datasets with metadata and classifications.
Access control — who can read/use which data, enforced with IAM and Lake Formation permissions.
Lifecycle & retention — how long data lives, when it's archived/deleted (lifecycle policies), honoring regulations and consent.
Bias auditing — checking datasets for representativeness before training (Clarify).

AWS services for data governance

AWS Glue Data Catalog / Glue DataBrew

Catalog datasets; profile and clean data visually without code.

AWS Lake Formation

Build governed data lakes with fine-grained (table/column) access permissions.

Amazon DataZone

Organization-wide data cataloging, sharing, and governance portal.

SageMaker Feature Store

Central, versioned repository of ML features for consistency between training and inference.

Amazon S3 lifecycle + versioning

Retention, archival, and recoverability for datasets.

Exam tip

Keyword map: "fine-grained permissions on a data lake" → Lake Formation; "catalog and discover datasets across the org" → DataZone / Glue Data Catalog; "visually clean and normalize data without code" → Glue DataBrew; "track where training data came from" → lineage/provenance.

Knowledge check

Question 1 of 3

An auditor asks a company to prove where its model's training data originated and how it was transformed. What is this record called?

PreviousSecuring AI Systems NextCompliance & Governance for AI Workloads