The ML Lifecycle & Data

From business problem to production model: the ML pipeline stages, data splits, and MLOps basics.

9 min read

Models don't appear from thin air — they're products of a pipeline. The exam expects you to know the stages in order and what happens in each.

The ML lifecycle

Define the business problem — and whether ML is even the right tool (rule-based logic is cheaper when rules are known and stable).
Collect and prepare data — gather, clean, deduplicate, handle missing values, label if needed. Usually the most time-consuming stage.
Feature engineering — select and transform the input variables (features) the model will learn from.
Train the model — the algorithm adjusts internal parameters to minimize error on training data. Hyperparameters (settings like learning rate) are tuned by you, not learned.
Evaluate — measure performance on data the model has never seen.
Deploy — host the model for inference (making predictions).
Monitor & retrain — watch for drift as real-world data changes; retrain when performance degrades.

Splitting your data

Split	Used for
Training set (~70–80%)	The examples the model learns from
Validation set (~10–15%)	Tuning hyperparameters and comparing model candidates
Test set (~10–15%)	Final, untouched measure of real-world performance

Watch out

Never evaluate a model on data it trained on — it will look deceptively good. That's like grading students on the exact questions they practiced.

Inference: batch vs real time

Key points

Real-time inference — a hosted endpoint answers individual requests in milliseconds (chatbots, fraud checks at checkout). Always-on, costs more.
Batch inference — run predictions over a large dataset on a schedule (score all customers overnight). Cheaper; latency doesn't matter.
Edge inference — run the model on a local device where connectivity or latency rules out the cloud.

MLOps applies DevOps discipline to this lifecycle: version the data and models, automate pipelines, monitor for model drift and data drift, and make retraining repeatable. On AWS, Amazon SageMaker Pipelines, Model Registry, and Model Monitor implement these practices.

Knowledge check

Question 1 of 4

Which dataset provides the FINAL unbiased estimate of a model's real-world performance?

PreviousAI, ML & Deep Learning NextEvaluating ML Models