The ML Lifecycle & Data
From business problem to production model: the ML pipeline stages, data splits, and MLOps basics.
9 min read
Models don't appear from thin air — they're products of a pipeline. The exam expects you to know the stages in order and what happens in each.
The ML lifecycle
- Define the business problem — and whether ML is even the right tool (rule-based logic is cheaper when rules are known and stable).
- Collect and prepare data — gather, clean, deduplicate, handle missing values, label if needed. Usually the most time-consuming stage.
- Feature engineering — select and transform the input variables (features) the model will learn from.
- Train the model — the algorithm adjusts internal parameters to minimize error on training data. Hyperparameters (settings like learning rate) are tuned by you, not learned.
- Evaluate — measure performance on data the model has never seen.
- Deploy — host the model for inference (making predictions).
- Monitor & retrain — watch for drift as real-world data changes; retrain when performance degrades.
Splitting your data
| Split | Used for |
|---|---|
| Training set (~70–80%) | The examples the model learns from |
| Validation set (~10–15%) | Tuning hyperparameters and comparing model candidates |
| Test set (~10–15%) | Final, untouched measure of real-world performance |
Watch out
Never evaluate a model on data it trained on — it will look deceptively good. That's like grading students on the exact questions they practiced.
Inference: batch vs real time
Key points
- Real-time inference — a hosted endpoint answers individual requests in milliseconds (chatbots, fraud checks at checkout). Always-on, costs more.
- Batch inference — run predictions over a large dataset on a schedule (score all customers overnight). Cheaper; latency doesn't matter.
- Edge inference — run the model on a local device where connectivity or latency rules out the cloud.
MLOps applies DevOps discipline to this lifecycle: version the data and models, automate pipelines, monitor for model drift and data drift, and make retraining repeatable. On AWS, Amazon SageMaker Pipelines, Model Registry, and Model Monitor implement these practices.
Knowledge check
Question 1 of 4Which dataset provides the FINAL unbiased estimate of a model's real-world performance?