Evaluating GenAI Applications
ROUGE, BLEU, and BERTScore; human evaluation and benchmarks; and tying model quality to business results.
8 min read
Generated text has no single "correct answer," so evaluation is harder than classic ML. The exam expects you to know the metric names, what they're for, and when human judgment is required.
Automated metrics
| Metric | Measures | Typical task |
|---|---|---|
| ROUGE | Overlap of generated text with reference text (recall-oriented) | Summarization |
| BLEU | Precision of n-gram matches against references | Translation |
| BERTScore | Semantic similarity using embeddings (meaning, not exact words) | General text quality |
| Perplexity | How well a language model predicts text (lower = better) | Language modeling |
Beyond automated scores
- Human evaluation remains the gold standard for helpfulness, tone, and safety.
- Benchmark datasets (e.g., MMLU, HELM, and curated test sets) compare models on standard tasks.
- Amazon Bedrock Model Evaluation runs automatic or human-based evaluations to compare models on your own data.
- For RAG: also evaluate retrieval quality — are the right documents being found? — and groundedness of answers.
- Business metrics decide success: task completion rate, deflection rate, user satisfaction (CSAT), average handle time, cost per interaction, conversion/ROI.
Exam tip
Pure associations to memorize: ROUGE ↔ summarization, BLEU ↔ translation, BERTScore ↔ semantic similarity. And when a question asks how to know if a GenAI project "succeeded for the business," the answer involves business KPIs, not ROUGE.
Knowledge check
Question 1 of 4Which metric is MOST commonly used to evaluate text SUMMARIZATION quality against reference summaries?