Evaluating GenAI Applications

ROUGE, BLEU, and BERTScore; human evaluation and benchmarks; and tying model quality to business results.

8 min read

Generated text has no single "correct answer," so evaluation is harder than classic ML. The exam expects you to know the metric names, what they're for, and when human judgment is required.

Automated metrics

MetricMeasuresTypical task
ROUGEOverlap of generated text with reference text (recall-oriented)Summarization
BLEUPrecision of n-gram matches against referencesTranslation
BERTScoreSemantic similarity using embeddings (meaning, not exact words)General text quality
PerplexityHow well a language model predicts text (lower = better)Language modeling

Beyond automated scores

  • Human evaluation remains the gold standard for helpfulness, tone, and safety.
  • Benchmark datasets (e.g., MMLU, HELM, and curated test sets) compare models on standard tasks.
  • Amazon Bedrock Model Evaluation runs automatic or human-based evaluations to compare models on your own data.
  • For RAG: also evaluate retrieval quality — are the right documents being found? — and groundedness of answers.
  • Business metrics decide success: task completion rate, deflection rate, user satisfaction (CSAT), average handle time, cost per interaction, conversion/ROI.
Exam tip

Pure associations to memorize: ROUGE ↔ summarization, BLEU ↔ translation, BERTScore ↔ semantic similarity. And when a question asks how to know if a GenAI project "succeeded for the business," the answer involves business KPIs, not ROUGE.

Knowledge check
Question 1 of 4

Which metric is MOST commonly used to evaluate text SUMMARIZATION quality against reference summaries?