Evaluating GenAI Applications

ROUGE, BLEU, and BERTScore; human evaluation and benchmarks; and tying model quality to business results.

8 min read

Generated text has no single "correct answer," so evaluation is harder than classic ML. The exam expects you to know the metric names, what they're for, and when human judgment is required.

Automated metrics

Metric	Measures	Typical task
ROUGE	Overlap of generated text with reference text (recall-oriented)	Summarization
BLEU	Precision of n-gram matches against references	Translation
BERTScore	Semantic similarity using embeddings (meaning, not exact words)	General text quality
Perplexity	How well a language model predicts text (lower = better)	Language modeling

Beyond automated scores

Human evaluation remains the gold standard for helpfulness, tone, and safety.
Benchmark datasets (e.g., MMLU, HELM, and curated test sets) compare models on standard tasks.
Amazon Bedrock Model Evaluation runs automatic or human-based evaluations to compare models on your own data.
For RAG: also evaluate retrieval quality — are the right documents being found? — and groundedness of answers.
Business metrics decide success: task completion rate, deflection rate, user satisfaction (CSAT), average handle time, cost per interaction, conversion/ROI.

Exam tip

Pure associations to memorize: ROUGE ↔ summarization, BLEU ↔ translation, BERTScore ↔ semantic similarity. And when a question asks how to know if a GenAI project "succeeded for the business," the answer involves business KPIs, not ROUGE.

Knowledge check

Question 1 of 4

Which metric is MOST commonly used to evaluate text SUMMARIZATION quality against reference summaries?

PreviousAgents & GenAI Application Architecture NextPrinciples of Responsible AI