Inference Parameters & GenAI Pricing

Temperature, top-p, max tokens — plus how generative AI is priced and the trade-offs that drive model choice.

8 min read

The knobs on the model

Parameter	Controls	Low value	High value
Temperature	Randomness of token selection	Focused, deterministic, repetitive — good for facts/code	Creative, varied, riskier — good for brainstorming
Top-p (nucleus)	Limits choices to tokens covering p probability mass	Conservative word choice	Wider vocabulary variety
Top-k	Limits choices to the k most likely tokens	Very constrained	More diverse
Max tokens	Ceiling on response length	Short (and cheaper) outputs	Longer (and pricier) outputs
Stop sequences	Strings that end generation	—	—

Exam tip

One association answers most parameter questions: temperature = creativity dial. Factual Q&A or code → low temperature. Marketing copy or story ideas → higher temperature. Max tokens caps *length/cost*, not creativity.

How GenAI is priced

Key points

On-demand (per token) — pay for input + output tokens. Output tokens usually cost more. Best for variable or low volume.
Provisioned Throughput — reserve model capacity for a fixed hourly fee. Best for steady, high-volume production (and required for using fine-tuned custom models).
Bigger models cost more per token and respond slower; smaller models are cheaper and faster. Choose the smallest model that meets quality requirements.
Customization costs stack: prompt engineering (cheapest) < RAG (storage + retrieval) < fine-tuning (training + hosting) < training from scratch (astronomical).
Self-hosting on SageMaker swaps token fees for instance-hours — more control, more responsibility.

Think of it like this

On-demand tokens are taxi fares — perfect for occasional trips. Provisioned Throughput is leasing the car with a driver for your daily commute — cheaper per mile once usage is steady and high.

Knowledge check

Question 1 of 3

A legal team's chatbot must produce consistent, factual answers with minimal randomness. Which parameter change helps most?

PreviousAWS Generative AI Services NextPrompt Engineering