Inference Parameters & GenAI Pricing
Temperature, top-p, max tokens — plus how generative AI is priced and the trade-offs that drive model choice.
8 min read
The knobs on the model
| Parameter | Controls | Low value | High value |
|---|---|---|---|
| Temperature | Randomness of token selection | Focused, deterministic, repetitive — good for facts/code | Creative, varied, riskier — good for brainstorming |
| Top-p (nucleus) | Limits choices to tokens covering p probability mass | Conservative word choice | Wider vocabulary variety |
| Top-k | Limits choices to the k most likely tokens | Very constrained | More diverse |
| Max tokens | Ceiling on response length | Short (and cheaper) outputs | Longer (and pricier) outputs |
| Stop sequences | Strings that end generation | — | — |
Exam tip
One association answers most parameter questions: temperature = creativity dial. Factual Q&A or code → low temperature. Marketing copy or story ideas → higher temperature. Max tokens caps *length/cost*, not creativity.
How GenAI is priced
Key points
- On-demand (per token) — pay for input + output tokens. Output tokens usually cost more. Best for variable or low volume.
- Provisioned Throughput — reserve model capacity for a fixed hourly fee. Best for steady, high-volume production (and required for using fine-tuned custom models).
- Bigger models cost more per token and respond slower; smaller models are cheaper and faster. Choose the smallest model that meets quality requirements.
- Customization costs stack: prompt engineering (cheapest) < RAG (storage + retrieval) < fine-tuning (training + hosting) < training from scratch (astronomical).
- Self-hosting on SageMaker swaps token fees for instance-hours — more control, more responsibility.
Think of it like this
On-demand tokens are taxi fares — perfect for occasional trips. Provisioned Throughput is leasing the car with a driver for your daily commute — cheaper per mile once usage is steady and high.
Knowledge check
Question 1 of 3A legal team's chatbot must produce consistent, factual answers with minimal randomness. Which parameter change helps most?