Inference Parameters & GenAI Pricing

Temperature, top-p, max tokens — plus how generative AI is priced and the trade-offs that drive model choice.

8 min read

The knobs on the model

ParameterControlsLow valueHigh value
TemperatureRandomness of token selectionFocused, deterministic, repetitive — good for facts/codeCreative, varied, riskier — good for brainstorming
Top-p (nucleus)Limits choices to tokens covering p probability massConservative word choiceWider vocabulary variety
Top-kLimits choices to the k most likely tokensVery constrainedMore diverse
Max tokensCeiling on response lengthShort (and cheaper) outputsLonger (and pricier) outputs
Stop sequencesStrings that end generation
Exam tip

One association answers most parameter questions: temperature = creativity dial. Factual Q&A or code → low temperature. Marketing copy or story ideas → higher temperature. Max tokens caps *length/cost*, not creativity.

How GenAI is priced

Key points

  • On-demand (per token) — pay for input + output tokens. Output tokens usually cost more. Best for variable or low volume.
  • Provisioned Throughput — reserve model capacity for a fixed hourly fee. Best for steady, high-volume production (and required for using fine-tuned custom models).
  • Bigger models cost more per token and respond slower; smaller models are cheaper and faster. Choose the smallest model that meets quality requirements.
  • Customization costs stack: prompt engineering (cheapest) < RAG (storage + retrieval) < fine-tuning (training + hosting) < training from scratch (astronomical).
  • Self-hosting on SageMaker swaps token fees for instance-hours — more control, more responsibility.
Think of it like this

On-demand tokens are taxi fares — perfect for occasional trips. Provisioned Throughput is leasing the car with a driver for your daily commute — cheaper per mile once usage is steady and high.

Knowledge check
Question 1 of 3

A legal team's chatbot must produce consistent, factual answers with minimal randomness. Which parameter change helps most?