Skip to main content

LLM inference parameters

Inference parameters are the settings you pass with an LLM request to control how the model generates responses. They do not change the model weights. Instead, they impact the decoding process, such as:

  • How the next token is selected
  • How long the model can keep generating
  • When it should stop
  • How much repetition is allowed.

Some parameters mainly affect output quality and style, while others have direct serving implications for scheduling, throughput, memory usage, and production cost.

Common inference parameters

You will see these parameters in hosted APIs, OpenAI-compatible servers, inference frameworks like vLLM and SGLang, and agentic frameworks. Here is a quick summary of the common ones:

ParameterWhat it controlsCommon use
temperatureRandomness in token selectionLower for stable answers, higher for creative writing
top_pCumulative probability mass considered for samplingLimit sampling to a likely set of tokens
top_kMaximum number of candidate tokens consideredRemove very low-ranked tokens
max_tokensMaximum number of output tokensBound latency and cost
min_tokensMinimum number of output tokensAvoid responses that stop too early
stop / stop_token_idsText or token patterns that end generationStop before delimiters, sections, or tool boundaries
presence_penaltyPenalizes tokens that already appearedEncourage new topics or wording
frequency_penaltyPenalizes tokens based on repeat frequencyReduce repeated words or phrases
repetition_penaltyPenalizes repeated prompt or output tokensCommon in open-source serving stacks
seedRandom seed for samplingImprove reproducibility during testing
logprobsToken probability detailsDebug, score, or inspect outputs
n / best_ofNumber of candidate outputsGenerate alternatives, at higher cost

Not every provider supports every field and the exact names may vary. Even when the field name is the same, behavior can differ across models and frameworks. Treat these configurations as part of your evaluation surface, not as portable guarantees.

Temperature

temperature controls how much the model spreads probability across likely and unlikely tokens before sampling.

Lower temperature makes the probability distribution sharper. In other words, the model is more likely to choose the highest-probability token, so outputs become more stable and predictable.

Higher temperature flattens the distribution. Less likely tokens get more chance to appear, which can make outputs more varied, surprising, or creative.

Common patterns:

  • Use low temperature for factual question answering, extraction, classification, and structured workflows.
  • Use moderate temperature for chat, summarization, and product copy if you can accept some variation.
  • Use higher temperature for brainstorming, fiction, naming, and other creative tasks.
  • Use temperature: 0 or near-zero values when you want greedy or near-deterministic decoding.

Low temperature does not guarantee factual accuracy. It only reduces randomness in the decode step. A model can still give a confident wrong answer if the prompt lacks grounding, the model lacks knowledge, or the application does not verify outputs.

Top-p and top-k sampling

top_p and top_k limit which tokens are eligible before sampling.

Top-p

top_p, also called nucleus sampling, keeps the smallest set of tokens whose cumulative probability reaches a threshold. For example, top_p: 0.9 means the sampler considers the most likely tokens that together cover about 90% of the probability mass.

This adapts to the model's uncertainty. If the next token is obvious, the candidate set may be small. If many tokens are plausible, the set grows.

Top-k

top_k keeps only the k most likely tokens. For example, top_k: 50 means the model samples from the top 50 candidates and ignores everything below them.

This is simple and predictable, but it does not adapt to the shape of the probability distribution. Sometimes the top 50 tokens contain too many weak options. Sometimes there may be more than 50 reasonable options.

The visualizer below shows the same distribution under both filters. Switch between a peaky, mixed, and flat distribution, and notice how top-p keeps fewer tokens when one answer dominates and more when many are plausible. Top-k always keeps the same number, regardless of shape.

Top-p vs top-k
Same distribution, different filters. Top-p adapts to the shape; top-k doesn't.
Next token after "I love eating"
top-p0.90
pizza
22%
sushi
17%
pasta
14%
spicy
11%
ice
10%
bread
10%
fruits
8%
salad
8%
7 of 8 kept · 92% mass
top-k3
pizza
22%
sushi
17%
pasta
14%
spicy
11%
ice
10%
bread
10%
fruits
8%
salad
8%
3 of 8 kept · 53% mass

Which one should you tune

Many systems let you use temperaturetop_p, and top_k together. This can be useful, but it also makes behavior harder to reason about.

A practical starting point:

  • Tune temperature first.
  • Use greedy decoding if you want the model to select the highest probability token at each step. It is deterministic in a fixed serving setup, but can be prone to repetition.
  • Use top_p when you want to bound the long tail and keep sampling adaptive.
  • Use top_k when your inference framework or model family recommends it, or when you need a hard cap on candidate tokens.
  • Combine top_k and top_p: In many samplers, top-k removes the low-ranked tail first, then top-p refines the candidate set further.
  • Avoid changing all of them at once during evaluation.

Output length

Length parameters directly affect latency and cost because LLMs generate one token at a time during decode.

max_tokens sets the maximum number of tokens the model can produce. This is one of the most important production controls. If it is too low, responses get cut off. If it is too high, bad prompts or edge cases can waste GPU time and increase tail latency.

min_tokens asks the model to generate at least a certain number of tokens before it can stop. Use it carefully. It can help avoid empty or overly short responses, but it can also force the model to keep writing after the natural answer is done.

Good defaults depend on the application:

  • Short classification or extraction: Small max_tokens, often below 100.
  • Customer support answer: Enough room for a complete answer, but bounded to avoid rambling.
  • Code generation or long-form writing: Larger limits, with stronger monitoring for latency and cost.
  • Batch jobs: Explicit length limits so one bad input does not dominate the run.

For inference systems, output length also affects scheduling. Long generations hold active request state longer, consume KV cache longer, and can interfere with latency-sensitive traffic.

Stop sequences

Stop sequences tell the server to end generation when specific text appears. Some APIs use string stops such as stop, while lower-level engines may also support token IDs such as stop_token_ids.

They are useful when the output has a clear boundary:

  • Stop at "\n\nUser:" in a chat transcript format.
  • Stop at "</json>" or another delimiter in a structured prompt.
  • Stop after one list item, one SQL statement, or one tool call.

Stop sequences are not a replacement for schema enforcement. They only end generation when a sequence appears. If you need guaranteed JSON, use structured outputs or constrained decoding when your provider supports it.

Be careful with common substrings. A stop sequence that appears inside normal content can cut off valid answers.

Repetition penalties

Penalty parameters modify token scores based on what has already appeared. They are a practical tool for reducing loops and repetitive phrasing.

  • presence_penalty penalizes a token if it has appeared at all. Higher values encourage the model to introduce new tokens or topics.
  • frequency_penalty penalizes tokens more as they appear more often. This is useful when the model repeats the same word or phrase too many times.
  • repetition_penalty penalizes tokens that appeared in the prompt or generated text. Values above 1 usually discourage repetition, while values below 1 encourage it.

These parameters can help, but they are blunt instruments. If the model repeats itself because the prompt is ambiguous, the context is noisy, or the task asks for repetitive output, penalties may only hide the symptom. Too much penalty can also make writing awkward because the model avoids legitimate repeated terms.

Multiple candidates

Some APIs can return multiple completions for one prompt.

n usually means the number of outputs returned. best_of usually means the number of candidates generated internally before returning the best n.

This can help when you want several creative options or when another system will score candidates. The trade-off is cost. If you ask for five candidates, the system may do close to five times the generation work. best_of can be even more expensive because it may generate candidates that are never returned.

Use them deliberately:

  • Good fit: Brainstorming, reranking, test-time selection, evaluation data generation.
  • Poor fit: High-volume production requests since every extra token matters.
  • Risky fit: Latency-sensitive chat, because candidate generation can increase tail latency.

Reproducibility and log probabilities

LLM sampling involves randomness. As mentioned above, at each step the model picks the next token probabilistically based on parameters like temperature and top_p. seed initializes the random number generator that drives this sampling. With the same seed, identical inputs, identical parameters, and a stable serving setup, you can often get the same or near-identical output. This is useful for testing prompts, comparing model versions, or debugging an unexpected output.

However, do not treat seeds as a perfect production guarantee. Reproducibility can still change when the model version, tokenizer, serving framework, hardware kernels, batching behavior, or floating-point implementation changes.

logprobs returns probability information for generated tokens. Some systems also support prompt_logprobs, which reports probability information for prompt tokens.

These fields are useful for:

  • Inspecting why a model chose a token.
  • Building confidence heuristics.
  • Comparing candidate completions.
  • Debugging classification prompts.
  • Measuring how strongly the model prefers a constrained label.

They can increase response size and are not always supported by chat APIs. Enable them when you need the signal, not by default for every production request.

Advanced controls

More advanced inference parameters can restrict or alter token selection directly. Some common ones include:

  • logit_bias increases or decreases the score of specific tokens. This can nudge the model toward or away from specific words, labels, or formatting markers.
  • bad_words blocks certain word sequences.
  • allowed_token_ids restricts generation to a specific token set. These are powerful but easy to misuse because tokenization does not always match human-visible words.

Use advanced controls when the output is consumed by software. For example, extraction pipelines, function calling, and structured data generation usually need stronger guarantees than prompt instructions alone can provide.

There is no universal best parameter configuration. Good defaults depend on the task, the model family, and the serving stack. Different models can behave very differently even with the same settings.

Still, the following ranges are useful starting points for evaluation:

Use caseTemperatureTop-pmax_tokensNotes
Classification0.0–0.21.0Small, often < 20Prefer deterministic output
Extraction / structured parsing0.0–0.21.0SmallMinimize variation and formatting drift
RAG / factual QA0.1–0.50.9–1.0ModerateLower randomness may reduce hallucinations
General chat assistant0.5–0.80.9–1.0ModerateBalanced stability and variation
Summarization0.2–0.70.9–1.0ModerateDepends on how extractive vs. creative you want the summary to be
Code generation0.0–0.31.0Moderate to largeLower temperature usually improves syntax stability
Brainstorming / ideation0.7–1.20.9–0.95Moderate to largeEncourage more diverse outputs
Creative writing0.8–1.30.9–0.95LargeHigher diversity, but also higher instability

These are starting points, not rules. Always evaluate parameters with your actual prompts, model versions, and workloads.

For production systems, parameter tuning is also an infrastructure concern. For example, higher output lengths and more exploratory sampling can increase latency, GPU utilization, KV cache pressure, and tail latency under load.

FAQs

Are inference parameters the same as model hyperparameters?

No. Hyperparameters usually refer to settings used during training, such as learning rate or batch size. Inference parameters are request-time settings used when generating outputs from an already trained model.

Why did my output stop early?

Common causes include a low max_tokens, a stop sequence appearing in normal text, the model producing an end-of-sequence token, or a provider-side safety or length limit.

Do all models support the same parameters?

No. Hosted providers, OpenAI-compatible servers, and open-source engines expose different subsets. Some parameters may be ignored, rejected, or implemented differently depending on the serving stack.