LLM token sampling
- 2025-05-05
The generation process in Large Language Models (LLMs) fundamentally involves selecting the most probable next token. Sampling is a mechanism implemented to inject controlled stochasticity into this selection. Pure “greedy” decoding, which lacks sampling, results in the deterministic selection of the highest-probability token at every step, potentially yielding outputs that lack creativity or diversity. To mitigate this, various sampling techniques, including temperature scaling, token penalties, and truncation strategies, are applied to enable a broader range of output possibilities.