Top-k (top-k, top_k, top k)

The top-k sampling method is a strategy used in the process of text generation by language models to control the diversity of the generated output. It involves selecting the next word in a sequence based on a subset of the most probable candidates, which are determined by their predicted probabilities.

A detailed explanation of the steps within top-k sampling follows:

1. Probability Distribution of the Next Word:

Let p=(p1, p2, …, pV) be the probability distribution over the vocabulary V, where pi represents the probability of the ith word in the vocabulary being the next word in the sequence.
This probability distribution is computed by the language model, typically using a softmax function over the logits (unnormalized scores) generated by the model for each word in the vocabulary.

where zi is the logit score for the ith word.

2. Sorting the Probabilities:

The top-k sampling method starts by sorting the vocabulary words by their predicted probabilities in descending order.
Let psorted=(p(1), p(2), …, p(V)) be the sorted list of probabilities, where p(1) ≥ p(2) ≥ ⋯ ≥ p(V).

3. Selecting the Top-k Words:

Define k as the top-k parameter, where k ≤ V.
From the sorted list psorted, select the top k words with the highest probabilities. The corresponding set of indices Sk of these words is given by:

The selected words form a reduced probability distribution over the top-k words.

4. Renormalizing the Probability Distribution:

Since only the top k words are retained, the probabilities must be renormalized to ensure they sum to 1. The renormalized probability for the ith word in Sk is given by:

is the sum of the probabilities of the top-k words.

5. Sampling from the Top-k Words:

Finally, a word is randomly selected from the top-k set Sk according to the renormalized probabilities p'i.
The selected word becomes the next word in the sequence, and the process repeats for the generation of subsequent words.

Impact of k:

Small k
The model always picks the most probable word, leading to deterministic and often repetitive text.

Large k
The model considers almost the entire vocabulary, introducing more randomness and creativity at the risk of incoherence.

This method allows fine-tuning with the trade-off being between text diversity and coherence of the result.

Google Sites

Report abuse