Speculative Sampling

原创

已于 2024-09-13 14:16:16 修改 · 1.2k 阅读

标签

#人工智能

收录于

于 2024-09-13 14:15:37 首次发布

Speculative Sampling 【LLM系列 | 训练&推理加速】投机采样

This post provides an overview, implementation, and time complexity analysis of DeepMind's paper Accelerating Large Language Model Decoding with Speculative Sampling.

Code for this blog post can be found at github.com/jaymody/speculative-samlping.

EDIT (Apr 13th, 2023): Updated code and time complexity to avoid the extra forward pass of the draft model (credits to KexinFeng).

Autoregressive Sampling

The standard way of generating text from a language model is with autoregressive sampling, here's the algorithm as defined in the paper:

In code:

def autoregressive_sampling(x, model, N):
    n = len(x)
    T = len(x) + N

    while n < T:
        x = np.append(x, sample(model(x)[-1]))
        n += 1

    return x

Where:

x is a list of integers representing the token ids of the input text
model is a language model (like GPT-2) that accepts as input a list of token ids of length seq_len and outputs a matrix of probabilities of shape [seq_len, vocab_size].
N is the number of tokens we want to decode.

The time complexity of this algorithm is O(N⋅tmodel):

N: The number of iterations of our while loop, which is just the number of tokens to decode N.
tmodel: The time complexity of each iteration in the loop, which is just the time taken for a single forward pass of our model tmodel.