Prefix Caching 详解：实现 KV Cache 的跨请求高效复用

最新推荐文章于 2026-05-08 21:15:19 发布

原创

最新推荐文章于 2026-05-08 21:15:19 发布 · 3.9k 阅读

标签

#人工智能 #vLLM #KV Cache #AI

Prefix Caching 原理的讲解视频可以在这里观看：https://www.bilibili.com/video/BV1jgTRzSEjS

本文是 vLLM 系列文章的第 3 篇，介绍 vLLM 中 Prefix Caching 的实现原理。

往期文章：

1 什么是 Prefix Caching

前缀缓存（Prefix Caching）是一种大语言模型推理优化技术，它的核心思想是缓存历史对话中的 KV Cache，以便后续请求能直接重用这些中间结果。这样可以显著降低首 token 延迟，提升整体推理效率。Prefix Caching 尤其适用于多轮对话、长文档问答等高前缀复用场景。

Prefix Caching 在大语言模型推理中的应用场景主要包括以下几类：

Few-shot learning（少样本学习）：多个请求都包含相同的 few-shot 示例部分，只是最后的问题不同。Prefix Caching 可以将这些 few-shot 示例的 KV Cache 复用，避免每次都重新计算相同的示例内容。
Self-consistency（自洽性）：对于同一个问题，先采样多个不同的推理路径（重复请求多次），然后选择最一致的答案。这些请求都共享相同的前缀（问题部分），Prefix Caching 可以让每次 decode 时都直接复用问题部分的缓存，只计算不同的答案部分。
Multi-turn chat（多轮对话）：多轮对话中，每一轮的对话都基于之前的聊天历史。Prefix Caching 允许每一轮都复用之前聊天历史的KV缓存，只对新增的问答部分进行计算。
Tree-of-thought（思维树）：复杂推理任务中，一个问题会被分解成多个分支，每个分支下又有进一步的分支。每个分支都共享前面的搜索历史作为前缀。Prefix Caching 可以让所有分支共享公共的历史部分缓存，只对各自独立的分支内容做增量计算。

Prefix Caching 只会减少处理查询（prefill 阶段）的时间，而不会减少生成新 token（decode 阶段）的时间。

2 PagedAttention 和 Prefix Caching 的关系

PagedAttention 主要解决 KV Cache 如何在 GPU 显存中“按需分配”，通过分页机制让 KV Cache 可以非连续存储和动态扩容，极大缓解内存碎片化问题，实现高效的内存管理。
Prefix Caching 则专注于“避免重复算”，即当多个请求有相同的 prompt 前缀时，只需计算一次并缓存其 KV，后续请求直接复用，显著降低首 token 时延，尤其适合多轮对话和长 system prompt 场景。

维度	PagedAttention	Prefix Caching
关注点	高效管理 KV Cache 的内存分配与碎片化	复用请求间公共前缀的 KV Cache，减少重复计算
作用阶段	整个推理过程，包括 prefill 和 decode 阶段	prefill 阶段（推理开始前处理 prompt）
是否涉及跨请求	主要用于单个请求内部的缓存管理	针对不同请求间的共享前缀
技术原理	受操作系统虚拟内存分页启发，将 KV Cache 分块（block）动态分配和管理	通过哈希、基数树等结构检测和缓存相同前缀的 KV，跨请求复用
主要作用	解决 KV Cache 占用大、内存碎片严重、动态扩展难等问题，提升显存利用率和吞吐量	避免对相同前缀重复计算，显著降低首 token 延迟，提升多轮对话等场景效率
典型应用	任何高并发、长序列推理场景	长 system prompt、few-shot、对话历史复用、多轮对话等

3 RadixAttention

论文 SGLang: Efficient Execution of Structured Language Model Programs 中提出通过 RadixAttention 来实现Prefix Caching。

上图展示了采用 LRU 淘汰策略的 RadixAttention 操作示例，描绘了 Radix Tree（基数树）在不同请求作用下的动态演化过程。这些请求包括两个对话会话、一批 few-shot 学习查询，以及一次自洽性采样（self-consistency sampling）。树的每条边标注了一个子字符串或一段 token 序列，节点则通过颜色编码以区分不同状态：

绿色表示新添加的节点，
蓝色表示当前时间点访问到的缓存节点，
红色表示已经被淘汰的节点。

具体步骤如下：

步骤(1)：Radix Tree 初始为空。
步骤(2)：服务器接收到用户消息 "Hello"，并生成 LLM 回复 "Hi"。系统提示 "You are a helpful assistant"、用户消息 "Hello!" 和模型回复 "Hi!" 被整合为一条边，并连接到一个新节点。
步骤(3)：新的 prompt 到达，服务器在树中找到了该 prompt 的前缀（即第一轮对话），并重用其 KV cache。新的对话轮次作为新节点追加进树中。
步骤(4)：开启新的对话会话。为了让两个会话共享系统提示，“b” 节点被拆分成两个节点。
步骤(5)：第二个会话继续，但由于内存限制，第 (4) 步中的 “c” 节点被淘汰。新的轮次被追加在 “d” 节点之后。
步骤(6)：服务器收到一个 few-shot learning 查询，将其插入树中。由于该查询和现有节点没有公共前缀，根节点被拆分。
步骤(7)：服务器收到一批新的 few-shot learning 查询。它们共享相同的 few-shot 示例，因此将 (6) 中的 “e” 节点拆分以实现共享。
步骤(8)：服务器收到来自第一个对话会话的新消息。由于使用 LRU 策略，第二个对话的所有节点（如 “g” 和 “h”）被淘汰。
步骤(9)：服务器收到一个请求，要求对 (8) 中 “j” 节点的问题进行更多回答采样，可能是用于自洽性采样（self-consistency sampling）。为了腾出空间，第 (8) 步中的 “i”、 “k”、 “l” 节点被淘汰。

4 vLLM 中的 Prefix Caching

最初，vLLM 支持手动前缀缓存，用户需通过 prefix_pos 参数显式指定前缀边界位置。

PR：https://github.com/vllm-project/vllm/pull/1669

从 v0.4.0 版本开始，vLLM 引入了自动前缀缓存（Automatic Prefix Caching），无需手动指定即可自动识别并复用共享前缀。

PR：https://github.com/vllm-project/vllm/pull/2762

4.1 在 vLLM 中启用 Prefix Caching

4.1.1 环境准备

执行以下命令安装 vLLM。

# 安装 uv，管理 python 虚拟环境
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

# 安装 GPU Driver
wget https://cn.download.nvidia.com/tesla/565.57.01/NVIDIA-Linux-x86_64-565.57.01.run
sh NVIDIA-Linux-x86_64-565.57.01.run --silent

# 安装 CUDA Toolkit（如 nvcc、include、lib64）
sudo apt update
sudo apt install -y nvidia-cuda-toolkit

# 创建 python 虚拟环境
uv venv vllm-demo --python 3.12 --seed
source vllm-demo/bin/activate

# 安装 vLLM
uv pip install vllm

4.1.2 离线推理（Offline Inference）

在 vLLM 中设置 enable_prefix_caching=True 可以启用 Automatic Prefix Caching。下面这段代码展示了 vLLM 的 Automatic Prefix Caching 功能：第一次生成关于 “John Doe 年龄” 的回答时，需要完整构建 KV Cache；而第二次询问 “Zack Blue 年龄”，由于两次问题共享相同的长表格前缀，vLLM 会自动复用已有缓存，从而显著减少重复计算，加速生成过程。

import time

from vllm import LLM, SamplingParams

LONG_PROMPT = (
    "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n"
    + """
| ID  | Name          | Age | Occupation    | Country       | Email                  | Phone Number   | Address                       |
|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|
| 1   | John Doe      | 29  | Engineer      | USA           | john.doe@example.com   | 555-1234       | 123 Elm St, Springfield, IL  |
| 2   | Jane Smith    | 34  | Doctor        | Canada        | jane.smith@example.com | 555-5678       | 456 Oak St, Toronto, ON      |
| 3   | Alice Johnson | 27  | Teacher       | UK            | alice.j@example.com    | 555-8765       | 789 Pine St, London, UK      |
| 4   | Bob Brown     | 45  | Artist        | Australia     | bob.b@example.com      | 555-4321       | 321 Maple St, Sydney, NSW    |
| 5   | Carol White   | 31  | Scientist     | New Zealand   | carol.w@example.com    | 555-6789       | 654 Birch St, Wellington, NZ |
| 6   | Dave Green    | 28  | Lawyer        | Ireland       | dave.g@example.com     | 555-3456       | 987 Cedar St, Dublin, IE     |
| 7   | Emma Black    | 40  | Musician      | USA           | emma.b@example.com     | 555-1111       | 246 Ash St, New York, NY     |
| 8   | Frank Blue    | 37  | Chef          | Canada        | frank.b@example.com    | 555-2222       | 135 Spruce St, Vancouver, BC |
| 9   | Grace Yellow  | 50  | Engineer      | UK            | grace.y@example.com    | 555-3333       | 864 Fir St, Manchester, UK   |
| 10  | Henry Violet  | 32  | Artist        | Australia     | henry.v@example.com    | 555-4444       | 753 Willow St, Melbourne, VIC|
| 11  | Irene Orange  | 26  | Scientist     | New Zealand   | irene.o@example.com    | 555-5555       | 912 Poplar St, Auckland, NZ  |
| 12  | Jack Indigo   | 38  | Teacher       | Ireland       | jack.i@example.com     | 555-6666       | 159 Elm St, Cork, IE         |
| 13  | Karen Red     | 41  | Lawyer        | USA           | karen.r@example.com    | 555-7777       | 357 Cedar St, Boston, MA     |
| 14  | Leo Brown     | 30  | Chef          | Canada        | leo.b@example.com      | 555-8888       | 246 Oak St, Calgary, AB      |
| 15  | Mia Green     | 33  | Musician      | UK            | mia.g@example.com      | 555-9999       | 975 Pine St, Edinburgh, UK   |
| 16  | Noah Yellow   | 29  | Doctor        | Australia     | noah.y@example.com     | 555-0000       | 864 Birch St, Brisbane, QLD  |
| 17  | Olivia Blue   | 35  | Engineer      | New Zealand   | olivia.b@example.com   | 555-1212       | 753 Maple St, Hamilton, NZ   |
| 18  | Peter Black   | 42  | Artist        | Ireland       | peter.b@example.com    | 555-3434       | 912 Fir St, Limerick, IE     |
| 19  | Quinn White   | 28  | Scientist     | USA           | quinn.w@example.com    | 555-5656       | 159 Willow St, Seattle, WA   |
| 20  | Rachel Red    | 31  | Teacher       | Canada        | rachel.r@example.com   | 555-7878       | 357 Poplar St, Ottawa, ON    |
| 21  | Steve Green   | 44  | Lawyer        | UK            | steve.g@example.com    | 555-9090       | 753 Elm St, Birmingham, UK   |
| 22  | Tina Blue     | 36  | Musician      | Australia     | tina.b@example.com     | 555-1213       | 864 Cedar St, Perth, WA      |
| 23  | Umar Black    | 39  | Chef          | New Zealand   | umar.b@example.com     | 555-3435       | 975 Spruce St, Christchurch, NZ|
| 24  | Victor Yellow | 43  | Engineer      | Ireland       | victor.y@example.com   | 555-5657       | 246 Willow St, Galway, IE    |
| 25  | Wendy Orange  | 27  | Artist        | USA           | wendy.o@example.com    | 555-7879       | 135 Elm St, Denver, CO       |
| 26  | Xavier Green  | 34  | Scientist     | Canada        | xavier.g@example.com   | 555-9091       | 357 Oak St, Montreal, QC     |
| 27  | Yara Red      | 41  | Teacher       | UK            | yara.r@example.com     | 555-1214       | 975 Pine St, Leeds, UK       |
| 28  | Zack Blue     | 30  | Lawyer        | Australia     | zack.b@example.com     | 555-3436       | 135 Birch St, Adelaide, SA   |
| 29  | Amy White     | 33  | Musician      | New Zealand   | amy.w@example.com      | 555-5658       | 159 Maple St, Wellington, NZ |
| 30  | Ben Black     | 38  | Chef          | Ireland       | ben.b@example.com      | 555-7870       | 246 Fir St, Waterford, IE    |
"""
)

def get_generation_time(llm, sampling_params, prompts):
    # time the generation
    start_time = time.time()
    output = llm.generate(prompts, sampling_params=sampling_params)
    end_time = time.time()
    # print the output and generation time
    print("-" * 30)
    print(f"Output: {
     
     output[0].outputs[0].text}")
    print(f"Generation time: {
     
     end_time - start_time} seconds.")
    print("-" * 30)


def main():
    # set enable_prefix_caching=True to enable APC
    llm = LLM(model="deepseek-ai/deepseek-llm-7b-chat", enable_prefix_caching=True)

    sampling_params

最低0.47元/天解锁文章