声明: 纯兴趣爱好,如有疏漏敬请谅解。
源码版本: v0.21.0
学习相关源码路径:
vllm/vllm/entrypoints/cli/serve.py at v0.21.0 · vllm-project/vllm · GitHub
概述:
今天来详细剖析,serve三种启动方式之一的headless:
1. run_headless:
无头模式入口函数(api_server_count小于1时)
启动命令: vllm serve --headless
但是会出现报错RuntimeError: Did not receive response from front-end process within 5 minutes,目前还没找错误原因。哪位大佬知道可以指导我一下。

2. run_multi_api_server:
多API服务进程启动入口(api_server_count大于1时)
启动命令: vllm serve --api-server-count=2

3. uvloop.run(run_server(args))
单实例HTTP API服务顶层启动入口(api_server_count等于1时)
启动命令: vllm serve
先回顾下上篇文章(如下图),启动方式由以下参数决定
data_parallel_exteran_lb、data_parallel_rank参数决定了is_exteranl_lb
data_parallel_hybrid_lb、data_parallel_start_rank参数决定了is_hybrid_lb
(注: is_exteranl_lb、is_hybrid_lb互斥, 如果is_exteranl_lb、is_hybrid_lb同时为true则程序中断)
enable_elastic_ep: 对api_server_count进行修正

headless函数源码剖析
1. api_server_count验证
if args.api_server_count > 1:
raise ValueError("api_server_count can't be set in headless mode")
2. 创建运行配置vllm_config
# Create the EngineConfig.
engine_args = vllm.AsyncEngineArgs.from_cli_args(args)
usage_context = UsageContext.OPENAI_API_SERVER
vllm_config = engine_args.create_engine_config(
usage_context=usage_context, headless=True
)
3. data_parallel_hybrid_lb验证
if engine_args.data_parallel_hybrid_lb:
raise ValueError("data_parallel_hybrid_lb is not applicable in headless mode")
4.获取并行配置 local_engine_count参数验证
parallel_config = vllm_config.parallel_config
local_engine_count = parallel_config.data_parallel_size_local
if local_engine_count <= 0:
raise ValueError("data_parallel_size_local must be > 0 in headless mode")
5. 注册终止信号回调函数,实现优雅关闭
shutdown_requested = False
def signal_handler(signum, frame):
nonlocal shutdown_requested
logger.debug("Received %d signal.", signum)
if not shutdown_requested:
shutdown_requested = True
raise SystemExit
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)
6. 分布式从节点逻辑
单机部署永远不会触发
if parallel_config.node_rank_within_dp > 0:
from vllm.version import __version__ as VLLM_VERSION
# Run headless workers (for multi-node PP/TP).
host = parallel_config.master_addr
head_node_address = f"{host}:{parallel_config.master_port}"
logger.info(
"Launching vLLM (v%s) headless multiproc executor, "
"with head node address %s for torch.distributed process group.",
VLLM_VERSION,
head_node_address,
)
executor = MultiprocExecutor(vllm_config, monitor_workers=False)
executor.start_worker_monitor(inline=True)
return
7.实例化 CoreEngineProcManager 引擎进程管理器
专门负责创建、管理、监控所有本地 DP 推理 Worker 子进程。
host = parallel_config.data_parallel_master_ip
port = parallel_config.data_parallel_rpc_port
handshake_address = get_tcp_uri(host, port)
logger.info(
"Launching %d data parallel engine(s) in headless mode, "
"with head node address %s.",
local_engine_count,
handshake_address,
)
# Create the engines.
engine_manager = CoreEngineProcManager(
local_engine_count=local_engine_count,
start_index=vllm_config.parallel_config.data_parallel_rank,
local_start_index=0,
vllm_config=vllm_config,
local_client=False,
handshake_address=handshake_address,
executor_class=Executor.get_class(vllm_config),
log_stats=not engine_args.disable_log_stats,
)
8.启动监听
try:
engine_manager.monitor_engine_liveness()
finally:
timeout = None
if shutdown_requested:
timeout = vllm_config.shutdown_timeout
logger.info("Waiting up to %d seconds for processes to exit", timeout)
engine_manager.shutdown(timeout=timeout)
logger.info("Shutting down.")
无限阻塞循环,主线程全程卡在这一行,是程序常驻运行的核心;持续维护父子进程 TCP 心跳(127.0.0.1:29550),实时监控所有 Worker 子进程存活状态;
本来想一次性写完 headless、multi_api_server、uvloop.run三种启动方式,奈何最近项目太忙,其他两种方式下期整理。
知识点
Python嵌套函数
def outer():
# 外层函数
x = 10
# 内层函数,定义在 outer 里面
def inner():
print(x) # 可以访问外层变量
# 调用内层函数
inner()
outer() # 输出 10
优点
封装:工具逻辑只在当前函数可见,命名不污染全局;
闭包可捕获上层变量,不用反复传参;
代码内聚,逻辑就近存放,可读性高。
缺点
多层嵌套可读性变差(不建议超过 2 层);
调试栈会多一层,复杂嵌套排错稍麻烦。
nonlocal关键字
nonlocal 专门用在内层嵌套函数,代表: 变量不在当前函数局部作用域,也不是全局global,而是取自外层上层函数的局部变量。

457

被折叠的 条评论
为什么被折叠?



