RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please tr

最新推荐文章于 2026-01-23 14:39:37 发布

原创最新推荐文章于 2026-01-23 14:39:37 发布 · 2.6k 阅读

7 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#深度学习 #机器学习 #人工智能

CoPaw

内置vllm部署的Qwen3-4B-Instruct-2507模型，agentscope开源的类似openclaw个人助手。

1. 部署vllm服务报gpu内存错误

报错信息：

ERROR 05-10 09:27:22 [core.py:400] RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.
Process EngineCore_0:
Traceback (most recent call last):
  File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 1580, in _dummy_sampler_run
    sampler_output = self.sampler(logits=logits,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/sample/sampler.py", line 49, in forward
    sampled = self.sample(logits, sampling_metadata)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/sample/sampler.py", line 115, in sample
    random_sampled = self.topk_topp_sampler(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/sample/ops/topk_topp_sampler.py", line 91, in forward_native
    logits = apply_top_k_top_p(logits, k, p)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/sample/ops/topk_topp_sampler.py", line 189, in apply_top_k_top_p
    logits_sort, logits_idx = logits.sort(dim=-1, descending=False)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 298.00 MiB. GPU 0 has a total capacity of 31.36 GiB of which 308.31 MiB is free. Including non-PyTorch memory, this process has 29.94 GiB memory in use. Of the allocated memory 28.69 GiB is allocated by PyTorch, with 75.88 MiB allocated in private pools (e.g., CUDA Graphs), and 32.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/vllm/vllm/v1/engine/core.py", line 404, in run_engine_core
    raise e
  File "/workspace/vllm/vllm/v1/engine/core.py", line 391, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/core.py", line 333, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/workspace/vllm/vllm/v1/engine/core.py", line 72, in __init__
    self._initialize_kv_caches(vllm_config)
  File "/workspace/vllm/vllm/v1/engine/core.py", line 158, in _initialize_kv_caches
    self.model_executor.initialize_from_config(kv_cache_configs)
  File "/workspace/vllm/vllm/v1/executor/abstract.py", line 65, in initialize_from_config
    self.collective_rpc("compile_or_warm_up_model")
  File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/utils.py", line 2555, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 254, in compile_or_warm_up_model
    self.model_runner._dummy_sampler_run(
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 1584, in _dummy_sampler_run
    raise RuntimeError(
RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.
[rank0]:[W510 09:27:22.759524463 ProcessGroupNCCL.cpp:1487] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 33, in <module>
    sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/cli/main.py", line 53, in main
    args.dispatch_function(args)
  File "/workspace/vllm/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop-0.21.0-py3.12-linux-x86_64.egg/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop-0.21.0-py3.12-linux-x86_64.egg/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1077, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 151, in from_vllm_config
    return cls(
           ^^^^
  File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 118, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 649, in __init__
    super().__init__(
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 400, in __init__
    self._wait_for_engine_startup()
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 432, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.

2. 解决办法

报错原因： gpu显存不够
解决思路：限制gpu的显存使用或者增加虚拟显存
假如这是vllm运行脚本：
vllm serve /opt/modules/Qwen3-8B --trust-remote-code \
--served-model-name chat_model

增加如下配置，一般情况下就能够解决问题：
方法1：限制使用内存(0.8相当于80%) --gpu-memory-utilization 0.8
方法2：增加虚拟内存 --cpu-offload-gb 10 --swap-space 10

--swap-space ：每GPU的CPU交换空间的大小（以GIB为单位）。默认：4
--cpu-offload-gb ：每GPU，GIB中的空间可卸载到CPU。默认值为0，这意味着没有卸载。直观地，该参数可以看作是增加GPU内存大小的虚拟方法。例如，如果您有一个24 GB GPU并将其设置为10，则实际上可以将其视为34 GB GPU。然后，您可以加载带有BF16重量的13B型号，这至少需要26GB GPU内存。请注意，这需要快速的CPU-GPU互连，作为模型的一部分，从CPU存储器加载到每个模型向前通行中的GPU内存。默认：0

最终脚本1：
vllm serve /opt/modules/Qwen3-8B --trust-remote-code \
--served-model-name chat_model --gpu-memory-utilization 0.8

最终脚本2(部分机器不生效，推荐使用脚本1)：
vllm serve /opt/modules/Qwen3-8B --trust-remote-code \
--served-model-name chat_model --cpu-offload-gb 10 --swap-space 10

欢迎大佬留下更多解决办法的思路。

您可能感兴趣的与本文相关的镜像