RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please tr

CoPaw

内置vllm部署的Qwen3-4B-Instruct-2507模型,agentscope开源的类似openclaw个人助手。

1. 部署vllm服务报gpu内存错误

报错信息:

ERROR 05-10 09:27:22 [core.py:400] RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.
Process EngineCore_0:
Traceback (most recent call last):
  File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 1580, in _dummy_sampler_run
    sampler_output = self.sampler(logits=logits,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/sample/sampler.py", line 49, in forward
    sampled = self.sample(logits, sampling_metadata)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/sample/sampler.py", line 115, in sample
    random_sampled = self.topk_topp_sampler(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/sample/ops/topk_topp_sampler.py", line 91, in forward_native
    logits = apply_top_k_top_p(logits, k, p)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/sample/ops/topk_topp_sampler.py", line 189, in apply_top_k_top_p
    logits_sort, logits_idx = logits.sort(dim=-1, descending=False)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 298.00 MiB. GPU 0 has a total capacity of 31.36 GiB of which 308.31 MiB is free. Including non-PyTorch memory, this process has 29.94 GiB memory in use. Of the allocated memory 28.69 GiB is allocated by PyTorch, with 75.88 MiB allocated in private pools (e.g., CUDA Graphs), and 32.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/vllm/vllm/v1/engine/core.py", line 404, in run_engine_core
    raise e
  File "/workspace/vllm/vllm/v1/engine/core.py", line 391, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/core.py", line 333, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/workspace/vllm/vllm/v1/engine/core.py", line 72, in __init__
    self._initialize_kv_caches(vllm_config)
  File "/workspace/vllm/vllm/v1/engine/core.py", line 158, in _initialize_kv_caches
    self.model_executor.initialize_from_config(kv_cache_configs)
  File "/workspace/vllm/vllm/v1/executor/abstract.py", line 65, in initialize_from_config
    self.collective_rpc("compile_or_warm_up_model")
  File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/utils.py", line 2555, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 254, in compile_or_warm_up_model
    self.model_runner._dummy_sampler_run(
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 1584, in _dummy_sampler_run
    raise RuntimeError(
RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.
[rank0]:[W510 09:27:22.759524463 ProcessGroupNCCL.cpp:1487] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 33, in <module>
    sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/cli/main.py", line 53, in main
    args.dispatch_function(args)
  File "/workspace/vllm/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop-0.21.0-py3.12-linux-x86_64.egg/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop-0.21.0-py3.12-linux-x86_64.egg/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1077, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 151, in from_vllm_config
    return cls(
           ^^^^
  File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 118, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 649, in __init__
    super().__init__(
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 400, in __init__
    self._wait_for_engine_startup()
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 432, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.

2. 解决办法

报错原因:  gpu显存不够
解决思路: 限制gpu的显存使用或者增加虚拟显存
假如这是vllm运行脚本:
vllm serve /opt/modules/Qwen3-8B --trust-remote-code \
--served-model-name chat_model

增加如下配置,一般情况下就能够解决问题:
方法1: 限制使用内存(0.8相当于80%) --gpu-memory-utilization 0.8
方法2: 增加虚拟内存   --cpu-offload-gb 10 --swap-space 10

--swap-space : 每GPU的CPU交换空间的大小(以GIB为单位)。 默认:4
--cpu-offload-gb : 每GPU,GIB中的空间可卸载到CPU。默认值为0,这意味着没有卸载。直观地,该参数可以看作是增加GPU内存大小的虚拟方法。例如,如果您有一个24 GB GPU并将其设置为10,则实际上可以将其视为34 GB GPU。然后,您可以加载带有BF16重量的13B型号,这至少需要26GB GPU内存。请注意,这需要快速的CPU-GPU互连,作为模型的一部分,从CPU存储器加载到每个模型向前通行中的GPU内存。默认:0

最终脚本1:
vllm serve /opt/modules/Qwen3-8B --trust-remote-code \
--served-model-name chat_model  --gpu-memory-utilization 0.8

最终脚本2(部分机器不生效,推荐使用脚本1):
vllm serve /opt/modules/Qwen3-8B --trust-remote-code \
--served-model-name chat_model --cpu-offload-gb 10 --swap-space 10

欢迎大佬留下更多解决办法的思路。

您可能感兴趣的与本文相关的镜像

CoPaw

CoPaw

AI应用
Qwen
Qwen3

内置vllm部署的Qwen3-4B-Instruct-2507模型,agentscope开源的类似openclaw个人助手。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

建杉

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值