Tensorrt-llm的基础演示使用#TensorRTLLM 1.0实战#

原创已于 2025-11-06 10:52:26 修改 · 2.5k 阅读

3 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#nlp #TensorRT-LLM

于 2025-11-06 09:44:34 首次发布

本文干货满满，帮助大家快速的完成Tensorrt-llm的安装部署

Tensorrt-llm的简单介绍

TensorRT-LLM 是 NVIDIA 推出的开源大语言模型推理优化库，专为 NVIDIA GPU 设计，通过整合深度学习编译器、优化内核及量化技术，适配主流大模型与多代 GPU 架构，支持多精度计算、动态调度及集群部署，能大幅提升推理速度（如 H100 上每秒超 10000 token）、降低成本，广泛应用于实时交互、企业私有部署等场景，已与国内外众多企业合作构建生态。

Tensorrt-llm的快速安装以及验证

#本次部署安装的环境
Ubuntu 22.0.4
python 3.10
#此外本文使用conda来构建虚拟环境

创建虚拟环境

conda create -n llm python==3.10 #或直接加上-y
conda activate llm #激活该环境

在这里插入图片描述

下载源码

git clone https://gitcode.com/GitHub_Trending/te/TensorRT-LLM
#原文中有从源码快速构建的方式，为了节省时间本文倾向于从另一种简单的方式去构建

# 安装依赖
sudo apt-get -y install libopenmpi-dev
# 
pip install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# 安装TensorRT-LLM
pip install tensorrt_llm #可以加上清华源更快捷

在这里插入图片描述

如上图显示，安装好了

安装验证

import tensorrt_llm
from tensorrt_llm import BuildConfig, SamplingParams
 
print(f"TensorRT-LLM版本: {tensorrt_llm.__version__}")

可以看出成功安装
在这里插入图片描述

Tensorrt-llm的简单使用

from tensorrt_llm.quantization import quantize_and_export
import torch
from tensorrt_llm import LLM, SamplingParams

# 量化配置（针对TinyLlama小模型优化）
quant_config = {
    "quantize_weights": True,
    "quantize_activations": False,
    "use_int4_weights": True,
    "group_size": 64,
    "awq_block_size": 64,
    "calib_size": 256,
    "calib_batch_size": 4,
}

# 执行量化并导出
quantize_and_export(
    model_dir="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    output_dir="tinyllama-1.1b-int4",
    dtype="float16",
    qformat="int4_awq",
    kv_cache_dtype="fp8",
    device="cuda",** quant_config
)

# 加载量化后的模型
llm = LLM(model="tinyllama-1.1b-int4", tensor_parallel_size=1)

# 简单生成任务
prompts = [
    "The capital of France is",
    "Explain the theory of relativity in simple terms."
]

# 配置贪心解码参数
sampling_params = SamplingParams(
    temperature=0.0,  # 温度为0表示贪心解码（选择概率最高的词）
    max_tokens=64     # 最大生成token数
)

# 执行生成
results = llm.generate(prompts, sampling_params)

# 处理并打印结果
for i, result in enumerate(results):
    print(f"Prompt: {result.prompt}")
    print(f"Output: {result.outputs[0].text}\n")

附上一些精度对比
在这里插入图片描述

git clone https://www.modelscope.cn/AI-ModelScope/TinyLlama-1.1B-Chat-v1.0.git #附上模型的下载链接

pip install modelscope -i https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple

Tensorrt-llm的反馈总结

TensorRT-LLM 是 NVIDIA 专为大型语言模型推理打造的高性能优化框架，核心优势在于深度适配 NVIDIA GPU 硬件特性，通过层融合、自定义优化内核、动态批处理、分页 KV 缓存及 FP8/INT8 量化等技术，实现推理延迟的显著降低与吞吐量的大幅提升，部分场景下可使模型推理性能提升数倍。它支持 Llama 系列、DeepSeek 等主流开源模型的端到端部署，提供简洁的 Python API 与 PyTorch 生态兼容，降低了模型迁移与调试门槛，同时支持多 GPU 与多节点分布式推理，适配从小规模原型测试到大规模在线服务的各类场景。