在 Ubuntu 安装 TensorRT 并解决相关问题

最新推荐文章于 2026-04-17 19:01:55 发布

原创最新推荐文章于 2026-04-17 19:01:55 发布 · 2.9k 阅读

29 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#ubuntu #pytorch

模型转换与部署专栏收录该内容

1 篇文章

订阅专栏

该文章已生成可运行项目，

在 Ubuntu 上使用 Conda 安装并配置 TensorRT 以部署 PyTorch 的 X3D-M 模型

本文详细记录了在 Ubuntu 系统上从零开始安装和配置 TensorRT 的完整过程，确保其与 Conda 环境中的 PyTorch 及其依赖（CUDA、cuDNN）兼容。最终，我们将使用 torch2trt 将 PyTorch 的 X3D-M 模型转换为 TensorRT 引擎，并进行推理测试。

环境概述

操作系统: Ubuntu 22.04
Conda: 已安装
CUDA: 11.5
cuDNN: 8.9.5
PyTorch: Conda 安装
TensorRT: 8.4.3.1
其他依赖: torch2trt, pytorchvideo, pycuda

步骤 1: 检查和安装 CUDA

首先，确认系统是否已安装 CUDA 以及其版本。

检查 CUDA 版本

在终端中运行以下命令：

nvcc --version

输出示例:

nvcc: NVIDIA (R) Cuda compiler driver
...
Cuda compilation tools, release 11.5, V11.5.119

这表明系统上已安装 CUDA 11.5。

安装 CUDA（如果未安装或需升级）

如果未安装 CUDA 或需要升级，请按照以下步骤：

添加 NVIDIA 包存储库:

（没做笔记）

验证安装:

再次运行 nvcc --version 确认 CUDA 安装成功。

步骤 2: 检查和安装 cuDNN

接下来，确认是否已安装 cuDNN。

检查 cuDNN 版本

运行以下命令：

cat /usr/include/x86_64-linux-gnu/cudnn_version_v8.h | grep CUDNN_MAJOR -A 2

输出示例:

#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

这表明已安装 cuDNN 8.9.5。

安装 CUDNN8.9.5（下面懒得改）

如果需要安装或升级 cuDNN，请按照以下步骤：

卸载现有的 cuDNN（如果存在旧版本）:

sudo apt-get remove --purge libcudnn8 libcudnn8-dev libcudnn8-samples

下载 cuDNN 8.4.1:

前往 NVIDIA cuDNN 下载页面并下载适用于 CUDA 11.5 的 cuDNN 8.4.1 .deb 文件。

安装 cuDNN:

sudo dpkg -i cudnn-local-repo-ubuntu2204-8.9.5.30_1.0-1_amd64.deb
sudo apt-key add /var/cudnn-local-repo-ubuntu2204-8.9.5.30/cudnn-*****-keyring.gpg
sudo apt update
sudo apt install libcudnn8=8.9.5.30-1+cuda11.8 libcudnn8-dev=8.9.5.30-1+cuda11.8 libcudnn8-samples=8.9.5.30-1+cuda11.8

验证安装:

再次运行检查命令，确保 cuDNN 版本为 8.9.5。

问题

E: 无法定位软件包 libcudnn8
E: 无法定位软件包 libcudnn8-dev
E: 无法定位软件包 libcudnn8-samples

查看/etc/apt/sources.list.d/

ls /etc/apt/sources.list.d/
docker.list  google-chrome.list  nvidia-container-toolkit.list  official-package-repositories.list  tacit-dynamics-aps-foldersync-desktop.sources  vscode.list

看起来 /etc/apt/sources.list.d/ 目录中没有与 cuDNN 或 CUDA 相关的存储库条目。这可能是导致系统无法定位 libcudnn8 软件包的原因。我们将手动添加 NVIDIA 的 CUDA 和 cuDNN 存储库。

5.1手动添加 NVIDIA CUDA 和 cuDNN 存储库

添加 NVIDIA 存储库
我们需要手动添加 NVIDIA 的 CUDA 和 cuDNN 存储库。你可以使用以下命令：
```
sudo bash -c 'echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /" > /etc/apt/sources.list.d/cuda.list'
```
这个命令会将 NVIDIA 的 CUDA 存储库添加到你的 APT 源列表。

导入 NVIDIA 公钥
然后，手动添加 NVIDIA 的 GPG 密钥：

sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub

更新 APT 并安装 cuDNN
现在，更新 APT 软件包索引，并安装 libcudnn8 相关包：
```
sudo apt update
sudo apt install libcudnn8 libcudnn8-dev libcudnn8-samples
```

5.2 验证 cuDNN 安装

 ```bash
 # 安装完成后，你可以再次检查 cuDNN 版本：
 cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
 ```

步骤 3: 安装 TensorRT

3.1 复制现有 Conda 环境（防止环境乱）

假设你已有一个名为 movinet 的 Conda 环境，复制并命名为 tensorrt：

conda create --name tensorrt --clone movinet
conda activate tensorrt

3.2 安装 TensorRT

使用 pip 安装 TensorRT

由于 Conda 官方渠道可能不提供合适版本的 TensorRT，建议使用 pip 安装：

pip install nvidia-pyindex
pip install nvidia-tensorrt==8.4.3.1

确保安装的 TensorRT 版本与系统上的 CUDA、cuDNN 版本兼容。

步骤 4: 测试 TensorRT 安装

编写并运行一个简单的 TensorRT 测试程序，以验证安装是否成功。

示例测试代码

import numpy as np
import tensorrt as trt
import pycuda.autoinit
import pycuda.driver as cuda

print(f"TensorRT version: {trt.__version__}")
print(f"CUDA version: {cuda.get_version()}")
print(f"PyCUDA version: {pycuda.VERSION_TEXT}")

# 创建一个简单的全连接网络
def create_network(network, input_tensor):
    # 获取输入维度
    input_shape = input_tensor.shape

    # 第一个全连接层（使用矩阵乘法和偏置加法模拟）
    w1 = network.add_constant((32, input_shape[1]), np.random.randn(32, input_shape[1]).astype(np.float32))
    fc1 = network.add_matrix_multiply(input_tensor, trt.MatrixOperation.NONE, w1.get_output(0), trt.MatrixOperation.TRANSPOSE)
    b1 = network.add_constant((1, 32), np.random.randn(1, 32).astype(np.float32))
    fc1_bias = network.add_elementwise(fc1.get_output(0), b1.get_output(0), trt.ElementWiseOperation.SUM)
    
    # ReLU激活
    relu1 = network.add_activation(fc1_bias.get_output(0), trt.ActivationType.RELU)
    
    # 第二个全连接层
    w2 = network.add_constant((10, 32), np.random.randn(10, 32).astype(np.float32))
    fc2 = network.add_matrix_multiply(relu1.get_output(0), trt.MatrixOperation.NONE, w2.get_output(0), trt.MatrixOperation.TRANSPOSE)
    b2 = network.add_constant((1, 10), np.random.randn(1, 10).astype(np.float32))
    fc2_bias = network.add_elementwise(fc2.get_output(0), b2.get_output(0), trt.ElementWiseOperation.SUM)
    
    # Softmax输出
    softmax = network.add_softmax(fc2_bias.get_output(0))
    return softmax

def main():
    logger = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(logger)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    
    # Define input tensor
    input_tensor = network.add_input("input", trt.DataType.FLOAT, (1, 784))
    
    # Create network
    output_tensor = create_network(network, input_tensor).get_output(0)
    output_tensor.name = "output"
    network.mark_output(output_tensor)
    
    # Build engine
    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)  # 1 MiB
    
    try:
        serialized_engine = builder.build_serialized_network(network, config)
    except Exception as e:
        print(f"Error building serialized network: {e}")
        return
    
    if serialized_engine is None:
        print("Failed to create serialized TensorRT engine.")
        return
    
    # Create runtime engine from serialized engine
    runtime = trt.Runtime(logger)
    try:
        engine = runtime.deserialize_cuda_engine(serialized_engine)
    except Exception as e:
        print(f"Error deserializing CUDA engine: {e}")
        return
    
    if engine is None:
        print("Failed to create TensorRT engine.")
        return
    
    print("Successfully created TensorRT engine.")
    
    # Prepare input data
    input_data = np.random.rand(1, 784).astype(np.float32)
    
    # Allocate output memory
    output = np.empty((1, 10), dtype=np.float32)
    
    # Create CUDA stream
    stream = cuda.Stream()
    
    # Allocate device memory
    d_input = cuda.mem_alloc(input_data.nbytes)
    d_output = cuda.mem_alloc(output.nbytes)
    
    # Create execution context
    context = engine.create_execution_context()
    
    try:
        # Copy input data to device
        cuda.memcpy_htod_async(d_input, input_data, stream)
        
        # Execute inference
        print("Available methods on context:", dir(context))
        
        if hasattr(context, 'execute_async_v2'):
            print("Using execute_async_v2")
            context.execute_async_v2(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
        elif hasattr(context, 'execute_async'):
            print("Using execute_async")
            context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
        elif hasattr(context, 'execute_v2'):
            print("Using execute_v2")
            context.execute_v2(bindings=[int(d_input), int(d_output)])
        else:
            print("No known execution method found")
            raise AttributeError("No suitable execution method found on IExecutionContext")
        
        # Copy output from device to host
        cuda.memcpy_dtoh_async(output, d_output, stream)
        
        # Synchronize stream
        stream.synchronize()
        
        print("Input shape:", input_data.shape)
        print("Output shape:", output.shape)
        print("Output:", output)
        
        print("TensorRT test completed successfully!")
    except Exception as e:
        print(f"An error occurred during execution: {e}")
    finally:
        # Clean up resources
        del context
        del engine
        del runtime

if __name__ == "__main__":
    main()

运行测试程序

确保已安装 pycuda：

pip install pycuda

运行测试程序：

python test.py

预期输出:

TensorRT version: 8.4.3.1
CUDA version: (11, 5, 0)
PyCUDA version: 2024.1.2
[09/30/2024-17:25:00] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.4.0
[09/30/2024-17:25:00] [TRT] [W] Try increasing the workspace size to 4194304 bytes to get better performance.
[09/30/2024-17:25:00] [TRT] [W] Try increasing the workspace size to 4194304 bytes to get better performance.
[09/30/2024-17:25:01] [TRT] [W] Try increasing the workspace size to 4194304 bytes to get better performance.
[09/30/2024-17:25:01] [TRT] [W] Try increasing the workspace size to 4194304 bytes to get better performance.
[09/30/2024-17:25:01] [TRT] [W] Try increasing the workspace size to 4194304 bytes to get better performance.
[09/30/2024-17:25:01] [TRT] [W] Try increasing the workspace size to 4194304 bytes to get better performance.
[09/30/2024-17:25:01] [TRT] [W] Try increasing the workspace size to 4194304 bytes to get better performance.
[09/30/2024-17:25:01] [TRT] [W] Try increasing the workspace size to 4194304 bytes to get better performance.
[09/30/2024-17:25:01] [TRT] [W] Try increasing the workspace size to 4194304 bytes to get better performance.
[09/30/2024-17:25:01] [TRT] [W] Try increasing the workspace size to 4194304 bytes to get better performance.
[09/30/2024-17:25:01] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[09/30/2024-17:25:01] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
Successfully created TensorRT engine.
Available methods on context: ['__class__', '__del__', '__delattr__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'active_optimization_profile', 'all_binding_shapes_specified', 'all_shape_inputs_specified', 'debug_sync', 'device_memory', 'engine', 'enqueue_emits_profile', 'error_recorder', 'execute', 'execute_async', 'execute_async_v2', 'execute_v2', 'get_binding_shape', 'get_shape', 'get_strides', 'name', 'profiler', 'report_to_profiler', 'set_binding_shape', 'set_optimization_profile_async', 'set_shape_input']
Using execute_async_v2
Input shape: (1, 784)
Output shape: (1, 10)
Output: [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
TensorRT test completed successfully!

如果看到推理结果且没有重大错误，说明 TensorRT 安装成功。

步骤 5: 转换并部署 X3D-M 模型

5.1 安装必要的库

确保已安装以下库：

pip install torch torchvision pytorchvideo
pip install git+https://github.com/NVIDIA-AI-IOT/torch2trt.git

5.2 使用 `torch2trt` 转换 X3D-M 模型

编写并运行以下脚本，将 X3D-M 模型转换为 TensorRT 引擎。

转换脚本 (`totrt.py`)

import torch
from pytorchvideo.models.hub import x3d_m
from torch2trt import torch2trt

# 1. 加载 X3D 模型
model = x3d_m(pretrained=True, progress=True, input_channel=3).eval().cuda()  # model_depth='M' 表示中等大小的 X3D

# 2. 准备示例输入数据 (假设输入是 16 帧 3x112x112 的视频)
# 输入张量大小为: (batch_size, channels, frames, height, width)
example_input = torch.randn(1, 3, 16, 224, 224).cuda()

# 3. 将 PyTorch 模型转换为 TensorRT 模型
# 注意：torch2trt 默认使用 FP32，你可以通过传递 fp16_mode=True 来启用 FP16 模式
model_trt = torch2trt(model, [example_input], fp16_mode=False)  # 如果支持 FP16, 可将 fp16_mode=True

# 4. 保存 TensorRT 模型
torch.save(model_trt.state_dict(), 'x3d_trt.pth')

print("模型成功转换并保存！")

5.3 运行转换脚本

python totrt.py

预期输出:

模型成功转换并保存！

5.4 测试转换后的 TensorRT 模型

编写并运行以下脚本，加载并测试 TensorRT 引擎。

测试推理脚本 (`test_trt_model.py`)

from torch2trt import TRTModule
import torch
import numpy as np

# 1. 加载 TensorRT 模型
model_trt = TRTModule()
model_trt.load_state_dict(torch.load('x3d_trt.pth'))

# 2. 准备推理输入数据 (1 batch, 3 channels, 16 frames, 224x224 resolution)
input_data = torch.randn(1, 3, 16, 224, 224).cuda()

# 3. 使用 TensorRT 模型进行推理
output = model_trt(input_data)

# 4. 打印推理结果
print("推理输出形状:", output.shape)
print("推理输出:", output)

运行测试推理脚本

python test_trt_model.py

预期输出:

推理输出形状: torch.Size([1, 400])
推理输出: tensor([[ ... ]], device='cuda:0')

这表明 X3D-M 模型已成功转换为 TensorRT 并能够进行推理。

常见问题及解决方法

问题 1: TensorRT 与 cuDNN 版本不匹配

警告信息:

TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.4.0

解释: TensorRT 编译时使用的是 cuDNN 8.4.1，但系统加载的是 cuDNN 8.4.0。

解决方法:

**~~更新 cuDNN 到 8.4.1没测试~~ **:

sudo apt-get remove --purge libcudnn8 libcudnn8-dev libcudnn8-samples
# 下载 cuDNN 8.4.1 后安装
sudo dpkg -i cudnn-local-repo-ubuntu2204-8.4.1.50_1.0-1_amd64.deb
sudo apt-key add /var/cudnn-local-repo-ubuntu2204-8.4.1.50/cudnn-local-archive-keyring.gpg
sudo apt update
sudo apt install libcudnn8 libcudnn8-dev libcudnn8-samples

忽略警告:

如果推理过程无异常，可选择忽略此警告。

问题 2: `torchvision` 的警告

警告信息:

Failed to load image Python extension: '.../torchvision/image.so: undefined symbol: ...'

解决方法:

重新安装 torchvision:

pip uninstall torchvision
pip install torchvision --no-cache-dir

忽略警告:

如果不使用 torchvision.io 的图像功能，可忽略此警告。

问题 3: `torch2trt` 转换失败

错误信息:

AttributeError: 'NoneType' object has no attribute 'serialize'

解决方法:

确保输入形状正确:

对于 X3D-M 模型，输入形状应为 (batch_size, channels, frames, height, width)，如 (1, 3, 16, 224, 224)。
检查 TensorRT 构建错误:

如果构建引擎时出错，确保 TensorRT 及其依赖正确安装，且版本匹配。
使用 ONNX 作为中间格式:

如果 torch2trt 无法处理复杂模型，可尝试通过 ONNX 转换。

问题 4: 推理过程中 TensorRT 层维度不匹配

错误信息:

inputs with the same operation must have same number of dimensions, but have 5 and 3 dimensions respectively.

解决方法:

确保网络层输入输出维度一致:

在自定义 TensorRT 网络中，确保每个操作的输入维度匹配。
使用标准转换流程:

尽量使用标准工具（如 torch2trt 或 trtexec）进行模型转换，避免手动添加复杂层。

总结

通过本文档，你可以在 Ubuntu 系统上成功安装并配置 TensorRT，确保其与 Conda 环境中的 PyTorch 及相关依赖（CUDA、cuDNN）兼容。最终，你可以使用 torch2trt 将复杂的 PyTorch 模型（如 X3D-M）转换为高效的 TensorRT 引擎，并进行推理加速。

关键步骤回顾: