深度学习模型加速：Pytorch模型转TensorRT模型 - 文章 - 开发者社区

点击下方卡片，关注「集智书童」公众号

作者丨是摸鱼啊@知乎（已授权转载）来源丨https://zhuanlan.zhihu.com/p/597283907 编辑丨小书童

前段时间接到一个工程任务，对**「MVSNet_pytorch」** （链接：https://github.com/xy-guo/MVSNet_pytorch）模型进行加速，以实现效率的提升。经过一段时间的调研与实践，算是对模型加速这方面有了一定的了解，便促成了此文。

1、如何实现模型加速？

既然要提升效率，实现模型加速，那么具体应该怎么做呢？

目前常用的深度学习模型加速的方法是：将pytorch/tensorflow等表示的模型转化为TensorRT表示的模型。 pytorch和tensorflow我们了解，那么TensorRT是什么呢？

TensorRT是NVIDIA公司出的能加速模型推理的框架，其实就是让你训练的模型在测试阶段的速度加快，比如你的模型测试一张图片的速度是50ms，那么用tensorRT加速的话，可能只需要10ms。有关TensorRT更详细的介绍，本文不做赘述，可自行参考官网。

我将实现深度学习模型加速整体分成了两部分：

模型转换部分。实现 Pytorch/Tensorflow Model -> TensorRT Model 的转换。
模型推断（Inference）部分。利用 TensorRT Model 进行模型的 Inference。

注意：由于我只进行了 Pytorch -> TensorRT 模型的转换。因此，下面的方式仅对 Pytorch -> TensorRT 的转换有效，不保证对其他形式的模型转换同样适用！

2、模型转换

如何由 Pytorch Model 得到对应的 TensorRT Model 呢？一般有两种方式：

借助 「torch2trt」 （链接：https://github.com/NVIDIA-AI-IOT/torch2trt）进行转换。https://github.com/xy-guo/MVSNet\_pytorch 是一个直接将 Pytorch 模型转换为 TensorRT 模型的库，但是不能保证所有的模型的都能转换成功，比如本文所转换的mvsnet_pytorch就失败了。但不妨可以一试，毕竟可以省去一大部分功夫。
「Pytorch -> onnx -> TensorRT」 。这条路是使用最广泛的，首先将 Pytorch 模型转换为 ONNX 表示的模型；再将 ONNX 表示的模型转换为 TensorRT 表示的模型。这个方法也是本文重点介绍的方法。

3、Pytorch -> ONNX 的转换

Pytorch -> ONNX 的转换比较简单，借助于 Pytorch 内置的API即可。


          
            
torch.onnx.export(model,  
                      x,  
                      "./ckpts/onnx\_models/{}.onnx".format(model_name),  
                      input_names=input_names,  
                      output_names=output_names,  
                      opset_version=16,  
                      )

关于这个函数中各个参数的具体含义，可以参考这篇文章(https://zhuanlan.zhihu.com/p/498425043)或者官方文档(https://pytorch.org/docs/stable/onnx.html)。这里需要强调的一点是参数**「opset_version」** ：由于onnx官方还在不断更新，目前只有一部分的pytorch算子能够进行转换，还有相当一部分算子是无法转换的。所以，我们在进行转换的时候，尽量选择最新版本的opset_version，来确保更多的算子能够被转换。目前ONNX官方支持的算子及对应的版本(https://github.com/onnx/onnx/blob/main/docs/Operators.md)。

在我转换MVSNet_pytorch的时候，由于模型中使用了torch.inverse()算子，而不巧的是该算子并不能够被转换。在这种情况下，可以参考如下解决手段：

在数据准备阶段将数据转换好，从而在模型中移除该操作。（我也是使用这种方法的，由于torch.inverse只是对一个矩阵取逆，在模型训练之前，我就对矩阵取逆，直接将该结果送入模型，在网络中就不需要取逆了，从而避免了模型转换时出现错误。）
使用功能相近的算子进行替代。
自定义算子注册。难度较大，需要对pytorch源码有一定的理解。至此，Pytorch -> ONNX 的转换就结束了。可以借助onnxruntime工具(https://onnxruntime.ai/docs/tutorials/export-pytorch-model.html)测试一下转换完的ONNX模型是否正确。

4、ONNX -> TensorRT 的转换

在进行 ONNX -> TensorRT 的转换之前，强烈建议使用onnx-simplifier工具（https://github.com/daquexian/onnx-simplifier）对转换过的ONNX模型进行简化，否则有可能在接下来的转换中报错。onnx-simplifier是一个对ONNX模型进行简化的工具，我们前面转换得到的ONNX模型其实是非常冗余的，有一些操作（比如IF判断）是不需要的，而这些冗余的部分在接下来的ONNX->TensorRT模型的转换中很可能会引起不必要的错误，同时也会增大模型的内存；因此，对其进行简化是很有必要的。

下面我们需要将ONNX模型转为TensorRT模型，首先将ONNX文件移动到TensorRT-8.5.1.7/bin目录下并打开终端使用**「官方工具trtexec」** 进行模型转换。该工具已经在之前下载的TensorRT文件夹中。TensorRT的安装教程可以参考文末链接。


          
            
#输入命令  
./trtexec --onnx=mvsnet.onnx --saveEngine=mvsnet.trt --workspace=6000

如果不报错的话，我们会在bin目录下得到一个名为mvsnet.trt的模型，这就是转换得到的TensorRT模型。至此，模型转换部分全部结束。

5、模型推断（Inference）

这部分我们要使用转换得到的.trt模型进行Inference，要解决的任务就是：如何加载该模型，输入测试数据并得到对应的输出。

首先，编写TRTModule类，相当于使用TensorRT模型进行前向遍历


          
            
class TRTModule(torch.nn.Module):  
    def \_\_init\_\_(self, engine=None, input\_names=None, output\_names=None):  
        super(TRTModule, self).__init__()  
        self.engine = engine  
        if self.engine is not None:  
            # engine创建执行context  
            self.context = self.engine.create_execution_context()  
  
        self.input_names = input_names  
        self.output_names = output_names  
  
    def forward(self, inputs):  
        bindings = [None] * (len(self.input_names) + len(self.output_names))  
        # 创建输出tensor，并分配内存  
        outputs = [None] * len(self.output_names)  
        for i, output_name in enumerate(self.output_names):  
            idx = self.engine.get_binding_index(output_name)  # 通过binding\_name找到对应的input\_id  
            dtype = torch_dtype_from_trt(self.engine.get_binding_dtype(idx))  # 找到对应的数据类型  
            shape = tuple(self.engine.get_binding_shape(idx))  # 找到对应的形状大小  
            device = torch_device_from_trt(self.engine.get_location(idx))  
            output = torch.empty(size=shape, dtype=dtype, device=device)  
            outputs[i] = output  
            bindings[idx] = output.data_ptr()  # 绑定输出数据指针  
  
        for i, input_name in enumerate(self.input_names):  
            idx = self.engine.get_binding_index(input_name)  
            bindings[idx] = inputs[i].contiguous().data_ptr()  
  
        self.context.execute_async_v2(  
            bindings, torch.cuda.current_stream().cuda_stream  
        )  # 执行推理  
  
        outputs = tuple(outputs)  
        if len(outputs) == 1:  
            outputs = outputs[0]  
        return outputs

接着，创建TRTModule实例，即创建模型。输入测试数据进行测试


          
            
def main():  
    logger = trt.Logger(trt.Logger.INFO)  
    with open("./ckpts/trt\_models/model\_000015-sim.trt", "rb") as f, trt.Runtime(logger) as runtime:  
        engine = runtime.deserialize_cuda_engine(f.read())  # 输入trt本地文件，返回ICudaEngine对象  
  
    for idx in range(engine.num_bindings):  # 查看输入输出的名字，类型，大小  
        is_input = engine.binding_is_input(idx)  
        name = engine.get_binding_name(idx)  
        op_type = engine.get_binding_dtype(idx)  
        shape = engine.get_binding_shape(idx)  
        print(f"idx: {idx}, is\_input: {is\_input}, binding\_name: {name}, shape: {shape}, op\_type: {op\_type}")  
  
    trt_model = TRTModule(engine=engine,  
                          input_names=["in\_imgs", "in\_proj\_matrices", "in\_inverse\_proj\_matrices", "in\_depth\_values"],  
                          output_names=["out\_depth", "out\_confidence"]  
                          )  
  
    torch_model = torch.load(f"./ckpts/torch\_models/model\_000015.pth").cuda()  
  
    # create example data  
    data_iter = iter(TestImgLoader)  
    sample = data_iter.__next__()  
    sample_cuda = tocuda(sample)  
    x = (sample_cuda["imgs"],   # (1, 3, 3, 512, 640)  
         sample_cuda["proj\_matrices"],  # (1, 3, 4, 4)  
         sample_cuda["inverse\_proj\_matrices"],  # (1, 3, 4, 4)  
         sample_cuda["depth\_values"])   # (1, 192)  
  
    # define input and output names  
    # input\_names = ['in\_imgs', 'in\_proj\_matrices', 'in\_inverse\_proj\_matrices', 'in\_depth\_values']  
    # output\_names = ['out\_depth', 'out\_confidence']  
  
    check_results(torch_model=torch_model, trt_model=trt_model, x=x)  
    # check\_speed(torch\_model=torch\_model, trt\_model=trt\_model, data\_loader=TestImgLoader)

完整代码如下：


          
            
import sys  
import time  
  
import torch  
import tensorrt as trt  
import argparse  
  
from datasets import find_dataset_def  
from torch.utils.data import DataLoader  
from utils import *  
  
import warnings  
warnings.filterwarnings("ignore")  
  
  
parser = argparse.ArgumentParser(description='Predict depth, filter, and fuse. May be different from the original implementation')  
parser.add_argument('--model', default='mvsnet', help='select model')  
  
parser.add_argument('--dataset', default='dtu\_yao\_eval', help='select dataset')  
parser.add_argument('--testpath', default='/media/qing\_bo/sunxusen/mvs/data/DTU/mvs\_testing/dtu/', help='testing data path')  
parser.add_argument('--testlist', default='../../lists/dtu/test.txt', help='testing scan list')  
  
parser.add_argument('--batch\_size', type=int, default=1, help='testing batch size')  
parser.add_argument('--numdepth', type=int, default=192, help='the number of depth values')  
parser.add_argument('--interval\_scale', type=float, default=1.06, help='the depth interval scale')  
  
parser.add_argument('--loadckpt', default=None, help='load a specific checkpoint')  
parser.add_argument('--outdir', default='./outputs', help='output dir')  
parser.add_argument('--display', action='store\_true', help='display depth images and masks')  
  
# parse arguments and check  
args = parser.parse_args()  
  
# dataset, dataloader  
MVSDataset = find_dataset_def(args.dataset)  
test_dataset = MVSDataset(args.testpath, args.testlist, "test", 5, args.numdepth, args.interval_scale)  
TestImgLoader = DataLoader(test_dataset, args.batch_size, shuffle=False, num_workers=4, drop_last=False)  
  
  
def trt\_version():  
    return trt.__version__  
  
  
def torch\_device\_from\_trt(device):  
    if device == trt.TensorLocation.DEVICE:  
        return torch.device("cuda")  
    elif device == trt.TensorLocation.HOST:  
        return torch.device("cpu")  
    else:  
        return TypeError("%s is not supported by torch" % device)  
  
  
def torch\_dtype\_from\_trt(dtype):  
    if dtype == trt.int8:  
        return torch.int8  
    elif trt_version() >= '7.0' and dtype == trt.bool:  
        return torch.bool  
    elif dtype == trt.int32:  
        return torch.int32  
    elif dtype == trt.float16:  
        return torch.float16  
    elif dtype == trt.float32:  
        return torch.float32  
    else:  
        raise TypeError("%s is not supported by torch" % dtype)  
  
  
class TRTModule(torch.nn.Module):  
    def \_\_init\_\_(self, engine=None, input\_names=None, output\_names=None):  
        super(TRTModule, self).__init__()  
        self.engine = engine  
        if self.engine is not None:  
            # engine创建执行context  
            self.context = self.engine.create_execution_context()  
  
        self.input_names = input_names  
        self.output_names = output_names  
  
    def forward(self, inputs):  
        bindings = [None] * (len(self.input_names) + len(self.output_names))  
        # 创建输出tensor，并分配内存  
        outputs = [None] * len(self.output_names)  
        for i, output_name in enumerate(self.output_names):  
            idx = self.engine.get_binding_index(output_name)  # 通过binding\_name找到对应的input\_id  
            dtype = torch_dtype_from_trt(self.engine.get_binding_dtype(idx))  # 找到对应的数据类型  
            shape = tuple(self.engine.get_binding_shape(idx))  # 找到对应的形状大小  
            device = torch_device_from_trt(self.engine.get_location(idx))  
            output = torch.empty(size=shape, dtype=dtype, device=device)  
            outputs[i] = output  
            bindings[idx] = output.data_ptr()  # 绑定输出数据指针  
  
        for i, input_name in enumerate(self.input_names):  
            idx = self.engine.get_binding_index(input_name)  
            bindings[idx] = inputs[i].contiguous().data_ptr()  
  
        self.context.execute_async_v2(  
            bindings, torch.cuda.current_stream().cuda_stream  
        )  # 执行推理  
  
        outputs = tuple(outputs)  
        if len(outputs) == 1:  
            outputs = outputs[0]  
        return outputs  
  
  
# check the results of torch model and tensorrt model  
def check\_results(torch\_model, trt\_model, x):  
    with torch.no_grad():  
        torch_output = torch_model(x[0], x[1], x[2], x[3])  
        trt_output = trt_model(x)  
  
    for k, v in torch_output.items():  
        print(k, v.shape)  
  
    for i in range(len(trt_output)):  
        print(i, trt_output[i].shape)  
  
    print(f"depth: max\_diff={torch.max((torch\_output['depth'] - trt\_output[0]) ** 2)}")  
    print(f"photometric\_confidence: max\_diff={torch.max((torch\_output['photometric\_confidence'] - trt\_output[1]) ** 2)}")  
  
  
# check yhe speed of torch mmodel and tensorrt model  
def check\_speed(torch\_model, trt\_model, data\_loader):  
    print(f"============================== Torch Model ==============================")  
    print(f"[Torch] >>> begin.")  
    t1 = time.time()  
    with torch.no_grad():  
        for batch_idx, sample in enumerate(data_loader):  
            sample_cuda = tocuda(sample)  
            x = (sample_cuda["imgs"],  # (1, 3, 3, 512, 640)  
                 sample_cuda["proj\_matrices"],  # (1, 3, 4, 4)  
                 sample_cuda["inverse\_proj\_matrices"],  # (1, 3, 4, 4)  
                 sample_cuda["depth\_values"])  # (1, 192)  
  
            torch_outputs = torch_model(x[0], x[1], x[2], x[3])  
            # print('Iter {}/{}'.format(batch\_idx, len(TestImgLoader)))  
    t2 = time.time()  
    print(f"[Torch] >>> end, t={t2 - t1}")  
  
    print(f"============================== TensorRT Model ==============================")  
    print(f"[TensorRT] >>> begin.")  
    t3 = time.time()  
    with torch.no_grad():  
        for batch_idx, sample in enumerate(data_loader):  
            sample_cuda = tocuda(sample)  
            x = (sample_cuda["imgs"],  # (1, 3, 3, 512, 640)  
                 sample_cuda["proj\_matrices"],  # (1, 3, 4, 4)  
                 sample_cuda["inverse\_proj\_matrices"],  # (1, 3, 4, 4)  
                 sample_cuda["depth\_values"])  # (1, 192)  
            trt_output = trt_model(x)  
    t4 = time.time()  
    print(f"[TensorRT] end, t={t4 - t3}")  
    print(f"function: check\_speed finished. ")  
  
  
def main():  
    logger = trt.Logger(trt.Logger.INFO)  
    with open("./ckpts/trt\_models/model\_000015-sim.trt", "rb") as f, trt.Runtime(logger) as runtime:  
        engine = runtime.deserialize_cuda_engine(f.read())  # 输入trt本地文件，返回ICudaEngine对象  
  
    for idx in range(engine.num_bindings):  # 查看输入输出的名字，类型，大小  
        is_input = engine.binding_is_input(idx)  
        name = engine.get_binding_name(idx)  
        op_type = engine.get_binding_dtype(idx)  
        shape = engine.get_binding_shape(idx)  
        print(f"idx: {idx}, is\_input: {is\_input}, binding\_name: {name}, shape: {shape}, op\_type: {op\_type}")  
  
    trt_model = TRTModule(engine=engine,  
                          input_names=["in\_imgs", "in\_proj\_matrices", "in\_inverse\_proj\_matrices", "in\_depth\_values"],  
                          output_names=["out\_depth", "out\_confidence"]  
                          )  
  
    torch_model = torch.load(f"./ckpts/torch\_models/model\_000015.pth").cuda()  
  
    # create example data  
    data_iter = iter(TestImgLoader)  
    sample = data_iter.__next__()  
    sample_cuda = tocuda(sample)  
    x = (sample_cuda["imgs"],   # (1, 3, 3, 512, 640)  
         sample_cuda["proj\_matrices"],  # (1, 3, 4, 4)  
         sample_cuda["inverse\_proj\_matrices"],  # (1, 3, 4, 4)  
         sample_cuda["depth\_values"])   # (1, 192)  
  
    # define input and output names  
    # input\_names = ['in\_imgs', 'in\_proj\_matrices', 'in\_inverse\_proj\_matrices', 'in\_depth\_values']  
    # output\_names = ['out\_depth', 'out\_confidence']  
  
    check_results(torch_model=torch_model, trt_model=trt_model, x=x)  
    # check\_speed(torch\_model=torch\_model, trt\_model=trt\_model, data\_loader=TestImgLoader)  
  
  
if __name__ == '\_\_main\_\_':  
    main()

这部分写的较为简略，具体要根据自己的模型实现输入输出的绑定，引擎的创建。可参考如下文章实现：

TensorRT8.2最新版入门教程(https://zhuanlan.zhihu.com/p/467401558)
如何使用TensorRT对训练好的PyTorch模型进行加速? 伯恩legacy(https://zhuanlan.zhihu.com/p/88318324)

6、结束语

本文到这里就结束了，大概介绍了一下如何利用TensorRT对深度学习模型进行加速。深度学习模型加速是一个繁杂的任务，需要注意的是，本文并没有对各个内容进行详细的讲解，更多的是提供一种整体的框架、流程，并给出相应的解决指南，这一点从文中嵌入的各个链接也可以看出。希望读者可以根据这个大框架，针对自己的任务有方向的去学习相关知识，并寻找解决方案，而不是想当然的仅依靠本文解决所有问题。

picture.image