yolo11 + sahi 借助图像切割，提升小物体对象视频实时检测的效果的简单实现

原创已于 2025-06-20 16:22:03 修改 · 1.5k 阅读

31 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#yolo #ultralytics #目标检测 #视频 #python

于 2025-06-16 17:47:28 首次发布

ai 专栏收录该内容

4 篇文章

订阅专栏

该文章已生成可运行项目，

1. 背景

之前使用yolo做一些简单的图片视屏检测，能用于大部分简单的场景了。但对于一些角度比较特殊的，还有一些远景的小物体对象，默认的检测模型效果没有那么好。要么尝试自己标记做训练，但训练的调整和效果也没有能很好把握。
于是最后找到sahi做图像的切割，把一张图切割成多张小图，再结合yolo模型的多图批处理，来让对象在每张图片中的占比变大，从而提升检测效果。
提一嘴，sahi的切割其实yolo官方示例记得提了并不是特别适用于实时检测的，资源和处理时间上可能需要自己平衡一下

2. linux下yolo环境（可跳）

项目迁移到linux服务下，需要配置下环境，假如已经配置好yolo的环境（或者继续在之前配置好的windows下跑）可以跳过这一节。

# 基本代码和vnev环境
sudo apt install -y python3 python3-venv python3-pip
cd ~
git clone https://github.com/ultralytics/ultralytics
cd ultralytics
python3 -m venv yolo-env
source yolo-env/bin/activate

# Pytorch（需要依据自己硬件环境调整版本）
pip uninstall torch torchvision torchaudio
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# yolo依赖
pip install -e .

# cuda（需要依据自己硬件环境调整版本）
# https://developer.nvidia.com/cuda-12-8-1-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=24.04&target_type=deb_network
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8

# bash添加对应的path变量
vim ~/.bashrc
# 添加一下到文件末尾保存
export PATH=/usr/local/cuda-12.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH

3. sahi + yolo 的pt模型

yolo官方文档已经有对于sahi简单使用的比较详细的demo样例了：
https://docs.ultralytics.com/zh/guides/sahi-tiled-inference/

环境：

pip install -U ultralytics sahi

简单样例（参照上面文档示例）：

from sahi.predict import get_sliced_prediction
from sahi import AutoDetectionModel
from IPython.display import Image

detection_model = AutoDetectionModel.from_pretrained(
    model_type="ultralytics",
    model_path="models/yolo11n.pt",   //先把模型准备好
    confidence_threshold=0.3,
    device="cuda:0",
)
# 切分，参数切分的切片大小和切片的重叠率，在精确度和效率之间自己找找平衡
result = get_sliced_prediction(
    "demo_data/small-vehicles1.jpeg",
    detection_model,
    slice_height=256,
    slice_width=256,
    overlap_height_ratio=0.2,
    overlap_width_ratio=0.2,
)
result.export_visuals(export_dir="demo_data/")
Image("demo_data/prediction_visual.png")

识别效果会比直接用yolo predict识别的远端小车更多：

如果只是跑跑看效果，到这就差不多了。看看对应的sahi文档调调参数，可以做一些图片视频的预测处理等等。
sahi github

4. sahi + tensorRT模型

4.1 背景

但目前的视频实时检测这一块，默认的yolo pt 模型+sahi跑起来处理时间确实有点长了。解决这个问题暂时考虑两个方向：

一个是可以考虑用一些快速的算法，结合sahi yolo的预测，做一些简单推测来节省yolo预测的帧数，降低总的时间。这个这里暂且不提了，后续在花时间慢慢研究下。
另一个就是转化下模型看能不能提升下效率。yolo的官方文档来说，也提供了模型导出操作的一些指引，可以考虑转tensorRT来降低下处理时间。

4.2 tensorRT导出

官方指引：https://docs.ultralytics.com/zh/modes/export/

环境：
pip install --upgrade tensorrt
// 下面代码跑起来缺啥补啥
pip install --no-cache-dir "onnx>=1.12.0" "onnxslim" "onnxruntime-gpu"
pip install nvidia-pyindex

然后参照指引，补一下导出代码，具体的导出参数参照文档看：

from ultralytics import YOLO
from pathlib import Path

BASE_DIR = Path(__file__).parent.resolve()
# Load a model
model = YOLO(str(BASE_DIR / "yolo11x.pt"))  # load an official model

# Export the model
model.export(format="engine",  # tensorRT
            half=True,        # 使用 FP16（可选）
            int8=False,       # 若需 INT8，可设为 True 并提供校准集
            dynamic=True,     # 启用动态形状支持
            batch=8				# 批处理数量，便于结合sahi的切分做一组图像的处理
            )

4.3 结合sahi的实现

sahi的文档提了默认支持没包括tensorRT，所以需要参考已有的模型代码去自定义下sahi的模型。
这里可以参考下yolo+sahi的默认实现代码，有一点要注意的是默认代码没有支持批处理的，要自己在参考的同时稍微调整一下： https://github.com/obss/sahi/blob/main/sahi/models/ultralytics.py
下面是简单的实现范例：

4.3.1 自定义子类

子类继承需要重写部分函数：

class YoloTensorRTModel(DetectionModel):
   def __init__(self, *args, task="detect", max_batch_size=6, max_predict_det=500, **kwargs):
       self._original_shape = None
       # 记录传入的一些参数：任务类型，批处理最大值，预测对象最大值等
       self.task = task
       self.max_batch_size = max_batch_size
       self.max_predict_det = max_predict_det
       super().__init__(*args, **kwargs)

   # 如果YoloTensorRTModel初始化传入了model那就会调用set_model，否则执行load_model从路径加载模型
   def load_model(self):
       try:
           model = YOLO(model=self.model_path, task=self.task)
           self.set_model(model)
       except Exception as e:
           raise TypeError("model_path is not a valid yolo model path: ", e)

   def set_model(self, model: Any, **kwargs):
       self.model = model
       # set category_mapping
       if not self.category_mapping:
           category_mapping = {str(ind): category_name for ind, category_name in enumerate(self.category_names)}
           self.category_mapping = category_mapping

   def predict(self, image):
       results = self.model.predict(image)
       return results

   # 参考源码ultralytics框架下perform_inference实现的批量处理版本
   def perform_inference(self, images):
       # Confirm model is loaded
       if self.model is None:
           raise ValueError("Model is not loaded, load it by calling .load_model()")
       # 1. 直接调用模型批量推理
       prediction_result = self.model.predict(images, save_txt=False, save_conf=False, max_det=self.max_predict_det, conf=0.25)
       # 2. 从每个 Results 中提取 boxes.data（torch.Tensor, shape=(ni,6)）
       prediction_result = [per_result.boxes.data for per_result in prediction_result] #data 包含 [x1,y1,x2,y2,conf,cls]
       self._original_predictions = prediction_result
       # 3. 记录对应每张图的原始尺寸，供后续还原时用
       self._original_shape = [
           (img.shape[0], img.shape[1]) if isinstance(img, np.ndarray)
           else (int(img.shape[2]), int(img.shape[3]))
           for img in images
       ]
       
	# 每个切片的推理结果转化为大图坐标然后整合到上级的列表中
   def _create_object_prediction_list_from_original_predictions(
           self,
           shift_amount_list: Optional[List[List[int]]] = [[0, 0]],
           full_shape_list: Optional[List[List[int]]] = None,
   ):
       """
       self._original_predictions is converted to a list of prediction.ObjectPrediction and set to
       self._object_prediction_list_per_image.
       Args:
           shift_amount_list: list of list
               To shift the box and mask predictions from sliced image to full sized image, should
               be in the form of List[[shift_x, shift_y],[shift_x, shift_y],...]
           full_shape_list: list of list
               Size of the full image after shifting, should be in the form of
               List[[height, width],[height, width],...]
       """
       original_predictions = self._original_predictions

       # compatibility for sahi v0.8.15   对之前版本做的格式兼容处理
       shift_amount_list = fix_shift_amount_list(shift_amount_list)
       full_shape_list = fix_full_shape_list(full_shape_list)

       # handle all predictions
       object_prediction_list_per_image = []
       for image_ind, image_predictions in enumerate(original_predictions):
           shift_amount = shift_amount_list[image_ind]
           full_shape = None if full_shape_list is None else (
               full_shape_list[image_ind]
               if 0 <= image_ind < len(full_shape_list) and full_shape_list[image_ind] is not None
               else full_shape_list[0]
           )
           object_prediction_list = []

           # Extract boxes and optional masks/obb
           if self.has_mask or self.is_obb:
               boxes = image_predictions[0].cpu().detach().numpy()
               masks_or_points = image_predictions[1].cpu().detach().numpy()
           else:
               boxes = image_predictions.data.cpu().detach().numpy()
               masks_or_points = None

           # Process each prediction
           for pred_ind, prediction in enumerate(boxes):
               # Get bbox coordinates
               bbox = prediction[:4].tolist()
               score = prediction[4]
               category_id = int(prediction[5])
               category_name = self.category_mapping[str(category_id)]

               # Fix box coordinates
               bbox = [max(0, coord) for coord in bbox]
               if full_shape is not None:
                   bbox[0] = min(full_shape[1], bbox[0])
                   bbox[1] = min(full_shape[0], bbox[1])
                   bbox[2] = min(full_shape[1], bbox[2])
                   bbox[3] = min(full_shape[0], bbox[3])

               # Ignore invalid predictions
               if not (bbox[0] < bbox[2]) or not (bbox[1] < bbox[3]):
                   # logger.warning(f"ignoring invalid prediction with bbox: {bbox}")
                   continue

               # Get segmentation or OBB points
               segmentation = None
               if masks_or_points is not None:
                   if self.has_mask:
                       bool_mask = masks_or_points[pred_ind]
                       # Resize mask to original image size
                       bool_mask = cv2.resize(
                           bool_mask.astype(np.uint8), (self._original_shape[image_ind][1], self._original_shape[image_ind][0])
                       )
                       segmentation = get_coco_segmentation_from_bool_mask(bool_mask)
                   else:  # is_obb
                       obb_points = masks_or_points[pred_ind]  # Get OBB points for this prediction
                       segmentation = [obb_points.reshape(-1).tolist()]

                   if len(segmentation) == 0:
                       continue

               # Create and append object prediction
               object_prediction = ObjectPrediction(
                   bbox=bbox,
                   category_id=category_id,
                   score=score,
                   segmentation=segmentation,
                   category_name=category_name,
                   shift_amount=shift_amount,
                   full_shape=self._original_shape[image_ind] if full_shape is None else full_shape,  # (height, width)
               )
               object_prediction_list.append(object_prediction)

           object_prediction_list_per_image.append(object_prediction_list)

       self._object_prediction_list_per_image = object_prediction_list_per_image

   @property
   def category_names(self):
       return self.model.names.values()

   @property
   def has_mask(self):
       """
       Returns if model output contains segmentation mask
       """
       return self.model.overrides["task"] == "segment"

   @property
   def is_obb(self):
       """
       Returns if model output contains oriented bounding boxes
       """
       return self.model.overrides["task"] == "obb"

4.3.2 切分推理实现

参考第三节ultralytics官网的使用demo，其实就是调用get_sliced_prediction做推理，所以这里我们需要参考，重写这个切分推理函数（代码参考，还有很多优化修改空间）：

# 支持批处理的切分推理
def get_sliced_prediction_batch(
   image,
   detection_model=None,
   slice_height=None,
   slice_width=None,
   overlap_height_ratio=0.0,
   overlap_width_ratio=0.0,
   perform_standard_pred: bool = True,
   postprocess_type: str = "GREEDYNMM",
   postprocess_match_metric: str = "IOS",
   postprocess_match_threshold: float = 0.5,
   postprocess_class_agnostic: bool = False,
   verbose: int = 0,
   merge_buffer_length: Optional[int] = None,
   auto_slice_resolution: bool = True,
   slice_export_prefix: Optional[str] = None,
   slice_dir: Optional[str] = None,
   exclude_classes_by_name: Optional[List[str]] = None,
   exclude_classes_by_id: Optional[List[int]] = None,
):
   # 时间记录
   durations_in_seconds = dict()
   time_start_call = time_start = time.time()
   # 第一步的切片处理
   slice_image_result = slice_image(
       image=image,
       output_file_name=slice_export_prefix,
       output_dir=slice_dir,
       slice_height=slice_height,
       slice_width=slice_width,
       overlap_height_ratio=overlap_height_ratio,
       overlap_width_ratio=overlap_width_ratio,
       auto_slice_resolution=auto_slice_resolution,
   )
   num_slices = len(slice_image_result)
   time_end = time.time() - time_start
   durations_in_seconds["slice"] = time_end
   tprint(f"sahi分割耗时：{time_end*1000:.2f}ms", tag="sahi")

   # 一些后处理操作初始化，处理分片检测后合并问题的
   if detection_model.is_obb:
       # Only NMS is supported for OBB model outputs
       postprocess_type = "NMS"
   # init match postprocess instance
   if postprocess_type not in POSTPROCESS_NAME_TO_CLASS.keys():
       raise ValueError(
           f"postprocess_type should be one of {list(POSTPROCESS_NAME_TO_CLASS.keys())} but given as {postprocess_type}"
       )
   postprocess_constructor = POSTPROCESS_NAME_TO_CLASS[postprocess_type]
   postprocess = postprocess_constructor(
       match_threshold=postprocess_match_threshold,
       match_metric=postprocess_match_metric,
       class_agnostic=postprocess_class_agnostic,
   )

   # create prediction input
   num_batch = detection_model.max_batch_size
   num_group = math.ceil(num_slices / num_batch)
   tprint(f"sahi分割总画面数：{num_slices}，批处理数量：{num_batch}，分组：{num_group}", tag="sahi")
   time_start = time.time()
   object_prediction_list = []
   # perform sliced prediction
   for start in range(0, num_slices, num_batch):
       image_list = slice_image_result.images[start: start + num_batch]
       shift_amount_list = slice_image_result.starting_pixels[start: start + num_batch]
       # perform batch prediction
       detection_model.perform_inference(image_list)
       # 后处理，拼接去重之类的
       detection_model._create_object_prediction_list_from_original_predictions(
           shift_amount_list = shift_amount_list,  #格式对的，源代码定义返回的时候有点问题
           full_shape_list = [
               slice_image_result.original_image_height,
               slice_image_result.original_image_width,
           ]
       )
       # 重映射种类键值对的
       if detection_model.category_remapping:
           detection_model._apply_category_remapping()

       # postprocess matching predictions
       # 这里原先源代码有问题，默认只支持batch为1，所以写死了detection_model.object_prediction_list返回object_prediction_list_per_image[0].现在用了批处理需要改一下
       object_prediction_list_t: List[ObjectPrediction] = []
       all_object_prediction_list: List[List[ObjectPrediction]] = detection_model.object_prediction_list_per_image
       for per_object_prediction_list in all_object_prediction_list:
           object_prediction_list_filter = filter_predictions(per_object_prediction_list, exclude_classes_by_name,exclude_classes_by_id)
           if postprocess is not None:
               object_prediction_list_filter = postprocess(object_prediction_list_filter)
           object_prediction_list_t.extend(object_prediction_list_filter)
       prediction_result = PredictionResult(
           image=image, object_prediction_list=object_prediction_list_t, durations_in_seconds=durations_in_seconds
       )
       # 把每一个批处理组的切片预测对象转到全局
       for object_prediction in prediction_result.object_prediction_list:
           if object_prediction:  # if not empty
               object_prediction_list.append(object_prediction.get_shifted_object_prediction())
       # 后续合并检测对象的。作用是在合并所有的slice的检测结果之后，在做一次process
       # if merge_buffer_length is not None and len(object_prediction_list) > merge_buffer_length:
       if postprocess is not None:
           object_prediction_list = postprocess(object_prediction_list)

   time_end = time.time()- time_start
   durations_in_seconds["prediction"] = time_end
   tprint(f"sahi推理及后处理耗时：{time_end*1000:.2f}ms", tag="sahi")

   tprint(f"sahi总操作耗时：{(time.time() - time_start_call)*1000:.2f}ms", tag="sahi")
   return PredictionResult(
       image=image, object_prediction_list=object_prediction_list, durations_in_seconds=durations_in_seconds
   )

4.3.3 示例

简单的测试：

if __name__ == '__main__':
   detection_model = YoloTensorRTModel(
       model_path=str("model/yolo12x.engine"),
       confidence_threshold=0.2,
       device="cuda:0",  # or 'cpu'
       task = "detect"
   )
   # 切分，参数切分的切片大小和切片的重叠率，在精确度和效率之间自己找找平衡
   result = get_sliced_prediction_batch(
       str("src/small-vehicles1.jpeg"),
       detection_model,
       overlap_height_ratio=0.2,
       overlap_width_ratio=0.2,
       slice_height=200,
       slice_width=300,
   )

   result.export_visuals(export_dir=str("data/"), hide_conf=True)
   Image(str("data/prediction_visual.png"))

4.3.4 结合视频检测

视频检测的流程参考博客的另外一个文章吧，不再重复了：
链接: https://blog.csdn.net/chaney_f/article/details/146204615
主要就是修改里面的process_frame函数：

# 原来yolo视频检测的处理函数
def process_frame(model_in, frame_in):
results = model_in.predict(frame_in)
# 绘制检测框
for result in results:
    for box in result.boxes:
        x1, y1, x2, y2 = map(int, box.xyxy[0])
        conf = box.conf[0].item()
        cls_id = int(box.cls[0])
        label = f"{model_in.names[cls_id]} {conf:.2f}"
        print(label)
        # 绘制矩形和标签
        draw_rounded_rect(frame_in, (x1, y1), (x2, y2), (0, 255, 0), 2,
                          cv2.LINE_AA, 10)  # 红色圆角矩形
        cv2.putText(frame_in, label, (x1, y1 - 10),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
        return frame_in, results

其实就是把推理的yolo model和具体的predict函数，都替换成本文的YoloTensorRTModel和get_sliced_prediction_batch就行了。可以补一个解析结果函数来把sahi和yolo的结果解析统一一下结构，方便后续继续我们的绘制等等处理。

# 两方案回复的解析
def parse_results(results, use_sahi):
   parsed = []
   if use_sahi:
       # SAHI 输出为 PredictionResult
       for pred in results.object_prediction_list:
           x1, y1, x2, y2 = pred.bbox.to_xyxy()
           parsed.append({
               "bbox": [x1, y1, x2, y2],
               "score": pred.score.value,
               "c_id": pred.category.id,
               "c_name": pred.category.name,
           })
   else:
       # Ultralytics YOLO 输出为 Results 对象列表
       for result in results:
           boxes = result.boxes
           for box in boxes:
               x1, y1, x2, y2 = box.xyxy[0].tolist()
               parsed.append({
                   "bbox": [x1, y1, x2, y2],
                   "score": box.conf[0].item(),
                   "c_id": int(box.cls[0].item()),
                   "c_name": result.names[int(box.cls[0].item())],
               })
   return parsed

5. 其他

使用sahi切分后的效果确实比之前的模型对于小对象的检测要提升了。
但也有一些问题，例如在实时视频检测处理上，要精确度高其实处理延时加上我们的绘制等等操纵，可能在部分硬件环境下有点吃力。现在用的5080单卡跑起来也只能调调参数勉强够用。
然后tensorRT的模型在linux下拉起一个模型实例需要申请的显存资源也挺大的，比原先pt格式模型占用大的多。这个也需要后续进一步研究下参数或者其他方案解决下。

本文章已经生成可运行项目