目录
1. 背景
- 之前使用yolo做一些简单的图片视屏检测,能用于大部分简单的场景了。但对于一些角度比较特殊的,还有一些远景的小物体对象,默认的检测模型效果没有那么好。要么尝试自己标记做训练,但训练的调整和效果也没有能很好把握。
- 于是最后找到sahi做图像的切割,把一张图切割成多张小图,再结合yolo模型的多图批处理,来让对象在每张图片中的占比变大,从而提升检测效果。
- 提一嘴,sahi的切割其实yolo官方示例记得提了并不是特别适用于实时检测的,资源和处理时间上可能需要自己平衡一下
2. linux下yolo环境(可跳)
项目迁移到linux服务下,需要配置下环境,假如已经配置好yolo的环境(或者继续在之前配置好的windows下跑)可以跳过这一节。
# 基本代码和vnev环境
sudo apt install -y python3 python3-venv python3-pip
cd ~
git clone https://github.com/ultralytics/ultralytics
cd ultralytics
python3 -m venv yolo-env
source yolo-env/bin/activate
# Pytorch(需要依据自己硬件环境调整版本)
pip uninstall torch torchvision torchaudio
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# yolo依赖
pip install -e .
# cuda(需要依据自己硬件环境调整版本)
# https://developer.nvidia.com/cuda-12-8-1-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=24.04&target_type=deb_network
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8
# bash添加对应的path变量
vim ~/.bashrc
# 添加一下到文件末尾保存
export PATH=/usr/local/cuda-12.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
3. sahi + yolo 的pt模型
yolo官方文档已经有对于sahi简单使用的比较详细的demo样例了:
https://docs.ultralytics.com/zh/guides/sahi-tiled-inference/
环境:
pip install -U ultralytics sahi
简单样例(参照上面文档示例):
from sahi.predict import get_sliced_prediction
from sahi import AutoDetectionModel
from IPython.display import Image
detection_model = AutoDetectionModel.from_pretrained(
model_type="ultralytics",
model_path="models/yolo11n.pt", //先把模型准备好
confidence_threshold=0.3,
device="cuda:0",
)
# 切分,参数切分的切片大小和切片的重叠率,在精确度和效率之间自己找找平衡
result = get_sliced_prediction(
"demo_data/small-vehicles1.jpeg",
detection_model,
slice_height=256,
slice_width=256,
overlap_height_ratio=0.2,
overlap_width_ratio=0.2,
)
result.export_visuals(export_dir="demo_data/")
Image("demo_data/prediction_visual.png")
识别效果会比直接用yolo predict识别的远端小车更多:

如果只是跑跑看效果,到这就差不多了。看看对应的sahi文档调调参数,可以做一些图片视频的预测处理等等。
sahi github
4. sahi + tensorRT模型
4.1 背景
但目前的视频实时检测这一块,默认的yolo pt 模型+sahi跑起来处理时间确实有点长了。解决这个问题暂时考虑两个方向:
- 一个是可以考虑用一些快速的算法,结合sahi yolo的预测,做一些简单推测来节省yolo预测的帧数,降低总的时间。这个这里暂且不提了,后续在花时间慢慢研究下。
- 另一个就是转化下模型看能不能提升下效率。yolo的官方文档来说,也提供了模型导出操作的一些指引,可以考虑转tensorRT来降低下处理时间。

4.2 tensorRT导出
官方指引:https://docs.ultralytics.com/zh/modes/export/
环境:
pip install --upgrade tensorrt
// 下面代码跑起来缺啥补啥
pip install --no-cache-dir "onnx>=1.12.0" "onnxslim" "onnxruntime-gpu"
pip install nvidia-pyindex
然后参照指引,补一下导出代码,具体的导出参数参照文档看:
from ultralytics import YOLO
from pathlib import Path
BASE_DIR = Path(__file__).parent.resolve()
# Load a model
model = YOLO(str(BASE_DIR / "yolo11x.pt")) # load an official model
# Export the model
model.export(format="engine", # tensorRT
half=True, # 使用 FP16(可选)
int8=False, # 若需 INT8,可设为 True 并提供校准集
dynamic=True, # 启用动态形状支持
batch=8 # 批处理数量,便于结合sahi的切分做一组图像的处理
)
4.3 结合sahi的实现
sahi的文档提了默认支持没包括tensorRT,所以需要参考已有的模型代码去自定义下sahi的模型。
这里可以参考下yolo+sahi的默认实现代码,有一点要注意的是默认代码没有支持批处理的,要自己在参考的同时稍微调整一下: https://github.com/obss/sahi/blob/main/sahi/models/ultralytics.py
下面是简单的实现范例:
4.3.1 自定义子类
子类继承需要重写部分函数:
class YoloTensorRTModel(DetectionModel):
def __init__(self, *args, task="detect", max_batch_size=6, max_predict_det=500, **kwargs):
self._original_shape = None
# 记录传入的一些参数:任务类型,批处理最大值,预测对象最大值等
self.task = task
self.max_batch_size = max_batch_size
self.max_predict_det = max_predict_det
super().__init__(*args, **kwargs)
# 如果YoloTensorRTModel初始化传入了model那就会调用set_model,否则执行load_model从路径加载模型
def load_model(self):
try:
model = YOLO(model=self.model_path, task=self.task)
self.set_model(model)
except Exception as e:
raise TypeError("model_path is not a valid yolo model path: ", e)
def set_model(self, model: Any, **kwargs):
self.model = model
# set category_mapping
if not self.category_mapping:
category_mapping = {str(ind): category_name for ind, category_name in enumerate(self.category_names)}
self.category_mapping = category_mapping
def predict(self, image):
results = self.model.predict(image)
return results
# 参考源码ultralytics框架下perform_inference实现的批量处理版本
def perform_inference(self, images):
# Confirm model is loaded
if self.model is None:
raise ValueError("Model is not loaded, load it by calling .load_model()")
# 1. 直接调用模型批量推理
prediction_result = self.model.predict(images, save_txt=False, save_conf=False, max_det=self.max_predict_det, conf=0.25)
# 2. 从每个 Results 中提取 boxes.data(torch.Tensor, shape=(ni,6))
prediction_result = [per_result.boxes.data for per_result in prediction_result] #data 包含 [x1,y1,x2,y2,conf,cls]
self._original_predictions = prediction_result
# 3. 记录对应每张图的原始尺寸,供后续还原时用
self._original_shape = [
(img.shape[0], img.shape[1]) if isinstance(img, np.ndarray)
else (int(img.shape[2]), int(img.shape[3]))
for img in images
]
# 每个切片的推理结果转化为大图坐标然后整合到上级的列表中
def _create_object_prediction_list_from_original_predictions(
self,
shift_amount_list: Optional[List[List[int]]] = [[0, 0]],
full_shape_list: Optional[List[List[int]]] = None,
):
"""
self._original_predictions is converted to a list of prediction.ObjectPrediction and set to
self._object_prediction_list_per_image.
Args:
shift_amount_list: list of list
To shift the box and mask predictions from sliced image to full sized image, should
be in the form of List[[shift_x, shift_y],[shift_x, shift_y],...]
full_shape_list: list of list
Size of the full image after shifting, should be in the form of
List[[height, width],[height, width],...]
"""
original_predictions = self._original_predictions
# compatibility for sahi v0.8.15 对之前版本做的格式兼容处理
shift_amount_list = fix_shift_amount_list(shift_amount_list)
full_shape_list = fix_full_shape_list(full_shape_list)
# handle all predictions
object_prediction_list_per_image = []
for image_ind, image_predictions in enumerate(original_predictions):
shift_amount = shift_amount_list[image_ind]
full_shape = None if full_shape_list is None else (
full_shape_list[image_ind]
if 0 <= image_ind < len(full_shape_list) and full_shape_list[image_ind] is not None
else full_shape_list[0]
)
object_prediction_list = []
# Extract boxes and optional masks/obb
if self.has_mask or self.is_obb:
boxes = image_predictions[0].cpu().detach().numpy()
masks_or_points = image_predictions[1].cpu().detach().numpy()
else:
boxes = image_predictions.data.cpu().detach().numpy()
masks_or_points = None
# Process each prediction
for pred_ind, prediction in enumerate(boxes):
# Get bbox coordinates
bbox = prediction[:4].tolist()
score = prediction[4]
category_id = int(prediction[5])
category_name = self.category_mapping[str(category_id)]
# Fix box coordinates
bbox = [max(0, coord) for coord in bbox]
if full_shape is not None:
bbox[0] = min(full_shape[1], bbox[0])
bbox[1] = min(full_shape[0], bbox[1])
bbox[2] = min(full_shape[1], bbox[2])
bbox[3] = min(full_shape[0], bbox[3])
# Ignore invalid predictions
if not (bbox[0] < bbox[2]) or not (bbox[1] < bbox[3]):
# logger.warning(f"ignoring invalid prediction with bbox: {bbox}")
continue
# Get segmentation or OBB points
segmentation = None
if masks_or_points is not None:
if self.has_mask:
bool_mask = masks_or_points[pred_ind]
# Resize mask to original image size
bool_mask = cv2.resize(
bool_mask.astype(np.uint8), (self._original_shape[image_ind][1], self._original_shape[image_ind][0])
)
segmentation = get_coco_segmentation_from_bool_mask(bool_mask)
else: # is_obb
obb_points = masks_or_points[pred_ind] # Get OBB points for this prediction
segmentation = [obb_points.reshape(-1).tolist()]
if len(segmentation) == 0:
continue
# Create and append object prediction
object_prediction = ObjectPrediction(
bbox=bbox,
category_id=category_id,
score=score,
segmentation=segmentation,
category_name=category_name,
shift_amount=shift_amount,
full_shape=self._original_shape[image_ind] if full_shape is None else full_shape, # (height, width)
)
object_prediction_list.append(object_prediction)
object_prediction_list_per_image.append(object_prediction_list)
self._object_prediction_list_per_image = object_prediction_list_per_image
@property
def category_names(self):
return self.model.names.values()
@property
def has_mask(self):
"""
Returns if model output contains segmentation mask
"""
return self.model.overrides["task"] == "segment"
@property
def is_obb(self):
"""
Returns if model output contains oriented bounding boxes
"""
return self.model.overrides["task"] == "obb"
4.3.2 切分推理实现
参考第三节ultralytics官网的使用demo,其实就是调用get_sliced_prediction做推理,所以这里我们需要参考,重写这个切分推理函数(代码参考,还有很多优化修改空间):
# 支持批处理的切分推理
def get_sliced_prediction_batch(
image,
detection_model=None,
slice_height=None,
slice_width=None,
overlap_height_ratio=0.0,
overlap_width_ratio=0.0,
perform_standard_pred: bool = True,
postprocess_type: str = "GREEDYNMM",
postprocess_match_metric: str = "IOS",
postprocess_match_threshold: float = 0.5,
postprocess_class_agnostic: bool = False,
verbose: int = 0,
merge_buffer_length: Optional[int] = None,
auto_slice_resolution: bool = True,
slice_export_prefix: Optional[str] = None,
slice_dir: Optional[str] = None,
exclude_classes_by_name: Optional[List[str]] = None,
exclude_classes_by_id: Optional[List[int]] = None,
):
# 时间记录
durations_in_seconds = dict()
time_start_call = time_start = time.time()
# 第一步的切片处理
slice_image_result = slice_image(
image=image,
output_file_name=slice_export_prefix,
output_dir=slice_dir,
slice_height=slice_height,
slice_width=slice_width,
overlap_height_ratio=overlap_height_ratio,
overlap_width_ratio=overlap_width_ratio,
auto_slice_resolution=auto_slice_resolution,
)
num_slices = len(slice_image_result)
time_end = time.time() - time_start
durations_in_seconds["slice"] = time_end
tprint(f"sahi分割耗时:{time_end*1000:.2f}ms", tag="sahi")
# 一些后处理操作初始化,处理分片检测后合并问题的
if detection_model.is_obb:
# Only NMS is supported for OBB model outputs
postprocess_type = "NMS"
# init match postprocess instance
if postprocess_type not in POSTPROCESS_NAME_TO_CLASS.keys():
raise ValueError(
f"postprocess_type should be one of {list(POSTPROCESS_NAME_TO_CLASS.keys())} but given as {postprocess_type}"
)
postprocess_constructor = POSTPROCESS_NAME_TO_CLASS[postprocess_type]
postprocess = postprocess_constructor(
match_threshold=postprocess_match_threshold,
match_metric=postprocess_match_metric,
class_agnostic=postprocess_class_agnostic,
)
# create prediction input
num_batch = detection_model.max_batch_size
num_group = math.ceil(num_slices / num_batch)
tprint(f"sahi分割总画面数:{num_slices},批处理数量:{num_batch},分组:{num_group}", tag="sahi")
time_start = time.time()
object_prediction_list = []
# perform sliced prediction
for start in range(0, num_slices, num_batch):
image_list = slice_image_result.images[start: start + num_batch]
shift_amount_list = slice_image_result.starting_pixels[start: start + num_batch]
# perform batch prediction
detection_model.perform_inference(image_list)
# 后处理,拼接去重之类的
detection_model._create_object_prediction_list_from_original_predictions(
shift_amount_list = shift_amount_list, #格式对的,源代码定义返回的时候有点问题
full_shape_list = [
slice_image_result.original_image_height,
slice_image_result.original_image_width,
]
)
# 重映射种类键值对的
if detection_model.category_remapping:
detection_model._apply_category_remapping()
# postprocess matching predictions
# 这里原先源代码有问题,默认只支持batch为1,所以写死了detection_model.object_prediction_list返回object_prediction_list_per_image[0].现在用了批处理需要改一下
object_prediction_list_t: List[ObjectPrediction] = []
all_object_prediction_list: List[List[ObjectPrediction]] = detection_model.object_prediction_list_per_image
for per_object_prediction_list in all_object_prediction_list:
object_prediction_list_filter = filter_predictions(per_object_prediction_list, exclude_classes_by_name,exclude_classes_by_id)
if postprocess is not None:
object_prediction_list_filter = postprocess(object_prediction_list_filter)
object_prediction_list_t.extend(object_prediction_list_filter)
prediction_result = PredictionResult(
image=image, object_prediction_list=object_prediction_list_t, durations_in_seconds=durations_in_seconds
)
# 把每一个批处理组的切片预测对象转到全局
for object_prediction in prediction_result.object_prediction_list:
if object_prediction: # if not empty
object_prediction_list.append(object_prediction.get_shifted_object_prediction())
# 后续合并检测对象的。作用是在合并所有的slice的检测结果之后,在做一次process
# if merge_buffer_length is not None and len(object_prediction_list) > merge_buffer_length:
if postprocess is not None:
object_prediction_list = postprocess(object_prediction_list)
time_end = time.time()- time_start
durations_in_seconds["prediction"] = time_end
tprint(f"sahi推理及后处理耗时:{time_end*1000:.2f}ms", tag="sahi")
tprint(f"sahi总操作耗时:{(time.time() - time_start_call)*1000:.2f}ms", tag="sahi")
return PredictionResult(
image=image, object_prediction_list=object_prediction_list, durations_in_seconds=durations_in_seconds
)
4.3.3 示例
简单的测试:
if __name__ == '__main__':
detection_model = YoloTensorRTModel(
model_path=str("model/yolo12x.engine"),
confidence_threshold=0.2,
device="cuda:0", # or 'cpu'
task = "detect"
)
# 切分,参数切分的切片大小和切片的重叠率,在精确度和效率之间自己找找平衡
result = get_sliced_prediction_batch(
str("src/small-vehicles1.jpeg"),
detection_model,
overlap_height_ratio=0.2,
overlap_width_ratio=0.2,
slice_height=200,
slice_width=300,
)
result.export_visuals(export_dir=str("data/"), hide_conf=True)
Image(str("data/prediction_visual.png"))
4.3.4 结合视频检测
视频检测的流程参考博客的另外一个文章吧,不再重复了:
链接: https://blog.csdn.net/chaney_f/article/details/146204615
主要就是修改里面的process_frame函数:
# 原来yolo视频检测的处理函数
def process_frame(model_in, frame_in):
results = model_in.predict(frame_in)
# 绘制检测框
for result in results:
for box in result.boxes:
x1, y1, x2, y2 = map(int, box.xyxy[0])
conf = box.conf[0].item()
cls_id = int(box.cls[0])
label = f"{model_in.names[cls_id]} {conf:.2f}"
print(label)
# 绘制矩形和标签
draw_rounded_rect(frame_in, (x1, y1), (x2, y2), (0, 255, 0), 2,
cv2.LINE_AA, 10) # 红色圆角矩形
cv2.putText(frame_in, label, (x1, y1 - 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
return frame_in, results
其实就是把推理的yolo model和具体的predict函数,都替换成本文的YoloTensorRTModel和get_sliced_prediction_batch就行了。可以补一个解析结果函数来把sahi和yolo的结果解析统一一下结构,方便后续继续我们的绘制等等处理。
# 两方案回复的解析
def parse_results(results, use_sahi):
parsed = []
if use_sahi:
# SAHI 输出为 PredictionResult
for pred in results.object_prediction_list:
x1, y1, x2, y2 = pred.bbox.to_xyxy()
parsed.append({
"bbox": [x1, y1, x2, y2],
"score": pred.score.value,
"c_id": pred.category.id,
"c_name": pred.category.name,
})
else:
# Ultralytics YOLO 输出为 Results 对象列表
for result in results:
boxes = result.boxes
for box in boxes:
x1, y1, x2, y2 = box.xyxy[0].tolist()
parsed.append({
"bbox": [x1, y1, x2, y2],
"score": box.conf[0].item(),
"c_id": int(box.cls[0].item()),
"c_name": result.names[int(box.cls[0].item())],
})
return parsed
5. 其他
使用sahi切分后的效果确实比之前的模型对于小对象的检测要提升了。
但也有一些问题,例如在实时视频检测处理上,要精确度高其实处理延时加上我们的绘制等等操纵,可能在部分硬件环境下有点吃力。现在用的5080单卡跑起来也只能调调参数勉强够用。
然后tensorRT的模型在linux下拉起一个模型实例需要申请的显存资源也挺大的,比原先pt格式模型占用大的多。这个也需要后续进一步研究下参数或者其他方案解决下。

2997

被折叠的 条评论
为什么被折叠?



