YOLOv3庖丁解牛（二）：数据输入

最新推荐文章于 2026-03-14 00:15:09 发布

原创最新推荐文章于 2026-03-14 00:15:09 发布 · 4.4k 阅读

31 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#yolo #tensorflow #keras

本文深入解析YOLOv3数据输入流程，包括源数据存储格式、数据生成器的工作原理，以及数据预处理过程中的坐标转换、IOU计算和真实值构造，帮助读者更好地理解YOLOv3模型的训练过程。

一、源数据存储。首先我们得保证我们的数据跟大牛代码的格式保持一致。

1、图片。图片以文件的形式存放，放在哪无所谓，但是在传入的时候必须给到正确的地址。比如

/opt/others/data2007/VOC2007/JPEGImages/000073.jpg
/opt/others/data2007/VOC2007/JPEGImages/000003.jpg

2、标注。内容放在一个txt文本（比如：/opt/annotations.txt）里，如：

/opt/others/data2007/VOC2007/JPEGImages/000012.jpg 156,97,351,270,6

/opt/others/data2007/VOC2007/JPEGImages/000032.jpg 104,78,375,183,0 133,88,197,123,0 195,180,213,229,14 26,189,44,238,14

这些数据分成两大块，第一块指向图片地址，第二部分每五个数字一组，分别表示xmin, ymin, xmax, ymax和class_id，分别代表两个对角的坐标值和类别，如下：

为什么min部分是在左上角，各位看官自己琢磨一下哈

二、数据格式。

（1）在数据生成阶段，经过一个如下生成器（data_generator）

def data_generator(annotation_lines, batch_size, input_shape, anchors, num_classes):
    """data generator for fit_generator"""
    n = len(annotation_lines)
    i = 0
    while True:
        image_data = []
        box_data = []
        for b in range(batch_size):
            if i == 0:
                np.random.shuffle(annotation_lines)
            # 数据增强，任何防止网络过拟合的方法都可以认为是数据增强
            # [h,w,channel], [n, 5]
            # box[xmin, ymin, xmax, ymax, class_id], 原始的，只是经过数据增强
            image, box = get_random_data(annotation_lines[i], input_shape, random=True)
            '''
            :return
            image = [416,416,3], 
            box = [n, 5]
            原始的4个点位坐标，但是经过数据增强
            '''
            # [h,w,channel]
            image_data.append(image)
            # [n, 5]
            box_data.append(box)
            i = (i + 1) % n
        # image_data = [batchsize,h,w,channel]
        image_data = np.array(image_data)
        # box_data=[batchsize,n,5]
        box_data = np.array(box_data)
        # 传入box_data=[batchsize,n,5], anchors=[9,2], num_classes=[20]
        # y_true [batchsize, grid, grid, 3, 25]
        # 3 x [batchsize, grid, grid, 3, 25]
        y_true = preprocess_true_boxes(box_data, input_shape, anchors, num_classes)
        '''
        :return
        3 x [batchsize, grid, grid, 3, 25]
        batchsize:      批次
        grid：             以416为被除数计算的相对值,<1,h
        grid：             以416为被除数计算的相对值,<1,w
        3:              参与计算的anchor层
        25:             x,y,w,h,c(以grid为被除数计算的相对值)， + class_id
        '''
        # image_data:       batch x 416 x 416 x 3
        # y_true    :       batch x grid x grid x 3 x 25
        # 这里之所以yield了两个参数，因为咱们的Input有两个，一个是image(batchsizex416x416x3),一个是对应的box信息(batchsizexgridxgridx3x25)
        yield [image_data, *y_true], np.zeros(batch_size)

其中get_random_data函数，该函数是用来做数据增强的，所有防止过拟合的数据预处理方法，都可以认为是数据增强，比如对小图像进行位移、放缩、翻转和颜色抖动调节等，它也有扩充数据集的作用，这个函数就不具体去捋了。主要提两点作者进行处理的高明之处。

1）图像的放缩使用了双三次插值法

2）图像放缩的过程中，进行了灰色为背景的贴图操作，即：把放缩的图片贴在416x416的灰色背景图上。

最后返回了一个数据增强后的image（type:np.ndarray，shape:416x416x3），和根据图像做对等变换的boxes（dtype: np.ndarray, shape:n x 5）,这里的n指的是多少个boxes，一张图片会有多个物体，一个物体就一个box。但是这些数据是一批一批生成的，所以最后在该次生成器结束时，加到一个列表里，形成shape=(batchsize, 416, 416, 3)的iamge_data和shape=(batchsize, n, 5)的boxes_data

（2）数据预处理（preprocess_true_boxes）

以上生成的数据，boxes_data会传入该函数，同时伴随着input_shape, anchors, num_classes三个参数。

input_shape = (416, 416)

anchors = [[10,13], [16,30], [33,23], [30,61], [62,45], [59,119], [116,90], [156,198], [373,326]]

num_classes = 20...................20是voc数据集的类别数，80是coco数据集的，我们以voc为例

def preprocess_true_boxes(true_boxes, input_shape, anchors, num_classes):

    """Preprocess true boxes to training input format

    Parameters
    ----------
    true_boxes: array, shape=(m, T, 5)
        Absolute x_min, y_min, x_max, y_max, class_id relative to input_shape.
    input_shape: array-like, hw, multiples of 32
    anchors: array, shape=(N, 2), wh
    num_classes: integer

    true_boxes=[batchsize,n,5], anchors=[9,2], num_classes=[20]

    Returns
    -------
    y_true: list of array, shape like yolo_outputs, xywh are reletive value

    """
    assert (true_boxes[..., 4] < num_classes).all(), 'class id must be less than num_classes'
    num_layers = len(anchors) // 3
    anchor_mask = [[6, 7, 8], [3, 4, 5], [0, 1, 2]] if num_layers == 3 else [[3, 4, 5], [1, 2, 3]]
    # [batchsize,n,5]
    true_boxes = np.array(true_boxes, dtype='float32')
    # 416x416
    input_shape = np.array(input_shape, dtype='int32')
    # 边点坐标转中心坐标
    # [batchsize, n, 2]
    boxes_xy = (true_boxes[..., 0:2] + true_boxes[..., 2:4]) // 2
    # 边点坐标转具体宽高
    # [batchsize,n,2]
    boxes_wh = true_boxes[..., 2:4] - true_boxes[..., 0:2]
    # 具体坐标转换成相对比例
    # 这里把具体值转成以416为基数的相对值,
    true_boxes[..., 0:2] = boxes_xy / input_shape[::-1]
    true_boxes[..., 2:4] = boxes_wh / input_shape[::-1]

    # batchsize
    m = true_boxes.shape[0]
    # [13x13,26x26,52x52]
    grid_shapes = [input_shape // {0: 32, 1: 16, 2: 8}[l] for l in range(num_layers)]
    # 3 x [batchsize, grid, grid, 3, 25]
    y_true = [np.zeros((m, grid_shapes[l][0], grid_shapes[l][1], len(anchor_mask[l]), 5 + num_classes),  dtype='float32') for l in range(num_layers)]

    # 0扩维
    # [1,9,2]
    anchors = np.expand_dims(anchors, 0)

    anchor_maxes = anchors / 2.
    anchor_mins = -anchor_maxes
    # 有物体的宽高，据本人理解，有标注错的或者在数据增强出现一丝思维漏洞，容易导致box转换的过程中的错误，该步骤应该是防止异常值的出现
    # [batch x n]
    valid_mask = boxes_wh[..., 0] > 0

    for b in range(m):
        # 去掉这些没有物体的数据
        # Nx2
        wh = boxes_wh[b, valid_mask[b]]
        if len(wh) == 0:
            continue
        # Expand dim to apply broadcasting.
        # [N,1,2]
        wh = np.expand_dims(wh, -2)
        box_maxes = wh / 2.
        box_mins = -box_maxes

        # 此时的box_mins, anchor_mins里面对应的已经是x、y的坐标和该坐标下的宽和高，是坐标以及对应的长度
        # 选出二者相应位置的较大者，形成新的array
        # [Nx9x2]   N个box，9个anchor，2列（w，h）
        # Nx1x2   1x9x2
        # TODO numpy的广播机制有点玄乎，一下没看明白，只能依葫芦画瓢得出结果
        intersect_mins = np.maximum(box_mins, anchor_mins)
        '''
        :return
        [Nx9x2]
        '''
        # 选出二者相应位置的较小者，形成新的array
        # [Nx9x2]   N个box，9个anchor，2列（w，h）
        intersect_maxes = np.minimum(box_maxes, anchor_maxes)
        '''
        :return
        [Nx9x2]
        '''
        # 以上两步，对每个图像下的所有box与anchor都做了一次maximum和minimum，相当于对9x2的anchor和Nx2的box做了笛卡尔积，广播机制

        intersect_wh = np.maximum(intersect_maxes - intersect_mins, 0.)
        # 计算面积
        # [Nx9] * [Nx9]
        intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
        '''
        :return
        [Nx9]
        '''
        # [Nx9] * [Nx9]
        box_area = wh[..., 0] * wh[..., 1]
        '''
        :return
        [Nx9]
        '''
        anchor_area = anchors[..., 0] * anchors[..., 1]
        # 算出iou了！！！
        # (Nx9,)   N个box，9个anchor，1列（面积）
        iou = intersect_area / (box_area + anchor_area - intersect_area)
        #
        # 只选出每个anchor下的所有box中iou最大的一个
        # (9, )，代表着该图片的每一层对应使用的anchor下标
        best_anchor = np.argmax(iou, axis=-1)
        ''' 
        :return
        [1xN]
        返回每个box在9个anchor中iou最大的anchor的下标
        '''
        # [(N x 9],注意，这不一定是3，这个是由原始数据box的数量决定的
        for t, n in enumerate(best_anchor):
            for l in range(num_layers):
                # 这个n决定了哪些box是否在该层做检测
                # n是anchor的下标s
                if n in anchor_mask[l]:
                    # true_boxes [batchsize, N, 5],true_box中的xywh都是相对比例
                    # *13就相当于把xywh映射到feature map的绝对位置上
                    # 先映射x
                    # true_boxes = [batchsize x n x 4]
                    # true_boxes是以416为基数做除法的介于01之间的值
                    i = np.floor(true_boxes[b, t, 0] * grid_shapes[l][1]).astype('int32')
                    # 再映射y
                    j = np.floor(true_boxes[b, t, 1] * grid_shapes[l][0]).astype('int32')
                    # 第几个anchor
                    k = anchor_mask[l].index(n)
                    # class_id
                    c = true_boxes[b, t, 4].astype('int32')
                    # 在这里把？？？w和h掉了个位置？？？
                    # 奥，了解了，现在终于知道为什么作者要把坐标轴设置成右向和下向了，这样设计的好处就是能把xy轴和矩阵的行列下标对应上
                    # 放上xywh
                    # k表示该层对应的anchors指向的anchor下标
                    y_true[l][b, j, i, k, 0:4] = true_boxes[b, t, 0:4]
                    # 置信度为1，表示有物体
                    y_true[l][b, j, i, k, 4] = 1
                    y_true[l][b, j, i, k, 5 + c] = 1
    # 3 x [batchsize, grid, grid, 3, 25]
    '''
    :return
    grid:       在第几个grid输出的什么点位
    3   :       第几套anchor，一共三套，每一套3个anchor，grid=13x13，就对应第一套
    25  :       0-4表示 以416为基数的相对值
    '''

    return y_true

1）坐标转换

    boxes_xy = (true_boxes[..., 0:2] + true_boxes[..., 2:4]) // 2
    boxes_wh = true_boxes[..., 2:4] - true_boxes[..., 0:2]    
    true_boxes[..., 0:2] = boxes_xy / input_shape[::-1]
    true_boxes[..., 2:4] = boxes_wh / input_shape[::-1]

前两句代码，把具体的坐标点位转换成相对坐标点位，举个栗子。

我们有四个坐标(0, 0), (2, 0), (0, 2),(2, 2)，这四个坐标在二维坐标体系必然能够确定一个矩形，如：

假如我们有两个值，一个是矩形中心坐标(1, 1)以及长和宽(2, 2)，我们也能确定一个矩形，如：

而yolo使用的是第二种，一个中心点坐标和一组长度值决定一个矩形范围。

后两句代码把坐标和长宽进行以416为分母的统一缩小，这样得出来的的值是介于0-1之间的，以方便在不同的feature maps上进行统一放大。举个栗子：

假设一组值为(x, y) = (0.5, 0.5), (w, h) = (0.1, 0.1)，在416 x 416的图像的意思就是，有一个矩形，其中心点在坐标为(416x0.5, 416x0.5) = (308, 306)的位置，长和宽分别为（416x0.1, 416x0.1） = (41.6, 41.6)，这样的好处就在于，不管你的图像尺寸是416 x416，还是13 x 13，还是100 x 300等等，无所谓，只要你给我你需要的图像尺寸，我都能把你安你给的尺寸比例，放大回去。这是它的表象作用，也是最终进行矩形范围确定的原理。

对于yolo网络结构来说，这样做有两个目的，一个可以使用sigmoid函数模型来进行计算，一个是可以在多尺度的feature maps进行box的放缩操作

2）IOU计算逻辑

    anchor_maxes = anchors / 2.
    anchor_mins = -anchor_maxes
    # 有物体的宽高，据本人理解，有标注错的或者在数据增强出现一丝思维漏洞，容易导致box转换的过程中的错误，该步骤应该是防止异常值的出现
    # [batch x n]
    valid_mask = boxes_wh[..., 0] > 0

    for b in range(m):
        # 去掉这些没有物体的数据
        # Nx2
        wh = boxes_wh[b, valid_mask[b]]
        if len(wh) == 0:
            continue
        # Expand dim to apply broadcasting.
        # [N,1,2]
        wh = np.expand_dims(wh, -2)
        box_maxes = wh / 2.
        box_mins = -box_maxes

        # 此时的box_mins, anchor_mins里面对应的已经是x、y的坐标和该坐标下的宽和高，是坐标以及对应的长度
        # 选出二者相应位置的较大者，形成新的array
        # [Nx9x2]   N个box，9个anchor，2列（w，h）
        # Nx1x2   1x9x2
        # TODO numpy的广播机制有点玄乎，一下没看明白，只能依葫芦画瓢得出结果
        intersect_mins = np.maximum(box_mins, anchor_mins)
        '''
        :return
        [Nx9x2]
        '''
        # 选出二者相应位置的较小者，形成新的array
        # [Nx9x2]   N个box，9个anchor，2列（w，h）
        intersect_maxes = np.minimum(box_maxes, anchor_maxes)
        '''
        :return
        [Nx9x2]
        '''
        # 以上两步，对每个图像下的所有box与anchor都做了一次maximum和minimum，相当于对9x2的anchor和Nx2的box做了笛卡尔积，广播机制

        intersect_wh = np.maximum(intersect_maxes - intersect_mins, 0.)
        # 计算面积
        # [Nx9] * [Nx9]
        intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
        '''
        :return
        [Nx9]
        '''
        # [Nx9] * [Nx9]
        box_area = wh[..., 0] * wh[..., 1]
        '''
        :return
        [Nx9]
        '''
        anchor_area = anchors[..., 0] * anchors[..., 1]
        # 算出iou了！！！
        # (Nx9,)   N个box，9个anchor，1列（面积）
        iou = intersect_area / (box_area + anchor_area - intersect_area)

以上步骤，是yolo非常巧妙的iou计算方法，原理给大家解释一下，代码就不一一分析了。

理论上，iou=[AA-BB-CC-O] / ([AA-BB-CC-0] + [aa-bb-cc-o] - [AA-BB-CC-0]∩[aa-bb-cc-o])

但是因为yolo把边点坐标转成了相对坐标和相对长宽值，于是进行了一个转换，转成求

iou=[A-B-C-D] / ([A-B-C-D] + [a-b-c-d] - [A-B-C-D]∩[a-b-c-d])，其中B和b分别是box和anchor（二者互换无所谓）的中心点

这是他的计算逻辑。

3）构造真实值

best_anchor = np.argmax(iou, axis=-1)

         for t, n in enumerate(best_anchor):
            for l in range(num_layers):
                # 这个n决定了哪些box是否在该层做检测
                # n是anchor的下标s
                if n in anchor_mask[l]:
                    # true_boxes [batchsize, N, 5],true_box中的xywh都是相对比例
                    # *13就相当于把xywh映射到feature map的绝对位置上
                    # 先映射x
                    # true_boxes = [batchsize x n x 4]
                    # true_boxes是以416为基数做除法的介于01之间的值
                    i = np.floor(true_boxes[b, t, 0] * grid_shapes[l][1]).astype('int32')
                    # 再映射y
                    j = np.floor(true_boxes[b, t, 1] * grid_shapes[l][0]).astype('int32')
                    # 第几个anchor
                    k = anchor_mask[l].index(n)
                    # class_id
                    c = true_boxes[b, t, 4].astype('int32')
                    # 在这里把？？？w和h掉了个位置？？？
                    # 奥，了解了，现在终于知道为什么作者要把坐标轴设置成右向和下向了，这样设计的好处就是能把xy轴和矩阵的行列下标对应上
                    # 放上xywh
                    # k表示该层对应的anchors指向的anchor下标
                    y_true[l][b, j, i, k, 0:4] = true_boxes[b, t, 0:4]
                    # 置信度为1，表示有物体
                    y_true[l][b, j, i, k, 4] = 1
                    y_true[l][b, j, i, k, 5 + c] = 1

这里开始构造真实值，即Y输入。Y是一个3 x batchsize x grid x grid x anchor_id x 25的输出

其中：

a）3指的是yolo网络结构的三层输出，第一层为13x13x3x25，第二层为26x26x3x25，第三层为52x52x3x25

b）batchsize就是批次大小

c）grid就是每一层的feature map的size，第一层就是grid = 13，第二层grid = 26，第三层grid = 52

d）anchor_id决定了用哪一套的哪一组anchor进行物体检测。这里需要详细解释一下。

首先，anchor一共有3套，每一套三组，一共是9组，分别为：[[10,13], [16,30], [33,23], [30,61], [62,45], [59,119], [116,90], [156,198], [373,326]]。前1-3组的anchor给第3层的输出做检测，4-6组的anchor给第2层的输出，7-9组给第1层的输出做检测。

看到这里，大家是不是有点诧异，为何大的anchor给小的feature maps做滑窗呢，难道不是越大的anchor应该在大的feature maps上进行滑窗呢？

举个栗子：一张10 X 10的图像，我们用一个5 x 5的卷积核进行卷积，步长为1，得出的输出是6 x 6；用一个3 x 3的卷积核进行卷积，步长为1，得出的输出是8 x 8。于是乎，我们可以这么描述：卷积核越大，感受野越大，得到的特征越抽象，得出的特征相对越小；卷积核越小，感受野越小，得到的特征越零散，得出的特征图相对较大。感受野大的特征图，我们需要的anchor就得越大，相反则越小。因此，特征图较小，说明卷积核较大，感受野较大，我们才会需要更大的anchor进行滑窗。

于是这里的anchor_id就是对应的第anchor_id组anchor。其实也就是iou计算出来最大的那个下标.

e）25对应的就是xywh的相对值（切记这是相对值，就是介于0-1之间的数值），以及20个分类对应的数值。

4）返回值

yield [image_data, *y_true], np.zeros(batch_size)

yolo需要两组输入，第一组输入就是常规的图片，以nump形式输入；第二组就是Y值，在计算loss的时候用到。

image_data shape = (batchsize, 416 , 416, 3)

*y_true shape = 3 组 (batchsize, grid, grid, 3, 25)

OK，下次给大家捋一捋yolo的损失函数部分